Spark Data Stats Tutorial

This repository contains the code and examples for my article on Medium, which explains how to optimize computing data statistics in Apache Spark jobs using the Observations feature. You can read the full article here:
Optimize Computing Data Statistics in Spark Jobs with Observations

Summary of the Article:

This guide demonstrates how to optimize the computation of data statistics in Spark jobs using Observations. Key topics covered include:

Introduction to Observations in Spark: Learn how Observations work in Spark and how they can be used to capture data statistics during job execution.
Optimizing Data Collection: Practical techniques for collecting statistics without significant performance overhead.
Implementing Observations in Spark Jobs: Step-by-step examples of how to use Observations to collect metrics and analyze Spark job performance.

The code in this repository allows you to follow along with the examples in the article and provides hands-on experience in using Observations to optimize data statistics in Spark.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
README.md		README.md
log4j2.properties		log4j2.properties
popular_destinations.py		popular_destinations.py
requirements.txt		requirements.txt
spark_utils.py		spark_utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spark Data Stats Tutorial

Summary of the Article:

About

Languages

SA01/spark-data-stats-tutorial

Folders and files

Latest commit

History

Repository files navigation

Spark Data Stats Tutorial

Summary of the Article:

About

Topics

Resources

Stars

Watchers

Forks

Languages