Apache Spark

Saurabh Sharma

So as part of the academic curriculum we just started re-exploring Apache Spark. It’s been one of the technology that I had long lost touch with. Ironical but true, in this age of big data and speed Spark was left somewhere way behind in my past projects working for Pitney Bowes.

History

Apache Spark was created in 2009 at UC Berkeley’s AMPLab as a research project led by Matei Zaharia. The goal was to overcome the limitations of Hadoop’s MapReduce, a popular but often slow data processing framework. MapReduce was designed for a linear workflow where each step reads data from a disk and writes the results back to a disk, which made it inefficient for tasks that require multiple passes over the same data, like machine learning algorithms.

Spark’s key innovation was the Resilient Distributed Dataset (RDD), a fault-tolerant collection of elements that can be processed in parallel. RDDs can be cached in memory, which significantly speeds up iterative computations and interactive queries by avoiding constant disk reads and writes. This in-memory processing is what makes Spark much faster than MapReduce.

In 2013, the project was donated to the Apache Software Foundation, and it has since become a top-level project with a massive global community of developers.

Purpose

The core purpose of Apache Spark is to provide a fast and unified engine for big data workloads. It can handle a wide variety of tasks, including:

  • Batch Processing: Analyzing large amounts of static data (e.g., analyzing all sales data from the past year).
  • Real-time Stream Processing: Analyzing data as it’s generated (e.g., monitoring a live feed of social media posts).
  • Machine Learning: Training and running machine learning models on vast datasets.
  • SQL Queries: Performing structured data analysis using a familiar language (SQL).
  • Graph Processing: Analyzing network-like data, such as social connections.

Spark’s power lies in its ability to do all these things on a single, unified platform, eliminating the need to use separate tools for different tasks. It can run on a variety of cluster managers like Hadoop YARN, Apache Mesos, or Kubernetes, and can read data from a multitude of sources, including local files, Amazon S3, and HDFS.

Example

Console output