{"id":2881,"date":"2025-09-14T16:42:42","date_gmt":"2025-09-14T16:42:42","guid":{"rendered":"https:\/\/blog.samarthya.me\/wps\/?p=2881"},"modified":"2025-09-14T16:42:45","modified_gmt":"2025-09-14T16:42:45","slug":"apache-spark","status":"publish","type":"post","link":"https:\/\/blog.samarthya.me\/wps\/2025\/09\/14\/apache-spark\/","title":{"rendered":"Apache Spark"},"content":{"rendered":"\n<figure class=\"wp-block-image size-full\"><img fetchpriority=\"high\" decoding=\"async\" width=\"1024\" height=\"1024\" src=\"https:\/\/blog.samarthya.me\/wps\/wp-content\/uploads\/2025\/09\/Big-data.png\" alt=\"\" class=\"wp-image-2882\" srcset=\"https:\/\/blog.samarthya.me\/wps\/wp-content\/uploads\/2025\/09\/Big-data.png 1024w, https:\/\/blog.samarthya.me\/wps\/wp-content\/uploads\/2025\/09\/Big-data-150x150@2x.png 300w, https:\/\/blog.samarthya.me\/wps\/wp-content\/uploads\/2025\/09\/Big-data-150x150.png 150w, https:\/\/blog.samarthya.me\/wps\/wp-content\/uploads\/2025\/09\/Big-data-300x300@2x.png 600w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>So as part of the academic curriculum we just started re-exploring Apache Spark. It&#8217;s been one of the technology that I had long lost touch with. Ironical but true, in this age of big data and speed Spark was left somewhere way behind in my past projects working for Pitney Bowes. <\/p>\n\n\n\n<h2 class=\"wp-block-heading\">History<\/h2>\n\n\n\n<p>Apache Spark was created in 2009 at UC Berkeley&#8217;s AMPLab as a research project led by Matei Zaharia. The goal was to overcome the limitations of <strong>Hadoop&#8217;s MapReduce<\/strong>, a popular but often slow data processing framework. MapReduce was designed for a linear workflow where each step reads data from a disk and writes the results back to a disk, which made it inefficient for tasks that require multiple passes over the same data, like machine learning algorithms.<\/p>\n\n\n\n<p>Spark&#8217;s key innovation was the <strong>Resilient Distributed Dataset (RDD)<\/strong>, a fault-tolerant collection of elements that can be processed in parallel. RDDs can be cached in memory, which significantly speeds up iterative computations and interactive queries by avoiding constant disk reads and writes. This in-memory processing is what makes Spark much faster than MapReduce.<\/p>\n\n\n\n<p>In 2013, the project was donated to the Apache Software Foundation, and it has since become a top-level project with a massive global community of developers.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Purpose<\/h2>\n\n\n\n<p>The core purpose of Apache Spark is to provide a <strong>fast and unified engine for big data workloads<\/strong>. It can handle a wide variety of tasks, including:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Batch Processing:<\/strong> Analyzing large amounts of static data (e.g., analyzing all sales data from the past year).<\/li>\n\n\n\n<li><strong>Real-time Stream Processing:<\/strong> Analyzing data as it&#8217;s generated (e.g., monitoring a live feed of social media posts).<\/li>\n\n\n\n<li><strong>Machine Learning:<\/strong> Training and running machine learning models on vast datasets.<\/li>\n\n\n\n<li><strong>SQL Queries:<\/strong> Performing structured data analysis using a familiar language (SQL).<\/li>\n\n\n\n<li><strong>Graph Processing:<\/strong> Analyzing network-like data, such as social connections.<\/li>\n<\/ul>\n\n\n\n<p>Spark&#8217;s power lies in its ability to do all these things on a single, unified platform, eliminating the need to use separate tools for different tasks. It can run on a variety of cluster managers like Hadoop YARN, Apache Mesos, or Kubernetes, and can read data from a multitude of sources, including local files, Amazon S3, and HDFS.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Example<\/h2>\n\n\n\n<pre class=\"wp-block-code has-black-color has-vivid-green-cyan-background-color has-text-color has-background has-link-color has-medium-font-size wp-elements-672694f738b4e46d01ab6be7e60de99c\"><code>import requests\nfrom pyspark.sql import SparkSession\nfrom pyspark.sql.functions import col, explode, split, lower, regexp_replace, length\n\n# 1. Fetch the content from a URL using a separate library\nurl = \"https:\/\/cloud.google.com\/learn\/what-is-artificial-intelligence\"\ntry:\n    response = requests.get(url, timeout=10)\n    response.raise_for_status()  # This will raise an HTTPError if the response was an error\n    url_content = response.text\nexcept requests.exceptions.RequestException as e:\n    print(f\"Error fetching URL: {e}\")\n    url_content = \"\"\n\n# 2. Create a SparkSession\nspark = SparkSession.builder.appName(\"CountWordsFromURL\").getOrCreate()\n\n# 3. Create a DataFrame with the fetched content\ndata = &#91;(url_content,)]\ndf = spark.createDataFrame(data, &#91;\"text_content\"])\n\n# 4. Use Spark to process the text\nwords = df.select(explode(split(df.text_content, \"\\\\s+\")).alias(\"word\"))\ncleaned_words = words.withColumn(\"word\", col(\"word\").cast(\"string\")) \\\n    .withColumn(\"word\", regexp_replace(col(\"word\"), \"&#91;^a-zA-Z]\", \"\")) \\\n    .filter(length(col(\"word\")) > 0) \\\n    .withColumn(\"word\", lower(col(\"word\")))\n\n# 5. Count the occurrences of \"ai\"\nai_count = cleaned_words.filter(col(\"word\") == \"ai\").count()\n\n# 6. Print the result\nprint(f\"The word 'AI' appears {ai_count} times on the page.\")\n\n# 7. Stop the SparkSession\nspark.stop()<\/code><\/pre>\n\n\n\n<p>Console output<\/p>\n\n\n\n<pre class=\"wp-block-code has-white-color has-black-background-color has-text-color has-background has-link-color has-small-font-size wp-elements-c543ce1ee37b384eda01523ad73a0a0a\"><code>25\/09\/14 12:38:18 WARN Utils: Your hostname, Samarthya resolves to a loopback address: 127.0.1.1; using 10.255.255.254 instead (on interface lo)\n25\/09\/14 12:38:18 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address\nSetting default log level to \"WARN\".\nTo adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).\n25\/09\/14 12:38:18 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\n25\/09\/14 12:38:19 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.\n25\/09\/14 12:38:22 WARN TaskSetManager: Stage 0 contains a task of very large size (1946 KiB). The maximum recommended task size is 1000 KiB.\nThe word 'AI' appears 228 times on the page.<\/code><\/pre>\n\n\n\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>So as part of the academic curriculum we just started re-exploring Apache Spark. It&#8217;s been one of the technology that I had long lost touch with. Ironical but true, in this age of big data and speed Spark was left somewhere way behind in my past projects working for Pitney Bowes. History Apache Spark was [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":2883,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"image","meta":{"_exactmetrics_skip_tracking":false,"_exactmetrics_sitenote_active":false,"_exactmetrics_sitenote_note":"","_exactmetrics_sitenote_category":0,"footnotes":""},"categories":[34],"tags":[343],"class_list":["post-2881","post","type-post","status-publish","format-image","has-post-thumbnail","hentry","category-technical","tag-spark","post_format-post-format-image"],"_links":{"self":[{"href":"https:\/\/blog.samarthya.me\/wps\/wp-json\/wp\/v2\/posts\/2881","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blog.samarthya.me\/wps\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.samarthya.me\/wps\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.samarthya.me\/wps\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.samarthya.me\/wps\/wp-json\/wp\/v2\/comments?post=2881"}],"version-history":[{"count":1,"href":"https:\/\/blog.samarthya.me\/wps\/wp-json\/wp\/v2\/posts\/2881\/revisions"}],"predecessor-version":[{"id":2884,"href":"https:\/\/blog.samarthya.me\/wps\/wp-json\/wp\/v2\/posts\/2881\/revisions\/2884"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/blog.samarthya.me\/wps\/wp-json\/wp\/v2\/media\/2883"}],"wp:attachment":[{"href":"https:\/\/blog.samarthya.me\/wps\/wp-json\/wp\/v2\/media?parent=2881"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.samarthya.me\/wps\/wp-json\/wp\/v2\/categories?post=2881"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.samarthya.me\/wps\/wp-json\/wp\/v2\/tags?post=2881"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}