Spark Introduction
Spark is an open-source lightning-fast unified analytics engine for large-scale data processing. The foundation of Spark is designed on resilient distributed dataset (RDD) which is a read-only dataset distributed over a cluster of machines. RDDs are fault tolerant meaning Spark will not crash if one or more failures occurs during processing. Spark requires a cluster manager and distributed storage system or can be executed on a single machine. The main components of Spark consist of Core, SQL, Streaming, Machine Learning, and GraphX.
Berkeley 2008 to 2013
Spark started as an academic project at the University of California, Berkeley Algorithms, Machines and People Lab (AMPLAB). The team was led by Matei Zaharia who developed the initial codebase in 2009 during his PhD studies. Then it was made open source in 2010 under a Berkeley Software Distribution (BSD) license. While at Berkeley, Matei published many papers and more papers are available from Standford.
Spark was designed to address limitations in MapReduce cluster computing. While MapReduce worked on distributed file system it was slowed down when performing iterative algorithms that required reading the data multiple times. Training algorithms for machine learning were greatly reduced by several orders of magnitude when using RDDs. Spark was released on GitHub on October 3, 2010. During the next few years the project grew and drew attention from many companies.
During the time of development, Spark was tested at different companies and in Amazon Web Services. I have tried to find information about some of the initial development and testing of Spark and have not found much. If any readers know or would send me information, I would greatly appreciate it. Just wanting to bridge the gap on how it went from a research project to a full company.
Research papers from 2010 through 2016 – https://spark.apache.org/research.html
Apache Software Foundation 2013 – 2014
The release of version 0.8.0 shows the changes made during the time period under incubation. On June 10, 2013 Spark was submitted to the Apache Incubator project. After a short five months it was released as version 0.9.1 as Apache Spark. Graduation from the incubator project was on February 15, 2014 and four days later became a Top Level Project with Apache.
http://incubator.apache.org/projects/spark.html
Databricks 2013
During the Apache Incubation of Spark, seven of the Berkeley professors, students, and board member founded Databricks. The main goal of the company is to assist those in running Spark on cloud-based clusters. Databricks developed a web-based platform similar to Jupyter Notebooks (IPython) to help the management and development of cluster processing.
Microsoft Azure 2018
On November 15,2017 during Microsoft Connect, Azure Databricks was announced. This collaboration between Microsoft and Databricks allows for a deep integration with Azure services. This marks the first time Apache Spark has been in a partnership to optimize data analytics workloads from the ground up. More can be read on Databricks blog announcement.
https://databricks.com/blog/2017/11/15/a-technical-overview-of-azure-databricks.html
Future
Spark’s ten years of history has been a quick one to the top in analytics. It is certainly only the beginning of what is a great tool for analysis of large datasets. By combining multiple tools into a single unified platform was a strong move. Increasing involvement with multiple cloud vendors will allow Spark to continue to grow into enterprises.