Apache Spark is a fast in-memory data processing engine for big data analytics. It provides SQL queries, real-time streaming, machine learning and graph processing capabilities in a single engine, and it supports multiple programming languages such as Java, Scala, R, and Python.
The core of apache spark is the Resilient Distributed Dataset (RDD), a programming abstraction that represents an immutable collection of elements that can be split across worker nodes and parallelized. RDDs hide partitioning from end users and are fault-tolerant. They are able to recover lost partitions and provide a lineage history for each operation. Spark uses this architecture to process large data sets in a highly scalable and efficient manner.
Using an application program, developers can tell Spark what they want to do with their data. This program is then translated into a series of steps that are executed by the cluster of workers. The result is a set of results that can be viewed and manipulated by the application program. Spark is intelligent in that it automatically optimizes the execution plan, and it checks the results to ensure they are correct before returning them.
The Spark MLlib library offers out-of-the-box machine learning functionality, including classification and regression, collaborative filtering, distributed linear algebra, decision trees, gradient-boosted trees, frequent pattern mining, evaluation metrics and statistics, and more. When combined with the other capabilities of Spark, this library makes Spark an indispensable Big Data tool. Despite its powerful abilities, some limitations exist. One major limitation is that it cannot handle close-to-real time processing; instead, you should turn to other frameworks like Apache Flink for real-time stream processing.