Apache Spark is an open-source unified analytics engine. It is designed to handle large-scale data processing and provides fault-tolerant computing and implicit data parallelism. The Spark codebase was originally developed at the University of California, Berkeley’s AMPLab and donated to the Apache Software Foundation. Its workflow is managed in the form of a directed acyclic graph, with nodes representing RDDs and edges representing operations on RDDs.
The underlying hardware is a cluster of CPUs and memory, and each node manages its own Spark environment. The Apache Spark ecosystem consists of a number of key components, which integrate with each other. The Core module contains the core programming abstractions of Spark, which include task scheduling, fault tolerance, and in-memory computation. The rest of the Spark ecosystem is built on top of these components. The Spark Core module contains the data structures used by the Spark framework, and is the main component of the project.
The Spark core provides the execution engine for the Spark platform. It also performs basic I/O functions, scheduling, and monitoring. In addition, Spark Core provides effective memory management and can reference external storage systems. During the runtime, the Spark Core executes queries and performs other necessary tasks. Apache Spark can be used for high-performance analytics because it reduces the number of disk operations. It can also process streaming data and graph algorithms.
Apache Spark offers developers rich APIs written in Scala, Java, Python, and R. Its in-memory processing boosts processing speed. Spark code can be reused for batch-processing, joining streams with historical data, and running ad-hoc queries on stream state. Apache Spark also provides fault tolerance. The RDD, its core abstraction, is designed to handle worker node failure, which reduces the risk of data loss.
Unlike traditional batch processing, Apache Spark can execute applications on a cluster in real time. It uses a distributed, scale-out computing environment, with millions of servers. And while it can be run on-premises, it is also available in the cloud. If you have an Apache Spark cluster, you can take advantage of its performance and scalability by using it to process big data. Once you’ve downloaded the latest version, you’ll find out if it’s the right choice for your workload.
Apache Spark is an open-source analytics platform developed at UC Berkeley. It is maintained by the Apache Software Foundation and boasts the largest open source community in big data. With over 1,000 contributors, it is now included as a core component in several commercial big data offerings. Spark’s architecture is hierarchical, with the master node controlling the cluster manager. Workers, in turn, deliver data results to the application client. That’s the basis for Apache Spark’s popularity.
A new feature of Apache Spark 3.0 is GPU scheduling. With GPU scheduling, Spark can schedule executors with a specified number of GPUs. You can also configure a discovery script to detect GPUs. This makes it easier to run ML applications that require GPUs. For example, you can now run production ETL jobs on the cluster, and have the production job hand off the data to a distributed deep learning training cluster, called Horovod.