Apache just announced Apache Spark 2.3. It is the fourth release in the 2.x line. It’s not just the point release you might expect with a few minor upgrades and improvements; instead, it ships with over 1,000 changes from the last version, including three major new features: continuous streaming, a native Python UDF and support for Kubernetes. Apache Spark 2.3 also forms the underpinning for version 4.0 of the Databricks Runtime.
Apache Spark is a multi-tool analytics engine used by data scientists and data engineers for big data processing. Internet powerhouses like Netflix, eBay and Yahoo have all deployed Apache Spark at huge scale, collectively processing multiple petabytes of data on clusters of over 8,000 nodes.
One of Spark’s challenges since its beginning has been Spark Streaming. The micro-batching approach that it takes to process data has meant it could not guarantee low-latency responses. Customers who demanded sub-second latencies had to rely on other programs, such as Apache Storm or Apache Flink to get that guarantee. Apache Spark 2.3 no longer relies on micro-batch architecture. The latest continuous streaming execution engine can process queries with sub-millisecond latencies.
The new Python support for vectorized user defined functions (UDFs) in Spark 2.3 will also be an exciting development for the data scientists who use Python and Python-based libraries to build data processing pipelines or machine learning models that execute in Spark. The new feature will massively increase the speed that custom Python-based functions run in Spark; allowing users to delegate jobs to the Python-based tools for processing, then bring the data back into Spark in a parallel manner.
The final major new feature in Spark 2.3 is support for Kubernetes, the open source Linux container orchestration system. It eliminates much of the manual processes previously necessary in deploying and scaling computerized applications. The last few years have brought what InfoWorld describes as “the astonishing rise of Kubernetes” and mass adoption by the industry. Also known as “kube”, the platform was originally developed by Google and is now maintained by the Cloud Native Computing Foundation.
Kubernetes can now be used as a resource manager by Spark customers for their implementations in Spark. The new Kubernetes scheduler supports native submission of Spark jobs to a cluster that Kubernetes manages. “We’re seeing a lot more Kubernetes coming up now,” Databricks Chief Architect Reynold Xin told Datanami. “It used to be the case that people used primarily YARN and Mesos, and now we see a lot more Kubernetes.”
InfoWorld said that its support in Apache Spark is “likely to be game-changing as it moves out of experimental status”, and points out that as enterprises have largely had to run Apache Spark on YARN, this has meant running an Apache Hadoop stack, which can be “temperamental beasts”. Most businesses will no longer need YARN as they can move their Apache Spark deployments onto Kubernetes, either on-premises or in the cloud. If they are running their clusters in the cloud, they will likely be able to replace HDFS with a managed storage system such as Google Cloud Storage, Amazon S3 or Azure Data Lake.
Reynold Xin, however, advises customers to test their Spark-on-Kubernetes integration before deploying it, or possibly to wait for a couple more releases until the integration is complete. The Spark 2.3 release notes that “this [Kubernetes] support is currently experimental and behavioral changes around configurations, container images and entrypoints should be expected.”