Spark AI Summit just concluded this week and as always, plenty of great announcements. (Note: I was one of the speakers at the event but this post is more about the announcements and areas of my personal interest in Spark. The whole art of virtual public speaking is another topic). The ML enhancements and impact is a bigger topic probably for another day as I catch up with all the relevant conference talks and try out the new features.
Firstly, I think the online format worked for this instance. This summit ( and I’ve been to it every year since its inception) was way more relaxing and didn’t leave me exhausted physically and mentally with information overload. Usually held at the Moscone in San Francisco, the event becomes a great opportunity to network with former colleagues, friends and Industry experts which is the most enjoyable part yet taxing in many ways with limited time to manage. The virtual interface was way better than most of the online events I’ve been to before – engaging and convenient. The biggest drawback was the networking aspect and the online networking options just don’t cut it. The video conferencing fatigue probably didn’t hit since it was 3 days and the videos were available instantly online so plenty of them are in my “Watch Later” list. (Note the talks I refer to below are only the few I watched so plenty of many more interesting ones)
The big announcement was the release of Spark 3.0 – Hard to believe but it’s been 10 years of evolution. I remember 2013 as the year I was adapting to the Hadoop ecosystem writing map-reduce using Java/Pig/Hive for large scale data pipelines when Spark started emerging as a fledgling competitor with an interesting distributed computational engine using Resilient Distributed Datasets (RDD). Fast-forward to 2020 and Spark is the foundation of large scale data implementations across the industry and its ecosystem has evolved to frameworks and engines like Delta and MLflow which are also gaining a foothold as foundational to the enterprise across Cloud providers. More importantly, smart investment into its DataFrames API has reduced the barrier to entry to it with the SQL access patterns.
There were tons of new features introduced but focusing on the ones I paid attention to. There has not been a major release of Spark for years so this is pretty significant (2.0 was in 2016).
Spark 3.0
- Adaptive Query execution: At the core, this helps in changing the number of reducers at runtime. It divides the SQL Execution plan into stages earlier instead of the usual RDD graph. Newer stages help injecting optimizations before the queries get executed as later stages have the full picture of the entire query plan to have a global picture of all shuffle dependencies . The execution plans can be auto-optimized at runtime for example changing a SortMergeJoin to a BroadcastJoin where applicable. This is huge in large-scale implementation when I see tons of poorly formed queries eating a lot of compute thanks to skewed joins. More specifically, settings like the number of shuffle partitions set using spark.sql.shuffle.partitions which has defaulted to 200 since inception can now be automatically tuned based on the reducers required for the mapping stage output – i.e. setting it high for larger data and smaller for smaller data.
- Dynamic partition pruning: Enables the ability to perform filter pushdowns versus table scans by adding a partition pruning filter. At the core if you consider a broadcast hash join between a fact and dimension table, the enhancement intercepts the result of the broadcast and plugs them as a filter on top of the dynamic filter on the fact table as opposed to the earlier approach of pushing out the broadcast hash table derived from the dimension table to every worker to determine the value of the join with the fact. This is huge to avoid scanning irrelevant data. This session explains it well.
- Accelerator-aware scheduler: Traditionally, the bottleneck usually has been small data in partitions that GPUs find hard to handle, cache processing efficiencies, slow I/O on disk, UDFs that need CPU processing and a lot more issues. But GPUs are massively useful for high cardinality datasets, matrix operations, window operations and transcoding situations. Originally termed project Hydrogen, this feature helps Spark be GPU-aware. The cluster managers now have GPU support that schedulers can request from. The schedulers can now understand GPUs allocations to executors and assign GPUs appropriately to tasks. The GPU resources still need to be configured using the configs to assign the appropriate resources. We can request resources at the executor, drive and the task level. This also allows the resources to be discovered on the nodes and their assignments. This is supported in YARN, Kubernetes and Standalone modes.
- Pandas UDF overhaul: Extensive use of python type annotations – this becomes more and more imperative as codebases scale and newer engineers take longer to understand and maintain the code effectively. instead of writing hundreds of test cases or worse find out about it from irate users. Great documentation and examples here.
- PySpark UDF: Another feature that I’ve looked forward is to enable PySpark to handle Pandas Vectorized UDFs as an array. In the past, we needed to jump through god awful hoops like writing scala functions as a helper and then switch over to Python in order to help Python read these as arrays. ML engineers will welcome this.
- Structured Streaming UI: Great to see more focus on the UI and additional features appearing in the workspace interface which frankly has got to be pretty stale over the last few years. The new tab shows more statistics for running and completed queries and more importantly will help developers debug exceptions quickly rather than poring through log files.
- Proleptic Gregorian calendar: Switched to this from the previous hybrid (Julian + Gregorian). This uses Java 8 API classes from the java.time packages that are based on ISO chronology . The “proleptic” part comes from extending the Gregorian calendar backward to dates before before 1582 when it was officially introduced.
Fascinating segway here –
The Gregorian Calendar (named after pope Gregory the 13th , not the guy who gave us the awesome Gregorian Chants, that was Gregory 1 ) is what we use today as part of ISO 8601:2004. The Gregorian calendar’ replaced the the Julian Calendar due to its inaccuracies in determining an actual year plus issues where it could not really take into the complexities of adding a leap year almost every 4 years. Catholics liked this and adopted it while protestants held out for 200 years (!) with suspicion before England and the colonies switched over advancing the date from September 2 to September 14, 1752! Would you hand over 12 days of your life as a write -off? In any case, you can thank Gregory the 13th for playing a part in this enhancement.
- Also a whole lot of talk on better ANSI SQL compatibility that I need to look closer at. Working with a large user base of SQL users, this could only be good news.
- A few smaller super useful enhancements:
- “Show Views” command
- “Explain” output formatted for better usability instead of a jungle of text
- Better documentation for SQL
- Enhancements on MLlib, GraphX
Useful talks:
- Deep Dive into the New Features of Apache Spark 3.0
- How Adobe Does 2 Million Records Per Second Using Apache Spark!
- Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large Datasets with Apache Spark
Delta Engine/Photon/Koalas
Being a big Delta proponent, this was important to me especially as adoption grows and large-scale implementations need continuous improvements in this product to justify rising storage costs on cloud providers as the scale grows.
The Delta Engine now has an improved query optimizer and a native vectorized execution engine written in C++. This builds on the optimized reads and writes in today’s NVMe SSDs that eclipse the SATA SSDs found in previous generations along with faster seek times. Gaining these efficiencies out of the CPU at the bare metal level is significant especially as data teams deal with more and more unstructured data and high velocity. The C++ implementation helps exploiting data-level and instruction-level parallelism as explained in detail in the keynote by Reynold Xin. Some interesting benchmarks on strings using regex to demonstrate faster processing. Looking forward to more details on how the optimization works under the hood and implementation guidelines.
Koalas 1.0 now implements 80% of the Pandas APIs. We can invoke accessors to use the Pyspark APIs from Koalas. Better type hinting and a ton of enhancements on DataFrames, Series and Indexes with support for Python 3.8 make this another value proposition on Spark.
A lot of focus on Lakehouse in ancillary meetings were encouraging and augurs well for data democratization on a single linear stack versus fragmenting data across data warehouses and data lakes. The Redash acquisition will provide another option for large scale enterprises for easy-to-use dashboarding and visualization capabilities on these curated data lake. Hope to see more public announcements on that topic.
- Building Data Quality Audit Framework using Delta Lake at Cerner
- A Thorough Comparison of Delta Lake, Iceberg and Hudi
- Automatic Forecasting using Prophet, Databricks, Delta Lake and MLflow
- Machine Learning Data Lineage with MLflow and Delta Lake
- Patterns and Operational Insights from the First Users of Delta Lake
MLflow
More announcements around the MLflow model serving aspects with Model Registry (announced in April) that lets data scientists track model lifecycle across versions such as Staging, Production, or Archived. With MLflow in the Linux Foundation, it helps evangelizing it to a larger audience with a vendor-independent non-profit managing this project.
- Autologging : Enables automatic logging of Spark datasource information at read-time, without the need for explicit log statements. mlflow.spark.autolog() will enable auto logging for spark data sources if you provide the relevant data and versions using Delta Lake so the managed Databricks implementation definitely looks slicker with the UI. Implementation would be as easy as attaching a ml-flow spark JARS and then call
mlflow.spark.autolog
. More significantly, enables the cloning of models.
- On Azure – the updated
mlflow.azureml.deploy
API for deploying MLflow models to AzureML. This now uses the up-to-date Model.package() and Model.deploy() APIs.
- Model schemas for input and out schemas, custom metadata tags for tracking which means more metadata to track which is great.
- Model Serving : Ability to deploy models via a rest endpoint on Hosted ML Flow which is great. Would have loved to see more turnkey methods to deploy to an agnostic deployment endpoint say, a managed Kubernetes service – the current implementation is for databricks clusters from what I noticed.
- Lots of cool UI fixes including highlighting different parameter values when comparing runs, UI plot updates with scaling to thousands of points.
Useful talks:
- Advertising Fraud Detection at Scale at T-Mobile
- Generative Hyperloop Design: Managing Massively Scaled Simulations Focused on Quick-Insight Analytics and Demand Modelling
- The Killer Feature Store: Orchestrating Spark ML Pipelines and MLflow for Production
- Productionizing Machine Learning Pipelines with Databricks and Azure ML
- Children Safety Retrieval (CENSER) System for Retrieval of Kidnapped Children from Brothels in India – A thought provoking talk on a real application of Image recognition for societal change
Looking forward to trying out the new MLflow features which will go on public preview later in July.