Highlights from Spark + AI Summit 2018 (SAIS 2018)

Are you into cluster-computing with Apache Spark? This year’s SAIS 2018 conference covered great data engineering and data science best practices for productionizing AI. In a nutshell, you should keep your training data fresh with stream processing, monitor quality, test and serve models (at massive scale when talking about Spark). The conference also provided some deep dive sessions on Spark integration with popular machine learning frameworks, such as well known TensorFlow, Scikit-learn, Keras, PyTorch, DeppLearning4j, BigDL and Deep Learning Pipelines.

Here is the list of several interesting topics (in case you couldn’t join;-):

Spark Experience and Use Cases

CERN’s Next Generation Data Analysis Platform with Apache Spark

Great talk about Spark utilization for HEP (high energy psysics) data processing and analysis as a complementary tool for current rid computing in CERN.

Scaling Genomics on Apache Spark by 100x

Brilliant Spark use case describing end-to-end analytics pipelines for genomic data.

Research and Spark Ecosystem

Streaming Random Forest Learning in Spark and StreamDM

How to train a machine learning model for streaming data? Sometimes, we need to be able to adapt the model in real-time with evolving data streams. Also, check the open source StreamDM library.

HEP Data Processing with Apache Spark

The massive amounts of HEP (high energy physics) data, such as at the LHC (Large Hadron Collider) in CERN, will require new approaches for physics data processing and analysis. This talk explores the possibility to ingest HEP data with Spark directly allowing to process physical data stored in the specialized format used in HEP.

Hudi: Large-Scale, Near Real-Time Pipelines at Uber

Have you heard of Hudi? Are you interested in the state of art in managing petabytes of analytical data on distributed storage, while supporting fast ingestion & queries in Uber?

Data Science

Building an Implicit Recommendation Engine with Spark

Interesting introduction into the Implicit Alternating Least Squares by introducing preference and confidence terms in the loss function. We also learned how to build a recommendation engine using Apache Spark environment.

Deep Learning Techniques

Geospatial Analytics at Scale with Deep Learning and Apache Spark

Spark can be utilized to blend many different libraries together. See how to easily create complex queries for satellite images using Deep Learning Pipelines and Magellan (a geospatial package) within Spark integration at scale.

Developers’ Sessions

Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Love to Scale

The title says is all – sharing experience and lessons learned on setting up and running the Apache Spark service inside the database group at CERN. And yes, that’s right, one more use case from HEP!

Running Apache Spark in production is definitely not an easy thing to do, but it is almost essential for large-scale data processing and analytics.


1 thought on “Highlights from Spark + AI Summit 2018 (SAIS 2018)

  1. Pingback: Daily Artificial Intelligence News Roundup #112 | Daily Artificial Intelligence & Machine Learning Curated News

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.