Are you into cluster-computing with Apache Spark? This year’s SAIS 2018 conference covered great data engineering and data science best practices for productionizing AI. In a nutshell, you should keep your training data fresh with stream processing, monitor quality, test and serve models (at massive scale when talking about Spark). The conference also provided some deep dive sessions on Spark integration with popular machine learning frameworks, such as well known TensorFlow, Scikit-learn, Keras, PyTorch, DeppLearning4j, BigDL and Deep Learning Pipelines.
Here is the list of several interesting topics (in case you couldn’t join;-):
Spark Experience and Use Cases
Great talk about Spark utilization for HEP (high energy psysics) data processing and analysis as a complementary tool for current rid computing in CERN.
Brilliant Spark use case describing end-to-end analytics pipelines for genomic data.
Research and Spark Ecosystem
How to train a machine learning model for streaming data? Sometimes, we need to be able to adapt the model in real-time with evolving data streams. Also, check the open source StreamDM library.
The massive amounts of HEP (high energy physics) data, such as at the LHC (Large Hadron Collider) in CERN, will require new approaches for physics data processing and analysis. This talk explores the possibility to ingest HEP data with Spark directly allowing to process physical data stored in the specialized format used in HEP.
Have you heard of Hudi? Are you interested in the state of art in managing petabytes of analytical data on distributed storage, while supporting fast ingestion & queries in Uber?
Interesting introduction into the Implicit Alternating Least Squares by introducing preference and confidence terms in the loss function. We also learned how to build a recommendation engine using Apache Spark environment.
Deep Learning Techniques
Spark can be utilized to blend many different libraries together. See how to easily create complex queries for satellite images using Deep Learning Pipelines and Magellan (a geospatial package) within Spark integration at scale.
The title says is all – sharing experience and lessons learned on setting up and running the Apache Spark service inside the database group at CERN. And yes, that’s right, one more use case from HEP!
Running Apache Spark in production is definitely not an easy thing to do, but it is almost essential for large-scale data processing and analytics.