Are you into cluster-computing with Apache Spark? This year’s SAIS 2018 conference covered great data engineering and data science best practices for productionizing AI. In a nutshell, you should keep your training data fresh with stream processing, monitor quality, test and serve models (at massive scale when talking about Spark). The conference also provided some deep dive sessions on Spark integration with popular machine learning frameworks, such as well known TensorFlow, Scikit-learn, Keras, PyTorch, DeppLearning4j, BigDL and Deep Learning Pipelines.
Here is the list of several interesting topics (in case you couldn’t join;-):
Spark Experience and Use Cases
CERN’s Next Generation Data Analysis Platform with Apache Spark
Great talk about Spark utilization for HEP (high energy psysics) data processing and analysis as a complementary tool for current rid computing in CERN.
Scaling Genomics on Apache Spark by 100x
Brilliant Spark use case describing end-to-end analytics pipelines for genomic data.
Research and Spark Ecosystem
Streaming Random Forest Learning in Spark and StreamDM
How to train a machine learning model for streaming data? Sometimes, we need to be able to adapt the model in real-time with evolving data streams. Also, check the open source StreamDM library.
HEP Data Processing with Apache Spark
The massive amounts of HEP (high energy physics) data, such as at the LHC (Large Hadron Collider) in CERN, will require new approaches for physics data processing and analysis. This talk explores the possibility to ingest HEP data with Spark directly allowing to process physical data stored in the specialized format used in HEP.
Hudi: Large-Scale, Near Real-Time Pipelines at Uber
Have you heard of Hudi? Are you interested in the state of art in managing petabytes of analytical data on distributed storage, while supporting fast ingestion & queries in Uber?
Data Science
Building an Implicit Recommendation Engine with Spark
Interesting introduction into the Implicit Alternating Least Squares by introducing preference and confidence terms in the loss function. We also learned how to build a recommendation engine using Apache Spark environment.
Deep Learning Techniques
Geospatial Analytics at Scale with Deep Learning and Apache Spark
Spark can be utilized to blend many different libraries together. See how to easily create complex queries for satellite images using Deep Learning Pipelines and Magellan (a geospatial package) within Spark integration at scale.
Developers’ Sessions
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Love to Scale
The title says is all – sharing experience and lessons learned on setting up and running the Apache Spark service inside the database group at CERN. And yes, that’s right, one more use case from HEP!
Running Apache Spark in production is definitely not an easy thing to do, but it is almost essential for large-scale data processing and analytics.
Pingback: Daily Artificial Intelligence News Roundup #112 | Daily Artificial Intelligence & Machine Learning Curated News