Are you into cluster-computing with Apache Spark? This year’s SAIS 2018 conference covered great data engineering and data science best practices for productionizing AI. In a nutshell, you should keep your training data fresh with stream processing, monitor quality, test and serve models (at massive scale when talking about Spark). The conference also provided some deep dive sessions on Spark integration with popular machine learning frameworks, such as well known TensorFlow, Scikit-learn, Keras, PyTorch, DeppLearning4j, BigDL and Deep Learning Pipelines.
Here is the list of several interesting topics (in case you couldn’t join;-):
This year’s ECCV 2018 conference experienced an unprecedented growth of community and brought to light the most recent advances in computer vision. As expected, all the sessions were dominated by Deep Learning with Convolutional Neural Networks (CNNs).
For those who couldn’t join, I picked up a few interesting topics that caught my attention. Here is the list:
One of the main topics at ECCV 2018 was autonomous driving. Can you compete against LIDAR? Can you detect and reconstruct cars as 3D objects from video? Check some ECCV’s challenges!
Clustering is a hugely important step of exploratory data analysis and finds plenty of great applications. Typically, clustering technique will identify different groups of observations among your data. For example, if you need to perform market segmentation, cluster analysis will help you with labeling each segment so that you can evaluate each segment’s potential and target some attractive segments. Therefore, your marketing program and positioning strategy rely heavily on the very fundamental step – grouping of your observations and creation of meaningful segments. We may also find many more use cases in computer science, biology, medicine or social science. However, it often turns out to be quite difficult to define properly how a well-separated cluster looks like.
Today, I will discuss some technical aspects of hierarchical cluster analysis, namely Agglomerative Clustering. One great advantage of this hierarchical approach would be fully automatic selection of the appropriate number of clusters. This is because in genuine unsupervised learning problem, we have no idea how many clusters we should look for! Also, in my view, this clever clustering technique solves some ambiguity issues regarding vague definition of a cluster and thus is more than suitable for automatic detection of such structures. On the other hand, the agglomerative clustering process employs standard metrics for clustering quality description. Therefore, it will be fairly easy to observe what is going on. Continue reading →
Irrespective of whether the underlying data comes from e-shop customers, your clients, small businesses or both large profit and non-profit organizations, market segmentation analysis always brings valuable insights and helps you to leverage otherwise hidden information in your favor, for example greater sales. Therefore, it is vitally important toutilize an efficient analytical pipeline, which would not only help you understand your customer base, but also further serve you during planning of your tailored offers, advertising, promos or strategy. Let us play with some advanced analytics in order to provide a simple example of efficiency improvement when using segmentation techniques, namely clustering, projection pursuit and t-SNE.
As your goal might be improving your sales through tailored customer contact, you need to discover homogeneous groups of people. The different groups of customers behave and respond differently, therefore it is only natural to treat them in a different way. The idea is to get greater profit in each segment separately, through diverse strategy. Thus, we need to accomplish two fundamental tasks:
identify homogeneous market segments (i.e. which people are in which group)
identify important features (i.e. what is decisive for customer behavior)
In this post, I am focusing on the first problem from the technical point of view, using some advanced analytic methods. For the sake of brief demonstration, I will work with simple dataset, describing the annual spending of clients of a wholesale distributor on diverse product categories. Following the figure below, it would be difficult to detect some well separated clusters of clients at the first sight.
In this post, I am going to talk about an exceptional ensemble method for improving classification accuracy (boosting) called AdaBoost. AdaBoost algorithm efficiently converts a weak classifier, which is defined as a classifier that achieves only a slightly better accuracy than random guessing, into a strong classifier, which performs significantly better. AdaBoost is fast, does not require any inner parameters to tune and we can combine it with any weak learner, for example Decision Tree.
Imagine you are dealing with a classification task critical to your underlying business. For example, you may want to identify two different groups of customers in order to make a targeted offer, suitable only for one of the groups. In this case, the more accurately you classify your two major groups of customers, the more profit you gain on your targeted offers.