10 essential skills: (6) machine learning / deep learning algorithms.

If you've adopted the principles and knowledge behind posts (1), (2), (3), (4), (5), you're really going to enjoy the content here!

Preamble:

I got pretty enthusiastic the other day after reading one of my go-to bloggers on DataScienceCentral describe the variety of skills deployed by the data scientist, so I'd like to share. The list below is just one of the myriad mentions of the many hats worn by the data scientist professional as described by Vincent Granville, and if these don't generate some enthusiasm, well, then I just don't know what to do anymore!

A nice visual representation of the many disciplines of data science, by Brendan Tierney 2012

The data scientist skillset:

data integration
distributed architecture
automating machine learning
data visualization
dashboards and BI
data engineering
deployment in production mode
automated, data-driven decisions

Overview

In this post I describe the essential skill of machine learning and deep learning algorithms in a way that I hope both inspires and encourages you. It's impossible to completely cover this topic, so I limit the scope of the discussion to the merits toward professional development and incorporating an ML/DL skillset into your toolkit. Quoted from a fantastic resource, here is a high-level definition to kick things off:

Machine learning is about extracting knowledge from data. It is a research field at the intersection of statistics, artificial intelligence, and computer science and is also known as predictive analytics or statistical learning. - Introduction to Machine Learning with Python

Remember, data is everywhere around us! We can even generate it (i.e. features engineering/features representation). Once we are able to capture, store, and process data, the analytical benefits are truly endless and their applications are bounded only by our imagination.

At the present moment, machine learning (ML) can't deliver the sort of artificial intelligence seen in productions such as Ex Machina or Westworld - ML is a subset of artificial intelligence, where the latter aims to mimic human abilities - and yet today's ML algorithms are already proving invaluable. They can filter spam, recommend movies or products to users, predict traffic, recognize faces, detect fraud, and chat with customers. On the other hand, deep learning is a subset of machine learning and is inspired by the neural networks of the brain. Deep learning (DL) works to give machines the capacity to learn and solve problems on their own. DL inspires much of the AI-hype around autonomous-robotics and even dares imagine their evolution toward sentient life.

Milestones in the achievements of machine learning beating humans in games are:

Deep Blue versus Gary Kasparov, 1996, 1997 (Game: Chess)

AlphaGo versus Fan Hui, 2015 (Game: Go)

AlphaGo versus Lee Seedol, 2016 (Game: Go)

AlphaGo versus Je Kie, 2017 (Game: Go)

Two rounds of games between the two.

MaNa wins and AlphaStar wins in latest version but previously lost 0-5.

AlphaStar versus MaNa, 2019 (Game: Starcraft)

If bringing AI to life is the end goal, there is still no escaping the necessity that we humans must first train ourselves in machine learning. Such a pity. To get started there are four paradigms of machine learning: supervised; unsupervised; semi-supervised; and reinforcement learning. Each of these deals specifically with the nature of data, the determination and mapping of input-output, the level of human instruction, and the ML techniques deployed. Supervised- versus unsupervised-learning is essentially a distinction between structured versus unstructured data. Reinforcement learning (RL) is unique in nature and a growing area of research. RL uses rewards and punishment as signals for positive and negative behaviour in order to incentivize some action without explicitly instructing outcomes. In 2018, the think tank OpenAI developed an AI that played games where the cost function wasn't solved with extrinsic rewards but was rather generated to avoid boredom or simply, be curious. It's part of what is called curiosity-driven AI and they wrote about it here. If you're worried about having the proper background to understand this material, let me assure you, your background in biology, chemistry, psychology, neuroscience, or economics, is ample preparation for a committed learner.

Add Machine Learning to your Repertoire

Data science, machine learning/deep learning, data mining, and artificial intelligence more generally, are all composite disciplines making heavy use of machine learning as pillars of the work. In the context of big data, industries from retail, insurance, financial services, communication, health care, government, marketing and sales, and transportation, are employing machine learning technology to analyze large data sets, make models and predictions, find relationships, uncover insights, and ultimately generate value. Statistical Analysis Software (SAS) - I've mentioned SAS previously as a proprietary enterprise-grade software and programming language - has a great breakdown of these three big data tools (I abbreviate the definitions for simplicity):

Data Mining

"Data mining applies methods from many different areas to identify previously unknown patterns from data. This can include statistical algorithms, machine learning, text analytics, time series analysis and other areas of analytics. Data mining also includes the study and practice of data storage and data manipulation."

Machine Learning

"Machine learning has developed based on the ability to use computers to probe the data for structure, even if we do not have a theory of what that structure looks like. [...] Because machine learning often uses an iterative approach to learn from data, the learning can be easily automated. Passes are run through the data until a robust pattern is found."

Deep learning

"Deep learning combines advances in computing power and special types of neural networks to learn complicated patterns in large amounts of data."

Using Machine Learning Tools

As an example, SAS will embed AI tools into its existing software making it accessible to enterprises wanting to benefit from AI's endless array of applications. The downside is that SAS is prohibitively expensive for the individual and truly an enterprise-level solution (it does support Hadoop and the Hadoop Distributed File System for any curious folk). However, open-source software such as Python, R, and Scala have machine learning libraries at your disposal, for free, and these languages and environments can be integrated into business flow!

Machine libraries are essentially packages that you can import into the program you are coding that allows you to quickly reference tools, algorithms, etc, and employ them for your purposes, without building the underlying methodologies from scratch. It's akin to asking your computer "Hey, what's for 'diner' ? Print ('diner')" where you've previously specified that 'diner' is the array of strings: {'lentil soup', 'toasted baguette', 'kale salad', 'fruit'}. Diner could be but one of many labels in a library holding many instructions. This library could be titled: meals. The benefits become more obvious when your set of procedures is quite complex.

With machine learning libraries there are many algorithms contained that you can borrow, like a book from the library. What you'll need to possess is an understanding of your data and your needs/desired outcome, and an understanding of the different algorithms, their purpose, design, and implementation.

Below is an infographic of the libraries that each of these environments support. P.s. the popular Sci-Kit learn package in Python is second from the top; it's not obvious. Also, note Spark's ML library discussed in an earlier post, under Scala, also second from the top.

Machine Learning Libraries

I've given extensive coverage in this series to Python programming language and environment. When it comes to learning machine learning algorithms, something I really like to do is go directly to the source. Here I show Python's SciKit Learn library. In SciKit's User Guide you can go right "under the hood" to look at the code and see how algorithms are built and applied. I recommend this.

Let's take the example: Supervised Learning > Decision Trees > Classification problem.

A classification problem means you are trying to use predictive modeling to approximate a mapping function from input variables (X) to discrete output variables (y). Output variables in this context are referred to as labels or categories, and all these labels or categories are given. The mapping function we are predicting finds the class or category for each given observation learned on iterations done on training data that includes both (X,y).

In Python, you can import this code as easily as typing: from sklearn.tree import DecisionTreeClassifier. The User Guide will give you all the information you need to apply these tools to your classification problem and dataset including practical notes, mathematical formulations, examples, and code. Then, it's a matter of practicing the techniques on a dataset of your choosing.

SciKit learns builds an example from a dataset called 'Iris' to show the effects of a decision tree classifier. I will use this data to show visual representation of both classifying and clustering ML techniques.

This is an example training on paired-variables (features) from a dataset using the decision tree classifier. Classifiers are built using the sepal length, sepal width, petal length, or petal width features to make four dimensions. The three classes are: Setosa, Versicolor, Virginica. This is how the decision tree classifier finds the best classes for each data point. Each box is a recommended classification. Take a moment to imagine how simplified this is in the grand scheme of classification. Our methodologies can become much more involved with pre-and post- procedures that we include before delivering results which may drive decisions in the corporate context.

Let's take the example: Unsupervised Learning > Clustering > K-Means

Clustering is an unsupervised machine learning technique that is quite common and involves grouping data points where we are given a set of data but don't know how to group them because there are no pre-set categories/labels that they identify with. In theory, data points that are in the same group will share similar properties while data points in a different group will have highly dissimilar properties. Using statistical techniques, select algorithms can help find correlations or similarities between data points and decide which groups are formed from this without the user needing to explicitly define these outputs. K-means is one popular algorithm for clustering which groups data by attempting to separate samples into n-number of groups of equal variance. This number of clusters must be specified, i.e. choose 2,3,4, etc. The rule is to minimize the sum-of-squares (a statistical method) from the centroid of the cluster.

Below is K-means clustering algorithm applied to the 'Iris' dataset discussed above:

Here is a 2011 example run in Python using the numpy, matplotlib, and scipy libraries. Scikit Learn is built on all three of these libaries today. I like this example because it shows the centroids that determine the clusters. This line code shows how Maciej specifies three clusters:

# kmeans for 3 clusters res, idx = kmeans2(numpy.array(zip(xy[:,0],xy[:,1])),3)

Output:

Maciej Pacula shows the K-means cluster centres in this representation. They look appropriate!

Simple classification and clustering problems are foundational to machine learning but only hint at the wealth of analytics that can be achieved using ML/DL techniques to the variety, volume and velocity of today's data and its immeasurable applications.

That being said, gaining familiarity with machine learning libraries, going through sample code and exercises, and applying various algorithms, is a great starting point. You may even choose to specialize in only a handful of techniques that are common in a domain of your choosing. This is a fantastic recommendation if you're looking to immediately apply ML/DL to your work. With a little research, and some exposure to developments in AI, it won't be hard to imagine or discover which ML techniques are commonplace today for a suite of applications in the workplace. Pretty soon, you may even be helping to build recommendation engines like Netflix!