A long time ago, when I first started to study machine learning models, the sheer number of different models and techniques was bewildering.
What I needed was a bird’s eye view of what all these different models and approaches do, what they were good at, and what they weren’t so good at.
So today, I’m going to break it down at a very simple level.
In the broadest terms
A machine learning model is built either through supervised learning, or unsupervised learning.
Supervised learning
When a machine learns through supervised learning, it means that it takes a bunch of sample data (or dataset), and figures out what input maps to what output.
As an example, say we have a whole bunch of data that maps distance and traffic conditions, to the time taken to get to a destination. We can then feed that information to a machine learning algorithm, and have it learn to estimate the time taken to get to a destination given any distance and any traffic condition.
The basic idea here is that you have data that you already have the answer to.
So let’s say we have a bunch of existing data that maps a bunch of variables (or features):
- Distance
- Traffic conditions
to a result:
- The time taken to get to a destination
We also call this dataset “training data”. And the machine learning algorithm uses the training data as “truth”, and learns how it can map those variables (features) to the result (output).
Because the machine learning algorithm has access to training data that it can learn from, that’s why we call it supervised learning.
Supervised learning is also sub-divided into classification models, and regression models.
Regression models
A regression model is one whose output is continuous.
In other words, there is a whole continual spectrum of outputs -how tall a person is, how high a student will score in a test, how likely are you to be able to sell your investment profitably.
Classification models
A classification model is one whose output is discrete, or categorical.
In other words, there is a finite number of outputs — it’s a dog or a cat, it’s either a red, blue, or green ball, it’s rock music or jazz music, it’s a tree or it’s not.
And that’s it in a nutshell.
Let’s see some of the algorithms and models below.
Algorithms that support both regression and classification models
Decision trees
Decision trees are a very popular, because they are simple to implement, and to inspect.
Decision trees are intuitive and easy to build but sometimes do fall short when it comes to accuracy.
A decision tree is something very familiar to us. In fact, we use decision trees almost every day to navigate our lives.
Of course, this example is very simple. In actuality if you’re considering whether or not to buy a new computer, you would consider a lot of other factors as well, not just the three factors (slowness, finance, new model coming out) represented above.
In other words, what you see above is probably just a branch in a much, much larger tree.
For a decision tree algorithm to grow a tree, it needs to decide which features to choose, and what conditions should be associated with each feature.
Every condition results in a split in the tree. And it also needs to decide when to stop splitting.
But the idea is really simple.
If you’re interested, a good next stop for more info might be:
Random forests
Random forests build on decision trees, only that they use multiple decision trees, each with a random subset of the features at each step of the decision tree.
What? How?
In our example, there were 3 factors:
- slowness of current computer
- how much money I have
- whether there is a new model coming out
In reality, there may be 30, or 300 factors (the features).
A random forest will take a random subset of those features (e.g. any 10 of them for the 1st decision tree, any 6 of them for the 2nd tree, and so on), and grow each tree.
The random forest then runs all the little decision trees, and takes the “majority vote”.
If most of the little decision trees say “buy a new computer”, then the random forest says that too.
Why bother with random forests?
If an ordinary decision tree grows too deep, it might suffer from what we call overfitting, which means that it’s actually not that accurate.
Most of the time, random forest prevents this by creating random subsets of the features and building smaller trees using those subsets. Afterwards, it combines the subtrees.
It’s important to note this doesn’t work every time and it does makes it more computationally expensive by some degree.
Hyperparameters of a random forest
There are quite a few components of a random forest that can be adjusted to modify how (well) it behaves. Here are some:
- Maximum depth of tree
- Minimum sample split
- Maximum number of leaf nodes
- Minimum number of samples in each leaf node
- Maximum number of trees
- Maximum amount of data to use for each tree
- Maximum number of features for each tree
Neural networks
A neural network is a model inspired by how our human brains work.
Let’s start with a really high level overview so we know what we are working with.
Neural networks are multi-layer networks of neurons (the dots in the picture below) that we use to classify things, make predictions, etc.
Below is the diagram of a simple neural network with 2 inputs, 1 output, and 1 hidden layers of neurons (the blue dots).
To make it clear:
- Left-most dots (red): The input layer of our neural network
- Middle dots (blue): The hidden layer of our neural network (there can be more than 1 hidden layer).
- Right-most dots (green): The output layer of our neural network (there can be more than 1 output).
The arrows show how the dots, which are the neurons, interconnect.
The direction of the arrow shows how information flows from neuron to neuron, from the input layer, to the output layer.
How does it learn?
Each of the arrows has an associated weight, which the neural network can change as it learns.
Every time the neural network is presented with a new piece of training data, it adjusts those weights to account for the new data, while retaining its understanding of the old data.
After it has finished learning, the weights are what defines the “ability” of the neural network to produce its output (classifier, or regression).
Hyperparameters of a neural network
There are quite a few components of a neural network that can be adjusted to modify how (well) it learns. Here are some:
- Learning rate
- Number of epochs
- Batch size
- Number of hidden layers
- Number of neurons in a given layer
- Activation function
If you’re interested, a good next stop for more info might be:
Algorithms that support regression models
Remember that one type of supervised learning is a regression model? A regression model is one whose output is continuous.
In other words, there is a whole continual spectrum of outputs —how tall a person is, how high a student will score in a test, how likely are you to be able to sell your investment profitably.
Linear regression
Intuitively, linear regression basically tries to find a best-fit straight line that kind of best matches all the training data.
Suppose the graph below determines the height of a person (y-axis) given his or her age (x-axis).
In other words, this best-fit line more or less predicts what the height of any person is, given that we know his or her age.
We can push linear regression further, and consider:
- Polynomial regression
- Multiple linear regression (or MLR)
Polynomial regression
Instead of trying to best-fit a line, we instead best-fit a curve, which may be more suitable for certain datasets.
Consider the above dataset, a line simply won’t fit very well, no matter how we try. But a curve will.
Multiple linear regression
So far in our (simple) linear regression, what we discussed is what you can use when you have one independent variable and one dependent variable. In our example, the dependent variable is the person’s height, and the independent variable is the person’s age.
Multiple linear regressionis what you can use when you have more than 1 independent variable.
If you’re interested, a good next stop for more info might be:
- https://towardsdatascience.com/linear-regression-explained-d0a1068accb
- https://towardsdatascience.com/introduction-to-linear-regression-and-polynomial-regression-f8adc96f31cb
Algorithms that support classification models
Remember that one type of supervised learning is a classification model? A classification model is one whose output is discrete, or categorical.
In other words, there is a finite number of outputs — it’s a dog or a cat, it’s either a red, blue, or green ball, it’s rock music or jazz music, it’s a tree or it’s not.
Logistic regression
To be fair, logistic regression is not a classification model on its own. It outputs a probability given some variables.
But it is usually paired with a decision rule that states that if the probability is above a certain threshold, then it is of class A, otherwise it is of class B. And therefore, in combination, makes it a classification model.
Intuitively, logistic regression basically tries to find a S-shaped curve that divides the training data into two, as cleanly as possible.
Why create an S-shaped curve? Firstly, it limits the output to some value between 0 and 1, which has all sorts of conveniences. It also handles data outliers in a fairly elegant way. Among other reasons.
Data on one side of the S-shaped curve belongs to class A, and the rest to class B. Various algorithms are used to determine the best S-shaped curve. Two of which are:
- Gradient descent
- Maximum likelihood
If you’re interested, a good next stop for more info might be:
Support vector machine
Again, a support vector machine, or SVM, is a classification model, which means we are trying to decide if something falls into Class A, or Class B.
Such as, is this email spam, or not spam?
The idea behind an SVM is to find a nice point, line, plane and so on (we’ll call them collectively “hyperplane”) that best divides the dataset into the 2 classes. That’s the optimum hyperplane.
What’s the meaning of best?
It’s the line (or hyperplane) that best separates the dataset is one that has the maximum margins between the hyperplane and all the points closest to it. This optimum hyperplane can also be called the margin maximizing hyperplane.
But, what if there isn’t a hyperplane that can cleanly divide the dataset?
Clearly there is no hyperplane (in this 2-dimensional space, a line) that can separate the above data.
Now what?
The kernel trick
What we want to do is to find a way to change the above data into something that can actually be linearly separated.
The way to do this is to use something called the kernel trick, which is simply a mechanism to transform the original non-linearly separable data (in the above example, the blue and red dots in 2-dimensional space) into a higher-dimensional space in which they are actually separable.
Types of kernels are:
- Linear
- Polynomial
- Radial basis function
- Sigmoid
Then we can use the SVM as we understand it.
Regularization parameter
We need to then be a bit more lax, and allow our classification model to tolerate a few outliers, but yet be able to divide as much of the dataset as possible. We can expose this as a parameter to our classification model, called the regularization parameter.
A high regularization parameter will tell the algorithm that its less okay to misclassify points, and to allow smaller margins in order to classify better.
A low regularization parameter will tell the algorithm that its okay to misclassify more points.
Needless to say, high regularization parameters can be computationally expensive.
Gamma
The gamma parameter defines how far away a single piece of training data can be from the hyperplane.
A high gamma value indicates to the algorithm that only data that is close to the hyperplane should be considered.
A low gamma value indicates to the algorithm that data that is far away from the hyperplane should also be considered.
Summary
In summary, there are parameters that affect how the SVM behaves, and the most common ones are:
- Kernel
- Regularization
- Gamma
If you’re interested, a good next stop for more info might be: