If you’re new to machine learning or have no idea what it is, this article is for you. The machine learning field is huge and evolving rapidly, but fear not, we will walk through the process of machine learning to get your feet wet. We’ll cover the following topics:
- What is machine learning?
- The process of machine learning.
- Ensemble techniques.
What is Machine Learning?
First of all, machine learning is a sub-field of artificial intelligence. More specifically, machine learning is training a model to learn pattern recognition of existing data with known outcomes. This pattern recognition can be used in the future to take input data and make a prediction of the outcome. A machine learning model can also be thought of as a function generator that takes inputs and gives an output. As a result, many machine learning algorithms are considered a “black box”. With a black box, it’s hard to see how the algorithm makes a prediction. However, some algorithms such as decision trees are “white boxes” where you can see how the algorithm arrives at a predicted value.
Machine learning models are trained using various algorithms designed to recognize patterns. Essentially these algorithms use optimization to train on known data. The optimization is to minimize a model’s prediction error compared to the known outputs.
Two of the primary tasks of machine learning are classification and regression.
Classification
Classification is identifying an item’s class based on input data. For example, suppose we want to identify if a person is a credit risk based on age, income, FICO score. This is overly simplistic for ease of explanation, but you get the point. In this problem, we use historical data where past customer ages, incomes, and FICO scores are known. We also know what kind of credit risk they were based on payment history. Credit risk falls into the following classes:
- Poor
- Fair
- Good
- Excellent
We train a machine learning model to predict a person’s credit risk based on the input factors. The output of the model picks one of the four risk classes as its prediction.
Regression
Regression is determining a numerical output based on input data. An example would be a machine with sensors whose readings are used as inputs to determine a setting on the machine. The setting is the output and it can be adjusted from 0 to 100 for our example. Based on machine testing, we have acquired input data readings and determined the appropriate setting to use. This data is used to train the machine learning model. When the machine goes online the model periodically gets the sensor values and predicts the setting to use.
What is the Process of Machine Learning?
Now that we have an ideal of what machine learning is, let’s cover the process of machine learning. For our purposes, the process is broken down to six steps.
- Preprocessing data
- Data partitioning
- Select a model to use
- Training
- Testing
- Prediction.
Preprocessing Data
Data is often dirty, that is, it’s missing values, has superfluous characters, or not in a form that the machine learning algorithm can handle. Data must be cleaned by removing missing data cases or replacing missing values with artificial data. Categorical data must be encoded in a way that the machine learning algorithm can understand. Numeric input variables that have vastly different scales are often rescaled so they are all approximately of similar magnitude which helps some models to better predict outcomes.
Data Partitioning
Before training a model, the data needs to be split into different sets. Some of the data gets used to train the model. While training, a separate set of data can be used to test the model periodically to see how well the model predicts on data it hasn’t seen from training. After training is complete, we can use a third set to test the final model on more unseen data to see that it will do a sufficient job of predicting data in the future.
Select a Model to Use
There are numerous machine learning algorithms to choose from, and each has strengths and weaknesses depending on the situation. Experience and experimentation are two primary tools to select which algorithm to use.
Training
Once the data is preprocessed and partitioned, we’re ready to train the algorithm to create a prediction model. When our training data contains known outcomes, we are employing supervised training. There are also unsupervised learning and reinforcement learning, but we will not cover those here.
One of the big concerns in training is overtraining. This is where the model learns all of the unique aspects of the training data so well that it does a poor job of predicting new data. There are various techniques that can be used to avoid overtraining and you can see some of them in this article.
Machine learning is rarely about picking an algorithm and training the model in one try. For each algorithm there are parameters that must be used for how training is conducted. These training parameters have an impact on the performance of the model and training can involve a lot of trial and error. This means there will be multiple training runs while tuning the model for best performance.
Testing
Once a model is trained, we need to verify that it is generalized. Generalized means that it can reliably predict new, unseen data. To do this, a separate testing data set is used to verify that the model can predict unseen data.
Prediction
Now that the model is trained and validated, it’s ready to be put into production. We now use the model to predict outcomes of new data. Once in production, the model should be tested routinely to make sure it’s still valid for the incoming data.
If One Model is Good (or not), Why Not More?
When we can’t find a model that does well, we’re still not out of luck. A powerful technique is to combine multiple models into an ensemble. Ensembles often improve performance by using aspects learned by the individual models to reach a more accurate conclusion. For classification tasks, the ensemble determines a prediction through a voting system, such as majority rule. For regression, the ensemble determines a prediction through some form of averaging the outputs of the individual models. If you would like to learn more about ensembles, you can start with this page.
Conclusion
In this article we have barely scratched the surface of machine learning, but we have learned the basic process of machine learning and what it does. If you want to dig deeper into machine learning, have a look through other articles and resources in the Knowledge Base.