Machine Learning (ML) is a field of computer science that allows machines (computers) to learn from data without being explicitly programmed. ML is considered a subfield of artificial intelligence (AI) and it involves algorithms that improve their performance on a specific task as they are exposed to more and more data.
The projects in machine learning can be broadly categorized into a) Supervised Learning, b) Unsupervised Learning and, c) Reinforcement Learning. From a mainstream business standpoint, the last type of ML problem is not as important as the first two.
Machine learning projects are often confused with creating AI, while in practice ML is about ‘empowering’ machines to learn and adapt to new situations based on data. Imagine a software that analyzes millions of emails to identify spam messages, or a self-driving car that learns to navigate different road conditions, or an algorithm that predicts which customer visiting your store will actually buy. These are just a few of examples of the power of machine learning.
Machine learning focuses on implementing, and sometimes developing, algorithms and techniques that enable computers to learn from data and improve their performance on specific tasks without being explicitly programmed. ML tasks often utilize computationally heavy algorithms to model the relationship between a variable or outcome of interest, called the target variable, with various other inputs, or attributes, that are presumed to have an impact on the outcome. For instance, an ML engineer hypothesizes that the average cart value of an online shopper (i.e. target variable) is linked to other attributes such as their age, location, income bracket, gender, purchase history, time of shopping, landing page, search queries, click-through-rate and dwelling time. With sufficient data, a well implemented ML model can be immensely helpful in shaping business strategy of an e-Commerce site, offering refinements in marketing spend, targeting right products to the right customer groups, preempting customer churn and in offering proactive measures to delight customers, among others.
You can approach almost any ML problem by taking a systematic and structured approach. Before tackling ML from a technical aspect, from a business standpoint the preliminary steps include making sure that machine learning is the way to go. In business it is almost never the case that the process owner defines a problem in analytical terms. For instance, a manager in the finance function may flag concerns about inefficient use of working capital due to high inventory costs eating into profit margins over time, or a marketing manager would be lamenting over deteriorating leads-to-conversion metric over the last 6 quarters. In both the cases the data science or the analytics function can help ameliorate the situation. In the first case the analytics function may explore possibilities to reduce inventory carrying costs by offering a better forecasting model; they can also help the marketing function by deploying ML for better lead generation.
First, the stakeholders must spend sufficient time in exploring and understanding the business context and take the time to define the problem. During the discussions it should become clear that an analytics-based solution can help address the problem at hand. In this ideation stage, everyone should agree that at least in theory analytical models can be effectively deployed for better decision making and business management. This is also the stage where stakeholders decide on how to measure success, or what defines success for this project. Assuming that the project is green lit, the stakeholders will work together to extract the available data and perform exploratory data analysis (EDA). During this stage the necessary data preprocessing steps are identified and implemented. The data is cleaned-up and readied for ML activities. While most of these steps are iterative, not all are.
Once the data is ready, here is how one can approach a machine learning problem:
- Identify Supervised vs. Unsupervised Learning: The data science team, which also includes ML engineers, will identify the type of ML problem they are dealing with. In most cases, we are dealing with either Supervised data or Unsupervised data. Supervised data always has at least one (or more) target variable associated with it, whereas Unsupervised data has no target variable to begin with. For example, trying to predict the resale value of a used car or the insurance payout for a particular person, or trying to tell if a picture is that of a cat or a dog are all supervised problems. In a business context, a supervised problem is often better understood and is considerably easier to deal with than an unsupervised problem. If the object is to predict a number (a numeric value where the output is a real number) then the problem belongs to the ‘regression’ category, and if the object is to identify a state or an outcome (e.g. cat or dog, 0 or 1, black or white etc.) then the problem is called a classification problem. It is worth noting that sometimes ML practitioners recast regression problems as classification problems depending on the success metric used and the available data. Plenty of ML algorithms are available to deal with regression and classification problems and more are being added every year.
Unsupervised problems do not have a target variable associated with them in general are more difficult to deal with. Whereas in the case of supervised datasets humans have played a role in marking the target variable, there is no such contribution from human actors in case of unsupervised data. Unsupervised problems can be tackled using one of the following approaches: Clustering algorithms, Association rule mining, Anomaly detection or Outlier analysis, and Principal Components Analysis (PCA). From a visual analytics standpoint ML practitioners can use Self-Organizing Maps (SOMs) or t-SNE (t-Distributed Stochastic Neighbor Embedding) algorithms to explore relationships in unsupervised datasets. (Reinforcement learning is beyond the scope of this post.)
- Decide on Cross-Validation and Model Evaluation Metrics: once the ML problem has been defined by the team, it may seem that model building is the next logical step. However, it is important that the team spend time on deliberating the nuances of what would define their “ideal” model first so that they can choose from the several dozens of ML models they would build. And if at first the team doesn’t succeed in coming close to the predefined success metric (which is not unusual), the team will have a path forward on how to tweak the best models.
After apportioning the data into the training set, the team will have to work on how to validate model results. The most commonly used approach is cross-validation -an approach where the test dataset is split into a number smaller subsets on which the trained model will be validated. A lot can be said about the technicalities behind the use of cross-validation to strengthen ML model performance, but the most important outcome of cross-validation is that it reduces model overfit. One must be acutely cognizant of mitigating model overfit especially when using neural networks, which by design, tend to adapt to nuances in the training data (i.e. the loss function reduces) at a faster rate with every new data sample. In ML we want our model to fit the data as accurately as possible, without overfitting it. In laypersons terms, the maximum accuracy achieved by the ML model on the training dataset should be reflected in the validation or the test dataset. This is achieved by implementing cross-validation. The most common approaches include k-fold cross-validation (and its variants) and Leave-One-Out cross-validation (LOOCV). Cross-validation is an essential ingredient in achieving model parsimony.
Model evaluation metrics tell us about how the ML model is performing for a given task or dataset. While it is not unusual for data scientists to come up with their own metrics when confronted with unique problems, in most cases the standard metrics do a good job of indicating model performance.
For regression problems the following metrics are the most important ones:
- Root mean squared error (RMSE)
- Root mean squared logarithmic error (RMSLE)
- Mean absolute error (MAE)
- Mean absolute percentage error (MAPE)
- Mean squared error (MSE), and
- R-squared
For classification problems the following metrics are the most important ones:
- Accuracy: overall correct predictions (correct / total)
- Precision: ratio of true positives to all predicted positives [TP / (TP + FP)]
- Recall: ratio of true positives to all actual positives [TP / (TP + FN)]
- F1-score: the most important score indicating the balance of performance between Precision and recall. Lies between 0 (worst) and 1 (best).
- ROC AUC: the area under the Receiver Operating Characteristic curve, measuring the model’s ability to discriminate between classes.
- Log Loss: measures the model’s ability to assign probabilities to classes, penalizing it more for confident wrong predictions.
Evaluation metrics used for unsupervised problems are intricately linked to the business context and the problem at hand and is situation dependent.
- Shortlist and Train the Model: ML practitioners are obsessed with achieving useful accuracy for a given task. This involves selecting and training different prediction models on the dataset at hand. A number of models are available for use in regression and classification tasks. The most commonly used ML algorithms which work on both regression and classification include: linear models, decision trees, Bayesian techniques (Naïve Bayes), kNN, random forests (also known as bootstrap forests), boosted trees, neural networks, and support vector machines (SVMs).
The explainability and interpretability of the models listed above vary with respect to the problem at hand, the underlying data, and the evaluation metrics. Some models, such as linear models and decision trees can help ML practitioners generate lots of insights into the business aspects of the problem; whereas, neural networks and random forests rarely offer that benefit.
However, there are instances where the ML task is all about achieving the highest possible prediction accuracy and explainability and interpretability is not a consideration. This is where ML practitioners can pull out gradient boosting techniques. Until a few years ago the most powerful gradient boosting algorithm was XGBoost, which was seen as a “miracle” algorithm in breaking the accuracy ceiling that is often encountered with mainstream algorithms. Although newer variants of the algorithms are out in the market, XGBoost is still seen as a solid choice by practitioners. Other gradient boosting algorithms include: AdaBoost, CatBoost, multiple additive regressive trees (MART), gradient boosting machines (GBM), and LightGBM. Going by empirical data, gradient boosting techniques work very well with classification tasks.
Undertaking an ML project often involves multiple iterations across different steps. In particular, ML practitioners go back and forth between this step, step 3 and step 4 (next).
- Select and Engineer Features: The principle of model parsimony dictates that when forced to choose between two models of comparable accuracy, always choose the model with lower complexity. Models having lower complexity (but similar accuracy) often translate to better generalizability and greater robustness. A necessary step towards achieving model parsimony is to undertake feature selection. After building first-cut models in step 3, ML practitioners routinely evaluate the information worthiness of each attribute in the dataset. The question is simple: how much is a particular attribute contributing to the predictive power of the model? The lowest contributors are dropped. This results in lowering model complexity.
The process of feature engineering is far more nuanced and demands high levels of domain knowledge. Engineered features are those attributes which the ML practitioner will construct from the data on hand, or using data that is readily available. The engineered features can be as simple as new ratio attribute or a transformed variable, or it can be embedding of geolocation data or cookie-tracking data sourced from a third party. For example, take the simple case of a health insurance company wanting to predict insurance payouts for policy holders using ‘age’ as one of the features. Knowing that there is a nonlinear relationship between age of a patient and their health deterioration will enable to modeler to create a new feature, in this case a transformed variable, that will account that relationship in a stronger way. It can be shown that engineering a new feature titled ‘age-squared’ will account for this relationship in a stronger way, in effect boosting model accuracy.
Feature engineering nominal and ordinal variables are trickier and small changes can lead to loss of model precision. A detailed discussion of categorical feature engineering is beyond the scope of this post, however, in practice ML engineers assess the impacts of engineering measures such as frequency encoding, target encoding, and hashing encoding.
- Tune Hyperparameters and Compare Models: ML practitioners build the first-cut versions of models using off-the-shelf versions to get a feel for the best fitting model; there is no hyperparameter optimization at this stage. The next important component is tuning the model hyperparameters. Hyperparameters are the parameters innate to a particular model which can be tuned to extract the best performance (from that model). The hyperparameters determine how closely the model fits (or ‘learns’) on the training dataset. For example, take a simple case of training a linear regression model with stochastic gradient descent (SGD). In this case the parameters of the model are the slope and the bias, and the hyperparameter is the learning rate. The best performing version of this model will have the “correct” combination of the three parameters. Fortunately for us, there is more than one correct combination which will result in the best performing model, or something very close to it. So how do we find the best combination? We implement a grid search algorithm that incrementally iterates across different values within the set boundary conditions. In short, we evaluate the different combinations using a limiting condition. The grid search algorithm automatically stops at the best model score. Easy. However, there are certain nuances which call for modeler’s attention. While implementing the best combination of hyperparameters can certainly improve model score, it will also tend to overfit the ‘noise’ in the training data. Modelers must prevent this from happening. This is done by introducing a loss function that penalizes overfit; this is called regularizing the machine learning model. Regularization also helps in model generalization and robustness. In case of linear regression with SGD, commonly used regularization methods include L1 (LASSO method) and L2 (ridge method) types. During the course of an ML project, the best scoring models are stored in “model depots” until sufficient number of models are ready to be compared. Models are scored on the validation data and the best model is selected.
- Model Selection and Deployment: once the best ML model for the task is selected, the next step is to serialize it for interoperability. If the model is developed in a Python environment, this can be done using ‘pickling’ where the serialized model is pickled for deployment in a variety of OS environments. After ensuring model interoperability, the data engineers, software architects and application development teams, working together, would setup the model on a server or on a cloud infrastructure to handle incoming requests. During the beta testing phase, the model versions are redesigned so as to handle multiple concurrent requests efficiently and to ensure that it meets the required performance metrics. Model performance will be logged and monitored by the data science support team or the model maintenance team. Ergo, when the model performs as expected we have achieved successful model deployment.
(Note: This post was about handling ML problems on structured datasets. ML projects on structured data has mainstream applications across different types of businesses and is the most frequently encountered type of project. There are special cases of machine learning which involve processing images, text and audio which lie more in the domain of tech-heavy businesses. That will be covered in a separate post.)
Written by
Prof. Abhijith Seetharam
Assistant Professor – Analytics & Data Science