Why model tuning is so important?
Implementing machine learning models in big data distributed ecosystem as well as individual servers has become extremely easy in past few years in no small measure due to with high level libraries such as sklearn in Python, or deep learning interface libraries such as Keras with Tensorflow, CNTK, or Theano.
Lets say that you want to run a nearest neighbor algorithm on a corpus of text documents to identify documents most similar to each other. It’s only three lines of coding if you use sklearn library.
>>> from sklearn.neighbors import NearestNeighbors >>> import numpy as np >>> nbrs = NearestNeighbors(n_neighbors=5, radius=1.0, algorithm=’auto’, leaf_size=30, metric=’minkowski’, p=2, metric_params=None, n_jobs=None).fit(X) # Where X is your dataset
Code block 1
The parameters such as
metric=’minkowski’ which you enter before running the model are referred to as hyperparameters.
metric=’minkowski’ refers to vector similarity metric which is set to minkowski as a default; there are alteast 12 options here including cosine so its necessary to know something about these metrics to tune the models; same with leaf size, radius etc which has even higher number of options you can specify.
The default hyperparameter values set by sklearn’s authors are usually appropriate for toy examples but will start giving all kinds of memory overflow and other errors when you move to real world datasets.
Even if you get away with no errors and actually get the model to run with default hyperparameters, what’s the guarantee that the model you got is the best one?
For quantitative such as ours, we need hard numbers and a process to identify and tune our models before we can say for sure that the model we are running are indeed the “best model”.
This is what separates an experienced data scientist consultant firms such as ours from the kid right out of college in Bangladesh or India who charges $5 on Fiverr to develop a machine learning model.
We will explain all the steps a good data science consultant should follow to tune the models and give you the best possible analysis for your dataset in the budget and time you set.
Define problem area and right data science model
In your initial consulting hour after you describe project requirements and show the consultant the dataset you want to work with, your data science consultant should tell you what kind of data science solution is appropriate for your project needs.
Features are independent variables which are descriptors (also known as predictors) of a given dataset. For example, for a given equation Y = a1x1 + a2x2 + ..+ c; the independent variables x1, x2 etc are called features. Targets or labels, Y in above equation, are dependent variables which are the predicted variables of a given dataset.
for example, If our dataset is a text document, then the individual words (known as “tokens”) are features; similarly, for an image, the pixel densities are features and whether or not a given text document is spam or not spam is a label. Similarly, for an image, the object in the image itself such as table or a chair is the target variable.
Supervised learning: These training algorithm require both features and targets. Classification and regression are both types of supervised learning.
- Classification algorithms classify a new observation to a set of sub populations using a training set of data containing observations (or instances) whose category membership is known. For example, categorizing a text document or an email as spam or not spam is a typical case of a classification type learning.
Unsupervised learning: These algorithms draw inferences from datasets containing only features without any targets or labels data.
- Clustering and density estimation algorithms are types of unsupervised learning algorithms. These typically try to group a set of objects in such a way that objects in the same group called a cluster are more similar to each other then to those in other groups/clusters. These clusters can be anything, the most canonical example being trying to group a set of text documents into different topics clusters.
What to optimize? Identifying best algorithms and hyperparameters space
Next, ask your data science consultant to define how many possible hyperparameters a particular model has which fits which you dataset and domain area.
He/she has already told you if you have a supervised or an unsupervised problem, now this is the time for them to tell you what kind of algorithms they are thinking of applying first.
With new algorithms and neural networks being published all the time in literature, this is rapidly expanding field, however, they should be able to tell you if they are thinking of applying a shallow machine learning technique such as support vector machine or random forest or skip that and directly move to deep learning model such as recurrent neural network.
As a client, you should try to restrict your options here to 2-3 shallow techniques first, see their results and use that as a baseline before moving to more computationally intensive (and expensive) but typically more accurate neural networks; if there is only 2-3% improvement, than many of our clients forgo the additional accuracy for saving computation time and money.
Let us take an example of trying to cluster similar text documents by using the nearest neighbor algorithm. Typically, text has to be preprocessed, tokenized, transformed into a vector using a vectorization algorithm before you are ready to apply nearest neighbor algorithms.
Preprocessing: In this step, you are trying to get rid of punctuations and convert all the word inflections into a root word. For example, all the word inflections of rain such as raining, rained etc. all have the same semantic meaning, and you need to convert them to their root word before you start comparing different sentences and documents for their similarity to one another.
The algorithms which convert words into root forms are called lemmatization (called lemma) and stemming algorithms. Lemma algos typically give you real words, whereas stemming simply cuts off last parts of the word so its faster but less accurate.
There are multitude of options out there but most common lemma and stemming algos are porter stemmer, snowball stemmer etc. The choice of right stemmer/lemmatizer is a hyperparameter, and you have option of choosing 3-4 different lemma and stemming algorithms here; lets call them [p1,p2,p3,p4].
Tokenizer: Converting documents into sets of words and phrases is called tokenization. The most important decision at this stage is to define length of phrases, you can choose length one (called unigram) or phrases of two or three in length called bigram and trigram respectively. Its almost a standard practice to run your hyperparameter optimization with atleast three different ngram combinations such as (1,1), (1,2) and (1,3).
Count Vectorization: At this stage you are trying to convert the sets of words of phrases into numerical vectors.
number of max words: since this is a continuous variable with possible values from 0 words to max number of unique words, I say pick 7-8 values such as (50, 500, 2500, 5000, 10,000, 50,000, none). none here just means that it will consider the entire vocabulary.
stop word list: here you can try a custom list (of only few words) which you want excluded from your similarity analysis. However, there are multiple such stop word lists out there already from different domain areas include technology, legal, medical etc as well as general purpose ones. Lets say you use 4 different lists.
max_df (When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). Lets say you use 4 different thresholds here.
min_df (When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature.) use 4 different cutoffs here.
Nearest neighbor: You have 12 major choices available under metrics hyperparameter in the code block 1 to do document/resume similarity.
Total hyperparameter combinations: Just add up all the different hyperparameter combinations discussed above and change one factor at a time per iteration; just square the total combinations to find total number of iterations required (~500). Ideally you want to go through all of them and accumulate results based on which you can find out best combinations giving your model the highest predictive power.
How to optimize? implementing hyperparameter optimization to find the best combination in your use case
An exhaustive search where you go through all combinations is known as grid searching. For complicated models, the hyperparameter combinations may run into 10,000-100,000 and at that point, its almost impossible to go through all of them. So you can randomize the hyperparameter grid and only search for few possible combinations; this is known as random search.
What to optimize for? capturing performance of your data science model
If we were working on standard supervised learning problem, the answer would perhaps be F-score, precision-recall, R^2, RMSE for cross validation etc.
If it is a clustering based unsupervised learning problem, or I would recommend any metrics such as Fowlkes-Mallows or Calinski-Harabasz (even simple Silhouette works well).
The top metric we use in text analytics and search engines evaluation is known as Mean Average Precision (MAP) at n results. Precision here is the fraction of relevant recommendation instances among the retrieved recommendation instances. But calculating just that is not useful if the relevant result is on 100th result on 10th page.
Hence we redefine MAP by only considering the number of relevant recommendation among first n results. As you probably guessed from this definition that someone has to manually rank recommendation results as relevant/not relevant and we use those truth labels to compute MAP.
Selecting a good metric to optimize is vital if you want to see any noticeble improvement in predictive power of your model. This topic can fill up a textbook although a good place to start would this intro to machine learning metrics article.
This is just a primer on most important factors when trying to tune your data science model. A good data science consultant should also tell you about the tradeoffs of performance vs speed in production setting and steer you towards a solution tailored to your needs.
If all you want is a content recommendation system for your ecommerce site or blog, than there is little point in running GPU based AWS servers to train your models and have CPU based servers for inference. A much better idea is to stick with low cost CPU server based models for both training and inference.