Member-only story
Choosing an ML estimator
Approaching the problem
Problem
Do you want to predict a category (classification), a quantity (regression), an anomaly (anomaly detection), finding a relationship between variables in different databases (association rules or recommendation) or do you want to discover structure in unexplored data (clustering)?
Available resources
The quantity, quality, variety of data available play an important role. Typically, it is said that if you have more than 1,00,000 data points you can apply almost all algorithms. (SAP Conversational AI, n.d.) The number of classes labeled data matter.
Based on Sci-kit learn minimum 50 samples are needed to start with ML approach.[1]
Constraints
What is your data storage capacity? Depending on the storage capacity of your system, you might not be able to store Gigabytes of classification/regression models or gigabytes of data to clusterize. This is the case, for instance, for embedded systems.
Does the prediction have to be fast? In real-time applications, it is obviously very important to have a prediction as fast as possible. For instance, in autonomous driving, it’s important that the classification of road signs be as fast as possible to avoid accidents, obviously…
Does learning have to be fast? In some circumstances, training models quickly are necessary: sometimes, you need to rapidly update, on the fly, your model…