Artificial Intelligence (AI) and the various associated learning algorithms may still seem obscure and mysterious to the non-experts. However, their presence is becoming more and more common in our daily life, from navigation systems to the autocompletion of sentences in messaging services or advanced recommendations in your favorite streaming app. This relative opacity is generally explained by the lack of understanding of these technologies and in particular of how they work. Among other things, it is not always clear what is needed to train and build those kinds of algorithms, both in terms of data and infrastructure. In this article, we will try to briefly clarify the actual needs of an “AI algorithm” in terms of data. Let's note however that there is no such a thing as a universal rule for that purpose, and that the answer depends on many factors, such as the problem (weather forecast, automatic filling of tax documents or prediction of the shortest route in a navigation application) or the type of learning algorithm envisaged (linear regression, classical machine learning or neural networks), etc.
As mentioned before, the complexity of the problem is the first factor affecting the amount of data needed to achieve a specific accuracy. The main purpose of a learning algorithm is to capture the links between the different features of the training data (the attributes). The more complicated the relationships, the more sample data that needs to be given to the algorithm. For example, the prediction of the number of people on holiday according to the time of the year or the identification of various impurities and defects on the surface of metallurgical products do not require the same amount of training data.
The complexity of the algorithm itself is also an important factor. Artificial intelligence ranges from simple algorithms such as linear regressions, to complex algorithms such as neural networks, to more “classical” machine learning algorithms such as random forests. The more complex the algorithm, the greater the amount of data required to train it. The same applies to the architecture of neural networks, their complexity being reflected by the number of layers of neurones and the number of neurones per layer.
Even if there is no absolute truth for the amount of data required, there are basic rules to estimate the order of magnitude of the appropriate sample size. One of the most widespread (but controversial… and rightly so) is the rule of 10: consider that a model needs about ten times more observations than it has degrees of freedom (a degree of freedom can be a parameter of the model, a feature of the data, or a class in the case of classification). If we apply this principle in the case of linear problems, the number of degrees of freedom is mainly the number of attributes of the data (problem dimensionality). In a neural network, on the other hand, the number of degrees of freedom grows considerably, due to the high number of parameters to be calculated. The latter varies according to the exact type of algorithm, but anyway depends on the size of each layer of neurones and the number of layers. For a complex problem such as high definition object recognition, thousands images per class can easily be required.
As one could have easily guessed, in general, the richer and more diverse the dataset, the better. However, alternatives exist if you do not have enough data to train your algorithm to reach high accuracy levels. Among these, one can leverage pre-trained models by extending a model trained for a global task to fulfil a new (but sufficiently similar) task through a process called “transfer learning”. For example, a pre-trained model for image-based object edge or text detection could be refined for license plate detection. Other techniques include the use of data coming from external (possibly public) data sources (e.g., weather data), or the use of data augmentation methods to generate synthetic data based on transformations made to real data (e.g., enriching image data sets with randomly cropped, flipped or rotated images). This last method is regularly used for object recognition/detection problems, for which it is not always easy to collect a large labelled database.
Last but not least, one has remember that quality is at least as (if not more) important than quantity. A hundred clean and varied observations will always be much better than a million very noisy or perfectly uniform observations, hence the distinction between “data” and “information”. Adding data, even of good quality, can even sometimes lead to a saturation in model performance, or even to a reduction of it (e.g., due to class unbalance).
In conclusion, the complexity of the problem and the number of explanatory variables are decisive criteria in determining the amount of data needed. But other aspects, such as data quality and variety also play a key role. If rules of thumb do exist to provide rough estimates, nothing can (at this stage) replace human expertise in that process.
Wondering how AI can bring you new opportunities in your business? Don't hesitate to contact our AI specialists.