What happens when we run out of data for AI models

Join top executives in San Francisco on July 11-12 to hear how leaders are integrating and optimizing AI investments for success.. Learn more

Long Language Models (LLMs) are one of the most important innovations today. With companies like OpenAI and Microsoft working on launching impressive new NLP systems, no one can deny the importance of having access to vast amounts of quality data that cannot be undermined.

However, according to recent research made by epoch, we may soon need more data to train AI models. The team has investigated the amount of high-quality data available on the Internet. (Indicated “high quality” resources like Wikipedia, as opposed to low quality data like social media posts.)

The analysis shows that high-quality data will run out soon, probably before 2026. While low-quality data sources will run out only decades later, it is clear that the current trend of endlessly scaling models to improve results could slow down. soon.

Machine learning (ML) models are known to improve their performance with an increase in the amount of data on which they are trained. However, simply feeding more data to a model is not always the best solution. This is especially true in the case of rare events or niche applications. For example, if we want to train a model to detect a rare disease, we may need more data to work with. But we still want the models to be more accurate over time.


transform 2023

Join us in San Francisco on July 11-12, where top executives will share how they’ve integrated and optimized AI investments to achieve success and avoid common pitfalls.

Register now

This suggests that if we want to prevent technological development from slowing down, we need to develop other paradigms for building machine learning models that are independent of the amount of data.

In this article, we will talk about what these approaches look like and estimate the pros and cons of these approaches.

The limitations of scaling AI models

One of the biggest challenges of scaling machine learning models is the diminishing returns of increasing model size. As the size of a model continues to grow, the improvement in its performance becomes marginal. This is because the more complex the model becomes, the more difficult it is to optimize and the more prone it is to overfitting. Also, larger models require more computational resources and time to train, making them less practical for real-world applications.

Another important limitation of scale models is the difficulty of guaranteeing their robustness and generalizability. Robustness refers to the ability of a model to perform well even when faced with noisy or antagonistic inputs. Generalizability refers to the ability of a model to perform well with data that it has not seen during training. As models become more complex, they become more susceptible to adversary attacks, making them less robust. Also, larger models memorize training data instead of learning the underlying patterns, leading to poor generalization performance.

Interpretability and explainability are essential to understanding how a model makes predictions. However, as models become more complex, their inner workings become increasingly opaque, making their decisions more difficult to interpret and explain. This lack of transparency can be problematic in critical applications like healthcare or finance, where the decision-making process needs to be explainable and transparent.

Alternative approaches to building machine learning models

One approach to overcome the problem would be to reconsider what we consider to be high and low quality data. According Swabha Swayamdipta, a professor of machine learning at the University of Southern California, creating more diversified training data sets could help overcome limitations without reducing quality. Also, according to him, training the model on the same data more than once could help reduce costs and reuse the data more efficiently.

These approaches might postpone the problem, but the more times we use the same data to train our model, the more prone it is to overfitting. We need effective strategies to overcome the data problem in the long term. So what are some workarounds to just feed more data into a model?

JEPA (Joint Empirical Probability Approximation)) is a machine learning approach proposed by Yann LeCun that differs from traditional methods in that it uses empirical probability distributions to model the data and make predictions.

In traditional approaches, the model is designed to fit a mathematical equation to the data, often based on assumptions about the underlying distribution of the data. However, in JEPA, the model learns directly from the data through the empirical distribution approximation. This approach involves dividing the data into subsets and estimating the probability distribution for each subset. These probability distributions are then combined to form a joint probability distribution that is used to make predictions. JEPA can handle complex, large-dimensional data and adapt to changing data patterns.

Another approach is to use data augmentation techniques. These techniques involve modifying existing data to create new data. This can be done by flipping, rotating, cropping, or adding noise to the images. Data augmentation can reduce overfitting and improve the performance of a model.

Finally, you can use transfer learning. This involves using a previously trained model and adjusting it to a new task. This can save time and resources, since the model has already learned valuable features from a large data set. The pretrained model can be fit using a small amount of data, making it a good solution for sparse data.


Today we can still use data augmentation and transfer learning, but these methods do not solve the problem once and for all. That is why we need to think more about effective methods that in the future can help us to overcome the problem. We still don’t know exactly what the solution might be. After all, for a human being, it is enough to look at just a couple of examples in order to learn something new. Maybe one day we’ll invent an AI that can do that too.

What is your opinion? What would your company do if it ran out of data to train its models?

Ivan Smetannikov is the data science team leader at Serokell.


Welcome to the VentureBeat community!

DataDecisionMakers is where experts, including data technicians, can share data-related insights and innovation.

If you want to read about cutting-edge ideas and up-to-date information, best practices, and the future of data and data technology, join us at DataDecisionMakers.

You might even consider contributing an article of your own!

Read more from DataDecisionMakers


Scroll to Top