Cover Image for The Opportunities and Risks of Synthetic Data.
Mon Oct 14 2024

The Opportunities and Risks of Synthetic Data.

Large tech companies and startups are increasingly using synthetic data to train their artificial intelligence models. However, this strategy carries certain risks.

The possibility of training an artificial intelligence (AI) system exclusively with data generated by another AI has been a subject of debate for some time. While it may initially seem like a far-fetched idea, this methodology has gained traction as obtaining real data becomes increasingly complex. Companies like Anthropic have utilized synthetic data to train their Claude 3.5 Sonnet model, and Meta has fine-tuned its Llama 3.1 models with AI-generated data. Furthermore, OpenAI is rumored to be acquiring synthetic data from model o1, related to reasoning, for the upcoming Orion model.

To understand why AI requires data, it is essential to recognize that AI systems are statistical machines that learn from examples. They use patterns from these examples to make predictions, such as in the case of an email where "to whom" typically precedes "concerns me." Annotations, which are textual labels describing the meaning or parts of the data, are critical to this process. For example, if a photo classification model receives many images of kitchens labeled as "kitchen," it will learn to associate that word with typical features of such spaces.

As a result of the growing demand for AI, the market for data annotation services has also expanded, currently valued at $838.2 million, with projections suggesting it will reach $10.34 billion in ten years. While it is difficult to estimate precisely how many people are dedicated to this task, it is known that there are "millions." Many companies rely on workers from data annotation firms to create labels for their training sets. However, pay for this work can vary significantly, with well-paying jobs in specialized sectors and others offering very low wages, especially in developing countries.

There are humanitarian reasons, but also practical ones, driving the search for alternatives to human-generated labels. The speed of human labeling is limited, and they may also exhibit biases in their work, which will affect the models trained with those annotations. Moreover, accessing real data has become costly and challenging due to heightened personal information protection and restrictions imposed by data owners wary of plagiarism or lack of recognition for their use. It is estimated that more than 35% of the top 1,000 websites have blocked OpenAI's crawlers.

In light of this scarcity, synthetic data could be a solution. According to some experts, if "data is the new oil," synthetic data could be considered as biofuels, meaning generated without the negative side effects of real data. This approach has been adopted by the AI industry. For example, Writer, a generative AI company, introduced a model called Palmyra X 004, almost completely trained with synthetic data, with a development cost significantly lower than that of other comparable OpenAI models.

However, the use of synthetic data is not without risks. These data may carry the limitations and biases of the original data used to train the models that generate them. This results in a flawed representation of certain groups in the synthetic database, which can undermine the quality and diversity of the resulting models. Research has shown that over-reliance on synthetic data can reduce both the diversity and accuracy of models over the training iterations.

It is crucial that synthetic data be properly reviewed, curated, and filtered, supplementing it with real-world data to prevent potential model failure. While some experts believe it will be possible for AI models to generate synthetic data good enough to train themselves, there currently is no technology that effectively allows this. Therefore, human intervention remains critical to ensure that the training of the models does not deviate from its purpose.