Artificial data is a dangerous teacher

In April 2022, when Dall-E, a Visio-Antiostic model, a text was released to the picture, apparently through Million users In the first three months this case was followed by the chapter, in January 2023, which was apparently reached 100 million Monthly active users just two months after launch. Both significant moments in the development of artificial intelligence manufactured, which in turn have created the explosion of the content produced by AI on the web. The bad news is that, in 2024, this means that we will see fabricated, nonsense, abuse and lack of information and intensify the encrypted social stereotypes in these models of artificial intelligence.

The artificial intelligence revolution has not been stimulated by any recent theoretical progress – in fact, most of the basic work of artificial neural networks is for decades – but with “availability” of extensive datasets. Ideally, an artificial intelligence model of certain phenomena – whether human language, cognition or visual world – records as closely as possible.

For example, for a large language model (LLM) to produce human text, it is important to feed this model of a large volume of data that somehow expresses human language, interaction and communication. It is believed that the larger the dataset, the better the human affairs, in all the inherent beauties, the ugliness and even their oppression. We are in a period of obsession with the scale of models, data set and GPUs. For example, current LLMs have now entered into a period of trillion -parameter learning models, which means they need a billion -size dataset. Where can we find it? On the web

It is assumed that this web source data records the “earthly truth” for human communication and interaction, the proxy that can be used. However, various researchers have now shown that the online data set is often used BenevolentDesire to Exacerbates negative stereotypesAnd containing problematic content such as Racial slogans Vat Hateful speechOften towards marginalized groups, these large companies do not stop using such data in the race for the scale.

With productive artificial intelligence, this problem gets worse. These models encrypt and enhance social stereotypes instead of representing the social world from input data in an objective way. Actually, recently Work View That is Production models are encrypted And reproduce racist and discriminatory attitudes to identity, cultures and languages ​​to the historical margins.

It is difficult to, if not impossible-even with advanced tools-be sure that text, image, audio and video data are currently and at the moment. Researchers at Stanford Hans Hanley and Zicker Duromrick Estimation A 68 % increase In the number of artificial articles submitted to Reddit and a 131 % increase in inaccurate information articles between January 1, 2022 and March 31, 2023. BarryAn online music company claims to produce 14.5 million songs (or 14 % of recorded music) so far. In 2021NVIDIA predicted that by 2030, there would be more artificial data than real data in AI models. One thing is certainly: the web is destroyed by artificial produced data.

The worrying thing is that these extensive amounts of artificial intelligence outputs, in turn, are used as educational materials for future production AI models. As a result, in 2024, it will be a very important part of training materials for manufactured synthetic data manufactured from manufacturing models. Soon, we are trapped in a recursive loop where we will teach AI models using artificial data produced by AI models. Most of these will be contaminated with stereotypes that will continue to strengthen historical and social inequalities. Unfortunately, this is also the data we will use to teach the production models applied in the above sectors, including medicine, therapeutic, education and law. We have not yet faced the catastrophic consequences of this issue. Until 2024, the explosion of artificial intelligence that we see is now attractive, instead of becoming a massive toxic waste that returns to us to bite us.

Source: https://www.wired.com/story/synthetic-data-is-a-dangerous-teacher/

Leave a Reply

Your email address will not be published. Required fields are marked *