The world’s fascination with artificial intelligence (AI) is undeniable. However, it is crucial to examine the ingredients behind these AI marvels. While tech companies proudly showcase their generative AI models, such as the large language models (LLMs) that can compose fluent sentences, they tend to be vague about their production processes.
The typical narrative revolves around the technology’s brilliance. The machines are trained using a massive dataset comprised of human-published content in machine-readable form. The training process involves powerful algorithms, including the “transformer” architecture developed by Google, combined with neural networks. These components enable the machines to predict the most likely word to occur next in a sentence, essentially functioning as statistical parrots.
From the designers’ perspective, the machines are mere statistical models. However, this framing inadvertently leads people to attribute intelligence to these machines and raises concerns about their potential threat to humanity. Nevertheless, these concerns distract from the actual harm caused by existing deployments of AI technology.
LLMs rely heavily on training data collected by web crawlers, which systematically explore the internet. Common Crawl, a widely used web crawler, has provided vast archives and datasets to the public. However, there are concerns about copyright infringement, as this training data contains copyrighted works collected under the guise of “fair use.” The extent to which LLMs have been trained on pirated material remains unclear.
Moreover, the environmental impact of these systems is a growing concern. Training an early LLM in 2019 was estimated to emit 300,000 kg of CO2, equivalent to 125 round-trip flights between New York and Beijing. Companies attempt to rationalize these emissions by buying offsets, but they are secretive about disclosing the true environmental costs of their AI technologies.
In light of these issues, transparency and regulation are crucial. Detailed disclosure about measurement and control methods used by AI developers and operators should be required. Understanding the process behind these AI sausages is essential to navigate the transformative technology we have invented. While we may have limited control over the machines themselves, regulations can hold their owners accountable.
In conclusion, as the world’s appetite for AI grows, it is vital to have a clear understanding of the ingredients that make up these technologies. Transparency, disclosure, and regulation are necessary to address concerns surrounding training data, copyright infringement, environmental impact, and accountability.