(Original image from the Bell System Technical Journal, 1922)
The world we live in is awash with information — awash with content. There are more documents being generated and data to sort through than anyone can make sense of on their own. One of the ironies of the digital revolution is that computers, the internet, search engines, etc., were all developed to wrangle this information, but in turn became tools to generate even more of it, cheaper and faster.
Enter artificial intelligence: the latest and greatest cause of (and solution to) all your content problems.
In the broadest sense of automating tasks that traditionally have required human labor, artificial intelligence has been around for a long time, first in the research and industrial fields, then in the world of the consumer. It’s an umbrella term, a family resemblance concept, used to group things that aren’t necessarily identical. (We used AI transcription technology to help compose this newsletter.) But recently there’s been a hyperfocus on the section of digital automation that uses large language models to generate documents. Since creating, understanding, categorizing, and presenting digital documents is very much in Autogram’s wheelhouse, we figured we should address the subject.
“The world of content marketing and content publishing over the past decade has generated such a huge glut of content that is put out there, some of it of very low value,” says Jeff Eaton. Consider a company that needs to generate product descriptions for thousands of products from specifications stored in a database. “We’re in a phase of enthusiasm and excitement for turning all of that work over to robots. And it’s hard to understand the ripple effects of all of that, other than that it’s harming a lot of people who work in content.”
On the one hand, then, this feels like a classic disruption scenario: a new technology doesn’t need to match the quality of its competition because it creates a “good enough” substitute in huge quantities at a much lower cost. On the other hand, at the scales we’re talking about, even the language of disruption might be understanding its effects.
You could imagine a scenario, says Karen McGrane, where “99 percent of the content on the internet will be AI generated. And that doesn’t mean that there will actually be less human content. It just means that the sheer volume of AI generated content will go up so dramatically that it will change the entire ecosystem of what having human-generated content means.”
Recipes, sports scores, job descriptions, product specifications, and more — any well-structured, generic (i.e., literally written within the constraints of a genre) content that needs to be produced over and over again — is a candidate for automation. That’s setting aside its use for spam or scam posts, reviews, comments, etc., in bulk, and before we get to machines’ increasing ability to freely generate narrative or argumentative content from unstructured inputs.
Another divide that AI widens is between content that’s primarily created for humans to read, understand, and enjoy, and content that’s ultimately designed to feed other machines: not just spam and SEO bait but also database fillers and even code and markup. One of the funny things about automatically generated content is that even when it’s “good enough”… people notice. They learn how to understand the new content, find telltale marks, and treat them as illegitimate or low-value. Furthermore, machines notice. And when both humans and machines that, for whatever reason, have cause to distrust or deprioritize content generated by other machines, all that automated content — content whose cost is not yet vanishingly small — turns into so much static.
These are just some of the unintended consequences of turning the dial all the way over to full automation. Another is entrenching (again) a handful of big companies who are fully capable of and motivated to enrich themselves at everyone else’s expense.
“Other Web 2.0 ‘disruptions’ included a cheap, unregulated, consumer-facing aspect to them,” says Ethan Marcotte, citing ride- or home-sharing apps. “These new tools like ChatGPT (or Google’s version, or Microsoft’s, etc.) are defined by how centralized they are. They’re sitting on corporate-owned server racks, and their business model is lucrative licensing deals for other big corporations… They’re actually very conservative in a lot of ways.”
Just like we must conceptually disentangle all the different strands of AI, we also have to disentangle the capabilities of the technology from the companies looking to sell a product. They are not identical, and a rush to pick winners and losers is sure to be premature.
We have to remember too that AI doesn’t operate in a vacuum. There’s a technological context, a business and employment context, a social and user context, and a legal and regulatory context. The last is one we still know too little about. Are technology companies going to continue to be free to train their machine learning modules on anything they come into contact with, from web pages and public data sets to user content and copyrighted and trademarked material? Will there be public and political pressure to both make automatically generated results more accurate or to protect proprietary data? We don’t have to recount the litany of disruptive innovations that eventually ran afoul of regulatory regimes who reined in their freewheeling potential. Organizations, too, are increasingly making the business decision to keep their information out of the AI data mines — out of any platform they don’t themselves control.
There’s also an open secret about most forms of AI: they’re still dependent on huge amounts of human labor, either congealed in its data sets or more directly, in the form of often underpaid workers who train, correct, edit, and yes, tag and sort the output of the machines. We’re still in the world of the so-called “mechanical Turk,” the chess-playing pseudo-robot hiding a human under the table, which Amazon took for the name of its own service that offered inexpensive labor with a digital front end. The odds are good that human beings will be entangled with the machines they’ve made for a long, long time.
AI Is Writing Code Now. For Companies, That Is Good and Bad, by Isabelle Bousquette
“People have talked about technical debt for a long time, and now we have a brand new credit card here that is going to allow us to accumulate technical debt in ways we were never able to do before,” said Armando Solar-Lezama, a professor at the Massachusetts Institute of Technology’s Computer Science & Artificial Intelligence Laboratory. “I think there is a risk of accumulating lots of very shoddy code written by a machine,” he said, adding that companies will have to rethink methodologies around how they can work in tandem with the new tools’ capabilities to avoid that.
"We Have No Moat, And Neither Does OpenAI" (leaked Google document sourced by Dylan Patel and Afzal Ahmad)
While our models still hold a slight edge in terms of quality, the gap is closing astonishingly quickly. Open-source models are faster, more customizable, more private, and pound-for-pound more capable. They are doing things with $100 and 13B params that we struggle with at $10M and 540B. And they are doing so in weeks, not months. This has profound implications for us.
Will Google’s AI Plans Destroy the Media? by John Herrman
This is a facet of the larger AI story — which is to say it’s about automation. But it’s also a story of a large platform deciding to compete more aggressively in the marketplace it controls. With snapshots, Google is pushing into some of the most lucrative parts of the content business over which it already exerts enormous influence. That the sorts of content it seems to be automating first are explainers, guides, and product rankings is no coincidence — these are styles of content that publishers currently produce with Google traffic in mind.
Signal’s Meredith Whittaker: ‘These are the people who could actually pause AI if they wanted to', interview by Ian Tucker
There is no Cartesian window of neutrality that you can put an algorithm behind and be like, “This is outside our present and history.” These algorithms are trained on data that reflects not the world, but the internet – which is worse, arguably. That is going to encode the historical and present-day histories of marginalisation, inequality etc. There isn’t a way to get out of that and then be like, “This is a pristine, unbiased algorithm,” because data is authored by people. It’s always going to be a recycling of the past and then spitting that out, projecting that on to the present.