GPT-2, the Destruction of the Web and Artificially Intelligent Textbooks

GPT-2 is a deep-learning language model released by OpenAI a few months ago which is causing a bit of controversy in the deep learning community.

Due to our concerns about malicious applications of the technology, we are not releasing the trained model. As an experiment in responsible disclosure, we are instead releasing a much smaller model for researchers to experiment with, as well as a technical paper.

OpenAI actually took a conservative position and decided to hold off from releasing the model due to the potential impact on society.

This is one of the first emerging technologies that really frightens me. There’s a massive potential or abuse here and I don’t think society is ready for it just yet.

With many disruptive technologies we generally have a few years to come to grips with the problems they might cause. Robotics, Crispr, and other technologies are still a few years away from really impacting our lives (however it plays out).

Not true with GPT-2 - it’s here today.

Background

GPT-2 is a deep learning model released by OpenAI back in February which they built by indexing 40GB of web content derived from a web crawl.

They indexed about 8M pages and then dumped it into a neural network

It builds an internal linguistic model of the underlying text represented as a neural network.

This helps it understand the actual text but also it’s generative. This means it can build NEW text based on a prompt.

Here’s a short example:

System Prompt

This part was generated by a human as the input prompt:

In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.

Model Completion

This was generated by GPT-2, a computer:

The scientist named the population, after their distinctive horn, Ovid’s Unicorn. These four-horned, silver-white unicorns were previously unknown to science.

Now, after almost two centuries, the mystery of what sparked this odd phenomenon is finally solved.

Dr. Jorge Pérez, an evolutionary biologist from the University of La Paz, and several companions, were exploring the Andes Mountains when they found a small valley, with no other animals or humans. Pérez noticed that the valley had what appeared to be a natural fountain, surrounded by two peaks of rock and silver snow.

… this is just a short excerpt.

You can read more here

Could it Destroy the Web?

The web is built on a number of assumptions which aren’t necessarily true any longer.

First, that only humans can write legitimate content.

In the past machine-generated text seemed, well, robotic. If you used a markov chain to generate text it sounded like you would expect - like canned text.

It went off topic. Didn’t make sense. They were words. They were a subset of language but they didn’t have much meaning.

You can get a sense of how this works by using auto-complete on your mobile phone.

If you start off with a sample sentence and let it auto-complete and just select the first match it will sound rather - insane.

Here’s an example.

On my phone I entered “The cat is outside” and then hit next again and again.

My came up with:

“The cat is outside and is never told about Jesus do they go to the emergency place of his own and they are crawling to be my friend.”

I mean, those are words. That’s technically a sentence. But it doesn’t make a ton of sense.

GPT-2 can write entire articles of text that sound like they were written by a human.

Not only do they seem human they’re also readable - you won’t get bored either.

This is going to lead to a massive amount of Google spam.

If you can write content for cheap you can use arbitrage and post content on the web and then run ads against it.

If you can generate gigabytes of content for pennies you can make a lot of money just cranking out fake content and spamming search engines..

Legitimate marketers are going do to this too.

Why write new content when you could just crawl your competitors site and have it generate content based on their data.

Or just crawl your own content and have it generate new articles based on your current articles as a system prompt.

What else could we do?

You could create fake news.

Fake scientific research.

You could spam your competitors support queue with fake technical support requests.

The list goes on and on.

If you have any background in SEO or search your heart is probably racing.

Take a 20 minute break and meditate. We’ll be here when you get back.

What’s the upside?

Is there any upside? There is. I think it could have a dramatic and amazing impact on spaced repetition and cognitive science.

GPT-2 could help build question and answer pairs by bulk processing large textbooks to build flashcards for spaced repetition systems like Anki and Polar.

It could also help with the raw understanding of the text so you could ask it questions and it could provide you with answers. Far better than simple full-text search!

Full-text search will get you closer to an answer but you still have to spend a bit of time reading.

With GPT-2 it could just flat out give you the answer.

If you’ve ever read Diamond Age it could help build a “Young Lady’s Illustrated Primer” or an artificially intelligent textbook that you can interact with directly!

You would literally be able to interact with a textbook and ask it questions from the material you’re reading.

Textbooks would no longer be sterile and inert. It could suggest other reading based on your interests. You could interact with it and it could actually teach you directly.

It can’t understand the text in the classical sense than you or I can. You can’t ask it mathematical questions and expect answers or questions that are more philosophical in nature.

Those still require humans.

Mitigation

There is some risk mitigation we could pursue. We could require public keys for users and only index content that is signed and build a trust model similar to Pagerank.

Humans can only write so much content. By using this model you’re going to throttle people by forcing their rank to flow between the graph of the web of trust.

People can create fake keys of course. They can fake key after key for their fake robots but they have to attract REAL humans to certify them at some point and you would only flow so much rank from a real human forward.

If search engines employed auditors to find and validate real people we could keep the web from becoming owned by AI.

It’s not perfect but I think there’s a path forward here though going in depth is a bit ouf of scope for this article.

What Could Stop GPT-2?

Let’s say we collectively decide that building and releasing something like GPT-2 is a bad idea.

How could we stop it?

We can’t. It’s just a matter of time until this technology is democratized and in the hands of everyone.

The prices of GPUs and cloud computing is just getting cheaper and cheaper and eventually this is going to become commoditized.

In five or so years you’ll be able to build a model like GPT-2 in a few days to weeks with very little funding and with off the shelf tools.

We’re going to have to deal with this one way or another.

Conclusion

GPT-2 is really frightening but if we can mitigate the risks we might be able to use these systems in the future and focus mostly on the positives.

We’re going to need to educate older voters and people naive about modern technology on the risks of trusting fake news and deep fakes.

This might be an uphill battle though. We already have fake news and we haven’t done a good job dealing with it already. How are we going to deal with it when its gets far far worse?