Reading OpenAI paper on Unsupervised Multitask Learners

On Feb 14 the team of OpenAi, a famous Elon Musk and Peter Thiel AI startup, published information about their recent breakthrough. It instantly electrified press and received significant coverage by Business Insider, The Guardian, The Verge, and others. The headlines were alarming. But what actually is behind this “revolutionary AI” and how much is there to fear? [footnote] Just look at the headlines! Business Insider “An Elon Musk-backed AI firm is keeping a text generating tool under wraps amid fears it’s too dangerous,” or The Guardian: “AI can write just like me. Brace for the robot apocalypse.”[/footnote]

What spins the story even more is the fact that for now OpenAi won’t release the full source code of their Machine Learning model. As stated by Jack Clark, policy director at OpenAI, the team wants to encourage academics and the public to have a conversation about the potential harms of this technology before it becomes widely available, as it may be used for automated trolling which might influence online debate, allowing existing trollers to scale their efforts overnight.

Readings from the source for the diligent reader

A good starting point is to understand what has actually been published. Let’s find out what a diligent yet not purely scientific reader could learn about the, so called, GPT-2 Model from OpenAi reading the source paper.

Here’s a breakdown of the piece for the sake of simplicity, allowing for high-level understanding:

It goes without saying that researchers are looking to find a path toward general artificial intelligence. The piece in question is a step on that path.

“Current ML (machine learning) systems need hundreds to thousands of examples to induce functions which generalize well.”

It’s a well known limitation of existing Machine Learning models. Not only are they data-hungry, but also domain-specific. However, some scientists have started to suggest 1 that task-specific architectures are no longer necessary as the so-called, “self-attention block-based architecture”, makes it possible to escape both the narrow scope of use and the necessity of the tedious preparation of data which limits progress. That’s big news!

What is “Self-attention”? A mechanism used in a model conceived by Google Brain specialists 2 that the OpenAi team used with some modifications. A common practice in science to seek progress reusing the existing findings of others.

But a real breakthrough comes with the suggestion that a general system (domain-independent) should be able to use language to learn without supervision and, based on acquired knowledge, perform different tasks. In more specific words: reserchers suggest that systems could use the syntax of human language for its own conditioning, as it provides a flexible way to specify tasks, inputs, and outputs as a sequence of symbols. The team expressed the claim:

“Our speculation is that a language model with sufficient capacity will begin to learn to infer and perform the tasks demonstrated in natural language sequences in order to better predict them, regardless of their method of procurement.”

The work is therefore a quest to answer a question – can a neural network learn by itself using human language and, as a result, perform aw ide range of tasks?

In order to learn from a language, the OpenAi model interprets word constructs as byte chunks. The team reports it as a “practical middle ground between character and word level language modeling.” The team used three variations of the model which differ by capacity: starting from 345 million of parameters, 3 all the way up to 1 542 million.

All models were trained by using a dataset consisting of 40GB of raw text snippets from over 8 million documents, all scraped from social media platforms. The team filtered only those texts which could be assumed to have a decent quality thanks to being marked with at least three karma points.

What has been demonstrated?

A Machine Learning system, trained using language should, in practice, provide answers to questions as a result of effective symbol-processing. The hope is, of course, that successful systems will demonstrate increased comprehension of the question context even independently from the language that the system was trained on. In order to measure how well a system is doing, researchers came up with a set of tests over time which make it possible to examine answers.

This makes results quantifiable, allowing for both system-to-system result comparisons and also a point of reference to human performance. What’s interesting is that the results of the OpenAi model in text generating is better than previous attempts and is sometimes even close to the results expected from humans.

Scores of GPT-2

One of the tests, designed by the scientists measures how well the system finds the omitted word in a sentence. GPT-2 peformed great on this so called, ‘Children’s Book Test,’ with a 93,3% hit rate for common nouns and 89,1% for named entities.

The other test was design to measure understanding of a difficult context, where at least 50 tokens of context are needed for humans to successfully predict the final word of an unfinished sentence. The test is called ‘Lambada’ 4. Here GPT-2 increased the machine accuracy from previous record of 19% to a whopping 52,6%

Researchers also tested Reading Comperhension by using a set of documents from seven different domaines and asking questions related to the content of the documents. It’s worth noting that highly specialized systems already exist that are en par with humans in this isolated task. However systems designed and trained to excel only at a single task cannot really compare to a more general GPT-2 trained without supervision. From that perspective it’s score of 55 F1 5 is halfway to that of humans and specialized systems.

The score was obtained by conditioning GPT-2 to perform on specific documents, so the result should be seen as a result of an isolated test. The more general measure resulted from testing what answers the model is providing when asking factoid-style questions. Here the success ratio was only 4,1%, ten times worse than systems designed for this task alone.

One curious finding was related to how well the system performs translation between languages. Given an English dataset with some French words, the OpenAI model performed well in English-French translations, achieving 11.5 BLEU points 6. In comparison, the best unsupervised algorithm achieve 33,5 BLEU.

The OpenAI team also reports a score of 70,70% for common sense reasoning, tested using the Winograd Schema Challenge. These results should be taken with a pinch of salt, as the task was to identify the antecedent of an ambiguous pronoun in a statement in a limited set of 273 text samples.

This only demonstrates how high our expectation are in relation to what is actually tested. It also shows how actual results of Machine Learning models are often misinterpreted. OpenAI GPT-2 clearly demonstrated progress on a path of symbol processing systems with results closer to those of our cognition. But it’s a long way to what newspapers already make us fear with their bloated headlines.

Read more about history of symbol processing and hopes to make computers think in this article.

  1. See Radford et al, 2018 and Devlin et al., 2018

  2. See “Attention Is All You Need” by Ashish Vaswani et al. and “Improving Language Understanding by Generative Pre-Training” by Alec Radford et al. 

  3. Parameter is a learnable filter in a building block of the network

  4. LAMBADA stands for LAnguage Modeling Broadened to Account for Discourse Aspects

  5. F1 is a measure of a test’s accuracy. It considers both the precision (how useful the answer is) and and the recall (how complete the results are). High precision means that an algorithm returned substantially more relevant results than irrelevant ones, while high recall means that an algorithm returned most of the relevant results.

  6. BLEU: Bilingual Evaulation Understudy Quality is considered to be the correspondence between a machine’s output and that of a human on a scale from 0-100. The closer a machine translation is to a professional human translation, the better it is.