The new OpenAI tool tries to explain the behavior of language models

Large Language Models (LLMs) along the lines of OpenAI’s ChatGPT are often said to be a black box, and certainly, there is some truth to that. Even for data scientists, it’s hard to know why a model always responds the way it does, as if it made facts out of thin air.

In an effort to remove layers from LLMs, OpenAI is developing a tool to automatically identify which parts of an LLM are responsible for which of its behaviors. The engineers behind it stress that it’s in the early stages, but the code to run it is available open source on GitHub as of this morning.

“We are trying to [develop ways to] anticipate what the problems will be with an AI system,” William Saunders, manager of the OpenAI interpretability team, told TechDigiPro in a phone interview. “We really want to be able to know that we can trust what the model is doing and the response it produces.”

To that end, the OpenAI tool uses a language model (ironically) to discover the functions of components of other architecturally simpler LLMs, specifically OpenAI’s own GPT-2.

The OpenAI tool attempts to simulate the behaviors of neurons in an LLM.

As? First, a quick explanation on LLMs for background. Like the brain, they are made up of “neurons,” which look at a specific pattern in the text to influence what the overall pattern “says” next. For example, when faced with a message about superheroes (eg, “Which superheroes have the most useful superpowers?”), a “Marvel superhero neuron” might increase the probability that the model will name specific superheroes from Marvel movies. Marvel.

The OpenAI tool exploits this configuration to break the models into their individual pieces. First, the tool runs text sequences through the model being tested, waiting for instances where a particular neuron is frequently “fired”. It then “samples” GPT-4, OpenAI’s latest text-generating AI model, these highly active neurons, and GPT-4 generates an explanation. To determine how accurate the explanation is, the tool provides GPT-4 with strings of text and has it predict or simulate how the neuron would behave. It then compares the behavior of the simulated neuron with the behavior of the real neuron.

“Using this methodology, we can basically, for each neuron, generate some kind of preliminary natural language explanation of what it’s doing and also have a score of how well that explanation matches actual behavior,” Jeff Wu, who leads said the OpenAI Scalable Alignment Team. “We use GPT-4 as part of the process to produce explanations of what a neuron is looking for, and then rate how well those explanations match the reality of what it’s doing.”

The researchers were able to generate explanations for the 307,200 neurons in GPT-2, which they compiled into a dataset that was published along with the tool’s code.

Tools like this could one day be used to improve the performance of an LLM, the researchers say, for example, to reduce bias or toxicity. But they acknowledge that it has a long way to go before it’s really useful. The tool relied on its explanations for about 1,000 of those neurons, a small fraction of the total.

A cynical person might also argue that the tool is essentially an advertisement for GPT-4, since it requires GPT-4 to work. Other LLM interpretation tools are less dependent on commercial APIs, such as DeepMind. tracka compiler that translates programs into neural network models.

Wu said that this is not the case, the fact that the tool uses GPT-4 is merely “incidental” and instead shows GPT-4’s weaknesses in this area. He also said that it was not created with commercial applications in mind and could, in theory, be adapted to use LLM in addition to GPT-4.

OpenAI explainability

The tool identifies neurons that fire across layers in the LLM.

“Most of the explanations score pretty low or don’t explain much of the behavior of the actual neuron,” Wu said. “A lot of the neurons, for example, fire in a way where it’s very hard to tell what’s going on, like they’re firing on five or six different things, but there’s no discernible pattern. sometimes there is is a discernible pattern, but GPT-4 can’t find it.”

That’s not to mention bigger, newer, more complex models, or models that can surf the web for information. But on that second point, Wu thinks web browsing wouldn’t change the underlying mechanics of the tool much. It could simply be modified, he says, to find out why neurons decide to make certain queries in search engines or access particular websites.

“We hope this opens up a promising avenue to address interpretability in an automated way that others can build on and contribute to,” Wu said. “The hope is that we really have good explanations of not just what neurons respond to but more generally for the behavior of these models: what kinds of circuits are computing, and how certain neurons affect other neurons.”


Scroll to Top