New tool to explain the behavior of AI models
What actually happens after prompt input into an AI model and how exactly are the texts created that ChatGPT and Co. spit out on request? An OpenAI development team is currently working on a tool that is intended to transparently disclose the answers to these questions.
Among other things, the in-house language model GPT-2 is used as a test object, and GPT-4 is intended to help with the breakdown.
Project leader William Saunders and his team want a better insight into how large language models (LLM) work. The overall goal: “We really want to be able to know that we can trust what the model is doing and the response it’s producing,” Saunders told the platform Techcrunch. With his team, he is looking for ways to better “anticipate what the problems with an AI system will be”.
Editor’s Recommendations
A system that automatically determines which components of an LLM trigger which behaviors should help. So far, the neurons used in language models have been examined individually, manually and therefore very time-consuming. “This process does not scale well: it is difficult to apply to neural networks with tens or hundreds of billions of parameters,” OpenAI said in one blog post.
The project is still in its infancy – as of this week, initial insights are not only available on the blog, but also via github.
Similar to neurons in the human brain, the neurons in the AI ​​model recognize certain patterns in texts and draw conclusions for their own output. The Saunders team uses this working method as follows: The model whose working method is to be examined – OpenAI used GPT-2 for demo purposes – is presented with various text sequences for processing.
The new OpenAI tool monitors the processing and registers whether there are neurons that are activated particularly frequently. This evaluation is passed on to ChatGPT-4, which should analyze and explain what the individual neurons are responsible for. This is followed by a counter-check to see how correct the explanation is: ChatGPT-4 is presented with new text sequences and is intended to simulate how the observed neurons would react to them. Finally, the same sequence is sent through the original model – GPT-2 in the example – and the result is compared with the simulation.
“Using this method, we can basically find, for each individual neuron, a sort of tentative natural-language explanation of what it’s doing, and also have an assessment of how well that explanation matches actual behavior,” said Jeff Wu, OpenAI- Team Lead for Scalable Alignment.
The explanations for all 307,200 neurons in GPT-2 as well as the tool code are now publicly available on the OpenAI API. According to the developers, however, the new method does not yet work particularly well: Only 1,000 neuron explanations were actually reliable.
“For example, a lot of the neurons are active in a way that it’s very hard to tell what’s going on — they’re activated on five or six different things, for example, but there’s no discernible pattern,” Wu told Techchrunch to. And “sometimes there’s a recognizable pattern, but GPT-4 isn’t able to find it.”
There are also additional problems with models that are larger and more complex than GPT-2. Accordingly, there is still “a long way” to go before there is actually a tool that can automatically break down the processes in language models.