The Problem with OpenAI's New Models
The aspect of ChatGPT which made it so incredible is also a flaw- and OpenAI is doubling down on it.
There is already debate over the problem of overly anthropomorphizing physical robots. But can the same debate be had with text-generating models? Absolutely.
Upon its release, we were collectively and immediately impressed by ChatGPT. Our amazement was fuelled by the model’s ability to carry a conversation like a real person. In short, it passed the Turing test.
The Turing test is a test of anthropomorphization. By presenting all output as complete written prose, ChatGPT poses as a person instead of software. Until ChatGPT, none of us had received such responses from anything other than a human.
This is problematic. The use of natural language implies intelligence. Humans are predisposed to be fooled by this, to varying degrees. The use of natural, conversational and friendly language skews the expectations of the person interacting with the chatbot. They are likely to ascribe qualities to this model which are equivalent to those of a human producing the same language.
OpenAI knows this. They know the same model was much less impressive before being packaged as a conversational chatbot. This may be why they are continuing down this path with GPT-4o, one of openAI’s latest flagship models. In fact, they are doubling down on this theme.
There are two aspects I’ll mention. The first is chain-of-thought reasoning. It was quickly observed with the original ChatGPT that prompting it to “think step by step” often yielded responses of higher quality. Many assume the higher quality responses are the result of the model being coaxed to “think” more carefully or thoroughly. But there’s a simpler explanation. By using such language in the prompt, the model’s output is seeded more heavily towards the training data which consists of responses provided in such a stepwise fashion. These tend to be of higher quality, for reasons which would be interesting to explore further.
OpenAI has now baked in the chain-of-thought prompting into the model’s output. The model is very verbose in this way. The output will consist of a plan with some amount of recursion to execute this plan. This is a nice trick. This lends further illusions of human-like behaviour, in tandem with the previous conversational aspects.
The second misleading problem is the display of “thinking time” as a header to the output. A ChatGPT session powered by the GPT4-01 model will inform you that it “thought” for a set amount of seconds. In terms of user experience, this is a design decision which has nothing to do with the model. The user is left to assume this thinking is analogous to human thinking.
I’ve seen one user exclaim: “Imagine if you let it think for 6 months!”. Anecdotally, there appears to be no correlation between thinking time and output quality. In fact, long thinking times appear to be indicative of a time-out mechanism on the model not reaching a desired output. Perhaps this is due to the recursion imbued by the chain-of-thought prompting mentioned previously. If that’s the case, it’s interesting to observe that these two design decisions are interfering with each other in unexpected ways.
It’s clear that OpenAI is making design decisions with the purpose to lend humanness to their models. These are design decisions made to influence how users interact with and perceive their products. In the meantime, humans are predisposed to conceive human-like qualities of objects, machinery and animals. Or in this case, mathematical models.