OpenAI delivered GPT-4 yesterday, the highly anticipated text-generating AI model, and it’s a curious piece of work.
GPT-4 improves on its predecessor, GPT-3, in essential ways, such as giving more factual statements and making it easier for developers to prescribe its style and behavior. It is also multimodal in the sense that it can understand images, allowing it to caption and even explain in detail the content of a photo.
But GPT-4 has serious shortcomings. Like GPT-3, the model “hallucinates” facts and makes basic reasoning errors. In an example on OpenAI’s own blog, GPT-4 describes Elvis Presley as the “son of an actor”. (Neither of his parents were actors.)
To better understand GPT-4’s development cycle and its capabilities, as well as its limitations, TechCrunch spoke with Greg Brockman, one of OpenAI’s co-founders and president, via video call on Tuesday.
Asked to compare GPT-4 to GPT-3, Brockman had one word: Different.
“It’s just different,” he told TechCrunch. “There are still a lot of problems and mistakes that [the model] done…but you can really see the jump in skill in things like numeracy or law, where it went from really bad in some areas to actually pretty good compared to humans.
The test results support his case. On the AP Calculus BC exam, GPT-4 scores a 4 out of 5 while GPT-3 scores a 1. (GPT-3.5, the intermediate model between GPT-3 and GPT-4, also scores a 4.) And in a mock bar exam, GPT-4 passes with a score around the top 10% of candidates; The GPT-3.5 score hovered around the lower 10%.
When changing gears, one of the most intriguing aspects of GPT-4 is the aforementioned multimodality. Unlike GPT-3 and GPT-3.5, which could only accept text prompts (e.g. “Write an essay about giraffes”), GPT-4 can take an image and text prompt to perform an action ( for example, an image of giraffes in the Serengeti with the prompt “How many giraffes are shown here?”).
This is because GPT-4 was formed on the image And textual data whereas its predecessors were only trained on text. OpenAI says the training data came from “various licensed, authored and publicly available data sources, which may include publicly available personal information,” but Brockman objected when I asked for specifics. (Training data has already caused legal problems for OpenAI.)
The image understanding capabilities of GPT-4 are quite impressive. For example, complete the prompt “What’s funny about this picture?” Describe it panel by panel” plus a three-panel image showing a fake VGA cable plugged into an iPhone, GPT-4 gives a breakdown of each image panel and correctly explains the joke (“The humor in this image comes from the absurdity of plugging a big, outdated VGA connector into a small, modern smartphone charging port”).
Only one launch partner has access to GPT-4’s image analysis capabilities at this time – an assistive application for the visually impaired called Be My Eyes. Brockman says the wider rollout, whenever it happens, will be “slow and intentional” as OpenAI weighs the risks and benefits.
“There are policy issues like facial recognition and how to deal with images of people that we need to address and resolve,” Brockman said. “We need to figure out, for example, where the danger zones are – where the red lines are – and then clarify that over time.”
OpenAI has dealt with similar ethical dilemmas around DALL-E 2, its text-to-image conversion system. After initially disabling the feature, OpenAI allowed customers to upload people’s faces to edit using the AI-powered image generation system. At the time, OpenAI claimed that upgrades to its security system made the face-editing feature possible by “minimizing the potential for harm” from deepfakes as well as attempts to create sexual, political and violent content.
Another perennial prevents GPT-4 from being used in an unintended way that could cause harm – psychological, monetary or otherwise. Hours after the model was released, Israeli cybersecurity startup Adversa AI published a blog post demonstrating methods to bypass OpenAI’s content filters and cause GPT-4 to generate phishing emails, offensive descriptions of gay people and other highly objectionable material.
This is not a new phenomenon in the field of language models. Meta’s BlenderBot and OpenAI’s ChatGPT were also tricked into saying extremely offensive things, and even revealing sensitive details about their inner workings. But many had hoped, including this reporter, that GPT-4 might bring significant improvements on the moderation front.
Asked about the robustness of GPT-4, Brockman pointed out that the model had undergone six months of security training and that, in internal tests, it was 82% less likely to respond to requests for content denied by policy. using OpenAI and 40% more likely to produce “factual” answers than GPT-3.5.
“We spent a lot of time trying to figure out what GPT-4 is capable of,” Brockman said. “It is by broadcasting it to the world that we learn. We are constantly making updates, including a bunch of improvements, so the model is much more upgradable to whatever personality or type of mode you want it to be in.
Early real-world results aren’t that promising, frankly. Beyond Adversa AI testing, Bing Chat, Microsoft’s GPT-4-powered chatbot, was found to be highly susceptible to jailbreak. Using carefully tailored inputs, users were able to trick the bot into professing love, threatening harm, advocating for the Holocaust, and inventing conspiracy theories.
Brockman didn’t deny that GPT-4 falls short here. But he emphasized the model’s new mitigating piloting tools, including an API-level capability called “system” messages. System messages are essentially instructions that set the tone – and set boundaries – for GPT-4 interactions. For example, a system message might read: “You are a tutor who always responds in the Socratic style. You Never give the student the answer, but always try to ask the right question to help them learn to think for themselves.
The idea is that system messages act as guardrails to prevent the GPT-4 from veering off course.
“Really understanding the tone, style and substance of GPT-4 has been a big priority for us,” Brockman said. “I think we’re starting to understand a little bit more about how to do engineering, how to have a repeatable process that allows you to get predictable results that will be really useful to people.”
Brockman also pointed to Evals, OpenAI’s new open source software framework for evaluating the performance of its AI models, as a sign of OpenAI’s commitment to “robust” its models. Evals allows users to develop and run benchmarks to evaluate models such as GPT-4 while inspecting their performance – a sort of participatory approach to model testing.
“With Evals, we can see the [use cases] that users care about in a systematic form that we can test,” Brockman said. “Part of the reason why we [open-sourced] that’s because we’re moving away from releasing a new model every three months – whatever it came before – to making constant improvements. You don’t do what you don’t measure, do you? As we make new releases [of the model]we can at least know what these changes are.
I asked Brockman if OpenAI would ever compensate people for testing its models with Evals. He wouldn’t commit to it, but he noted that – for a limited time – OpenAI is granting certain Evals users early access to the GPT-4 API.
Brockman and I’s conversation also touched on the GPT-4 popup, which refers to text that the model can consider before generating additional text. OpenAI is testing a version of GPT-4 that can “remember” about 50 pages of content, five times more than vanilla GPT-4 can hold in its “memory” and eight times more than GPT-3.
Brockman believes the expanded popup leads to new, previously unexplored applications, especially in the enterprise. He envisions an AI chatbot designed for an enterprise that leverages context and knowledge from different sources, including employees across all departments, to answer questions in a highly informed yet conversational way.
It’s not a new concept. But Brockman argues that the answers from GPT-4 will be far more useful than those from today’s chatbots and search engines.
“Previously, the model had no knowledge of who you are, what interests you, etc.,” Brockman said. “Having that kind of history [with the larger context window] definitely going to make him more capable… It’s going to energize what people can do.