GPT-4’s perceived decline in intelligence | ENBLE

GPT-4's perceived decline in intelligence | ENBLE

GPT-4’s Decline: Is OpenAI’s Language Model Losing its Edge?

As impressive as GPT-4 was at launch, some onlookers have observed that it has lost some of its accuracy and power. These observations have been posted online for months now, including on the OpenAI forums.

These concerns have now been seemingly supported by a study conducted in collaboration with Stanford University and UC Berkeley. The study, called “How Is ChatGPT’s Behavior Changing over Time?”, aimed to test the capability of GPT-4 against its predecessor, GPT-3.5. The researchers conducted the tests between March and June, using a dataset of 500 problems.

The results were astonishing. In March, GPT-4 had an impressive 97.6% accuracy rate, with 488 correct answers. However, after going through some updates, GPT-4’s accuracy rate plummeted to a mere 2.4% in June. The model went from producing 488 correct answers to only 12 correct answers in just a few months. This decline in accuracy is a significant cause for concern.

The researchers also employed a chain-of-thought technique in which they asked GPT-4, “Is 17,077 a prime number?” This question requires reasoning and understanding of mathematical concepts. Surprisingly, GPT-4 not only got the answer wrong but also failed to provide any explanation for its incorrect response.

These findings come on the heels of an attempt by an OpenAI executive to dismiss suspicions that GPT-4 was getting worse. The executive suggested that the degradation in quality of answers is a psychological phenomenon resulting from heavy usage.

It’s important to note that at present, GPT-4 is only available for developers or paid members through ChatGPT Plus. In comparison, asking the same question to GPT-3.5 through the ChatGPT free research preview provides both the correct answer and a detailed explanation of the mathematical process behind it. This discrepancy raises questions about the value and efficacy of GPT-4.

However, the decline in GPT-4’s performance does not seem limited to just question answering. Code generation has also been impacted, with developers at LeetCode witnessing a significant drop in GPT-4’s accuracy from 52% to 10% on their dataset of 50 easy problems between March and June.

Adding further insult to injury, Twitter commentator @svpino mentioned rumors that OpenAI might be using smaller and specialized versions of GPT-4 that act similarly to the larger model but are less expensive to run. This cost-saving measure, if true, could be contributing to the decline in the quality of GPT-4 responses. This decline is particularly concerning as OpenAI relies on its technology for collaboration with numerous large organizations.

However, not everyone is convinced that the study proves GPT-4’s decline in capability. Some argue that a change in behavior does not necessarily equate to a reduction in capability. The study itself acknowledges this, stating that “a model that has a capability may or may not display that capability in response to a particular prompt.” In other words, obtaining the desired result may require different types of prompts from the user.

When GPT-4 was first announced, OpenAI touted its use of Microsoft Azure AI supercomputers to train the language model for six months. They claimed that this resulted in a 40% higher likelihood of generating the desired information from user prompts. However, it seems that ChatGPT, which is based on GPT-3.5, already had information challenges such as limited knowledge of world events after 2021. Now, the regression in information quality appears to be a new problem, disappointing users who were eagerly anticipating updates to address previous issues.

In this context, it is worth mentioning that OpenAI’s CEO, Sam Altman, recently expressed his disappointment in a tweet following the launch of an investigation by the Federal Trade Commission into whether ChatGPT has violated consumer protection laws. Altman emphasized OpenAI’s transparency about the limitations of their technology and their capped-profits structure, which ensures they aren’t incentivized to prioritize unlimited returns over quality.

The concerns raised by the study, coupled with the ongoing investigation, indicate that OpenAI needs to address the decline in the performance of GPT-4. Maintaining the cutting-edge quality of their language models is crucial for OpenAI’s reputation and the trust it has established with its users and partners. By acknowledging the issues and taking steps to rectify them, OpenAI can ensure that their language models continue to deliver accurate and reliable results, fostering innovation and collaboration in the technological world.