The video discusses the recent release of GPT-5.2, highlighting both its impressive benchmark performance and its notable shortcomings. The creator points out some bizarre errors made by the model, such as incorrectly counting letters in words and making illogical financial comparisons. Despite these issues, the model excels in traditional benchmarks, especially in high-level research tasks covered by GDP val. However, when tested with more realistic and complex benchmarks, including those created by the creator and other YouTubers like AI Explained, GPT-5.2 often underperforms compared to competitors like Claude 4 and Gemini 2.5 Pro. This discrepancy suggests that while GPT-5.2 is smart on paper, its practical usability and reliability are questionable.

One of the creator’s key insights is that GPT-5.2 seems to have been “benchmaxed” — optimized heavily for benchmark scores at the expense of real-world performance and usability. This is evident in its inconsistent behavior, such as poor spatial reasoning in certain tests and a tendency to stray from instructions. The creator contrasts GPT-5.2 with other models like Kimmy K2, which, although underrated, provide a more pleasant and reliable user experience. The video also highlights the cost inefficiency of running GPT-5.2 at higher settings, where marginal improvements come at significantly higher computational expenses.

The creator introduces a new testing project called the “writing arena,” where multiple models write essays, review each other’s work, and then revise their essays based on feedback. This iterative process reveals that GPT-5.2 is particularly strong at following instructions and applying feedback, outperforming models like Gemini 3 Pro in essay quality after revisions. The feedback from other models, especially Claude, helps GPT-5.2 improve its writing by adding concrete evidence and refining arguments. In contrast, Gemini 3 Pro struggles with coherence and structure, producing weaker essays regardless of feedback.

Speed and efficiency are also major points of comparison. The creator demonstrates that GPT-5.2 is significantly slower than other models like Composer and Opus when performing tasks such as generating to-do lists or making code changes. This sluggishness hampers productivity and contrasts sharply with the fast, responsive experience offered by models integrated into tools like Cursor. The creator emphasizes that while intelligence is important, the ability to quickly and accurately follow instructions without unnecessary distractions is equally critical for practical use.

In conclusion, the creator expresses mixed feelings about GPT-5.2. While it is undeniably one of the smartest models in terms of raw capability, it falls short in usability, speed, and consistent performance. The video suggests that GPT-5.2’s focus on benchmark optimization has led to a model that is smart but not necessarily good for everyday tasks. The creator prefers models like Opus 4.5 for their balance of intelligence and usability and views Gemini 3 Pro as a model with potential but significant flaws. Ultimately, the video calls for a shift in focus from purely chasing benchmark scores to developing models that are faster, more reliable, and better at following user instructions.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *