Grok 4: Elon Musk's AI outperforms competitors, not yet AGI

Grok 4: Elon Musk's AI outperforms competitors, not yet AGI
  • Grok 4 boasts significant improvements over previous iterations with RLVW.
  • Grok 4 excels on benchmarks, surpassing Google and OpenAI models.
  • Grok 4 shows promise in real-world tasks like game development.

Elon Musk's AI startup xAI has launched Grok 4, the latest and most advanced iteration of its Grok chatbot. Musk boldly claims it's “smarter than almost all graduate students in all disciplines simultaneously.” This assertion stems from Grok 4's impressive performance on various benchmarks, which xAI says demonstrates a significant leap in large language model (LLM) capabilities. Grok 4 distinguishes itself through enhancements over its predecessors, particularly its utilization of reinforcement learning with verifiable rewards (RLVW). RLVW is a learning method where the AI agent interacts with its environment, receiving rewards or penalties based on its actions, thereby learning to make better decisions. Grok initially focused on next-token prediction, a fundamental aspect of language modeling. Subsequent models, particularly Grok 3, saw a tenfold increase in compute, resulting in improved pre-training. Grok 3.5 introduced reasoning capabilities through reinforcement learning, but Grok 4 has taken this concept further with heavy emphasis on RLVW. The core concept behind RLVW is that the model is trained repeatedly with problems that have known answers, such as math equations or scientific facts. Each correct solution is rewarded, improving the model’s reasoning abilities. During demonstrations, Musk’s team of engineers indicated they were running out of such training problems, suggesting real-world environments will become the best training grounds due to the unlimited verifiable feedback they offer. The assertion that Grok 4 is the smartest LLM stems from its high scores on popular benchmarks that assess abilities like answering questions, solving logical problems, identifying patterns, and performing coding tasks. In recent years, the tech industry has seen a trend where companies release AI models claiming they are the 'best and most advanced AI yet'. While benchmark scores provide a means of judging AI model capabilities, their real-world implications and practical applications may differ substantially.

Grok 4 has demonstrated remarkable performance across various categories, especially on the 'Humanity's Last Exam' benchmark. This benchmark evaluates knowledge and understanding in academic fields like biology, physics, computer science, and engineering, designed to challenge even bright human experts. Without tools, Grok 4 scored 26.9%, surpassing Google's Gemini 2.5 Pro (21.6%) and OpenAI's GPT-4 (around 20%). With tools like web browsing, memory, and coding environments, it scored 41%. Grok 4 Heavy, which uses multiple AI agents to solve problems collaboratively, achieved 50.7%. This multi-agent architecture distinguishes the Grok 4 Heavy model, allowing for shared insights and refined collective responses. Another key benchmark is ARC-AGI, which assesses abstract reasoning and problem-solving skills. Grok 4 obtained 15.9% on ARC-AGI V2, double the previous top score of 8% achieved by Opus 4. Greg Kamradt, founder of ARC Prize, suggested that this score indicates a breakthrough in AI, demonstrating non-zero levels of fluid intelligence. Visualizations, sports predictions, and game design were showcased during the demonstration. Grok 4 created a scientifically plausible visual of colliding black holes. The model has access to real-time data, allowing it to organize timelines of reactions and news developments. On GPQA, Grok 4 scored 88.9%, on Math Arena, it achieved 96.7%. It also performed exceptionally well in the USA Math Olympiad (79.4%) and the AI and Machine Learning 2025 Challenge (100%). Live CodeBench suggests that Grok 4 is a top-tier coder.

The model was tested with the VendingBench test. This test simulates managing a vending machine, with budget and inventory constraints. AI agents must handle orders, manage inventory and pricing, and essentially make money. This test determines long-term coherence. Grok 4 achieved a net worth of $4,700, outperforming GPT-3.5 ($1,800) and human test takers ($844). This demonstrates Grok 4’s ability to reason, plan, and act under unpredictable situations. Users have also showcased unique use cases. An xAI team member built a first-person shooter game in just four hours, automating tasks such as asset sourcing, logic, and visuals. Musk has claimed that AI will generate full-fledged AAA titles, and this shows how far AI has come in video game development. xAI is currently training its Foundation Model v7 and plans to unveil a coding-specialized model in August, a multimodal agent in September, and a video generation model in October. However, it is essential to consider the context of Musk’s claims about Grok 4's intelligence compared to graduate students. Grok 4 is still an LLM, meaning it is prone to hallucinations, just like other AI models. It excels in structured tasks like math and code but struggles with spatial reasoning and nuanced visual understanding. Musk clarified that his “graduate-level” intelligence comment was based on academic tests. Some users noted that the charts shared by xAI might exaggerate the differences between models. Grok 4 has a modest improvement over Gemini 2.5 Pro on full multimodal benchmarks. While Grok 4 is a step forward, it is not yet Artificial General Intelligence (AGI), which is a theoretical AI system with human-level cognitive abilities. Grok 4 lacks agency, goals, and the ability to learn from its mistakes. It mimics thinking but is not yet an autonomous thinker. xAI launched Grok 4, Grok4 Heavy, and SuperGrok Heavy on July 10. Grok 4 is based on xAI’s Foundation Model v6 and can be accessed via xAI’s platform or through an API. It features a 256K context window, multimodal reasoning, real-time web access, and enterprise-grade security. Grok 4 is priced at $30 a month, while Grok 4 Heavy costs $300 a month or $3,000 a year.

Source: Is Grok 4 the smartest AI model yet? Why Elon Musk’s new model is winning praise

Post a Comment

Previous Post Next Post