"*indicating fundamental differences with human intelligence.*"

Dude! Human intelligence is the property of consciousness of being able to 
bring new ideas into existence out of nothing. You cannot simulate such a 
thing. 
Omg... so many children! When will you people ever grow up ?

On Saturday, 21 December 2024 at 12:14:05 UTC+2 PGC wrote:

> Not enough detail for any conclusions. According to the blog post on 
> arcprize.org (see here: https://arcprize.org/blog/oai-o3-pub-breakthrough
> ), 
>
> *“OpenAI’s new o3 system—**trained on the ARC-AGI-1 Public Training set**—has 
> scored a breakthrough 75.7% on the Semi-Private Evaluation set at our 
> stated public leaderboard $10k compute limit. A high-compute (172x) o3 
> configuration scored 87.5%.” *
>
> This excerpt is central to the discussion. The blog is announcing what it 
> calls a “breakthrough” result, attributing the model’s performance on an 
> evaluation set to the new “o3 system.” The mention of the “$10k compute 
> limit” probably refers to a constraint or budget allocated for training 
> and/or inference on the public leaderboard. Additionally, there is a 
> statement that when the system is scaled up dramatically (172 times more 
> compute resources), it manages to score 87.5%. The difference between the 
> 75.7% result and 87.5% result is thus explained by a large disparity in the 
> computational budget used for training or inference.
>
> More significantly, as I’ve highlighted in bold, the model was explicitly 
> trained on the very same data (or a substantial subset of it) against which 
> it was later tested. The text itself says: “trained on the ARC-AGI-1 Public 
> Training set” and then, in the next phrase, reports scores on “the 
> Semi-Private Evaluation set,” which is presumably meant to be the test 
> portion of that same overall dataset (or at least closely related). While 
> it is possible in machine learning to maintain a strictly separate portion 
> of data for testing, *the mention of “Semi-Private” invites questions 
> about how distinct or withheld that portion really is.* If the 
> “Semi-Private Evaluation set” is derived from the same overall data 
> distribution used in training, or if it contains overlapping examples, then 
> the resulting 75.7% or 87.5% scores might reflect overfitting/memorization 
> more than genuine progress toward robust, generalizable intelligence.
>
> The separation of training data from test or evaluation data is critical 
> to ensure that performance metrics capture generalization, rather than the 
> model having “seen” or memorized the answers in training. When a blog post 
> highlights a “breakthrough” but simultaneously acknowledges that the data 
> used to measure said breakthrough was closely related to the training set, 
> skeptics like yours truly naturally question whether this milestone is more 
> about tuning to a known distribution instead of a leap in fundamental 
> capabilities. Memorization is not reasoning, as I’ve stated many times 
> before. But this falls on deaf ears here all the time.
>
> Beyond the bare mention of “trained on the ARC-AGI-1 Public Training set,” 
> there is an implied process of repeated tuning or hyperparameter searches. 
> If the developers iterated many times over that dataset—adjusting 
> parameters, architecture decisions, or training strategies to maximize 
> performance on that very same distribution—then the reported results are 
> likely inflated. In other words, repeated attempts to boost the benchmark 
> score can lead to an overly optimistic portrayal of performance. Everybody 
> knows that this becomes “benchmark gaming,” or in milder terms, an 
> accidental overfitting to the validation/test data that was supposed to be 
> isolated.
>
> The blog post mentions two different performance figures: one achieved 
> under the publicly stated $10k compute limit, and another, much higher 
> score (87.5%) when the system was scaled up 172 times in terms of compute 
> expenditure. Consider the costliness of such experiments for so small of a 
> performance bump! That’s not a positive sign for optimists regarding this 
> question.
>
> AI models can indeed be improved—sometimes dramatically—by throwing more 
> compute at them. However, from the perspective of practicality or genuine 
> progress, it may be less impressive if the improvement depends purely on 
> scaling up hardware resources by a large factor, rather than demonstrating 
> a new or more efficient approach to learning and reasoning. Is this 
> warranted if the tasks the model is solving do not necessarily require 
> advanced reasoning skills that a much smaller system (or a human child) can 
> handle in a simpler way?
>
> If a large-scale AI system with a massive compute budget is merely 
> matching or modestly exceeding the performance that a human child can 
> achieve, it undercuts the notion of a major “breakthrough.” Additionally, 
> children’s ability to adapt to novel tasks and generalize without being 
> artificially “trained” on the same data is a key part of the skepticism: 
> the kind of intelligence the AI system is demonstrating might be narrower 
> or more brittle compared to natural, human-like intelligence.
>
> All of this underscores how these “breakthrough” claims can be misleading 
> if not accompanied by rigorous methodology (e.g., truly held-out data, 
> minimal overlap, reproducible results under consistent compute budgets). 
> While the raw numbers of 75.7% and 87.5% might look impressive at face 
> value, the context provided—including the fact that the ARC-AGI-1 dataset 
> was also used for training—casts doubt on the significance of those scores 
> as an indicator of robust progress in AI or alignment research.
>
> I leave you with a quote from the blog: 
>
> *Passing ARC-AGI does not equate to achieving AGI, and, as a matter of 
> fact, I don't think o3 is AGI yet. o3 still fails on some very easy tasks, 
> indicating fundamental differences with human intelligence.*
>
> *Furthermore, early data points suggest that the upcoming ARC-AGI-2 
> benchmark will still pose a significant challenge to o3, potentially 
> reducing its score to under 30% even at high compute (while a smart human 
> would still be able to score over 95% with no training). This demonstrates 
> the continued possibility of creating challenging, unsaturated benchmarks 
> without having to rely on expert domain knowledge. You'll know AGI is here 
> when the exercise of creating tasks that are easy for regular humans but 
> hard for AI becomes simply impossible.*
>
> And even with that, the memory vs. reasoning problem doesn’t vanish. You 
> believe that your interlocutor is generally intelligent or not. But I’ve 
> repeated this so many times, it’s getting too time consuming to keep 
> responding in detail. Thanks to John anyway for posting past all the 
> narcissism needing therapy spam here. It’s getting tedious, I have to agree 
> with Quentin. Less and less posts with big picture + good level of nuances. 
>
> On Saturday, December 21, 2024 at 5:17:52 PM UTC+8 Cosmin Visan wrote:
>
>> @Brent. Shut up you woke communist! In case you don't know, you are a 
>> straight white male. It that woke feminazi would have won the election, you 
>> would have been the first to be exterminated. Be glad that Trump won! 
>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"Everything List" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion visit 
https://groups.google.com/d/msgid/everything-list/f758d4c3-6232-4fa1-a6da-b0f6cb482713n%40googlegroups.com.

Reply via email to