"*indicating fundamental differences with human intelligence.*" Dude! Human intelligence is the property of consciousness of being able to bring new ideas into existence out of nothing. You cannot simulate such a thing. Omg... so many children! When will you people ever grow up ?
On Saturday, 21 December 2024 at 12:14:05 UTC+2 PGC wrote: > Not enough detail for any conclusions. According to the blog post on > arcprize.org (see here: https://arcprize.org/blog/oai-o3-pub-breakthrough > ), > > *“OpenAI’s new o3 system—**trained on the ARC-AGI-1 Public Training set**—has > scored a breakthrough 75.7% on the Semi-Private Evaluation set at our > stated public leaderboard $10k compute limit. A high-compute (172x) o3 > configuration scored 87.5%.” * > > This excerpt is central to the discussion. The blog is announcing what it > calls a “breakthrough” result, attributing the model’s performance on an > evaluation set to the new “o3 system.” The mention of the “$10k compute > limit” probably refers to a constraint or budget allocated for training > and/or inference on the public leaderboard. Additionally, there is a > statement that when the system is scaled up dramatically (172 times more > compute resources), it manages to score 87.5%. The difference between the > 75.7% result and 87.5% result is thus explained by a large disparity in the > computational budget used for training or inference. > > More significantly, as I’ve highlighted in bold, the model was explicitly > trained on the very same data (or a substantial subset of it) against which > it was later tested. The text itself says: “trained on the ARC-AGI-1 Public > Training set” and then, in the next phrase, reports scores on “the > Semi-Private Evaluation set,” which is presumably meant to be the test > portion of that same overall dataset (or at least closely related). While > it is possible in machine learning to maintain a strictly separate portion > of data for testing, *the mention of “Semi-Private” invites questions > about how distinct or withheld that portion really is.* If the > “Semi-Private Evaluation set” is derived from the same overall data > distribution used in training, or if it contains overlapping examples, then > the resulting 75.7% or 87.5% scores might reflect overfitting/memorization > more than genuine progress toward robust, generalizable intelligence. > > The separation of training data from test or evaluation data is critical > to ensure that performance metrics capture generalization, rather than the > model having “seen” or memorized the answers in training. When a blog post > highlights a “breakthrough” but simultaneously acknowledges that the data > used to measure said breakthrough was closely related to the training set, > skeptics like yours truly naturally question whether this milestone is more > about tuning to a known distribution instead of a leap in fundamental > capabilities. Memorization is not reasoning, as I’ve stated many times > before. But this falls on deaf ears here all the time. > > Beyond the bare mention of “trained on the ARC-AGI-1 Public Training set,” > there is an implied process of repeated tuning or hyperparameter searches. > If the developers iterated many times over that dataset—adjusting > parameters, architecture decisions, or training strategies to maximize > performance on that very same distribution—then the reported results are > likely inflated. In other words, repeated attempts to boost the benchmark > score can lead to an overly optimistic portrayal of performance. Everybody > knows that this becomes “benchmark gaming,” or in milder terms, an > accidental overfitting to the validation/test data that was supposed to be > isolated. > > The blog post mentions two different performance figures: one achieved > under the publicly stated $10k compute limit, and another, much higher > score (87.5%) when the system was scaled up 172 times in terms of compute > expenditure. Consider the costliness of such experiments for so small of a > performance bump! That’s not a positive sign for optimists regarding this > question. > > AI models can indeed be improved—sometimes dramatically—by throwing more > compute at them. However, from the perspective of practicality or genuine > progress, it may be less impressive if the improvement depends purely on > scaling up hardware resources by a large factor, rather than demonstrating > a new or more efficient approach to learning and reasoning. Is this > warranted if the tasks the model is solving do not necessarily require > advanced reasoning skills that a much smaller system (or a human child) can > handle in a simpler way? > > If a large-scale AI system with a massive compute budget is merely > matching or modestly exceeding the performance that a human child can > achieve, it undercuts the notion of a major “breakthrough.” Additionally, > children’s ability to adapt to novel tasks and generalize without being > artificially “trained” on the same data is a key part of the skepticism: > the kind of intelligence the AI system is demonstrating might be narrower > or more brittle compared to natural, human-like intelligence. > > All of this underscores how these “breakthrough” claims can be misleading > if not accompanied by rigorous methodology (e.g., truly held-out data, > minimal overlap, reproducible results under consistent compute budgets). > While the raw numbers of 75.7% and 87.5% might look impressive at face > value, the context provided—including the fact that the ARC-AGI-1 dataset > was also used for training—casts doubt on the significance of those scores > as an indicator of robust progress in AI or alignment research. > > I leave you with a quote from the blog: > > *Passing ARC-AGI does not equate to achieving AGI, and, as a matter of > fact, I don't think o3 is AGI yet. o3 still fails on some very easy tasks, > indicating fundamental differences with human intelligence.* > > *Furthermore, early data points suggest that the upcoming ARC-AGI-2 > benchmark will still pose a significant challenge to o3, potentially > reducing its score to under 30% even at high compute (while a smart human > would still be able to score over 95% with no training). This demonstrates > the continued possibility of creating challenging, unsaturated benchmarks > without having to rely on expert domain knowledge. You'll know AGI is here > when the exercise of creating tasks that are easy for regular humans but > hard for AI becomes simply impossible.* > > And even with that, the memory vs. reasoning problem doesn’t vanish. You > believe that your interlocutor is generally intelligent or not. But I’ve > repeated this so many times, it’s getting too time consuming to keep > responding in detail. Thanks to John anyway for posting past all the > narcissism needing therapy spam here. It’s getting tedious, I have to agree > with Quentin. Less and less posts with big picture + good level of nuances. > > On Saturday, December 21, 2024 at 5:17:52 PM UTC+8 Cosmin Visan wrote: > >> @Brent. Shut up you woke communist! In case you don't know, you are a >> straight white male. It that woke feminazi would have won the election, you >> would have been the first to be exterminated. Be glad that Trump won! >> >> -- You received this message because you are subscribed to the Google Groups "Everything List" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion visit https://groups.google.com/d/msgid/everything-list/f758d4c3-6232-4fa1-a6da-b0f6cb482713n%40googlegroups.com.

