On Sun, 17 May 2026 12:42:12 -0700 Roman Gushchin <[email protected]> wrote:
> > > On May 17, 2026, at 11:57 AM, Theodore Tso <[email protected]> wrote: > > On Sun, May 17, 2026 at 11:17:06AM -0700, Roman Gushchin wrote: > >> > >> I actually tried to run it with ollama on my > >> personal framework 13. Adding nominal support is trivial, but the > >> whole thing is not really useful: I can get maybe few hundreds > >> tokens per second using a quantified model with reduced quality; an > >> average sashiko review is consuming 3.5 millions tokens (with Gemini > >> 3.1 pro, it’s also model-dependent). > > > > I'm curious. What hardware and LLM model were you using? A few > > hundred tokens per second seems surprising high. My initial > > research[1] showes that an M5 Max Macbook Pro costing 5 or 6 kilobucks > > can do 31.6 tokens/second on a 27B 4-bit Quanitized model (Qwen 3.5). > > I’ve framework 13 with amd 7840u. I’ve tried several models both on cpu and > gpu. > Sorry, it was a couple of months ago and I don’t remember all the details, so > I won’t > claim any specific numbers, but as I remember the best numbers were around > a hundred tokens per second. In any case it’s few orders of magnitude slower > than > what is realistically required. > > If someone has a powerful hardware and is willing to benchmark sashiko with > open-source > models, I’m very interested in results. If you add the patch you used with ollama somewhere, I can try running here and do some benchmarks - that is assuming that it won't try to run 3.5 millions of tokens. > > > [1] > > https://www.reddit.com/r/LocalLLaMA/comments/1rzkw4x/m5_max_128g_performance_tests_i_just_got_my_new/ > > > > The model matters of course. With Gemma 3 27B and a 6-bit > > quantization, it's 21 tokens/s, and with Deepseek R1 8B Q6_K, it's > > 72.8 tokens/second. But unless you're using a really low-end model, > > or a really expensive, splufty hardware platform, I haven't seen > > reports of hundreds of tokens per second on hardware costing a > > reasonable amount of memory. (I'll set aside the question of whether > > spending $6k for a fully spec'ed out M5 Max Macbook Pro, or $15k for a > > fully spec'ed out M3 Ultra Mac Studio is "reasonable".) > > > > As a result I'm not entirely sure how realistic it is to do reviews > > using "free" (you still have to pay $$$ for the hardware) local, > > open-weight LLM's if an average review requires around 3.5 million > > tokens. > > Fully agree. But it might change in few years, things are moving quickly. Thanks, Mauro

