Minor correction. On 2025-01-27 01:44, Cordell Bloor wrote:
After a single message, the prompt is going to be at least as long as a message, and I think the 6 t/s gain in PP will offset the 6 t/s loss in TG. From that point on, the tradeoff is a complete win.
My brain is clearly a little fried. The implied math is nonsense. It's not until PP and TG take the same amount of _time_ that trading 6 t/s from TG to PP becomes a net benefit. Since PP is so much faster than TG, that won't happen until probably 10-20 messages into the conversation.
And, frankly, I'm probably extrapolating a bit too much from a 10% performance difference on one workstation. In any case, the important bit was really that OpenBLAS brings the Prompt Processing step back up to rough parity. Maybe that's enough.
Sincerely, Cory Bloor