karl3@writeme.com wrote: > karl3@writeme.com wrote: > > {{{{{{there was some energy around making a network-based inference engine, > > maybe by modifying deepseek.cpp (don't quite recall why not staying in > > python, some concern arose) > > task got weak, found cinatra as a benchmark leader for c++ web engines > > (although pico.v was the top! (surprised all c++ http engines were beaten > > by java O_O very curious about this, wondering if it's a high-end test > > system) never heard of V language but is interesting it won a leaderboard) > > inhibition ended up discovering concern somewaht like ... on this 4GB ram > > system it might take 15-33GB of network transfor for each forward pass of > > the model ... [multi-token passes ^^ > > karl3@writeme.com wrote: > > the concern resonates with difficulty making the implementation, and some > > form of inhibition or concern around using python. notably, i've made > > offloading python hooks a lot and they never last due to the underlying > > interfaces changing (although those interfaces have stabilized much more > > now as hf made accelerator as their official implementation) (also i think > > the issue is more severe dissociative associations than the interface, if > > one considers the possibility of personal maintenance and use rather than > > usability for others). don't immediately recall the python concern > > seem to be off task or taking break, but it would make sense to do disk > > caching too. there is also option of quantizing. basically, LLMs and AI in > > general place r&d effort between the user and ease, smallness, cheapness, > > power, etc
I poked at python again. The existing implementations of the 8 bit quantization used by the model all require an NVidia GPU which I do not presently have. It is fun to imagine making it work, maybe I can upcast it to float32 or something >)