karl3@writeme.com wrote:
> karl3@writeme.com wrote:
> > {{{{{{there was some energy around making a network-based inference engine, 
> > maybe by modifying deepseek.cpp (don't quite recall why not staying in 
> > python, some concern arose)
> > task got weak, found cinatra as a benchmark leader for c++ web engines 
> > (although pico.v was the top! (surprised all c++ http engines were beaten 
> > by java O_O very curious about this, wondering if it's a high-end test 
> > system) never heard of V language but is interesting it won a leaderboard)
> > inhibition ended up discovering concern somewaht like ... on this 4GB ram 
> > system it might take 15-33GB of network transfor for each forward pass of 
> > the model ... [multi-token passes ^^
> > karl3@writeme.com wrote:
> > the concern resonates with difficulty making the implementation, and some 
> > form of inhibition or concern around using python. notably, i've made 
> > offloading python hooks a lot and they never last due to the underlying 
> > interfaces changing (although those interfaces have stabilized much more 
> > now as hf made accelerator as their official implementation) (also i think 
> > the issue is more severe dissociative associations than the interface, if 
> > one considers the possibility of personal maintenance and use rather than 
> > usability for others). don't immediately recall the python concern
> > seem to be off task or taking break, but it would make sense to do disk 
> > caching too. there is also option of quantizing. basically, LLMs and AI in 
> > general place r&d effort between the user and ease, smallness, cheapness, 
> > power, etc

I poked at python again. The existing implementations of the 8 bit quantization 
used by the model all require an NVidia GPU which I do not presently have. It 
is fun to imagine making it work, maybe I can upcast it to float32 or something 
>)

Reply via email to