karl3ļ¼ writeme.com wrote:
> > so http://github.com/karl3wm/httptransformer now has a resuming class
> > i can load llama 405b and process 0.00002% of it on my tiny system, then 
> > reboot the system and start right off again at 0.00002% and process up to 
> > 0.00004% ! :D
> > > you can try this with `python test_nettensor.py` but I _do not recommend 
> > > doing it unless you are spending the time working on it_ because it is 
> > > incredibly inefficient. i do expect it to correctly complete prompts if 
> > > let run for weeks.
> > of course the intent of his project was to implement models that provide 
> > for amortizable completion in much smaller time like deepseek or mistral or 
> > any ad-hoc setup providing for multi-token completion such as an assistant 
> > model [1]
> 
> using it for llama 405b is just for fun
> 
> 1: 
> https://huggingface.co/docs/transformers/main/en/main_classes/text_generatio...
>  note that you can probably implement this much more effectively than 
> huggingface did by engaging logits and attention internals
> > it doesn't do any paralleliz--
> > i tried it on google colab, but it actually ran at about the same speed 
> > because of the synchronous streaming nature --
> > so there's some interest in focusing on speeding up the network portion, 
> > the biggest bottleneck
> >

ok! so how can we make it download faster :s
ideas might include:
- prefetching in the background (really baseline)
- downloading compressed data and decompressing it [this opens up an 
interesting space of weight compression]
- downloading only top-k items, could be very powerful, works better if model 
weights are shuffled to get similar things in similar rows, alternatively this 
might work well with lora too.
- opening multiple connections at once, possibly accessing multiple sources

:s

Reply via email to