karl3@writeme.com wrote:
> karl3@writeme.com wrote:
> > so http://github.com/karl3wm/httptransformer now has a resuming class
> > i can load llama 405b and process 0.00002% of it on my tiny system, then 
> > reboot the system and start right off again at 0.00002% and process up to 
> > 0.00004% ! :D
> > you can try this with `python test_nettensor.py` but I _do not recommend 
> > doing it unless you are spending the time working on it_ because it is 
> > incredibly inefficient. i do expect it to correctly complete prompts if let 
> > run for weeks.
> > of course the intent of his project was to implement models that provide 
> > for amortizable completion in much smaller time like deepseek or mistral or 
> > any ad-hoc setup providing for multi-token completion such as an assistant 
> > model [1]
> > using it for llama 405b is just for fun
> > 1: 
> > https://huggingface.co/docs/transformers/main/en/main_classes/text_generatio...
> >  note that you can probably implement this much more effectively than 
> > huggingface did by engaging logits and attention internals
> > it doesn't do any paralleliz--
> > i tried it on google colab, but it actually ran at about the same speed 
> > because of the synchronous streaming nature --
> > so there's some interest in focusing on speeding up the network portion, 
> > the biggest bottleneck
> > ok! so how can we make it download faster :s
> ideas might include:
> - prefetching in the background (really baseline)
> - downloading compressed data and decompressing it [this opens up an 
> interesting space of weight compression]
> - downloading only top-k items, could be very powerful, works better if model 
> weights are shuffled to get similar things in similar rows, alternatively 
> this might work well with lora too.
> - opening multiple connections at once, possibly accessing multiple sources
> 
> :s

small space of how prefetching could be calculated via lazy evaluation, torch 
hides the operator graph tho ummmm
might be easiest to forward with virtual/meta tensors awkward tho

huh! i wonder how prefetching could work best?
now prefetching would be easy inside single operators. it breaks data into 
chunks and calculates the operators over only part at once. so it knows clearly 
what's coming up then.
another space would be module children. when a module is forwarded we can 
probably guess all its children could be important. similarly layers.
... [... [ the umm current resume code doesn't engage the opera--

Reply via email to