karl3ļ¼ writeme.com wrote: > > so http://github.com/karl3wm/httptransformer now has a resuming class > > i can load llama 405b and process 0.00002% of it on my tiny system, then > > reboot the system and start right off again at 0.00002% and process up to > > 0.00004% ! :D > > > you can try this with `python test_nettensor.py` but I _do not recommend > > > doing it unless you are spending the time working on it_ because it is > > > incredibly inefficient. i do expect it to correctly complete prompts if > > > let run for weeks. > > of course the intent of his project was to implement models that provide > > for amortizable completion in much smaller time like deepseek or mistral or > > any ad-hoc setup providing for multi-token completion such as an assistant > > model [1] > > using it for llama 405b is just for fun > > 1: > https://huggingface.co/docs/transformers/main/en/main_classes/text_generatio... > note that you can probably implement this much more effectively than > huggingface did by engaging logits and attention internals > > it doesn't do any paralleliz-- > > i tried it on google colab, but it actually ran at about the same speed > > because of the synchronous streaming nature -- > > so there's some interest in focusing on speeding up the network portion, > > the biggest bottleneck > >
ok! so how can we make it download faster :s ideas might include: - prefetching in the background (really baseline) - downloading compressed data and decompressing it [this opens up an interesting space of weight compression] - downloading only top-k items, could be very powerful, works better if model weights are shuffled to get similar things in similar rows, alternatively this might work well with lora too. - opening multiple connections at once, possibly accessing multiple sources :s
