karl3ļ¼ writeme.com wrote:
> > https://github.com/karl3wm/httptransformer or maybe c++ or something
> > deepseek is designed with 5% evaluation size and pretrained speculative 
> > decode
> > so the next step i left was subsharding large weights.
> i have a potential bump today so i wanted to mention that subsharding looks 
> pretty easy, one approach is to use torch's __torch_function__ functionality 
> where it can treat any object as a tensor if it has a __torch_function__ 
> function (the examples shows a class function but member functions may work 
> too), and it calls this function (if present) for operations rather than the 
> torch implementations.
> very good for embedding layer, a LazyTensor could store the url and offset 
> and calculate and fill only the sparse columns needed for the tokens passed, 
> saving network and memory significantly.

i spent a lot of hours playing with things around this, although most of it was 
spent generating model tracing data to validate inference code.
there's a second more organized implementation, a little old now, still useful:
- https://github.com/karl3wm/httptransformer/blob/main/netsafetensors.py loads 
the huggingface safetensors format remotely from their git-lfs with a 
variable-sized cache such that it will only download tensors used in 
evaluation, and will memory-map them all to disk if there is space, and 
otherwise download them from the network on use if there isn't space
- https://github.com/karl3wm/httptransformer/blob/main/netsafetensors.py wraps 
netsafetensors with pytorch tensors and presents a lazy tensor and lazy state 
dict object such that entire models can be run off the network. uses similar 
code as netsafetensors to provide a variable sized cache in RAM
- https://github.com/karl3wm/httptransformer/blob/main/test_nettensors.py runs 
language models off the network and validates that the logis used in them 
compare well with recorded logits i made at 
https://huggingface.co/datasets/baffo32/llm_logits -- but of course only a 
couple models are recorded there

now, deepseek doesn't run correctly. it's designed to only work on the H100, a 
high-end datacenter GPU, and the huggingface quantization initialization code 
only runs under certain conditions. i haven't figured out if there is a correct 
way to run it under other conditions or not. it _looks_ like it would run fine 
on any cpu or gpu if the code to do so was enabled nowadays, but I could be 
wrong. as-is it loads but doesn't init its quantization block scalars and 
produces very wrong outputs.

meanwhile, things have gotten a little funny online around it ... i might step 
back from trying this again with this model and let them sort out ...

Reply via email to