> https://github.com/karl3wm/httptransformer or maybe c++ or something
> deepseek is designed with 5% evaluation size and pretrained speculative decode

so the next step i left was subsharding large weights.
i have a potential bump today so i wanted to mention that subsharding looks 
pretty easy, one approach is to use torch's __torch_function__ functionality 
where it can treat any object as a tensor if it has a __torch_function__ 
function (the examples shows a class function but member functions may work 
too), and it calls this function (if present) for operations rather than the 
torch implementations.
very good for embedding layer, a LazyTensor could store the url and offset and 
calculate and fill only the sparse columns needed for the tokens passed, saving 
network and memory significantly.

Reply via email to