karl3ļ¼ writeme.com wrote:
> so i looked through https://huggingface.co/deepseek-ai/DeepSeek-V3/tree/main 
> a little bit.
> information is in the two readmes, the config file, the safetensors json 
> file, and the modelling python class, all in that repository
> 
> it looks like they are including a fully trained layer with a full set of 
> experts for speculatively decoding more than 1 token (different from the 
> architecture i've spammed about which does all tokens as equal class, this 
> instead uses a single extra layer dedicated to guessing further tokens), but 
> it does not look like this layer is used in the modelling class.
> 
> in the safetensors json there are some unique properties to the mtp 
> (multitoken prediction) layer, such as a name "shared_head", which i don't 
> find in the source that wires the architecture.
> 
> i have a google cloud vm going while i'm on this windows system, but it could 
> be hard to reference all the layers to test the model as-is with python. but 
> if i did -- usually when there are weights in a safetensors file that don't 
> have a place in the model architecture it will emit a warning, and i could 
> look into that to see if they just left that warning or if there's more 
> information on how to use the weights.
> 
> noting that overtly they are describing the multitoken weights as mostly to 
> improve training performance. they do also briefly express support for 
> speculative decoding with them.
> 
> the normal way people were doing speculative decoding looks like they run a 
> second model in parallel to the first, so it's possible the extra weights 
> just need to be wired up as a parallel model.
> 
> it could be fun to make it run and test outputs to figure out the right 
> wiring by trial. but most of my time/energy would likely be spent figuring 
> out how to map that many huge safetensors files on a system without ram to 
> load them. i have a w.i.p. project that has a bump.
> 
> or i could look for more media or a paper regarding this model, or a 
> repository that was used for training it; this would show the wiring of the 
> weights for training which would be basically the same for inference.
> 
> basically i'm curious if it can infer more than 1 extra token at once. i'm 
> guessing likely it would do that simply in a sequential way like normal 
> generation. but it might also rely on the outputs of the larger model and 
> only make 1 token. or it might possibly be trained to make 4 tokens in one go 
> or something this seems unlikely since that's not normative.

the deepseek3 paper is at 
https://github.com/deepseek-ai/DeepSeek-V3/blob/main/DeepSeek_V3.pdf and it 
describes the architecture wiring. it sounds like nobody has implemented the 
inference wiring for this yet.

i could look into implementing it, but of course it's a challenge to reference 
all those layers to test it can predict a token.

it seems a little interesting to think about working on the safetensors files 
one-by-one and sharding them differently, planning to break them into different 
files based on experts intentionally, so that only a portion of them need to be 
mapped to forward the model

i guess i'm thinking of that as kind of a subtask that might be a thing on its 
own ... if i did succeed at this, i wonder what i would do with the remapped 
weights, and what kind of auxiliary code might be needed to make use of them. 
maybe none! maybe it's just interesting to remap them unsure :s

Reply via email to