so i looked through https://huggingface.co/deepseek-ai/DeepSeek-V3/tree/main a 
little bit.
information is in the two readmes, the config file, the safetensors json file, 
and the modelling python class, all in that repository

it looks like they are including a fully trained layer with a full set of 
experts for speculatively decoding more than 1 token (different from the 
architecture i've spammed about which does all tokens as equal class, this 
instead uses a single extra layer dedicated to guessing further tokens), but it 
does not look like this layer is used in the modelling class.

in the safetensors json there are some unique properties to the mtp (multitoken 
prediction) layer, such as a name "shared_head", which i don't find in the 
source that wires the architecture.

i have a google cloud vm going while i'm on this windows system, but it could 
be hard to reference all the layers to test the model as-is with python. but if 
i did -- usually when there are weights in a safetensors file that don't have a 
place in the model architecture it will emit a warning, and i could look into 
that to see if they just left that warning or if there's more information on 
how to use the weights.

noting that overtly they are describing the multitoken weights as mostly to 
improve training performance. they do also briefly express support for 
speculative decoding with them.

the normal way people were doing speculative decoding looks like they run a 
second model in parallel to the first, so it's possible the extra weights just 
need to be wired up as a parallel model.

it could be fun to make it run and test outputs to figure out the right wiring 
by trial. but most of my time/energy would likely be spent figuring out how to 
map that many huge safetensors files on a system without ram to load them. i 
have a w.i.p. project that has a bump.

or i could look for more media or a paper regarding this model, or a repository 
that was used for training it; this would show the wiring of the weights for 
training which would be basically the same for inference.

basically i'm curious if it can infer more than 1 extra token at once. i'm 
guessing likely it would do that simply in a sequential way like normal 
generation. but it might also rely on the outputs of the larger model and only 
make 1 token. or it might possibly be trained to make 4 tokens in one go or 
something this seems unlikely since that's not normative.

Reply via email to