karl3ļ¼ writeme.com wrote: > so i looked through https://huggingface.co/deepseek-ai/DeepSeek-V3/tree/main > a little bit. > information is in the two readmes, the config file, the safetensors json > file, and the modelling python class, all in that repository > > it looks like they are including a fully trained layer with a full set of > experts for speculatively decoding more than 1 token (different from the > architecture i've spammed about which does all tokens as equal class, this > instead uses a single extra layer dedicated to guessing further tokens), but > it does not look like this layer is used in the modelling class. > > in the safetensors json there are some unique properties to the mtp > (multitoken prediction) layer, such as a name "shared_head", which i don't > find in the source that wires the architecture. > > i have a google cloud vm going while i'm on this windows system, but it could > be hard to reference all the layers to test the model as-is with python. but > if i did -- usually when there are weights in a safetensors file that don't > have a place in the model architecture it will emit a warning, and i could > look into that to see if they just left that warning or if there's more > information on how to use the weights. > > noting that overtly they are describing the multitoken weights as mostly to > improve training performance. they do also briefly express support for > speculative decoding with them. > > the normal way people were doing speculative decoding looks like they run a > second model in parallel to the first, so it's possible the extra weights > just need to be wired up as a parallel model. > > it could be fun to make it run and test outputs to figure out the right > wiring by trial. but most of my time/energy would likely be spent figuring > out how to map that many huge safetensors files on a system without ram to > load them. i have a w.i.p. project that has a bump. > > or i could look for more media or a paper regarding this model, or a > repository that was used for training it; this would show the wiring of the > weights for training which would be basically the same for inference. > > basically i'm curious if it can infer more than 1 extra token at once. i'm > guessing likely it would do that simply in a sequential way like normal > generation. but it might also rely on the outputs of the larger model and > only make 1 token. or it might possibly be trained to make 4 tokens in one go > or something this seems unlikely since that's not normative.
the deepseek3 paper is at https://github.com/deepseek-ai/DeepSeek-V3/blob/main/DeepSeek_V3.pdf and it describes the architecture wiring. it sounds like nobody has implemented the inference wiring for this yet. i could look into implementing it, but of course it's a challenge to reference all those layers to test it can predict a token. it seems a little interesting to think about working on the safetensors files one-by-one and sharding them differently, planning to break them into different files based on experts intentionally, so that only a portion of them need to be mapped to forward the model i guess i'm thinking of that as kind of a subtask that might be a thing on its own ... if i did succeed at this, i wonder what i would do with the remapped weights, and what kind of auxiliary code might be needed to make use of them. maybe none! maybe it's just interesting to remap them unsure :s