I am having trouble sending a spam to cypherpunks.

I suspect cypherpunks is temporarily down.

Here it is from this central email address which I have regained
access to but haven't returned to yet:
----
troubleshooting deepseek inference failure [on remote hardware]

transformers/modeling_utils.py 4788
`p` is an mlp weight, "model.layers.61.self_attn.q_a_proj.weight"
`param_device_map[p]` does not exist
`p` is enumerated from `weight_map`

transformers modeling_utils.py 4785:
- `weight_map` has mlp weights and `param_device_map` does not
- an mlp weight is "model.layers.61.self_attn.q_a_proj.weight"
- this is in PreTrainedModel._load_pretrained_model
0352

what conditions cause this block to execute?
where do weight_map and param_device_map come from?

`weight_map` is constructed in previous block.
`else` block in line 4783 indent depth 3
which weight map is constructed?
go up file. indent depth 3 is weight map condition
indent depth 2 is offload code condition

weightmap condition is `if sharded_metadata is None`
(Pdb) p sharded_metadata is None
False

so we have `weight_map = {p: os.path.join(folder, f) for p, f in
sharded_metadata["weight_map"].items()}`

-> weight map is constructed from `sharded_metadata`. if
sharded_metadata were None, it would be constructed from
`original_loaded_keys` and would still contain mlp weights.

it looks like a good avenue would either be to figure out why
`param_device_map` does not have mlp keys, or why the larger block is
being executed.
0358

line 4773 indent depth 2: `if device_map is not None and is_safetensors`
so basically this block is only run if there is both a device map, and
is_safetensors is set.
i think i'm manually setting is_safetensors; maybe i'll try disabling
it and see if i can generate the data then.
0359
0400 ok while that is loading lets see if we can figure out where
param_device_map comes from

0402: removing `use_safetensors` did not resolve the crash.
param_device_map is set on line 4774:
4773         if device_map is not None and is_safetensors:
4774             param_device_map = expand_device_map(device_map,
original_loaded_keys, start_prefix)

basically, `device_map` is expanded to `model.layers.[i]` but does not
have an entry for layer 61 which is the mlp layer.
so when it is expanded it doesn't have any of the weights in that layer.

this probably happens when the device map is autogenerated, which
happened outside this function.
0405
but rather in the calling function: .from_pretrained()
likely line 4259 device_map = infer_auto_device_map(...)

right now:
(Pdb) p device_map_kwargs
{'no_split_module_classes': ['DeepseekV3DecoderLayer'],
'special_dtypes': {'lm_head.weight': torch.bfloat16}, 'max_memory':
{'cpu': 85212960085}}

0407

so basically it sounds like these weights are not present in the model
enumeration
but are present on disk

i have run the model before as have many other so there's some way to
make it work.
it looks like the easiest way is to disable device_map which may mean
fitting the entire model on one device, or it may mean manually
calling offload code after construction.

i could maybe put it on cpu, then set the dtype and offloading after
or maybe i can set the offloading for the whole model without using a
device map somehow .... maybe not

- set a breakpint on infer_auto_device_map ? (i confirmed the layer is
not in the model)
- look at the model source code again to see if the layer can be
enabled for this step
- try calling without a device map

some confusion. it looks like the model has _62_ layers, whereas ....
uhhh ...
so num_hidden_layers is 61 and num_nextn_predict_layers is 1.
the ModuleList .layers is constructed with num_hidden_layers
and it has names that range from 0 to 60.
so the layer that is named "61" is the mlp layer. and it's #62.
confusing because there are 61 hidden layers
and it seemed like the kind of community that might use 1-based numbering
but nope! layer 61 is the 62nd layer, the mlp layer, and it's not in
the list of layers

so i don't see any way for layer 61 to be instantiated here :/ which
is strange cause i've thought i've seen it eval'd
maybe i can look at my logits and see what happened !
0417

0424
no, the log doesn't show layer 61 ever used. it does show expert 61
used a lot, maybe i missaw that

ok hmm
so the huggingface device_map code assumes that what's on-disk matches
what's in the model ...
but i know elsewhere in the code they often handle that mismatching,
so maybe something just needs to be set for something to mismatch ...?

0425
0427
looks like the mismatched key code might be after this code; the
present assumption might be that sharded device mapped models are kind
of tuned for use

hmm there's an unused function _load_pretrained_model_low_mem that
looks intended for people like me to try out

the keys come from the state_dict parameter. so i could either look
into the function for loading that, or preload a custom state dict, or
not use a device map
it looks like it might work to call
transformers.modeling_utils.load_state_dict in advance and filter the
unused keys.
oh no that function is only used if it's not sharded

the keylist comes from get_checkpoint_shard_files

hrm >(
ok options:
- likely a way by passing a custom state dict
- likely a way by not using a device map
- likely a way by engaging internals, one option is get_checkpoint_shard_files
- likely a way by modifying the model to add the unused layers in

that last option might be _easiest and quickest_ here while it's kind
of a unique quirk just for generating test data
i'd just list all the keys in the weights that are on layer 61 and
patch them in i guess

when i run without a device map the warning says i'm supposed to use
"device_map = 'cuda'".

it seems happy to load on cpu

hmm device_map='cuda' seems to work. why is this?

ok i'll try on an H100 again. last time i checked i had $6 on vast.ai
. an H100 is maybe $2.50/hr .
0516
ok device_map='cuda' works fine but then i run out of gpu memory ...

0526
so i stepped into device_map='cuda' and i'm around line 4586 and it
did actually enumerate missing_keys and unexpected_keys way back on
line 4582 ...
there is also a list of unexpected keys to accept:

4620            # Some models may have keys that are not in the state
by design, removing them before needlessly warning
4621            # the user.
4622 ->         if cls._keys_to_ignore_on_load_missing is not None:
4623                for pat in cls._keys_to_ignore_on_load_missing:
4624                    missing_keys = [k for k in missing_keys if
re.search(pat, k) is None]
4625
4626            if cls._keys_to_ignore_on_load_unexpected is not None:
4627                for pat in cls._keys_to_ignore_on_load_unexpected:
4628                    unexpected_keys = [k for k in unexpected_keys
if re.search(pat, k) is None]
4629            if hf_quantizer is not None:
4630                missing_keys =
hf_quantizer.update_missing_keys(model, missing_keys, prefix)

however, layer 61 is still in loaded_keys after, despite being
detected as unexpected

ok so on line 4773 is_safetensors is _false_ and the failing block
isn't executed. that's basically why it worked.
so why is is_safetensors false?

looks like on line 4534 that is_safetensors is only set if device_map
contains "disk".

it sounds like deepseek will run if i offload to cpu and not to disk.
maybe if i can get a VM running i can use swap. i haven't gotten VMs
working on vast.ai, it won't let me connect to them. hrm
maybe i'll just patch those lines to run the model! i can add a check
for the key to be present. lemme see how that works. line 4788 of
modeling_utils.py 0535
0556 well now i get an error in get_disk_only_shard_files

i might want to just capture some weights manually at this point

- initially config.quantization_config = {'activation_scheme':
'dynamic', 'fmt': 'e4m3', 'quant_method': 'fp8', 'weight_block_size':
[128, 128]}
- then config.quantization_config =
AutoHfQuantizer.merge_quantization_configs(config.quantization_config,
quantization_config=None) =
FineGrainedFP8Config(quant_method=<QuantizationMethod.FP8: 'fp8'>)
- then
3691 ->             hf_quantizer = AutoHfQuantizer.from_config(
3692                    config.quantization_config,
3693                    pre_quantized=pre_quantized, // = True
3694                )

3699                hf_quantizer.validate_environment(
3700                    torch_dtype=torch_dtype,
3701                    from_tf=from_tf,
3702 ->                 from_flax=from_flax,
3703                    device_map=device_map,
3704                    weights_only=weights_only,
3705                )
3706                torch_dtype = hf_quantizer.update_torch_dtype(torch_dtype)
3707                device_map = hf_quantizer.update_device_map(device_map)

(... the model is constructed with empty weights ...)

4200 ->             hf_quantizer.preprocess_model(
4201                    model=model, device_map=device_map,
keep_in_fp32_modules=keep_in_fp32_modules
4202                )

it looks like preprocess_model is replacing Linear modules with
FP8Linear modules, before weights are loaded.

so that's likely a really important step my code was missing

... now it's doing the weight loading code i engaged so much ...

[hey one thing i could do is forward with saving weights, but only
save them for e.g. the first layer]

it looked like some of the param quantization initialization could
have been in _load_state_dict_into_meta_model or somesuch

so here's this, but it doesn't look properly initialized:
(Pdb) p model_to_load.model.layers[0].self_attn.q_a_proj.weight.cpu()
tensor([[ -22.0000,  -72.0000,   88.0000,  ...,   -9.0000, -208.0000,
          -28.0000],
        [ 128.0000,   14.0000,   16.0000,  ...,  104.0000,  -64.0000,
           26.0000],
        [  72.0000,  -36.0000,   64.0000,  ..., -120.0000,   80.0000,
          -72.0000],
        ...,
        [-144.0000,   80.0000,   48.0000,  ...,  -72.0000,  -96.0000,
           72.0000],
        [ -80.0000,  120.0000,   72.0000,  ...,  -44.0000,  112.0000,
          112.0000],
        [ 224.0000,    4.5000,  -56.0000,  ...,  160.0000,  -64.0000,
           36.0000]], dtype=torch.float8_e4m3fn)

these are much higher magnitude numbers than i'd expect, i don't think
they've been scaled here
ok it's in weight_scale_inv:

(Pdb) p model_to_load.model.layers[0].self_attn.q_a_proj.weight_scale_inv.cpu()
tensor([[0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0003,
0.0002, 0.0002,
         0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0003, 0.0002,
0.0002, 0.0002,
         0.0002, 0.0002, 0.0002, 0.0003, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002,
         0.0001, 0.0002, 0.0002, 0.0002, 0.0001, 0.0001, 0.0001,
0.0002, 0.0002,
         0.0002, 0.0002, 0.0002, 0.0002, 0.0003, 0.0002, 0.0002, 0.0002, 0.0002,
         0.0002, 0.0001, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002,
0.0003, 0.0004,
         0.0002, 0.0002],
        [0.0002, 0.0001, 0.0002, 0.0002, 0.0001, 0.0002, 0.0002, 0.0002, 0.0001,
         0.0001, 0.0002, 0.0002, 0.0002, 0.0001, 0.0002, 0.0002, 0.0002, 0.0002,
         0.0002, 0.0002, 0.0001, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002,
         0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0001,
0.0001, 0.0002,
         0.0001, 0.0002, 0.0003, 0.0001, 0.0003, 0.0003, 0.0002, 0.0002, 0.0001,
         0.0002, 0.0002, 0.0002, 0.0002, 0.0001, 0.0002, 0.0002,
0.0003, 0.0002,
         0.0002, 0.0001],
        [0.0003, 0.0001, 0.0002, 0.0003, 0.0001, 0.0003, 0.0005, 0.0002, 0.0002,
         0.0002, 0.0003, 0.0003, 0.0003, 0.0002, 0.0004, 0.0004, 0.0002, 0.0004,
         0.0003, 0.0002, 0.0002, 0.0005, 0.0002, 0.0003, 0.0005, 0.0001, 0.0002,
         0.0002, 0.0004, 0.0002, 0.0001, 0.0002, 0.0002, 0.0002,
0.0002, 0.0002,
         0.0003, 0.0003, 0.0005, 0.0002, 0.0005, 0.0004, 0.0002, 0.0001, 0.0001,
         0.0002, 0.0003, 0.0003, 0.0002, 0.0002, 0.0003, 0.0002,
0.0005, 0.0004,
         0.0002, 0.0002],
        [0.0004, 0.0002, 0.0002, 0.0003, 0.0001, 0.0003, 0.0005, 0.0002, 0.0003,
         0.0002, 0.0003, 0.0003, 0.0004, 0.0002, 0.0004, 0.0004, 0.0003, 0.0004,
         0.0003, 0.0001, 0.0002, 0.0005, 0.0002, 0.0003, 0.0005, 0.0002, 0.0003,
         0.0002, 0.0004, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0001,
         0.0003, 0.0003, 0.0005, 0.0002, 0.0005, 0.0005, 0.0002, 0.0002, 0.0001,
         0.0001, 0.0003, 0.0003, 0.0002, 0.0002, 0.0003, 0.0002, 0.0005, 0.0004,
         0.0002, 0.0001],
        [0.0002, 0.0001, 0.0002, 0.0002, 0.0002, 0.0002, 0.0003, 0.0002, 0.0002,
         0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0003, 0.0003, 0.0002, 0.0003,
         0.0002, 0.0001, 0.0002, 0.0003, 0.0002, 0.0002, 0.0003, 0.0002, 0.0002,
         0.0001, 0.0003, 0.0001, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0001,
         0.0002, 0.0002, 0.0003, 0.0002, 0.0004, 0.0003, 0.0002, 0.0002, 0.0002,
         0.0001, 0.0002, 0.0002, 0.0002, 0.0001, 0.0003, 0.0002, 0.0004, 0.0004,
         0.0002, 0.0002],
        [0.0002, 0.0002, 0.0001, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0001,
         0.0002, 0.0002, 0.0002, 0.0002, 0.0001, 0.0002, 0.0001, 0.0001, 0.0002,
         0.0002, 0.0002, 0.0001, 0.0002, 0.0002, 0.0001, 0.0002, 0.0001, 0.0002,
         0.0002, 0.0002, 0.0002, 0.0001, 0.0002, 0.0001, 0.0002, 0.0002, 0.0001,
         0.0002, 0.0001, 0.0002, 0.0001, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002,
         0.0001, 0.0001, 0.0002, 0.0002, 0.0002, 0.0001, 0.0002, 0.0002, 0.0002,
         0.0002, 0.0001],
        [0.0003, 0.0001, 0.0002, 0.0003, 0.0002, 0.0002, 0.0004, 0.0002, 0.0002,
         0.0002, 0.0002, 0.0002, 0.0003, 0.0001, 0.0003, 0.0004, 0.0002, 0.0003,
         0.0003, 0.0002, 0.0002, 0.0004, 0.0002, 0.0002, 0.0004, 0.0002, 0.0002,
         0.0002, 0.0003, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0001, 0.0002,
         0.0002, 0.0003, 0.0004, 0.0002, 0.0004, 0.0004, 0.0001, 0.0002, 0.0002,
         0.0001, 0.0003, 0.0002, 0.0002, 0.0002, 0.0003, 0.0002, 0.0004, 0.0003,
         0.0002, 0.0002],
        [0.0003, 0.0001, 0.0002, 0.0003, 0.0002, 0.0002, 0.0003,
0.0002, 0.0002,
             [3/1806]
         0.0002, 0.0003, 0.0002, 0.0002, 0.0001, 0.0004, 0.0003, 0.0002, 0.0003,
         0.0003, 0.0002, 0.0002, 0.0004, 0.0002, 0.0002, 0.0003, 0.0002, 0.0002,
         0.0002, 0.0003, 0.0002, 0.0001, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002,
         0.0002, 0.0003, 0.0004, 0.0002, 0.0004, 0.0004, 0.0002, 0.0002, 0.0002,
         0.0002, 0.0002, 0.0003, 0.0002, 0.0001, 0.0003, 0.0002, 0.0005, 0.0004,
         0.0002, 0.0002],
        [0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002,
         0.0002, 0.0002, 0.0002, 0.0002, 0.0003, 0.0002, 0.0002, 0.0002, 0.0002,
         0.0002, 0.0003, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0003, 0.0003,
         0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002,
         0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002,
         0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0003, 0.0002,
         0.0002, 0.0002],
        [0.0003, 0.0002, 0.0002, 0.0003, 0.0002, 0.0003, 0.0005, 0.0002, 0.0002,
         0.0002, 0.0003, 0.0003, 0.0003, 0.0002, 0.0004, 0.0004, 0.0002, 0.0004,
         0.0003, 0.0002, 0.0002, 0.0004, 0.0002, 0.0003, 0.0005, 0.0002, 0.0002,
         0.0002, 0.0004, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0001,
         0.0003, 0.0003, 0.0005, 0.0002, 0.0005, 0.0004, 0.0002, 0.0002, 0.0002,
         0.0001, 0.0003, 0.0003, 0.0002, 0.0002, 0.0003, 0.0002, 0.0005, 0.0004,
         0.0002, 0.0002],
        [0.0001, 0.0002, 0.0001, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002,
         0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0001,
         0.0002, 0.0001, 0.0001, 0.0001, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002,
         0.0002, 0.0002, 0.0002, 0.0001, 0.0002, 0.0002, 0.0002, 0.0001, 0.0002,
         0.0002, 0.0001, 0.0002, 0.0002, 0.0001, 0.0001, 0.0002, 0.0002, 0.0001,
         0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0001, 0.0002, 0.0002,
         0.0002, 0.0001],
        [0.0002, 0.0002, 0.0002, 0.0001, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002,
         0.0001, 0.0002, 0.0002, 0.0002, 0.0001, 0.0001, 0.0002, 0.0002, 0.0002,
         0.0002, 0.0002, 0.0002, 0.0002, 0.0001, 0.0002, 0.0002, 0.0002, 0.0002,
         0.0001, 0.0002, 0.0002, 0.0001, 0.0002, 0.0002, 0.0001, 0.0002, 0.0001,
         0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002,
         0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0001, 0.0001, 0.0002, 0.0001,
         0.0001, 0.0002]])

and of course i could have made mistakes copying that by hand from pdb

Reply via email to