cypherpunks list not loading for me.
also many emails from it missing to inbox at the moment.

here is last spam that i tried to send during connectivity issues:

troubleshooting deepseek inference failure [on remote hardware]

transformers/modeling_utils.py 4788
`p` is an mlp weight, "model.layers.61.self_attn.q_a_proj.weight"
`param_device_map[p]` does not exist
`p` is enumerated from `weight_map`

transformers modeling_utils.py 4785:
- `weight_map` has mlp weights and `param_device_map` does not
- an mlp weight is "model.layers.61.self_attn.q_a_proj.weight"
- this is in PreTrainedModel._load_pretrained_model
0352

what conditions cause this block to execute?
where do weight_map and param_device_map come from?

`weight_map` is constructed in previous block.
`else` block in line 4783 indent depth 3
which weight map is constructed?
go up file. indent depth 3 is weight map condition
indent depth 2 is offload code condition

weightmap condition is `if sharded_metadata is None`
(Pdb) p sharded_metadata is None
False

so we have `weight_map = {p: os.path.join(folder, f) for p, f in 
sharded_metadata["weight_map"].items()}`

-> weight map is constructed from `sharded_metadata`. if sharded_metadata were 
None, it would be constructed from `original_loaded_keys` and would still 
contain mlp weights.

it looks like a good avenue would either be to figure out why 
`param_device_map` does not have mlp keys, or why the larger block is being 
executed.
0358

line 4773 indent depth 2: `if device_map is not None and is_safetensors`
so basically this block is only run if there is both a device map, and 
is_safetensors is set.
i think i'm manually setting is_safetensors; maybe i'll try disabling it and 
see if i can generate the data then.
0359
0400 ok while that is loading lets see if we can figure out where 
param_device_map comes from

0402: removing `use_safetensors` did not resolve the crash. param_device_map is 
set on line 4774:
4773         if device_map is not None and is_safetensors:
4774             param_device_map = expand_device_map(device_map, 
original_loaded_keys, start_prefix)

basically, `device_map` is expanded to `model.layers.[i]` but does not have an 
entry for layer 61 which is the mlp layer.
so when it is expanded it doesn't have any of the weights in that layer.

this probably happens when the device map is autogenerated, which happened 
outside this function.
0405
but rather in the calling function: .from_pretrained()
likely line 4259 device_map = infer_auto_device_map(...)

right now:
(Pdb) p device_map_kwargs
{'no_split_module_classes': ['DeepseekV3DecoderLayer'], 'special_dtypes': 
{'lm_head.weight': torch.bfloat16}, 'max_memory': {'cpu': 85212960085}}

0407

so basically it sounds like these weights are not present in the model 
enumeration
but are present on disk

i have run the model before as have many other so there's some way to make it 
work.
it looks like the easiest way is to disable device_map which may mean fitting 
the entire model on one device, or it may mean manually calling offload code 
after construction.

i could maybe put it on cpu, then set the dtype and offloading after
or maybe i can set the offloading for the whole model without using a device 
map somehow .... maybe not

- set a breakpint on infer_auto_device_map ? (i confirmed the layer is not in 
the model)
- look at the model source code again to see if the layer can be enabled for 
this step
- try calling without a device map

some confusion. it looks like the model has _62_ layers, whereas ....
uhhh ...
so num_hidden_layers is 61 and num_nextn_predict_layers is 1.
the ModuleList .layers is constructed with num_hidden_layers
and it has names that range from 0 to 60.
so the layer that is named "61" is the mlp layer. and it's #62.
confusing because there are 61 hidden layers
and it seemed like the kind of community that might use 1-based numbering
but nope! layer 61 is the 62nd layer, the mlp layer, and it's not in the list 
of layers

so i don't see any way for layer 61 to be instantiated here :/ which is strange 
cause i've thought i've seen it eval'd
maybe i can look at my logits and see what happened !
0417

0424
no, the log doesn't show layer 61 ever used. it does show expert 61 used a lot, 
maybe i missaw that

ok hmm
so the huggingface device_map code assumes that what's on-disk matches what's 
in the model ...
but i know elsewhere in the code they often handle that mismatching, so maybe 
something just needs to be set for something to mismatch ...?

0425
0427
looks like the mismatched key code might be after this code; the present 
assumption might be that sharded device mapped models are kind of tuned for use

hmm there's an unused function _load_pretrained_model_low_mem that looks 
intended for people like me to try out

the keys come from the state_dict parameter. so i could either look into the 
function for loading that, or preload a custom state dict, or not use a device 
map
it looks like it might work to call transformers.modeling_utils.load_state_dict 
in advance and filter the unused keys.
oh no that function is only used if it's not sharded

the keylist comes from get_checkpoint_shard_files

hrm >(
ok options:
- likely a way by passing a custom state dict
- likely a way by not using a device map
- likely a way by engaging internals, one option is get_checkpoint_shard_files
- likely a way by modifying the model to add the unused layers in

that last option might be _easiest and quickest_ here while it's kind of a 
unique quirk just for generating test data
i'd just list all the keys in the weights that are on layer 61 and patch them 
in i guess

when i run without a device map the warning says i'm supposed to use 
"device_map = 'cuda'".

it seems happy to load on cpu

hmm device_map='cuda' seems to work. why is this?

ok i'll try on an H100 again. last time i checked i had $6 on vast.ai . an H100 
is maybe $2.50/hr .
0516
ok device_map='cuda' works fine but then i run out of gpu memory ...

0526
so i stepped into device_map='cuda' and i'm around line 4586 and it did 
actually enumerate missing_keys and unexpected_keys way back on line 4582 ...
there is also a list of unexpected keys to accept:

4620            # Some models may have keys that are not in the state by 
design, removing them before needlessly warning
4621            # the user.
4622 ->         if cls._keys_to_ignore_on_load_missing is not None:
4623                for pat in cls._keys_to_ignore_on_load_missing:
4624                    missing_keys = [k for k in missing_keys if 
re.search(pat, k) is None]
4625
4626            if cls._keys_to_ignore_on_load_unexpected is not None:
4627                for pat in cls._keys_to_ignore_on_load_unexpected:
4628                    unexpected_keys = [k for k in unexpected_keys if 
re.search(pat, k) is None]
4629            if hf_quantizer is not None:
4630                missing_keys = hf_quantizer.update_missing_keys(model, 
missing_keys, prefix)

however, layer 61 is still in loaded_keys after, despite being detected as 
unexpected

ok so on line 4773 is_safetensors is _false_ and the failing block isn't 
executed. that's basically why it worked.
so why is is_safetensors false?

looks like on line 4534 that is_safetensors is only set if device_map contains 
"disk".

it sounds like deepseek will run if i offload to cpu and not to disk.
maybe if i can get a VM running i can use swap. i haven't gotten VMs working on 
vast.ai, it won't let me connect to them. hrm
maybe i'll just patch those lines to run the model! i can add a check for the 
key to be present. lemme see how that works. line 4788 of modeling_utils.py 0535
0556 well now i get an error in get_disk_only_shard_files

i might want to just capture some weights manually at this point

- initially config.quantization_config = {'activation_scheme': 'dynamic', 
'fmt': 'e4m3', 'quant_method': 'fp8', 'weight_block_size': [128, 128]}
- then config.quantization_config = 
AutoHfQuantizer.merge_quantization_configs(config.quantization_config, 
quantization_config=None) = 
FineGrainedFP8Config(quant_method=<QuantizationMethod.FP8: 'fp8'>)
- then
3691 ->             hf_quantizer = AutoHfQuantizer.from_config(
3692                    config.quantization_config,
3693                    pre_quantized=pre_quantized, // = True
3694                )

3699                hf_quantizer.validate_environment(
3700                    torch_dtype=torch_dtype,
3701                    from_tf=from_tf,
3702 ->                 from_flax=from_flax,
3703                    device_map=device_map,
3704                    weights_only=weights_only,
3705                )
3706                torch_dtype = hf_quantizer.update_torch_dtype(torch_dtype)
3707                device_map = hf_quantizer.update_device_map(device_map)

(... the model is constructed with empty weights ...)

4200 ->             hf_quantizer.preprocess_model(
4201                    model=model, device_map=device_map, 
keep_in_fp32_modules=keep_in_fp32_modules
4202                )

it looks like preprocess_model is replacing Linear modules with FP8Linear 
modules, before weights are loaded.

so that's likely a really important step my code was missing

... now it's doing the weight loading code i engaged so much ...

[hey one thing i could do is forward with saving weights, but only save them 
for e.g. the first layer]

it looked like some of the param quantization initialization could have been in 
_load_state_dict_into_meta_model or somesuch

so here's this, but it doesn't look properly initialized:
(Pdb) p model_to_load.model.layers[0].self_attn.q_a_proj.weight.cpu()
tensor([[ -22.0000,  -72.0000,   88.0000,  ...,   -9.0000, -208.0000,
          -28.0000],
        [ 128.0000,   14.0000,   16.0000,  ...,  104.0000,  -64.0000,
           26.0000],
        [  72.0000,  -36.0000,   64.0000,  ..., -120.0000,   80.0000,
          -72.0000],
        ...,
        [-144.0000,   80.0000,   48.0000,  ...,  -72.0000,  -96.0000,
           72.0000],
        [ -80.0000,  120.0000,   72.0000,  ...,  -44.0000,  112.0000,
          112.0000],
        [ 224.0000,    4.5000,  -56.0000,  ...,  160.0000,  -64.0000,
           36.0000]], dtype=torch.float8_e4m3fn)

these are much higher magnitude numbers than i'd expect, i don't think they've 
been scaled here
ok it's in weight_scale_inv:

(Pdb) p model_to_load.model.layers[0].self_attn.q_a_proj.weight_scale_inv.cpu()
tensor([[0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0003, 0.0002, 0.0002,
         0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0003, 0.0002, 0.0002, 0.0002,
         0.0002, 0.0002, 0.0002, 0.0003, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002,
         0.0001, 0.0002, 0.0002, 0.0002, 0.0001, 0.0001, 0.0001, 0.0002, 0.0002,
         0.0002, 0.0002, 0.0002, 0.0002, 0.0003, 0.0002, 0.0002, 0.0002, 0.0002,
         0.0002, 0.0001, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0003, 0.0004,
         0.0002, 0.0002],
        [0.0002, 0.0001, 0.0002, 0.0002, 0.0001, 0.0002, 0.0002, 0.0002, 0.0001,
         0.0001, 0.0002, 0.0002, 0.0002, 0.0001, 0.0002, 0.0002, 0.0002, 0.0002,
         0.0002, 0.0002, 0.0001, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002,
         0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0001, 0.0001, 0.0002,
         0.0001, 0.0002, 0.0003, 0.0001, 0.0003, 0.0003, 0.0002, 0.0002, 0.0001,
         0.0002, 0.0002, 0.0002, 0.0002, 0.0001, 0.0002, 0.0002, 0.0003, 0.0002,
         0.0002, 0.0001],
        [0.0003, 0.0001, 0.0002, 0.0003, 0.0001, 0.0003, 0.0005, 0.0002, 0.0002,
         0.0002, 0.0003, 0.0003, 0.0003, 0.0002, 0.0004, 0.0004, 0.0002, 0.0004,
         0.0003, 0.0002, 0.0002, 0.0005, 0.0002, 0.0003, 0.0005, 0.0001, 0.0002,
         0.0002, 0.0004, 0.0002, 0.0001, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002,
         0.0003, 0.0003, 0.0005, 0.0002, 0.0005, 0.0004, 0.0002, 0.0001, 0.0001,
         0.0002, 0.0003, 0.0003, 0.0002, 0.0002, 0.0003, 0.0002, 0.0005, 0.0004,
         0.0002, 0.0002],
        [0.0004, 0.0002, 0.0002, 0.0003, 0.0001, 0.0003, 0.0005, 0.0002, 0.0003,
         0.0002, 0.0003, 0.0003, 0.0004, 0.0002, 0.0004, 0.0004, 0.0003, 0.0004,
         0.0003, 0.0001, 0.0002, 0.0005, 0.0002, 0.0003, 0.0005, 0.0002, 0.0003,
         0.0002, 0.0004, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0001,
         0.0003, 0.0003, 0.0005, 0.0002, 0.0005, 0.0005, 0.0002, 0.0002, 0.0001,
         0.0001, 0.0003, 0.0003, 0.0002, 0.0002, 0.0003, 0.0002, 0.0005, 0.0004,
         0.0002, 0.0001],
        [0.0002, 0.0001, 0.0002, 0.0002, 0.0002, 0.0002, 0.0003, 0.0002, 0.0002,
         0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0003, 0.0003, 0.0002, 0.0003,
         0.0002, 0.0001, 0.0002, 0.0003, 0.0002, 0.0002, 0.0003, 0.0002, 0.0002,
         0.0001, 0.0003, 0.0001, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0001,
         0.0002, 0.0002, 0.0003, 0.0002, 0.0004, 0.0003, 0.0002, 0.0002, 0.0002,
         0.0001, 0.0002, 0.0002, 0.0002, 0.0001, 0.0003, 0.0002, 0.0004, 0.0004,
         0.0002, 0.0002],
        [0.0002, 0.0002, 0.0001, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0001,
         0.0002, 0.0002, 0.0002, 0.0002, 0.0001, 0.0002, 0.0001, 0.0001, 0.0002,
         0.0002, 0.0002, 0.0001, 0.0002, 0.0002, 0.0001, 0.0002, 0.0001, 0.0002,
         0.0002, 0.0002, 0.0002, 0.0001, 0.0002, 0.0001, 0.0002, 0.0002, 0.0001,
         0.0002, 0.0001, 0.0002, 0.0001, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002,
         0.0001, 0.0001, 0.0002, 0.0002, 0.0002, 0.0001, 0.0002, 0.0002, 0.0002,
         0.0002, 0.0001],
        [0.0003, 0.0001, 0.0002, 0.0003, 0.0002, 0.0002, 0.0004, 0.0002, 0.0002,
         0.0002, 0.0002, 0.0002, 0.0003, 0.0001, 0.0003, 0.0004, 0.0002, 0.0003,
         0.0003, 0.0002, 0.0002, 0.0004, 0.0002, 0.0002, 0.0004, 0.0002, 0.0002,
         0.0002, 0.0003, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0001, 0.0002,
         0.0002, 0.0003, 0.0004, 0.0002, 0.0004, 0.0004, 0.0001, 0.0002, 0.0002,
         0.0001, 0.0003, 0.0002, 0.0002, 0.0002, 0.0003, 0.0002, 0.0004, 0.0003,
         0.0002, 0.0002],
        [0.0003, 0.0001, 0.0002, 0.0003, 0.0002, 0.0002, 0.0003, 0.0002, 
0.0002,                                                                     
[3/1806]
         0.0002, 0.0003, 0.0002, 0.0002, 0.0001, 0.0004, 0.0003, 0.0002, 0.0003,
         0.0003, 0.0002, 0.0002, 0.0004, 0.0002, 0.0002, 0.0003, 0.0002, 0.0002,
         0.0002, 0.0003, 0.0002, 0.0001, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002,
         0.0002, 0.0003, 0.0004, 0.0002, 0.0004, 0.0004, 0.0002, 0.0002, 0.0002,
         0.0002, 0.0002, 0.0003, 0.0002, 0.0001, 0.0003, 0.0002, 0.0005, 0.0004,
         0.0002, 0.0002],
        [0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002,
         0.0002, 0.0002, 0.0002, 0.0002, 0.0003, 0.0002, 0.0002, 0.0002, 0.0002,
         0.0002, 0.0003, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0003, 0.0003,
         0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002,
         0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002,
         0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0003, 0.0002,
         0.0002, 0.0002],
        [0.0003, 0.0002, 0.0002, 0.0003, 0.0002, 0.0003, 0.0005, 0.0002, 0.0002,
         0.0002, 0.0003, 0.0003, 0.0003, 0.0002, 0.0004, 0.0004, 0.0002, 0.0004,
         0.0003, 0.0002, 0.0002, 0.0004, 0.0002, 0.0003, 0.0005, 0.0002, 0.0002,
         0.0002, 0.0004, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0001,
         0.0003, 0.0003, 0.0005, 0.0002, 0.0005, 0.0004, 0.0002, 0.0002, 0.0002,
         0.0001, 0.0003, 0.0003, 0.0002, 0.0002, 0.0003, 0.0002, 0.0005, 0.0004,
         0.0002, 0.0002],
        [0.0001, 0.0002, 0.0001, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002,
         0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0001,
         0.0002, 0.0001, 0.0001, 0.0001, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002,
         0.0002, 0.0002, 0.0002, 0.0001, 0.0002, 0.0002, 0.0002, 0.0001, 0.0002,
         0.0002, 0.0001, 0.0002, 0.0002, 0.0001, 0.0001, 0.0002, 0.0002, 0.0001,
         0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0001, 0.0002, 0.0002,
         0.0002, 0.0001],
        [0.0002, 0.0002, 0.0002, 0.0001, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002,
         0.0001, 0.0002, 0.0002, 0.0002, 0.0001, 0.0001, 0.0002, 0.0002, 0.0002,
         0.0002, 0.0002, 0.0002, 0.0002, 0.0001, 0.0002, 0.0002, 0.0002, 0.0002,
         0.0001, 0.0002, 0.0002, 0.0001, 0.0002, 0.0002, 0.0001, 0.0002, 0.0001,
         0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0002,
         0.0002, 0.0002, 0.0002, 0.0002, 0.0002, 0.0001, 0.0001, 0.0002, 0.0001,
         0.0001, 0.0002]])

and of course i could have made mistakes copying that by hand from pdb

Reply via email to