karl3@writeme.com wrote:
> karl3@writeme.com wrote:
> > karl3@writeme.com wrote:
> > {{{{{{there was some energy around making a network-based inference engine, 
> > maybe by modifying deepseek.cpp (don't quite recall why not staying in 
> > python, some concern arose)
> > task got weak, found cinatra as a benchmark leader for c++ web engines 
> > (although pico.v was the top! (surprised all c++ http engines were beaten 
> > by java O_O very curious about this, wondering if it's a high-end test 
> > system) never heard of V language but is interesting it won a leaderboard)
> > inhibition ended up discovering concern somewaht like ... on this 4GB ram 
> > system it might take 15-33GB of network transfor for each forward pass of 
> > the model ... [multi-token passes ^^
> > karl3@writeme.com wrote:
> > the concern resonates with difficulty making the implementation, and some 
> > form of inhibition or concern around using python. notably, i've made 
> > offloading python hooks a lot and they never last due to the underlying 
> > interfaces changing (although those interfaces have stabilized much more 
> > now as hf made accelerator as their official implementation) (also i think 
> > the issue is more severe dissociative associations than the interface, if 
> > one considers the possibility of personal maintenance and use rather than 
> > usability for others). don't immediately recall the python concern
> > seem to be off task or taking break, but it would make sense to do disk 
> > caching too. there is also option of quantizing. basically, LLMs and AI in 
> > general place r&d effort between the user and ease, smallness, cheapness, 
> > power, etc
> > I poked at python again. The existing implementations of the 8 bit 
> > quantization used by the model all require an NVidia GPU which I do not 
> > presently have. It is fun to imagine making it work, maybe I can upcast it 
> > to float32 or something >)

so i have implemented a little of this.

it creates a model state_dict that lazily loads things off the network when 
accessed. it's quite fast! some of the layers are larger than my available ram 
which i have deferred with a plan to resolve via either storing them in mmap'd 
files or further sharding the modules that use them (ideally both). i did not 
implement cpu operators for 8 bit quantization but prevented the check from 
firing to continue testing for now.

the vm i have has 3-4 GB of ram so it's easy to exhaust with this task, at 
which point it freezes and i have to remotely power it off and on, which it 
does slowly because it is out of ram and has no swap. kind of a downer when it 
happens.

the deepseekv3 model implementation allocates a separate buffer for the RoPE 
position encodings for each layer. i haven't looked closely at the various 
different implementations that the architecture can configure, but in llama 
these buffers were algorithmically defined and constant; they're not available 
over the network but could be written to a file. right now they are exhausting 
my ram causing the vm thrash freeze, the model has 60+1 layers and each 
position encoding uses a few hundred megabytes :/ their size is likely 
proportional to the square of the max input context length, so a quick fix 
could be to reconfigure the model to expect a much smaller context.

i was trying to patch them to cache their construction so as to share the 
buffer across layers, but my attempt failed and i froze the vm again :)

late. sleep times. 2326

2400

ok i got to the next milestone. it constructs a model, puts it in a pipeline, 
all the weights download from the network when used

for it to do anything on this tiny system more code is needed to mmap or 
subshard large weights.

current code is like this:

import inspect, json, psutil
import accelerate, requests, torch, tqdm, transformers

class Quirks:
    if not torch.cuda.is_available():
        # this avoids FP8 assertions on cpu during placement testing
        torch.cuda.is_available = lambda: True
        torch.cuda.get_device_capability = lambda: [9,0]
    #def unify_rope(model_or_class):
    #    # deepseek generates separate rope buffers for each layer which can 
use significant memory
    #    deepseek = inspect.getmodule(model_or_class)
    #    cache = {}
    #    def wrap_rope(rope):
    #        def wrapper(*params, **kwparams):
    #            key = tuple(params) + tuple(kwparams)
    #            if key in cache:
    #                return cache[key]
    #            else:
    #                val = rope(*params, **kwparams)
    #                cache[key] = val
    #                return val
    #        return wrapper
    #    for key, val in deepseek.__dict__.items():
    #        if 'Rotary' in key and isinstance(val, torch.nn.Module):
    #            setattr(deepseek, key, wrap_rope(val))

class LazyStateDict(dict):
    def __init__(self, tensor_request_by_name, device):
        super().__init__(tensor_request_by_name)
        self.session = requests.Session()
        self.device = device

    def get_meta_tensor(self, weight):
        return super().__getitem__(weight)[0]

    def __getitem__(self, weight):
        tensor, request = super().__getitem__(weight)
        chunk_size = 1024*128
        with tqdm.tqdm(desc=weight, leave=False, total=tensor.nbytes) as pbar:
            #if tensor.nbytes > psutil.virtual_memory().available / 2:
            #    print(weight, 'is more than half available vram, mmapping a 
file ...')
            #else:

                # we could also further shard the embeddings and lm_head
                # since embeddings are sparsely used
            assert tensor.nbytes < psutil.virtual_memory().available / 2
            buffer = memoryview(bytearray(tensor.nbytes))

            with self.session.send(request, stream=True) as response:
                while pbar.n < pbar.total:
                    pbar.update(
                        response.raw.readinto(
                            buffer[ pbar.n : pbar.n + chunk_size ]
                        )
                    )

        result = torch.frombuffer(
            buffer,
            dtype = tensor.dtype,
            count = tensor.numel(),
            requires_grad = False,
            device = self.device,
        ).reshape(tensor.shape)

        return result

    def items(self):
        for key in self:
            yield [key, self[key]]

    def values(self):
        for key in self:
            yield self[key]

    def largest(self):
        return max([
            [key, tensor]
            for key, [tensor, _] in super().items()
        ], key = lambda keytensor: keytensor[1].nbytes)[1]

    @staticmethod
    def tensor_request_from_json(url, N, data):
        dtype = data['dtype']
        dtype = dict(
            F32 = torch.float32,
            F8_E4M3 = torch.float8_e4m3fn,
            BF16 = torch.bfloat16,
        )[dtype]
        shape = data['shape']
        tensor = torch.empty(shape, dtype=dtype, device='meta')
        start, end = data['data_offsets']
        start += N + 8
        end += N + 8 - 1
        request = requests.Request('GET', url, 
dict(Range='bytes='+str(start)+'-'+str(end)))
        request = request.prepare()
        return [tensor, request]

    @classmethod
    def from_user_repo_branch(cls, user, repo, branch, device):
        base_url = f'https://huggingface.co/{user}/{repo}/raw/{branch}/'
        lfs_base_url = f'https://huggingface.co/{user}/{repo}/resolve/{branch}/'
    
        safetensors_index_url = base_url + 'model.safetensors.index.json'
    
        print(safetensors_index_url)
    
        with requests.get(safetensors_index_url, stream=True) as response:
            safetensors_index = json.load(response.raw)
    
        print(safetensors_index['metadata'])
    
        fn_by_weight = safetensors_index['weight_map']
    
        urls = [lfs_base_url + fn for fn in set(fn_by_weight.values())]
    
        #url_range_dict = {}
        #data_by_weight = {}

        tensor_request_by_name = {}
    
        with tqdm.tqdm(urls,desc='constructing tensor urls') as pbar:
          for url in pbar:
            # we could potentially also check the git-lfs sha256 from the base 
url and merklify the data too, this would mean downloading it all
            #[b'version https://git-lfs.github.com/spec/v1', b'oid 
sha256:e94d32e8649e1a5b03cc0a343c59ca5a6d80d03cd46161b482fd3bb2484adb7d', 
b'size 4302350824']
            #lfs = dict([ line.decode().split(' ', 1) for line in 
response.iter_lines() ])
            with requests.get(url, stream=True) as response:
                N = int.from_bytes(response.raw.read(8), 'little')
                header = json.loads(response.raw.read(N))
            for weight, data in header.items():
                if weight == '__metadata__':
                    continue
                #dtype = data['dtype']
                #shape = data['shape']
                #start, end = data['data_offsets']
                #start += headersize + 8
                #end += headersize + 8
                #data_by_weight[weight] = data | {'url':url,'N':headersize}
                tensor_request_by_name[weight] = 
cls.tensor_request_from_json(url, N, data)
        return cls(tensor_request_by_name, device)

def construct(model_id, device, config_patches = {}, attr_patches = {}):
    user, repo = model_id.split('/',1)
    branch = 'main'

    print(user, repo, branch)

    config = transformers.AutoConfig.from_pretrained(model_id, 
trust_remote_code=True)
    for key, val in config_patches.items():
        setattr(config, key, val)
    tokenizer = transformers.AutoTokenizer.from_pretrained(model_id, 
trust_remote_code=True)

    with accelerate.init_empty_weights(), 
transformers.modeling_utils.no_init_weights():
        model = transformers.AutoModelForCausalLM.from_config(config, 
trust_remote_code=True)
    for key, val in attr_patches.items():
        setattr(model, key, val)

    lazy_state_dict = LazyStateDict.from_user_repo_branch(user, repo, branch, 
device=device)

    # misuse cpu offloading by providing lazy_state_dict
    model = accelerate.cpu_offload(model, device, state_dict = lazy_state_dict)
    model.hf_device_map = { '': device }

    return transformers.pipeline('text-generation', model=model, config=config, 
tokenizer=tokenizer)

pipe = construct(
    'deepseek-ai/DeepSeek-V3',
    device = 'cpu',
    config_patches = dict(
        max_position_embeddings = 64, # drop ctx len from 163840 to 64
    ),
    attr_patches = dict(
        _supports_cache_class = False, # might be a bug that this isn't in the 
model
    ),
)
pipe('Once upon a time,')

Reply via email to