Hi, > > The overall trend in machine learning embedding sizes has been growing > > rapidly over the last few years from 128 up to 4K dimensions yielding > > additional value and quality improvements. It's not clear when this trend > > in growth will ease. The leading text embedding models generate now exceeds > > the index storage available in IndexTupleData.t_info. > > > > The current index tuple size is stored in 13 bits of IndexTupleData.t_info, > > which limits the max size of an index tuple to 2^13 = 8129 bytes. Vectors > > implemented by pgvector currently use a 32 bit float for elements, which > > limits vector size to 2K dimensions, which is no longer state of the art. > > > > I've attached a patch that increases IndexTupleData.t_info from 16bits to > > 32bits allowing for significantly larger index tuple sizes. I would guess > > this patch is not a complete implementation that allows for migration from > > previous versions, but it does compile and initdb succeeds. I'd be happy to > > continue work if the core team is receptive to an update in this area, and > > I'd appreciate any feedback the community has on the approach.
If I read this correctly, basically the patch adds 16 useless bits for all applications except for ML ones... Perhaps implementing an alternative storage specifically for ML using TAM interface would be a better approach? -- Best regards, Aleksander Alekseev