Re: Temporary vector file during merging

Viliam Ďurina Fri, 27 Jun 2025 14:53:11 -0700

I can confirm the temp file isn't renamed, but it's copied a second time.
I'm on vacation next week.


Dňa pi 27. 6. 2025, 21:24 Michael Sokolov <[email protected]> napísal(a):

> Right! Thanks for the pointer. It does seem like there is room for
> improvement then, maybe Viliam wants to tackle it?
>
> On Fri, Jun 27, 2025 at 12:57 PM Adrien Grand <[email protected]> wrote:
> >
> > Mike, I believe that the answer to your question is in this PR review
> > comment: https://github.com/apache/lucene/pull/601#discussion_r783711025
> .
> >
> > Merging is currently implemented by looping over fields once, and merging
> > them. Writing the vec file first would require merging flat vectors for
> all
> > fields first, and then doing a second pass over all fields to create
> their
> > HNSW graph. This sounds doable, but we never got to it.
> >
> >
> >
> > On Fri, Jun 27, 2025 at 2:19 PM Michael Sokolov <[email protected]>
> wrote:
> >
> > > Without this temp file we would need to load the entire set of vectors
> > > for the new merged segment into RAM in order to support building an
> > > HNSW graph from it. This way we can read the vectors off the disk in
> > > the same way we would do during normal searches.  I'm not sure, but I
> > > think the temp file simply gets renamed into the new segment and
> > > doesn't have to be physically copied a second time.  It would be good
> > > to confirm that.
> > >
> > > On Thu, Jun 26, 2025 at 4:52 PM Viliam Ďurina <[email protected]
> >
> > > wrote:
> > > >
> > > > Hi all,
> > > >
> > > > I noticed that during merging in an index that contains vector
> fields,
> > > the
> > > > new segment contains a temporary file with ".vec_temp_N.tmp"
> extension,
> > > > which contains all the vectors being merged. This file is used to
> search
> > > > for neighbors for the new HNSW graph. It is later deleted, and the
> > > segment
> > > > will contain a ".vec" file with the same vectors. So vectors are
> copied
> > > two
> > > > times and more space is temporarily needed on disk.
> > > >
> > > > In my index, the ".vec" file is 98% of the index size and the index
> is
> > > many
> > > > GB. Is it really necessary to have the temp file? Couldn't Lucene
> query
> > > the
> > > > "vec" file directly? I checked the code around it, one temp file is
> > > created
> > > > per field and the temp file is probably deleted before starting the
> next
> > > > field, but still, there is another copy of the vector, so the temp
> file
> > > > seems unnecessary.
> > > >
> > > > Is there some specific need for the temp file? I might try to do a PR
> > > > removing the need for it.
> > > >
> > > > Viliam
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: [email protected]
> > > For additional commands, e-mail: [email protected]
> > >
> > >
> >
> > --
> > Adrien
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: Temporary vector file during merging

Reply via email to