Right! Thanks for the pointer. It does seem like there is room for
improvement then, maybe Viliam wants to tackle it?

On Fri, Jun 27, 2025 at 12:57 PM Adrien Grand <jpou...@gmail.com> wrote:
>
> Mike, I believe that the answer to your question is in this PR review
> comment: https://github.com/apache/lucene/pull/601#discussion_r783711025.
>
> Merging is currently implemented by looping over fields once, and merging
> them. Writing the vec file first would require merging flat vectors for all
> fields first, and then doing a second pass over all fields to create their
> HNSW graph. This sounds doable, but we never got to it.
>
>
>
> On Fri, Jun 27, 2025 at 2:19 PM Michael Sokolov <msoko...@gmail.com> wrote:
>
> > Without this temp file we would need to load the entire set of vectors
> > for the new merged segment into RAM in order to support building an
> > HNSW graph from it. This way we can read the vectors off the disk in
> > the same way we would do during normal searches.  I'm not sure, but I
> > think the temp file simply gets renamed into the new segment and
> > doesn't have to be physically copied a second time.  It would be good
> > to confirm that.
> >
> > On Thu, Jun 26, 2025 at 4:52 PM Viliam Ďurina <viliam.dur...@gmail.com>
> > wrote:
> > >
> > > Hi all,
> > >
> > > I noticed that during merging in an index that contains vector fields,
> > the
> > > new segment contains a temporary file with ".vec_temp_N.tmp" extension,
> > > which contains all the vectors being merged. This file is used to search
> > > for neighbors for the new HNSW graph. It is later deleted, and the
> > segment
> > > will contain a ".vec" file with the same vectors. So vectors are copied
> > two
> > > times and more space is temporarily needed on disk.
> > >
> > > In my index, the ".vec" file is 98% of the index size and the index is
> > many
> > > GB. Is it really necessary to have the temp file? Couldn't Lucene query
> > the
> > > "vec" file directly? I checked the code around it, one temp file is
> > created
> > > per field and the temp file is probably deleted before starting the next
> > > field, but still, there is another copy of the vector, so the temp file
> > > seems unnecessary.
> > >
> > > Is there some specific need for the temp file? I might try to do a PR
> > > removing the need for it.
> > >
> > > Viliam
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> >
>
> --
> Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to