Right! Thanks for the pointer. It does seem like there is room for improvement then, maybe Viliam wants to tackle it?
On Fri, Jun 27, 2025 at 12:57 PM Adrien Grand <jpou...@gmail.com> wrote: > > Mike, I believe that the answer to your question is in this PR review > comment: https://github.com/apache/lucene/pull/601#discussion_r783711025. > > Merging is currently implemented by looping over fields once, and merging > them. Writing the vec file first would require merging flat vectors for all > fields first, and then doing a second pass over all fields to create their > HNSW graph. This sounds doable, but we never got to it. > > > > On Fri, Jun 27, 2025 at 2:19 PM Michael Sokolov <msoko...@gmail.com> wrote: > > > Without this temp file we would need to load the entire set of vectors > > for the new merged segment into RAM in order to support building an > > HNSW graph from it. This way we can read the vectors off the disk in > > the same way we would do during normal searches. I'm not sure, but I > > think the temp file simply gets renamed into the new segment and > > doesn't have to be physically copied a second time. It would be good > > to confirm that. > > > > On Thu, Jun 26, 2025 at 4:52 PM Viliam Ďurina <viliam.dur...@gmail.com> > > wrote: > > > > > > Hi all, > > > > > > I noticed that during merging in an index that contains vector fields, > > the > > > new segment contains a temporary file with ".vec_temp_N.tmp" extension, > > > which contains all the vectors being merged. This file is used to search > > > for neighbors for the new HNSW graph. It is later deleted, and the > > segment > > > will contain a ".vec" file with the same vectors. So vectors are copied > > two > > > times and more space is temporarily needed on disk. > > > > > > In my index, the ".vec" file is 98% of the index size and the index is > > many > > > GB. Is it really necessary to have the temp file? Couldn't Lucene query > > the > > > "vec" file directly? I checked the code around it, one temp file is > > created > > > per field and the temp file is probably deleted before starting the next > > > field, but still, there is another copy of the vector, so the temp file > > > seems unnecessary. > > > > > > Is there some specific need for the temp file? I might try to do a PR > > > removing the need for it. > > > > > > Viliam > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > > -- > Adrien --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org