I can confirm the temp file isn't renamed, but it's copied a second time. I'm on vacation next week.
Dňa pi 27. 6. 2025, 21:24 Michael Sokolov <msoko...@gmail.com> napísal(a): > Right! Thanks for the pointer. It does seem like there is room for > improvement then, maybe Viliam wants to tackle it? > > On Fri, Jun 27, 2025 at 12:57 PM Adrien Grand <jpou...@gmail.com> wrote: > > > > Mike, I believe that the answer to your question is in this PR review > > comment: https://github.com/apache/lucene/pull/601#discussion_r783711025 > . > > > > Merging is currently implemented by looping over fields once, and merging > > them. Writing the vec file first would require merging flat vectors for > all > > fields first, and then doing a second pass over all fields to create > their > > HNSW graph. This sounds doable, but we never got to it. > > > > > > > > On Fri, Jun 27, 2025 at 2:19 PM Michael Sokolov <msoko...@gmail.com> > wrote: > > > > > Without this temp file we would need to load the entire set of vectors > > > for the new merged segment into RAM in order to support building an > > > HNSW graph from it. This way we can read the vectors off the disk in > > > the same way we would do during normal searches. I'm not sure, but I > > > think the temp file simply gets renamed into the new segment and > > > doesn't have to be physically copied a second time. It would be good > > > to confirm that. > > > > > > On Thu, Jun 26, 2025 at 4:52 PM Viliam Ďurina <viliam.dur...@gmail.com > > > > > wrote: > > > > > > > > Hi all, > > > > > > > > I noticed that during merging in an index that contains vector > fields, > > > the > > > > new segment contains a temporary file with ".vec_temp_N.tmp" > extension, > > > > which contains all the vectors being merged. This file is used to > search > > > > for neighbors for the new HNSW graph. It is later deleted, and the > > > segment > > > > will contain a ".vec" file with the same vectors. So vectors are > copied > > > two > > > > times and more space is temporarily needed on disk. > > > > > > > > In my index, the ".vec" file is 98% of the index size and the index > is > > > many > > > > GB. Is it really necessary to have the temp file? Couldn't Lucene > query > > > the > > > > "vec" file directly? I checked the code around it, one temp file is > > > created > > > > per field and the temp file is probably deleted before starting the > next > > > > field, but still, there is another copy of the vector, so the temp > file > > > > seems unnecessary. > > > > > > > > Is there some specific need for the temp file? I might try to do a PR > > > > removing the need for it. > > > > > > > > Viliam > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > > > > > > -- > > Adrien > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >