Without this temp file we would need to load the entire set of vectors for the new merged segment into RAM in order to support building an HNSW graph from it. This way we can read the vectors off the disk in the same way we would do during normal searches. I'm not sure, but I think the temp file simply gets renamed into the new segment and doesn't have to be physically copied a second time. It would be good to confirm that.
On Thu, Jun 26, 2025 at 4:52 PM Viliam Ďurina <viliam.dur...@gmail.com> wrote: > > Hi all, > > I noticed that during merging in an index that contains vector fields, the > new segment contains a temporary file with ".vec_temp_N.tmp" extension, > which contains all the vectors being merged. This file is used to search > for neighbors for the new HNSW graph. It is later deleted, and the segment > will contain a ".vec" file with the same vectors. So vectors are copied two > times and more space is temporarily needed on disk. > > In my index, the ".vec" file is 98% of the index size and the index is many > GB. Is it really necessary to have the temp file? Couldn't Lucene query the > "vec" file directly? I checked the code around it, one temp file is created > per field and the temp file is probably deleted before starting the next > field, but still, there is another copy of the vector, so the temp file > seems unnecessary. > > Is there some specific need for the temp file? I might try to do a PR > removing the need for it. > > Viliam --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org