kbobyrev added a comment. @sammccall thank you for the comments, I'll improve it. @ilya-biryukov also provided valuable feedback and suggested that the code looks complicated now. We also discussed different compression approach which would require storing `Head`s and `Payload`s separately so that binary search over `Head`s could have better cache locality. That can dramatically improve performance.
For now, I think it's important to simplify the code and I'll start with that. Also, your suggestions will help to save even more memory! Zeroed bytes are an example of that: I started the patch with the general encoding API (so that zeros could be encoded in the list), but there's no good reason to keep that assumption now. Also, I think I got my measurements wrong yesterday. I measured posting lists only: without compression size is 55 MB and with compression (current version, not optimized yet) it's 22 MB. This seems like a huge win. I try to keep myself from being overenthusiastic and double-check the numbers, but it looks more like something you estimated when we used posting list size distribution. ================ Comment at: clang-tools-extra/clangd/index/dex/PostingList.cpp:262 + "Unterminated byte sequence at the end of input stream."); + return Result; +} ---------------- sammccall wrote: > here I think you're missing a memory optimization probably equal in size to > the whole gains achieved by compression :-) > > libstdc++ uses a 2x growth factor for std::vector, so we're probably wasting > an extra 30% or so of ram (depending on size distribution, I forget the > theory here). > We should shrink to fit. If we were truly desperate we'd iterate over all the > numbers and presize the array, but we're probably not. > > I think `return std::vector<DocID>(Result); // no move, shrink-to-fit` will > shrink it as you want (note that `shrink_to_fit()` is usually a no-op :-\) Great catch! I have to be careful with `std::vector`s which are not allocated with their final size in advance. ================ Comment at: clang-tools-extra/clangd/index/dex/fuzzer/VByteFuzzer.cpp:1 +//===-- VByteFuzzer.cpp - Fuzz VByte Posting List encoding ----------------===// +// ---------------- sammccall wrote: > For better or worse, adding a fuzzer in the open-source project is pretty > high ceremony (CMake stuff, subdirectory, oss-fuzz configuration, following > up on bugs). > > I'm not sure the maintenance cost is justified here. Can we just run the > fuzzer but not check it in? OK, I'll leave this here until the patch is accepted for continuous testing, but I won't push it in the final version. https://reviews.llvm.org/D52300 _______________________________________________ cfe-commits mailing list cfe-commits@lists.llvm.org http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits