[PATCH] D52300: [clangd] Implement VByte PostingList compression

Kirill Bobyrev via Phabricator via cfe-commits Fri, 21 Sep 2018 01:24:02 -0700

kbobyrev added a comment.

@sammccall thank you for the comments, I'll improve it. @ilya-biryukov also 
provided valuable feedback and suggested that the code looks complicated now. 
We also discussed different compression approach which would require storing 
`Head`s and `Payload`s separately so that binary search over `Head`s could have 
better cache locality. That can dramatically improve performance.


For now, I think it's important to simplify the code and I'll start with that. 
Also, your suggestions will help to save even more memory! Zeroed bytes are an 
example of that: I started the patch with the general encoding API (so that 
zeros could be encoded in the list), but there's no good reason to keep that 
assumption now.

Also, I think I got my measurements wrong yesterday. I measured posting lists 
only: without compression size is 55 MB and with compression (current version, 
not optimized yet) it's 22 MB. This seems like a huge win. I try to keep myself 
from being overenthusiastic and double-check the numbers, but it looks more 
like something you estimated when we used posting list size distribution.



================
Comment at: clang-tools-extra/clangd/index/dex/PostingList.cpp:262
+             "Unterminated byte sequence at the end of input stream.");
+  return Result;
+}
----------------
sammccall wrote:
> here I think you're missing a memory optimization probably equal in size to 
> the whole gains achieved by compression :-)
> 
> libstdc++ uses a 2x growth factor for std::vector, so we're probably wasting 
> an extra 30% or so of ram (depending on size distribution, I forget the 
> theory here).
> We should shrink to fit. If we were truly desperate we'd iterate over all the 
> numbers and presize the array, but we're probably not.
> 
> I think `return std::vector<DocID>(Result); // no move, shrink-to-fit` will 
> shrink it as you want (note that `shrink_to_fit()` is usually a no-op :-\)
Great catch! I have to be careful with `std::vector`s which are not allocated 
with their final size in advance.


================
Comment at: clang-tools-extra/clangd/index/dex/fuzzer/VByteFuzzer.cpp:1
+//===-- VByteFuzzer.cpp - Fuzz VByte Posting List encoding 
----------------===//
+//
----------------
sammccall wrote:
> For better or worse, adding a fuzzer in the open-source project is pretty 
> high ceremony (CMake stuff, subdirectory, oss-fuzz configuration, following 
> up on bugs).
> 
> I'm not sure the maintenance cost is justified here. Can we just run the 
> fuzzer but not check it in?
OK, I'll leave this here until the patch is accepted for continuous testing, 
but I won't push it in the final version.


https://reviews.llvm.org/D52300



_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D52300: [clangd] Implement VByte PostingList compression

Reply via email to