It's interesting you're not seeing the same slowdown on the other field. How hard would it be for you to test what the performance is if you lowercase the name of the digest algorithms, ie. "md5;[md5 value in hex]", etc. The reason I'm asking is because the compression logic is optimized for lowercase ASCII so removing uppercase letters would help remove the need to encode exceptions, which is one reason I'm thinking why the slowdown might be less on your other field.
In case you're using an old JRE, you might want to try out with a JRE 13 or more recent. Some of the logic in this lowercase ASCII compression only gets vectorized on JDK13+. On Mon, Jul 27, 2020 at 9:40 AM Trejkaz <trej...@trypticon.org> wrote: > Yep, the timings posted were the best speed out of 10 runs in a row. > The profiling was done in the middle of 1000 iterations in a row just > to knock off any warm-up time. > > The sort of data we're storing in the field is quite possibly a > worst-case scenario for the compression. The data is mixed digest info > like > > "MD5;[md5 value in hex]" > "SHA-1;[sha1 value in hex]" > "SHA-256;[sha256 value in hex]" > > In fact, there's another field in the index which contains the same > MD5s without the common prefix - the same sort of operation on that > field doesn't get the same slowdown. (It's a bit slower. Like 5% or > so? Certainly nothing like 100%.) So at least for looking up MD5s we > have the luxury of an alternative option for the lookups. For other > digests I'm afraid we're stuck for now until we change how we index > those. > > What's ironic is that we originally put the prefix on to make seeking > to the values faster. ^^;; > > TX > > > On Mon, 27 Jul 2020 at 17:08, Adrien Grand <jpou...@gmail.com> wrote: > > > > Alex, this issue you linked is about the terms dictionary of doc values. > > Trejkaz linked the correct issue which is about the terms dictionary of > the > > inverted index. > > > > It's interesting you're seeing so much time spent in readVInt on 8.5 > since > > there is a single vint that is read for each block in > > "LowercaseAsciiCompression.decompress". Are these relative timings > > consistent over multiple runs? > > > > On Mon, Jul 27, 2020 at 5:57 AM Alex K <aklib...@gmail.com> wrote: > > > > > Hi, > > > > > > Also have a look here: > > > > https://issues.apache.org/jira/plugins/servlet/mobile#issue/LUCENE-9378 > > > > > > Seems it might be related. > > > - Alex > > > > > > On Sun, Jul 26, 2020, 23:31 Trejkaz <trej...@trypticon.org> wrote: > > > > > > > Hi all. > > > > > > > > I've been tracking down slow seeking performance in TermsEnum after > > > > updating to Lucene 8.5.1. > > > > > > > > On 8.5.1: > > > > > > > > SegmentTermsEnum.seekExact: 33,829 ms (70.2%) (remaining time in > our > > > > code) > > > > SegmentTermsEnumFrame.loadBlock: 29,104 ms (60.4%) > > > > CompressionAlgorithm$2.read: 25,789 ms (53.5%) > > > > LowercaseAsciiCompression.decompress: 25,789 ms > (53.5%) > > > > DataInput.readVInt: 24,690 ms (51.2%) > > > > SegmentTermsEnumFrame.scanToTerm: 2,921 ms (6.1%) > > > > > > > > On 7.7.0 (previous version we were using): > > > > > > > > SegmentTermsEnum.seekExact: 5,897 ms (43.7%) (remaining time in > our > > > > code) > > > > SegmentTermsEnumFrame.loadBlock: 3,499 ms (25.9%) > > > > BufferedIndexInput.readBytes: 1,500 ms (11.1%) > > > > DataInput.readVInt: 1,108 (8.2%) > > > > SegmentTermsEnumFrame.scanToTerm: 1,501 ms (11.1%) > > > > > > > > So on the surface it sort of looks like the new version spends less > > > > time scanning and much more time loading blocks to decompress? > > > > > > > > Looking for some clues to what might have changed here, and whether > > > > it's something we can avoid, but currently LUCENE-4702 looks like it > > > > may be related. > > > > > > > > TX > > > > > > > > --------------------------------------------------------------------- > > > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > > > > > > > > > > > > > -- > > Adrien > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > -- Adrien