Re: Performance Difference between files getting opened with IoContext.RANDOM vs IoContext.READ

2024-09-30 Thread Uwe Schindler

Hi,

please also note: In Lucene 10 there checksum IndexInput will always be 
opened with IOContext.READ_ONCE.


If you want to sequentially read a whole index file for other reasons 
than checksumming, please pass the correct IOContext. In addition, in 
Lucene 9.12 (latest 9.x) version released today there are some changes 
to ensure that checksumming is always done with IOContext.READ_ONCE 
(which uses READ behind scenes).


Uwe

Am 29.09.2024 um 17:09 schrieb Michael McCandless:

Hi Navneet,

With RANDOM IOcontext, on modern OS's / Java versions, Lucene will hint the
memory mapped segment that the IO will be random using madvise POSIX API
with MADV_RANDOM flag.

For READ IOContext, Lucene maybe hits with MADV_SEQUENTIAL, I'm not sure.
Or maybe it doesn't hint anything?

It's up to the OS to then take these hints and do something "interesting"
to try to optimize IO and page caching based on these hints.  I think
modern Linux OSs will readahead (and pre-warm page cache) for
MADV_SEQUENTIAL?  And maybe skip page cache and readhead for MADV_RANDOM?
Not certain...

For computing checksum, which is always a sequential operation, if we use
MADV_RANDOM (which is stupid), that is indeed expected to perform worse
since there is no readahead pre-caching.  50% worse (what you are seeing)
is indeed quite an impact ...

Maybe open an issue?  At least for checksumming we should open even .vec
files for sequential read?  But, then, if it's the same IndexInput which
will then be used "normally" (e.g. for merging), we would want THAT one to
be open for random access ... might be tricky to fix.

One simple workaround an application can do is to ask MMapDirectory to
pre-touch all bytes/pages in .vec/.veq files -- this asks the OS to cache
all of those bytes into page cache (if there is enough free RAM).  We do
this at Amazon (product search) for our production searching processes.
Otherwise paging in all .vec/.veq pages via random access provoked through
HNSW graph searching is crazy slow...

Mike McCandless

http://blog.mikemccandless.com

On Sun, Sep 29, 2024 at 4:06 AM Navneet Verma 
wrote:


Hi Lucene Experts,
I wanted to understand the performance difference between opening and
reading the whole file using an IndexInput with IoContext as RANDOM vs
READ.

I can see .vec files(storing the flat vectors) are opened with RANDOM and
whereas dvd files are opened as READ. As per my testing with files close to
size 5GB storing (~1.6M docs with each doc 3072 bytes), I can see that when
full file checksum validation is happening for a file opened via READ
context it is faster than RANDOM. The amount of time difference I am seeing
is close to 50%. Hence the performance question is coming up, I wanted to
understand is this understanding correct?

Thanks
Navneet


--
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail: u...@thetaphi.de


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Facet Count strategies and common errors

2024-09-30 Thread Marc Davenport
I've been looking at the way our code gets the facet counts from Lucene and
see if there are some obvious inefficiencies.  We have about 60 normal flat
facets, some of which are multi-valued, and 5 or so hierarchical and
multi-valued facets. I'm seeing cases where the call to create a
FastTaxonomyFacetCounts is taking 1+ seconds when it would be matching on
800k documents.  This leads me to believe I've got some implementation
flaw.  Are there any common errors people make when implementing facets?
Known trouble spots that I should investigate?

Right now we retrieve the counts for the facets independently from the
retrieval of matching documents.   Each facet has its own runner which will
calculate its current counts as well as a more relaxed query state that
will show its other values.  Different facets will share a cached facet
collector if they have the same query state.   I know the "hold one out"
pattern isn't ideal.  I am looking at how we could use the
drillsideways queries, but I'm not sure I totally understand them.

The FastTaxonomyFacetCounts creation speed is in relation to the number and
cardinality of the facets on the documents. We pruned off no longer needed
facets.  Would it make sense to start maintaining more than one Taxonomy
Index?

I've been looking for any good books or resources to read about lucene.  I
have the original Lucene in action, which has been helpful in some ways,
but covers only v3. Many newer concepts are sort of left to java doc, or
reading through the PRs.   Any suggestions on things to read to better
understand Lucene and it's proper use?

Thank you,
Marc


Re: Performance Difference between files getting opened with IoContext.RANDOM vs IoContext.READ

2024-09-30 Thread Navneet Verma
Hi Uwe and Mike,
Thanks for providing such a quick response. Let me try to ans few things
here:





*In addition, inLucene 9.12 (latest 9.x) version released today there are
some changesto ensure that checksumming is always done with
IOContext.READ_ONCE(which uses READ behind the scenes).*
I didn't find any such change for FlatVectorReaders
,
even though I checked the BufferedChecksumInput

and CheckedSumInput
,
CodecUtil

in 9.12 version. Please point me to the right file if I am missing
something here. I can see the same for lucene version 10

too.

Mike on the question of what is RANDOM vs READ context doing we found this
information related to MADV online.

MADV_RANDOM Expect page references in random order. (Hence, read ahead may
be less useful than normally.)
MADV_SEQUENTIAL Expect page references in sequential order. (Hence, pages
in the given range can be aggressively read ahead, and may be freed soon
after they are accessed.)
MADV_WILLNEED Expect access in the near future. (Hence, it might be a good
idea to read some pages ahead.)

This tells me that MADV_RANDOM random for checksum is not good as it will
consume more read cycles given the sequential nature of the checksum.







*One simple workaround an application can do is to ask MMapDirectory
topre-touch all bytes/pages in .vec/.veq files -- this asks the OS to
cacheall of those bytes into page cache (if there is enough free RAM).  We
dothis at Amazon (product search) for our production searching
processes.Otherwise paging in all .vec/.veq pages via random access
provoked throughHNSW graph searching is crazy slow...*
Did you mean the preload functionality offered by MMapDirectory here? I can
try this to see if that helps. But I doubt that in this case.

On opening the issue, I am working through some reproducible benchmarks
before creating a gh issue. If you believe I should create a GH issue first
I can do that. As it might take me sometime to build reproducible
benchmarks.

Thanks
Navneet


On Mon, Sep 30, 2024 at 3:08 AM Uwe Schindler  wrote:
>
> Hi,
>
> please also note: In Lucene 10 there checksum IndexInput will always be
> opened with IOContext.READ_ONCE.
>
> If you want to sequentially read a whole index file for other reason
> than checksumming, please pass the correct IOContext. In addition, in
> Lucene 9.12 (latest 9.x) version released today there are some changes
> to ensure that checksumming is always done with IOContext.READ_ONCE
> (which uses READ behind scenes).
>
> Uwe
>
> Am 29.09.2024 um 17:09 schrieb Michael McCandless:
> > Hi Navneet,
> >
> > With RANDOM IOcontext, on modern OS's / Java versions, Lucene will hint
the
> > memory mapped segment that the IO will be random using madvise POSIX API
> > with MADV_RANDOM flag.
> >
> > For READ IOContext, Lucene maybe hits with MADV_SEQUENTIAL, I'm not
sure.
> > Or maybe it doesn't hint anything?
> >
> > It's up to the OS to then take these hints and do something
"interesting"
> > to try to optimize IO and page caching based on these hints.  I think
> > modern Linux OSs will readahead (and pre-warm page cache) for
> > MADV_SEQUENTIAL?  And maybe skip page cache and readhead for
MADV_RANDOM?
> > Not certain...
> >
> > For computing checksum, which is always a sequential operation, if we
use
> > MADV_RANDOM (which is stupid), that is indeed expected to perform worse
> > since there is no readahead pre-caching.  50% worse (what you are
seeing)
> > is indeed quite an impact ...
> >
> > Maybe open an issue?  At least for checksumming we should open even .vec
> > files for sequential read?  But, then, if it's the same IndexInput which
> > will then be used "normally" (e.g. for merging), we would want THAT one
to
> > be open for random access ... might be tricky to fix.
> >
> > One simple workaround an application can do is to ask MMapDirectory to
> > pre-touch all bytes/pages in .vec/.veq files -- this asks the OS to
cache
> > all of those bytes into page cache (if there is enough free RAM).  We do
> > this at Amazon (product search) for our production searching processes.
> > Otherwise paging in all .vec/.veq pages via random access provoked
through
> > HNSW graph searching is crazy slow...
> >
> > Mike McCandless
> >
> > http://blog.mikemccandless.com
> >
> > On Sun, Sep 29, 2024 at 4:06 AM Navneet Verma 
> > wrote:
> >
> >> Hi Lucene Ex