[
https://issues.apache.org/jira/browse/SOLR-6888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14259787#comment-14259787
]
David Smiley commented on SOLR-6888:
------------------------------------
Erick,
I enjoyed reading what you have here. I think this issue duplicates SOLR-5478,
for which I have a patch in fact. I encourage you to review it and kick the
tires on it!
> Decompressing documents on first-pass distributed queries to get docId is
> inefficient, use indexed values instead?
> ------------------------------------------------------------------------------------------------------------------
>
> Key: SOLR-6888
> URL: https://issues.apache.org/jira/browse/SOLR-6888
> Project: Solr
> Issue Type: Improvement
> Affects Versions: 5.0, Trunk
> Reporter: Erick Erickson
> Assignee: Erick Erickson
> Attachments: SOLR-6888-hacktiming.patch
>
>
> Assigning this to myself to just not lose track of it, but I won't be working
> on this in the near term; anyone feeling ambitious should feel free to grab
> it.
> Note, docId used here is whatever is defined for <uniqueKey>...
> Since Solr 4.1, the compression/decompression process is based on 16K blocks
> and is automatic, and not configurable. So, to get a single stored value one
> must decompress an entire 16K block. At least.
> For SolrCloud (and distributed processing in general), we make two trips, one
> to get the doc id and score (or other sort criteria) and one to return the
> actual data.
> The first pass here requires that we return the top N docIDs and sort
> criteria, which means that each and every sub-request has to unpack at least
> one 16K block (and sometimes more) to get just the doc ID. So if we have 20
> shards and only want 20 rows, 95% of the decompression cycles will be wasted.
> Not to mention all the disk reads.
> It seems like we should be able to do better than that. Can we argue that doc
> ids are 'special' and should be cached somehow? Let's discuss what this would
> look like. I can think of a couple of approaches:
> 1> Since doc IDs are "special", can we say that for this purpose returning
> the indexed version is OK? We'd need to return the actual stored value when
> the full doc was requested, but for the sub-request only what about returning
> the indexed value instead of the stored one? On the surface I don't see a
> problem here, but what do I know? Storing these as DocValues seems useful in
> this case.
> 1a> A variant is treating numeric docIds specially since the indexed value
> and the stored value should be identical. And DocValues here would be useful
> it seems. But this seems an unnecessary specialization if <1> is implemented
> well.
> 2> We could cache individual doc IDs, although I'm not sure what use that
> really is. Would maintaining the cache overwhelm the savings of not
> decompressing? I really don't like this idea, but am throwing it out there.
> Doing this from stored data up front would essentially mean decompressing
> every doc so that seems untenable to try up-front.
> 3> We could maintain an array[maxDoc] that held document IDs, perhaps lazily
> initializing it. I'm not particularly a fan of this either, doesn't seem like
> a Good Thing. I can see lazy loading being almost, but not quite totally,
> useless, i.e. a hit ratio near 0, especially since it'd be thrown out on
> every openSearcher.
> Really, the only one of these that seems viable is <1>/<1a>. The others would
> all involve decompressing the docs anyway to get the ID, and I suspect that
> caching would be of very limited usefulness. I guess <1>'s viability hinges
> on whether, for internal use, the indexed form of DocId is interchangeable
> with the stored value.
> Or are there other ways to approach this? Or isn't it something to really
> worry about?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]