[jira] [Commented] (SOLR-6888) Decompressing documents on first-pass distributed queries to get docId is inefficient, use indexed values instead?

David Smiley (JIRA) Sun, 28 Dec 2014 18:02:33 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-6888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14259787#comment-14259787
 ]


David Smiley commented on SOLR-6888:
------------------------------------

Erick,
I enjoyed reading what you have here.  I think this issue duplicates SOLR-5478, 
for which I have a patch in fact. I encourage you to review it and kick the 
tires on it!

> Decompressing documents on first-pass distributed queries to get docId is 
> inefficient, use indexed values instead?
> ------------------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-6888
>                 URL: https://issues.apache.org/jira/browse/SOLR-6888
>             Project: Solr
>          Issue Type: Improvement
>    Affects Versions: 5.0, Trunk
>            Reporter: Erick Erickson
>            Assignee: Erick Erickson
>         Attachments: SOLR-6888-hacktiming.patch
>
>
> Assigning this to myself to just not lose track of it, but I won't be working 
> on this in the near term; anyone feeling ambitious should feel free to grab 
> it.
> Note, docId used here is whatever is defined for <uniqueKey>...
> Since Solr 4.1, the compression/decompression process is based on 16K blocks 
> and is automatic, and not configurable. So, to get a single stored value one 
> must decompress an entire 16K block. At least.
> For SolrCloud (and distributed processing in general), we make two trips, one 
> to get the doc id and score (or other sort criteria) and one to return the 
> actual data.
> The first pass here requires that we return the top N docIDs and sort 
> criteria, which means that each and every sub-request has to unpack at least 
> one 16K block (and sometimes more) to get just the doc ID. So if we have 20 
> shards and only want 20 rows, 95% of the decompression cycles will be wasted. 
> Not to mention all the disk reads.
> It seems like we should be able to do better than that. Can we argue that doc 
> ids are 'special' and should be cached somehow? Let's discuss what this would 
> look like. I can think of a couple of approaches:
> 1> Since doc IDs are "special", can we say that for this purpose returning 
> the indexed version is OK? We'd need to return the actual stored value when 
> the full doc was requested, but for the sub-request only what about returning 
> the indexed value instead of the stored one? On the surface I don't see a 
> problem here, but what do I know? Storing these as DocValues seems useful in 
> this case.
> 1a> A variant is treating numeric docIds specially since the indexed value 
> and the stored value should be identical. And DocValues here would be useful 
> it seems. But this seems an unnecessary specialization if <1> is implemented 
> well.
> 2> We could cache individual doc IDs, although I'm not sure what use that 
> really is. Would maintaining the cache overwhelm the savings of not 
> decompressing? I really don't like this idea, but am throwing it out there. 
> Doing this from stored data up front would essentially mean decompressing 
> every doc so that seems untenable to try up-front.
> 3> We could maintain an array[maxDoc] that held document IDs, perhaps lazily 
> initializing it. I'm not particularly a fan of this either, doesn't seem like 
> a Good Thing. I can see lazy loading being almost, but not quite totally, 
> useless, i.e. a hit ratio near 0, especially since it'd be thrown out on 
> every openSearcher.
> Really, the only one of these that seems viable is <1>/<1a>. The others would 
> all involve decompressing the docs anyway to get the ID, and I suspect that 
> caching would be of very limited usefulness. I guess <1>'s viability hinges 
> on whether, for internal use, the indexed form of DocId is interchangeable 
> with the stored value.
> Or are there other ways to approach this? Or isn't it something to really 
> worry about?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-6888) Decompressing documents on first-pass distributed queries to get docId is inefficient, use indexed values instead?

Reply via email to