[
https://issues.apache.org/jira/browse/SOLR-5244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13856335#comment-13856335
]
Joel Bernstein commented on SOLR-5244:
--------------------------------------
More testing of this feature shows the real challenge will be performance of
exporting string fields. Right now the docId->BytesRef lookup is way to slow to
be interesting on a large scale, even with in memory docValues. This must be do
to the compression on the docValues.
To get this working we'll need to have faster memory caches in place. I think
we can build segment level caches at commit time by caching the top X terms in
a particular field based on docFrequency. The cache would be a read only ord to
BytesRef (hppc IntObjectOpenHashMap) which we should be able to perform in
neighborhood of 10 million lookups per second. The in-memory docId->BytesRef
lookup performs at less then 1 million records per-second.
I think if we also move to a threaded approach we'll be able increase
throughput.
I'm shooting to achieve an export rate of 5+ million small records per-second
from a single server. This would scale linearly with the number of servers so a
cluster of 100 servers could export 500+ million small records per-second.
> Full Search Result Export
> -------------------------
>
> Key: SOLR-5244
> URL: https://issues.apache.org/jira/browse/SOLR-5244
> Project: Solr
> Issue Type: New Feature
> Components: search
> Affects Versions: 5.0
> Reporter: Joel Bernstein
> Priority: Minor
> Fix For: 5.0, 4.7
>
> Attachments: SOLR-5244.patch
>
>
> It would be great if Solr could efficiently export entire search result sets
> without scoring or ranking documents. This would allow external systems to
> perform rapid bulk imports from Solr. It also provides a possible platform
> for exporting results to support distributed join scenarios within Solr.
> This ticket provides a patch that has two pluggable components:
> 1) ExportQParserPlugin: which is a post filter that gathers a BitSet with
> document results and does not delegate to ranking collectors. Instead it puts
> the BitSet on the request context.
> 2) BinaryExportWriter: Is a output writer that iterates the BitSet and prints
> the entire result as a binary stream. A header is provided at the beginning
> of the stream so external clients can self configure.
> Note:
> These two components will be sufficient for a non-distributed environment.
> For distributed export a new Request handler will need to be developed.
> After applying the patch and building the dist or example, you can register
> the components through the following changes to solrconfig.xml
> Register export contrib libraries:
> <lib dir="../../../dist/" regex="solr-export-\d.*\.jar" />
>
> Register the "export" queryParser with the following line:
>
> <queryParser name="export"
> class="org.apache.solr.export.ExportQParserPlugin"/>
>
> Register the "xbin" writer:
>
> <queryResponseWriter name="xbin"
> class="org.apache.solr.export.BinaryExportWriter"/>
>
> The following query will perform the export:
> {code}
> http://localhost:8983/solr/collection1/select?q=*:*&fq={!export}&wt=xbin&fl=join_i
> {code}
> Initial patch supports export of four data-types:
> 1) Single value trie int, long and float
> 2) Binary doc values.
> The numerics are currently exported from the FieldCache and the Binary doc
> values can be in memory or on disk.
> Since this is designed to export very large result sets efficiently, stored
> fields are not used for the export.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]