[
https://issues.apache.org/jira/browse/SOLR-9240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Joel Bernstein resolved SOLR-9240.
----------------------------------
Resolution: Resolved
> Support parallel ETL with the topic expression
> ----------------------------------------------
>
> Key: SOLR-9240
> URL: https://issues.apache.org/jira/browse/SOLR-9240
> Project: Solr
> Issue Type: Improvement
> Reporter: Joel Bernstein
> Assignee: Joel Bernstein
> Fix For: 6.2
>
> Attachments: SOLR-9240.patch, SOLR-9240.patch
>
>
> It would be useful for SolrCloud to support large scale *Extract, Transform
> and Load* work loads with streaming expressions. Instead of using MapReduce
> for ETL, the topic expression can be used which allows SolrCloud to be
> treated like a distributed message queue filled with data to be processed.
> The topic expression works in batches and supports retrieval of stored
> fields, so large scale *text ETL* will work perfectly with this approach.
> This ticket makes two small changes to the topic() expression that makes this
> possible:
> 1) Changes the topic expression so it can operate in parallel.
> 2) Adds the initialCheckpoint parameter to the topic expression so a topic
> can start pulling records from anywhere in the queue.
> Daemons can be sent to worker nodes that each work on processing a partition
> of the data from the same topic. The daemon() function's natural behavior is
> perfect for iteratively calling a topic until all records in the topic have
> been processed.
> The sample code below pulls all records from one collection and indexes them
> into another collection. A Transform function could be wrapped around the
> topic() to transform the records before loading. Custom functions can also be
> built to load the data in parallel to any outside system.
> {code}
> parallel(
> workerCollection,
> workers="2",
> sort="_version_ desc",
> daemon(
> update(
> updateCollection,
> batchSize=200,
> topic(
> checkpointCollection,
> topicCollection,
> q=*:*,
> id="topic1",
> fl="id, to , from, body",
> partitionKeys="id",
> initialCheckpoint="0")),
> runInterval="1000",
> id="daemon1"))
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]