On Tue, Aug 24, 2010 at 10:42 AM, Jonathan Moore <jonat...@slando.com>wrote:

> Hello there,
>
> I am new to Riak, but we are thinking of migrating some of our data from
> mysql into it and running with it for some of our website.
>
> Temporarily we would need to keep the data in sync whilst we make other
> changes. So for some time we would be using riak in parallel and
> synchronising the data. So there are two processes we need to create:
>
> 1) full data import
> 2) synchrinising changes to the data
>
> We use solr which has a very usable DataImport handler to get many millions
> of mysql rows indexed, we also use this for delta-imports based on lists of
> unique IDs. Is there any similar technique for Riak? We have 16 million
> documents and counting, so we would rather not open a socket and push over
> HTTP. Currently the dataimporter selects them, and indexes in about 2 hours
> which, as we don't do this often, we can live with. Incremental
> synchronisation would be much smaller sets of documents (<1000 per 10 min)
> so I am less worried there.
>
> I have seen the PBC API which looks promising but I'd still need to fetch
> the rows and push. Does the node you connect to handle the consistent
> hashing in this case? Are there any benchmarks for this?
>
> Is there anything else out there for migrating this amount of data?
>
>

We specialize in enterprise search where this sort of data
integration/consolidation is common practice. Given Riak's distributed
architecture, you should be able to achieve excellent write capacity.

I would definitely use the Protocol Buffers interface if you're more
interested in performance. You can write a simple connector that iterates
over the rows in the database and uses the PBC API to publish to Riak. If
you need more throughput try adding more threads in your client.

We typically do a lot of data transformation from source to target (entity
recognition, classification, normalization, etc.) so our bottleneck is
usually the CPU bound transformation pipeline. With that said we typically
write to a cluster of transformation nodes in order to distribute the work
and maintain write throughput to the target system.

We designed an asynchronous event driven transformational data integration
tool called pypes <http://www.pypes.org> which is open source. We have a
Riak publisher component that leverages the PB interface. I haven't done any
official benchmarking but it's noticeably faster than using HTTP. At the
moment, we're doing some prototype work with Riak under an NDA so I can't
provide much detail.

On the initial bootstrap, be sure to tune your write quorum.

 Regards,
-Eric
_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to