python fast table copy/transform (subject updated)

Laing, Michael Fri, 06 Jun 2014 09:26:38 -0700

Hi Marcelo,

I have updated the prerelease app in this gist:


https://gist.github.com/michaelplaing/37d89c8f5f09ae779e47

I found that it was too easy to overrun my Cassandra clusters so I added a
throttle arg which by default is 1000 rows per second.

Fixed a few bugs too, reworked the args, etc.

I'll be interested to hear if you find it useful and/or have any comments.

ml


On Thu, Jun 5, 2014 at 1:09 PM, Marcelo Elias Del Valle <
marc...@s1mbi0se.com.br> wrote:

> Michael,
>
> I will try to test it up to tomorrow and I will let you know all the
> results.
>
> Thanks a lot!
>
> Best regards,
> Marcelo.
>
>
> 2014-06-04 22:28 GMT-03:00 Laing, Michael <michael.la...@nytimes.com>:
>
> BTW you might want to put a LIMIT clause on your SELECT for testing. -ml
>>
>>
>> On Wed, Jun 4, 2014 at 6:04 PM, Laing, Michael <michael.la...@nytimes.com
>> > wrote:
>>
>>> Marcelo,
>>>
>>> Here is a link to the preview of the python fast copy program:
>>>
>>> https://gist.github.com/michaelplaing/37d89c8f5f09ae779e47
>>>
>>> It will copy a table from one cluster to another with some
>>> transformation- they can be the same cluster.
>>>
>>> It has 3 main throttles to experiment with:
>>>
>>>    1. fetch_size: size of source pages in rows
>>>    2. worker_count: number of worker subprocesses
>>>    3. concurrency: number of async callback chains per worker subprocess
>>>
>>> It is easy to overrun Cassandra and the python driver, so I recommend
>>> starting with the defaults: fetch_size: 1000; worker_count: 2; concurrency:
>>> 10.
>>>
>>> Additionally there are switches to set 'policies' by source and
>>> destination: retry (downgrade consistency), dc_aware, and token_aware.
>>> retry is useful if you are getting timeouts. For the others YMMV.
>>>
>>> To use it you need to define the SELECT and UPDATE cql statements as
>>> well as the 'map_fields' method.
>>>
>>> The worker subprocesses divide up the token range among themselves and
>>> proceed quasi-independently. Each worker opens a connection to each cluster
>>> and the driver sets up connection pools to the nodes in the cluster. Anyway
>>> there are a lot of processes, threads, callbacks going at once so it is fun
>>> to watch.
>>>
>>> On my regional cluster of small nodes in AWS I got about 3000 rows per
>>> second transferred after things warmed up a bit - each row about 6kb.
>>>
>>> ml
>>>
>>>
>>> On Wed, Jun 4, 2014 at 11:49 AM, Laing, Michael <
>>> michael.la...@nytimes.com> wrote:
>>>
>>>> OK Marcelo, I'll work on it today. -ml
>>>>
>>>>
>>>> On Tue, Jun 3, 2014 at 8:24 PM, Marcelo Elias Del Valle <
>>>> marc...@s1mbi0se.com.br> wrote:
>>>>
>>>>> Hi Michael,
>>>>>
>>>>> For sure I would be interested in this program!
>>>>>
>>>>> I am new both to python and for cql. I started creating this copier,
>>>>> but was having problems with timeouts. Alex solved my problem here on the
>>>>> list, but I think I will still have a lot of trouble making the copy to
>>>>> work fine.
>>>>>
>>>>> I open sourced my version here:
>>>>> https://github.com/s1mbi0se/cql_record_processor
>>>>>
>>>>> Just in case it's useful for anything.
>>>>>
>>>>> However, I saw CQL has support for concurrency itself and having
>>>>> something made by someone who knows Python CQL Driver better would be very
>>>>> helpful.
>>>>>
>>>>> My two servers today are at OVH (ovh.com), we have servers at AWS but
>>>>> but several cases we prefer other hosts. Both servers have SDD and 64 Gb
>>>>> RAM, I could use the script as a benchmark for you if you want. Besides, 
>>>>> we
>>>>> have some bigger clusters, I could run on the just to test the speed if
>>>>> this is going to help.
>>>>>
>>>>> Regards
>>>>> Marcelo.
>>>>>
>>>>>
>>>>> 2014-06-03 11:40 GMT-03:00 Laing, Michael <michael.la...@nytimes.com>:
>>>>>
>>>>> Hi Marcelo,
>>>>>>
>>>>>> I could create a fast copy program by repurposing some python apps
>>>>>> that I am using for benchmarking the python driver - do you still need 
>>>>>> this?
>>>>>>
>>>>>> With high levels of concurrency and multiple subprocess workers,
>>>>>> based on my current actual benchmarks, I think I can get well over 1,000
>>>>>> rows/second on my mac and significantly more in AWS. I'm using variable
>>>>>> size rows averaging 5kb.
>>>>>>
>>>>>> This would be the initial version of a piece of the benchmark suite
>>>>>> we will release as part of our nyt⨍aбrik project on 21 June for my
>>>>>> Cassandra Day NYC talk re the python driver.
>>>>>>
>>>>>> ml
>>>>>>
>>>>>>
>>>>>> On Mon, Jun 2, 2014 at 2:15 PM, Marcelo Elias Del Valle <
>>>>>> marc...@s1mbi0se.com.br> wrote:
>>>>>>
>>>>>>> Hi Jens,
>>>>>>>
>>>>>>> Thanks for trying to help.
>>>>>>>
>>>>>>> Indeed, I know I can't do it using just CQL. But what would you use
>>>>>>> to migrate data manually? I tried to create a python program using auto
>>>>>>> paging, but I am getting timeouts. I also tried Hive, but no success.
>>>>>>> I only have two nodes and less than 200Gb in this cluster, any
>>>>>>> simple way to extract the data quickly would be good enough for me.
>>>>>>>
>>>>>>> Best regards,
>>>>>>> Marcelo.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 2014-06-02 15:08 GMT-03:00 Jens Rantil <jens.ran...@tink.se>:
>>>>>>>
>>>>>>> Hi Marcelo,
>>>>>>>>
>>>>>>>> Looks like you can't do this without migrating your data manually:
>>>>>>>> https://stackoverflow.com/questions/18421668/alter-cassandra-column-family-primary-key-using-cassandra-cli-or-cql
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>> Jens
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Jun 2, 2014 at 7:48 PM, Marcelo Elias Del Valle <
>>>>>>>> marc...@s1mbi0se.com.br> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> I have some cql CFs in a 2 node Cassandra 2.0.8 cluster.
>>>>>>>>>
>>>>>>>>> I realized I created my column family with the wrong partition.
>>>>>>>>> Instead of:
>>>>>>>>>
>>>>>>>>> CREATE TABLE IF NOT EXISTS entity_lookup (
>>>>>>>>>   name varchar,
>>>>>>>>>   value varchar,
>>>>>>>>>   entity_id uuid,
>>>>>>>>>   PRIMARY KEY ((name, value), entity_id))
>>>>>>>>> WITH
>>>>>>>>>     caching=all;
>>>>>>>>>
>>>>>>>>> I used:
>>>>>>>>>
>>>>>>>>> CREATE TABLE IF NOT EXISTS entitylookup (
>>>>>>>>>   name varchar,
>>>>>>>>>   value varchar,
>>>>>>>>>   entity_id uuid,
>>>>>>>>>   PRIMARY KEY (name, value, entity_id))
>>>>>>>>> WITH
>>>>>>>>>     caching=all;
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Now I need to migrate the data from the second CF to the first
>>>>>>>>> one.
>>>>>>>>> I am using Data Stax Community Edition.
>>>>>>>>>
>>>>>>>>> What would be the best way to convert data from one CF to the
>>>>>>>>> other?
>>>>>>>>>
>>>>>>>>> Best regards,
>>>>>>>>> Marcelo.
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

python fast table copy/transform (subject updated)

Reply via email to