Michael,

I will try to test it up to tomorrow and I will let you know all the
results.

Thanks a lot!

Best regards,
Marcelo.


2014-06-04 22:28 GMT-03:00 Laing, Michael <michael.la...@nytimes.com>:

> BTW you might want to put a LIMIT clause on your SELECT for testing. -ml
>
>
> On Wed, Jun 4, 2014 at 6:04 PM, Laing, Michael <michael.la...@nytimes.com>
> wrote:
>
>> Marcelo,
>>
>> Here is a link to the preview of the python fast copy program:
>>
>> https://gist.github.com/michaelplaing/37d89c8f5f09ae779e47
>>
>> It will copy a table from one cluster to another with some
>> transformation- they can be the same cluster.
>>
>> It has 3 main throttles to experiment with:
>>
>>    1. fetch_size: size of source pages in rows
>>    2. worker_count: number of worker subprocesses
>>    3. concurrency: number of async callback chains per worker subprocess
>>
>> It is easy to overrun Cassandra and the python driver, so I recommend
>> starting with the defaults: fetch_size: 1000; worker_count: 2; concurrency:
>> 10.
>>
>> Additionally there are switches to set 'policies' by source and
>> destination: retry (downgrade consistency), dc_aware, and token_aware.
>> retry is useful if you are getting timeouts. For the others YMMV.
>>
>> To use it you need to define the SELECT and UPDATE cql statements as well
>> as the 'map_fields' method.
>>
>> The worker subprocesses divide up the token range among themselves and
>> proceed quasi-independently. Each worker opens a connection to each cluster
>> and the driver sets up connection pools to the nodes in the cluster. Anyway
>> there are a lot of processes, threads, callbacks going at once so it is fun
>> to watch.
>>
>> On my regional cluster of small nodes in AWS I got about 3000 rows per
>> second transferred after things warmed up a bit - each row about 6kb.
>>
>> ml
>>
>>
>> On Wed, Jun 4, 2014 at 11:49 AM, Laing, Michael <
>> michael.la...@nytimes.com> wrote:
>>
>>> OK Marcelo, I'll work on it today. -ml
>>>
>>>
>>> On Tue, Jun 3, 2014 at 8:24 PM, Marcelo Elias Del Valle <
>>> marc...@s1mbi0se.com.br> wrote:
>>>
>>>> Hi Michael,
>>>>
>>>> For sure I would be interested in this program!
>>>>
>>>> I am new both to python and for cql. I started creating this copier,
>>>> but was having problems with timeouts. Alex solved my problem here on the
>>>> list, but I think I will still have a lot of trouble making the copy to
>>>> work fine.
>>>>
>>>> I open sourced my version here:
>>>> https://github.com/s1mbi0se/cql_record_processor
>>>>
>>>> Just in case it's useful for anything.
>>>>
>>>> However, I saw CQL has support for concurrency itself and having
>>>> something made by someone who knows Python CQL Driver better would be very
>>>> helpful.
>>>>
>>>> My two servers today are at OVH (ovh.com), we have servers at AWS but
>>>> but several cases we prefer other hosts. Both servers have SDD and 64 Gb
>>>> RAM, I could use the script as a benchmark for you if you want. Besides, we
>>>> have some bigger clusters, I could run on the just to test the speed if
>>>> this is going to help.
>>>>
>>>> Regards
>>>> Marcelo.
>>>>
>>>>
>>>> 2014-06-03 11:40 GMT-03:00 Laing, Michael <michael.la...@nytimes.com>:
>>>>
>>>> Hi Marcelo,
>>>>>
>>>>> I could create a fast copy program by repurposing some python apps
>>>>> that I am using for benchmarking the python driver - do you still need 
>>>>> this?
>>>>>
>>>>> With high levels of concurrency and multiple subprocess workers, based
>>>>> on my current actual benchmarks, I think I can get well over 1,000
>>>>> rows/second on my mac and significantly more in AWS. I'm using variable
>>>>> size rows averaging 5kb.
>>>>>
>>>>> This would be the initial version of a piece of the benchmark suite we
>>>>> will release as part of our nyt⨍aбrik project on 21 June for my
>>>>> Cassandra Day NYC talk re the python driver.
>>>>>
>>>>> ml
>>>>>
>>>>>
>>>>> On Mon, Jun 2, 2014 at 2:15 PM, Marcelo Elias Del Valle <
>>>>> marc...@s1mbi0se.com.br> wrote:
>>>>>
>>>>>> Hi Jens,
>>>>>>
>>>>>> Thanks for trying to help.
>>>>>>
>>>>>> Indeed, I know I can't do it using just CQL. But what would you use
>>>>>> to migrate data manually? I tried to create a python program using auto
>>>>>> paging, but I am getting timeouts. I also tried Hive, but no success.
>>>>>> I only have two nodes and less than 200Gb in this cluster, any simple
>>>>>> way to extract the data quickly would be good enough for me.
>>>>>>
>>>>>> Best regards,
>>>>>> Marcelo.
>>>>>>
>>>>>>
>>>>>>
>>>>>> 2014-06-02 15:08 GMT-03:00 Jens Rantil <jens.ran...@tink.se>:
>>>>>>
>>>>>> Hi Marcelo,
>>>>>>>
>>>>>>> Looks like you can't do this without migrating your data manually:
>>>>>>> https://stackoverflow.com/questions/18421668/alter-cassandra-column-family-primary-key-using-cassandra-cli-or-cql
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Jens
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Jun 2, 2014 at 7:48 PM, Marcelo Elias Del Valle <
>>>>>>> marc...@s1mbi0se.com.br> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I have some cql CFs in a 2 node Cassandra 2.0.8 cluster.
>>>>>>>>
>>>>>>>> I realized I created my column family with the wrong partition.
>>>>>>>> Instead of:
>>>>>>>>
>>>>>>>> CREATE TABLE IF NOT EXISTS entity_lookup (
>>>>>>>>   name varchar,
>>>>>>>>   value varchar,
>>>>>>>>   entity_id uuid,
>>>>>>>>   PRIMARY KEY ((name, value), entity_id))
>>>>>>>> WITH
>>>>>>>>     caching=all;
>>>>>>>>
>>>>>>>> I used:
>>>>>>>>
>>>>>>>> CREATE TABLE IF NOT EXISTS entitylookup (
>>>>>>>>   name varchar,
>>>>>>>>   value varchar,
>>>>>>>>   entity_id uuid,
>>>>>>>>   PRIMARY KEY (name, value, entity_id))
>>>>>>>> WITH
>>>>>>>>     caching=all;
>>>>>>>>
>>>>>>>>
>>>>>>>> Now I need to migrate the data from the second CF to the first one.
>>>>>>>> I am using Data Stax Community Edition.
>>>>>>>>
>>>>>>>> What would be the best way to convert data from one CF to the other?
>>>>>>>>
>>>>>>>> Best regards,
>>>>>>>> Marcelo.
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Reply via email to