Michael, I will try to test it up to tomorrow and I will let you know all the results.
Thanks a lot! Best regards, Marcelo. 2014-06-04 22:28 GMT-03:00 Laing, Michael <michael.la...@nytimes.com>: > BTW you might want to put a LIMIT clause on your SELECT for testing. -ml > > > On Wed, Jun 4, 2014 at 6:04 PM, Laing, Michael <michael.la...@nytimes.com> > wrote: > >> Marcelo, >> >> Here is a link to the preview of the python fast copy program: >> >> https://gist.github.com/michaelplaing/37d89c8f5f09ae779e47 >> >> It will copy a table from one cluster to another with some >> transformation- they can be the same cluster. >> >> It has 3 main throttles to experiment with: >> >> 1. fetch_size: size of source pages in rows >> 2. worker_count: number of worker subprocesses >> 3. concurrency: number of async callback chains per worker subprocess >> >> It is easy to overrun Cassandra and the python driver, so I recommend >> starting with the defaults: fetch_size: 1000; worker_count: 2; concurrency: >> 10. >> >> Additionally there are switches to set 'policies' by source and >> destination: retry (downgrade consistency), dc_aware, and token_aware. >> retry is useful if you are getting timeouts. For the others YMMV. >> >> To use it you need to define the SELECT and UPDATE cql statements as well >> as the 'map_fields' method. >> >> The worker subprocesses divide up the token range among themselves and >> proceed quasi-independently. Each worker opens a connection to each cluster >> and the driver sets up connection pools to the nodes in the cluster. Anyway >> there are a lot of processes, threads, callbacks going at once so it is fun >> to watch. >> >> On my regional cluster of small nodes in AWS I got about 3000 rows per >> second transferred after things warmed up a bit - each row about 6kb. >> >> ml >> >> >> On Wed, Jun 4, 2014 at 11:49 AM, Laing, Michael < >> michael.la...@nytimes.com> wrote: >> >>> OK Marcelo, I'll work on it today. -ml >>> >>> >>> On Tue, Jun 3, 2014 at 8:24 PM, Marcelo Elias Del Valle < >>> marc...@s1mbi0se.com.br> wrote: >>> >>>> Hi Michael, >>>> >>>> For sure I would be interested in this program! >>>> >>>> I am new both to python and for cql. I started creating this copier, >>>> but was having problems with timeouts. Alex solved my problem here on the >>>> list, but I think I will still have a lot of trouble making the copy to >>>> work fine. >>>> >>>> I open sourced my version here: >>>> https://github.com/s1mbi0se/cql_record_processor >>>> >>>> Just in case it's useful for anything. >>>> >>>> However, I saw CQL has support for concurrency itself and having >>>> something made by someone who knows Python CQL Driver better would be very >>>> helpful. >>>> >>>> My two servers today are at OVH (ovh.com), we have servers at AWS but >>>> but several cases we prefer other hosts. Both servers have SDD and 64 Gb >>>> RAM, I could use the script as a benchmark for you if you want. Besides, we >>>> have some bigger clusters, I could run on the just to test the speed if >>>> this is going to help. >>>> >>>> Regards >>>> Marcelo. >>>> >>>> >>>> 2014-06-03 11:40 GMT-03:00 Laing, Michael <michael.la...@nytimes.com>: >>>> >>>> Hi Marcelo, >>>>> >>>>> I could create a fast copy program by repurposing some python apps >>>>> that I am using for benchmarking the python driver - do you still need >>>>> this? >>>>> >>>>> With high levels of concurrency and multiple subprocess workers, based >>>>> on my current actual benchmarks, I think I can get well over 1,000 >>>>> rows/second on my mac and significantly more in AWS. I'm using variable >>>>> size rows averaging 5kb. >>>>> >>>>> This would be the initial version of a piece of the benchmark suite we >>>>> will release as part of our nyt⨍aбrik project on 21 June for my >>>>> Cassandra Day NYC talk re the python driver. >>>>> >>>>> ml >>>>> >>>>> >>>>> On Mon, Jun 2, 2014 at 2:15 PM, Marcelo Elias Del Valle < >>>>> marc...@s1mbi0se.com.br> wrote: >>>>> >>>>>> Hi Jens, >>>>>> >>>>>> Thanks for trying to help. >>>>>> >>>>>> Indeed, I know I can't do it using just CQL. But what would you use >>>>>> to migrate data manually? I tried to create a python program using auto >>>>>> paging, but I am getting timeouts. I also tried Hive, but no success. >>>>>> I only have two nodes and less than 200Gb in this cluster, any simple >>>>>> way to extract the data quickly would be good enough for me. >>>>>> >>>>>> Best regards, >>>>>> Marcelo. >>>>>> >>>>>> >>>>>> >>>>>> 2014-06-02 15:08 GMT-03:00 Jens Rantil <jens.ran...@tink.se>: >>>>>> >>>>>> Hi Marcelo, >>>>>>> >>>>>>> Looks like you can't do this without migrating your data manually: >>>>>>> https://stackoverflow.com/questions/18421668/alter-cassandra-column-family-primary-key-using-cassandra-cli-or-cql >>>>>>> >>>>>>> Cheers, >>>>>>> Jens >>>>>>> >>>>>>> >>>>>>> On Mon, Jun 2, 2014 at 7:48 PM, Marcelo Elias Del Valle < >>>>>>> marc...@s1mbi0se.com.br> wrote: >>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> I have some cql CFs in a 2 node Cassandra 2.0.8 cluster. >>>>>>>> >>>>>>>> I realized I created my column family with the wrong partition. >>>>>>>> Instead of: >>>>>>>> >>>>>>>> CREATE TABLE IF NOT EXISTS entity_lookup ( >>>>>>>> name varchar, >>>>>>>> value varchar, >>>>>>>> entity_id uuid, >>>>>>>> PRIMARY KEY ((name, value), entity_id)) >>>>>>>> WITH >>>>>>>> caching=all; >>>>>>>> >>>>>>>> I used: >>>>>>>> >>>>>>>> CREATE TABLE IF NOT EXISTS entitylookup ( >>>>>>>> name varchar, >>>>>>>> value varchar, >>>>>>>> entity_id uuid, >>>>>>>> PRIMARY KEY (name, value, entity_id)) >>>>>>>> WITH >>>>>>>> caching=all; >>>>>>>> >>>>>>>> >>>>>>>> Now I need to migrate the data from the second CF to the first one. >>>>>>>> I am using Data Stax Community Edition. >>>>>>>> >>>>>>>> What would be the best way to convert data from one CF to the other? >>>>>>>> >>>>>>>> Best regards, >>>>>>>> Marcelo. >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >