Hi Marcelo, I have updated the prerelease app in this gist:
https://gist.github.com/michaelplaing/37d89c8f5f09ae779e47 I found that it was too easy to overrun my Cassandra clusters so I added a throttle arg which by default is 1000 rows per second. Fixed a few bugs too, reworked the args, etc. I'll be interested to hear if you find it useful and/or have any comments. ml On Thu, Jun 5, 2014 at 1:09 PM, Marcelo Elias Del Valle < marc...@s1mbi0se.com.br> wrote: > Michael, > > I will try to test it up to tomorrow and I will let you know all the > results. > > Thanks a lot! > > Best regards, > Marcelo. > > > 2014-06-04 22:28 GMT-03:00 Laing, Michael <michael.la...@nytimes.com>: > > BTW you might want to put a LIMIT clause on your SELECT for testing. -ml >> >> >> On Wed, Jun 4, 2014 at 6:04 PM, Laing, Michael <michael.la...@nytimes.com >> > wrote: >> >>> Marcelo, >>> >>> Here is a link to the preview of the python fast copy program: >>> >>> https://gist.github.com/michaelplaing/37d89c8f5f09ae779e47 >>> >>> It will copy a table from one cluster to another with some >>> transformation- they can be the same cluster. >>> >>> It has 3 main throttles to experiment with: >>> >>> 1. fetch_size: size of source pages in rows >>> 2. worker_count: number of worker subprocesses >>> 3. concurrency: number of async callback chains per worker subprocess >>> >>> It is easy to overrun Cassandra and the python driver, so I recommend >>> starting with the defaults: fetch_size: 1000; worker_count: 2; concurrency: >>> 10. >>> >>> Additionally there are switches to set 'policies' by source and >>> destination: retry (downgrade consistency), dc_aware, and token_aware. >>> retry is useful if you are getting timeouts. For the others YMMV. >>> >>> To use it you need to define the SELECT and UPDATE cql statements as >>> well as the 'map_fields' method. >>> >>> The worker subprocesses divide up the token range among themselves and >>> proceed quasi-independently. Each worker opens a connection to each cluster >>> and the driver sets up connection pools to the nodes in the cluster. Anyway >>> there are a lot of processes, threads, callbacks going at once so it is fun >>> to watch. >>> >>> On my regional cluster of small nodes in AWS I got about 3000 rows per >>> second transferred after things warmed up a bit - each row about 6kb. >>> >>> ml >>> >>> >>> On Wed, Jun 4, 2014 at 11:49 AM, Laing, Michael < >>> michael.la...@nytimes.com> wrote: >>> >>>> OK Marcelo, I'll work on it today. -ml >>>> >>>> >>>> On Tue, Jun 3, 2014 at 8:24 PM, Marcelo Elias Del Valle < >>>> marc...@s1mbi0se.com.br> wrote: >>>> >>>>> Hi Michael, >>>>> >>>>> For sure I would be interested in this program! >>>>> >>>>> I am new both to python and for cql. I started creating this copier, >>>>> but was having problems with timeouts. Alex solved my problem here on the >>>>> list, but I think I will still have a lot of trouble making the copy to >>>>> work fine. >>>>> >>>>> I open sourced my version here: >>>>> https://github.com/s1mbi0se/cql_record_processor >>>>> >>>>> Just in case it's useful for anything. >>>>> >>>>> However, I saw CQL has support for concurrency itself and having >>>>> something made by someone who knows Python CQL Driver better would be very >>>>> helpful. >>>>> >>>>> My two servers today are at OVH (ovh.com), we have servers at AWS but >>>>> but several cases we prefer other hosts. Both servers have SDD and 64 Gb >>>>> RAM, I could use the script as a benchmark for you if you want. Besides, >>>>> we >>>>> have some bigger clusters, I could run on the just to test the speed if >>>>> this is going to help. >>>>> >>>>> Regards >>>>> Marcelo. >>>>> >>>>> >>>>> 2014-06-03 11:40 GMT-03:00 Laing, Michael <michael.la...@nytimes.com>: >>>>> >>>>> Hi Marcelo, >>>>>> >>>>>> I could create a fast copy program by repurposing some python apps >>>>>> that I am using for benchmarking the python driver - do you still need >>>>>> this? >>>>>> >>>>>> With high levels of concurrency and multiple subprocess workers, >>>>>> based on my current actual benchmarks, I think I can get well over 1,000 >>>>>> rows/second on my mac and significantly more in AWS. I'm using variable >>>>>> size rows averaging 5kb. >>>>>> >>>>>> This would be the initial version of a piece of the benchmark suite >>>>>> we will release as part of our nyt⨍aбrik project on 21 June for my >>>>>> Cassandra Day NYC talk re the python driver. >>>>>> >>>>>> ml >>>>>> >>>>>> >>>>>> On Mon, Jun 2, 2014 at 2:15 PM, Marcelo Elias Del Valle < >>>>>> marc...@s1mbi0se.com.br> wrote: >>>>>> >>>>>>> Hi Jens, >>>>>>> >>>>>>> Thanks for trying to help. >>>>>>> >>>>>>> Indeed, I know I can't do it using just CQL. But what would you use >>>>>>> to migrate data manually? I tried to create a python program using auto >>>>>>> paging, but I am getting timeouts. I also tried Hive, but no success. >>>>>>> I only have two nodes and less than 200Gb in this cluster, any >>>>>>> simple way to extract the data quickly would be good enough for me. >>>>>>> >>>>>>> Best regards, >>>>>>> Marcelo. >>>>>>> >>>>>>> >>>>>>> >>>>>>> 2014-06-02 15:08 GMT-03:00 Jens Rantil <jens.ran...@tink.se>: >>>>>>> >>>>>>> Hi Marcelo, >>>>>>>> >>>>>>>> Looks like you can't do this without migrating your data manually: >>>>>>>> https://stackoverflow.com/questions/18421668/alter-cassandra-column-family-primary-key-using-cassandra-cli-or-cql >>>>>>>> >>>>>>>> Cheers, >>>>>>>> Jens >>>>>>>> >>>>>>>> >>>>>>>> On Mon, Jun 2, 2014 at 7:48 PM, Marcelo Elias Del Valle < >>>>>>>> marc...@s1mbi0se.com.br> wrote: >>>>>>>> >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> I have some cql CFs in a 2 node Cassandra 2.0.8 cluster. >>>>>>>>> >>>>>>>>> I realized I created my column family with the wrong partition. >>>>>>>>> Instead of: >>>>>>>>> >>>>>>>>> CREATE TABLE IF NOT EXISTS entity_lookup ( >>>>>>>>> name varchar, >>>>>>>>> value varchar, >>>>>>>>> entity_id uuid, >>>>>>>>> PRIMARY KEY ((name, value), entity_id)) >>>>>>>>> WITH >>>>>>>>> caching=all; >>>>>>>>> >>>>>>>>> I used: >>>>>>>>> >>>>>>>>> CREATE TABLE IF NOT EXISTS entitylookup ( >>>>>>>>> name varchar, >>>>>>>>> value varchar, >>>>>>>>> entity_id uuid, >>>>>>>>> PRIMARY KEY (name, value, entity_id)) >>>>>>>>> WITH >>>>>>>>> caching=all; >>>>>>>>> >>>>>>>>> >>>>>>>>> Now I need to migrate the data from the second CF to the first >>>>>>>>> one. >>>>>>>>> I am using Data Stax Community Edition. >>>>>>>>> >>>>>>>>> What would be the best way to convert data from one CF to the >>>>>>>>> other? >>>>>>>>> >>>>>>>>> Best regards, >>>>>>>>> Marcelo. >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >