I had considered using spark for this but: 1. we tried to deploy spark only to find out that it was missing a number of key things we need.
2. our app needs to shut down to release threads and resources. Spark doesn’t have support for this so all the workers would have stale thread leaking afterwards. Though I guess if I can get workers to fork then I should be ok. 3. Spark SQL actually returned invalid data to our queries… so that was kind of a red flag and a non-starter On Mon, Feb 9, 2015 at 2:24 AM, Marcelo Valle (BLOOMBERG/ LONDON) < mvallemil...@bloomberg.net> wrote: > Just for the record, I was doing the exact same thing in an internal > application in the start up I used to work. We have had the need of writing > custom code process in parallel all rows of a column family. Normally we > would use Spark for the job, but in our case the logic was a little more > complicated, so we wrote custom code. > > What we did was to run N process in M machines (N cores in each), each one > processing tasks. The tasks were created by splitting the range -2^ 63 to > 2^ 63 -1 in N*M*10 tasks. Even if data was not completely distributed along > the tasks, no machines were idle, as when some task was completed another > one was taken from the task pool. > > It was fast enough for us, but I am interested in knowing if there is a > better way of doing it. > > For your specific case, here is a tool we had opened as open source and > can be useful for simpler tests: > https://github.com/s1mbi0se/cql_record_processor > > Also, I guess you probably know that, but I would consider using Spark for > doing this. > > Best regards, > Marcelo. > > From: user@cassandra.apache.org > Subject: Re:Fastest way to map/parallel read all values in a table? > > What’s the fastest way to map/parallel read all values in a table? > > Kind of like a mini map only job. > > I’m doing this to compute stats across our entire corpus. > > What I did to begin with was use token() and then spit it into the number > of splits I needed. > > So I just took the total key range space which is -2^63 to 2^63 - 1 and > broke it into N parts. > > Then the queries come back as: > > select * from mytable where token(primaryKey) >= x and token(primaryKey) < > y > > From reading on this list I thought this was the correct way to handle > this problem. > > However, I’m seeing horrible performance doing this. After about 1% it > just flat out locks up. > > Could it be that I need to randomize the token order so that it’s not > contiguous? Maybe it’s all mapping on the first box to begin with. > > > > -- > > Founder/CEO Spinn3r.com > Location: *San Francisco, CA* > blog: http://burtonator.wordpress.com > … or check out my Google+ profile > <https://plus.google.com/102718274791889610666/posts> > <http://spinn3r.com> > > > -- Founder/CEO Spinn3r.com Location: *San Francisco, CA* blog: http://burtonator.wordpress.com … or check out my Google+ profile <https://plus.google.com/102718274791889610666/posts> <http://spinn3r.com>