Good way of experimenting Will. Share your observation :) Adding cassandra user group for the input of the community on num_tokens settings in cassandra.yaml.
Thanks Suraj On 07-Apr-2014 6:20 PM, "William Oberman" <ober...@civicscience.com> wrote: > If that works, it's a neat/fancy trick. But, after looking into the docs > for that setting, related to vnodes which I'm not using, I'm not > comfortable messing with it on my production cluster. > > I guess this implies that the # of mappers is related to the number of > physical/virtual nodes of the cassandra cluster, which makes sense... > > I think I'm just going to hack up a script that I can run locally on a > cassandra box, and only run queries/writes/deletes against the local token > range (and get parallelism from cassandra's distributed nature). > > will > > > On Sun, Apr 6, 2014 at 7:30 AM, Suraj Nayak <snay...@gmail.com> wrote: > >> Hi Will, >> >> Try changing the value of num_tokens in conf/cassandra.yaml. >> >> Set to your desired value minus 1. >> >> Example, if you want 5 map tasks to run, set the value of num_tokens to 4 >> (default is 256) >> >> I encountered almost same situation when I was trying to load or write >> very small data from/into Cassandra. It was launching 257 map tasks. When >> num_tokens value reduced to 1 it Pig launched only 2 job. Do restart >> Cassandra service after change. >> >> Hope it might help.. >> >> -- >> Suraj >> On 04-Apr-2014 11:44 PM, "William Oberman" <ober...@civicscience.com> >> wrote: >> >>> Apologies for cross posting! >>> >>> My core issue is unblocked, but I'm still curious on one aspect of my >>> question to the cassandra mailing list. How does Pig/Hadoop decide how >>> many tasks there are? The forwarded email below has the gory details, >>> but >>> basically: >>> -My Pig loadFunc was CassandraStorage >>> -The "table" (column family in cassandra) has something like a billion >>> rows >>> in it, and I want to say ~3TB of data. >>> -No matter what I tried(*), Pig/Hadoop decided this was worthy of 20 >>> tasks >>> >>> (*) I changed settings in the loadFunc, I booted hadoop clusters with >>> more >>> or less task slots, etc... >>> >>> I'm using AWS's EMR, which claims to be hadoop 1.0.3 + pig 11. >>> >>> will >>> >>> ---------- Forwarded message ---------- >>> From: William Oberman <ober...@civicscience.com> >>> Date: Fri, Apr 4, 2014 at 12:24 PM >>> Subject: using hadoop + cassandra for CF mutations (delete) >>> To: "user@cassandra.apache.org" <user@cassandra.apache.org> >>> >>> >>> Hi, >>> >>> I have some history with cassandra + hadoop: >>> 1.) Single DC + integrated hadoop = Was "ok" until I needed steady >>> performance (the single DC was used in a production environment) >>> 2.) Two DC's + integrated hadoop on 1 of 2 DCs = Was "ok" until my data >>> grew and in AWS compute is expensive compared to data storage... e.g. >>> running a 24x7 DC was a lot more expensive than the following solution... >>> 3.) Single DC + a constant "ETL" to S3 = Is still ok, I can spawn an >>> "arbitrarily large" EMR cluster. And 24x7 data storage + transient EMR >>> is >>> cost effective. >>> >>> But, one of my CF's has had a change of usage pattern making a large %, >>> but >>> not all of the data, fairly pointless to store. I thought I'd write a >>> Pig >>> UDF that could peek at a row of data and delete if it fails my criteria. >>> And it "works" in terms of logic, but not in terms of practical >>> execution. >>> The CF in question has O(billion) keys, and afterwards it will have ~10% >>> of that at most. >>> >>> I basically keep losing the jobs due to too many task failures, all >>> rooted >>> in: >>> Caused by: TimedOutException() >>> at >>> >>> org.apache.cassandra.thrift.Cassandra$get_range_slices_result.read(Cassandra.java:13020) >>> >>> And yes, I've messed around with: >>> -Number of failures for map/reduce/tracker (in the hadoop confs) >>> -split_size (on the URL) >>> -cassandra.range.batch.size >>> >>> But it hasn't helped. My failsafe is to roll my own distributed process, >>> rather than falling into a pit of internal hadoop settings. But I feel >>> like I'm close. >>> >>> The problem in my opinion, watching how things are going, is the >>> correlation of splits <-> tasks. I'm obviously using Pig, so this part >>> of >>> the process is fairly opaque to me at the moment. But, "something >>> somewhere" is picking 20 tasks for my job, and this is fairly independent >>> of the # of task slots (I've booted EMR cluster with different #'s and >>> always get 20). Why does this matter? When a task fails, it retries >>> from >>> the start, which is a killer for me as I "delete as I go", making that >>> pointless work and massively increasing the odds of an overall job >>> failure. >>> If hadoop/pig chose a large number of tasks, the retries would be much >>> less of a burden. But, I don't see where/what lets me mess with that >>> logic. >>> >>> Pig gives the ability to mess with reducers (PARALLEL), but I'm in the >>> load >>> path, which is all mappers. I've never jumped to the lower, raw hadoop >>> level before. But, I'm worried that will be the "falling into a pit" >>> issue... >>> >>> I'm using Cassandra 1.2.15. >>> >>> will >>> >> > > >