Apologies for cross posting! My core issue is unblocked, but I'm still curious on one aspect of my question to the cassandra mailing list. How does Pig/Hadoop decide how many tasks there are? The forwarded email below has the gory details, but basically: -My Pig loadFunc was CassandraStorage -The "table" (column family in cassandra) has something like a billion rows in it, and I want to say ~3TB of data. -No matter what I tried(*), Pig/Hadoop decided this was worthy of 20 tasks
(*) I changed settings in the loadFunc, I booted hadoop clusters with more or less task slots, etc... I'm using AWS's EMR, which claims to be hadoop 1.0.3 + pig 11. will ---------- Forwarded message ---------- From: William Oberman <[email protected]> Date: Fri, Apr 4, 2014 at 12:24 PM Subject: using hadoop + cassandra for CF mutations (delete) To: "[email protected]" <[email protected]> Hi, I have some history with cassandra + hadoop: 1.) Single DC + integrated hadoop = Was "ok" until I needed steady performance (the single DC was used in a production environment) 2.) Two DC's + integrated hadoop on 1 of 2 DCs = Was "ok" until my data grew and in AWS compute is expensive compared to data storage... e.g. running a 24x7 DC was a lot more expensive than the following solution... 3.) Single DC + a constant "ETL" to S3 = Is still ok, I can spawn an "arbitrarily large" EMR cluster. And 24x7 data storage + transient EMR is cost effective. But, one of my CF's has had a change of usage pattern making a large %, but not all of the data, fairly pointless to store. I thought I'd write a Pig UDF that could peek at a row of data and delete if it fails my criteria. And it "works" in terms of logic, but not in terms of practical execution. The CF in question has O(billion) keys, and afterwards it will have ~10% of that at most. I basically keep losing the jobs due to too many task failures, all rooted in: Caused by: TimedOutException() at org.apache.cassandra.thrift.Cassandra$get_range_slices_result.read(Cassandra.java:13020) And yes, I've messed around with: -Number of failures for map/reduce/tracker (in the hadoop confs) -split_size (on the URL) -cassandra.range.batch.size But it hasn't helped. My failsafe is to roll my own distributed process, rather than falling into a pit of internal hadoop settings. But I feel like I'm close. The problem in my opinion, watching how things are going, is the correlation of splits <-> tasks. I'm obviously using Pig, so this part of the process is fairly opaque to me at the moment. But, "something somewhere" is picking 20 tasks for my job, and this is fairly independent of the # of task slots (I've booted EMR cluster with different #'s and always get 20). Why does this matter? When a task fails, it retries from the start, which is a killer for me as I "delete as I go", making that pointless work and massively increasing the odds of an overall job failure. If hadoop/pig chose a large number of tasks, the retries would be much less of a burden. But, I don't see where/what lets me mess with that logic. Pig gives the ability to mess with reducers (PARALLEL), but I'm in the load path, which is all mappers. I've never jumped to the lower, raw hadoop level before. But, I'm worried that will be the "falling into a pit" issue... I'm using Cassandra 1.2.15. will
