Re: using hadoop + cassandra for CF mutations (delete)

William Oberman Mon, 07 Apr 2014 05:46:19 -0700

That's an excellent setting to know about!  But, I believe it depends on
the InputFormat implementing CombineFileInputFormat, and Cassandra does not.


Very cool though, thanks.


On Sun, Apr 6, 2014 at 12:43 AM, Dotan Patrich <[email protected]> wrote:

> Hi Will,
>
> Have you tried setting this directive at the top of you pig file:
> SET pig.maxCombinedSplitSize <some split size>
>
> It works for my on CDH 4.4, although my data source is HDFS files and not
> Casandra
>
> Regards,
> Dotan
>
>
>
>
> On Fri, Apr 4, 2014 at 9:13 PM, William Oberman <[email protected]
> >wrote:
>
> > Apologies for cross posting!
> >
> > My core issue is unblocked, but I'm still curious on one aspect of my
> > question to the cassandra mailing list.  How does Pig/Hadoop decide how
> > many tasks there are?  The forwarded email below has the gory details,
> but
> > basically:
> > -My Pig loadFunc was CassandraStorage
> > -The "table" (column family in cassandra) has something like a billion
> rows
> > in it, and I want to say ~3TB of data.
> > -No matter what I tried(*), Pig/Hadoop decided this was worthy of 20
> tasks
> >
> > (*) I changed settings in the loadFunc, I booted hadoop clusters with
> more
> > or less task slots, etc...
> >
> > I'm using AWS's EMR, which claims to be hadoop 1.0.3 + pig 11.
> >
> > will
> >
> > ---------- Forwarded message ----------
> > From: William Oberman <[email protected]>
> > Date: Fri, Apr 4, 2014 at 12:24 PM
> > Subject: using hadoop + cassandra for CF mutations (delete)
> > To: "[email protected]" <[email protected]>
> >
> >
> > Hi,
> >
> > I have some history with cassandra + hadoop:
> > 1.) Single DC + integrated hadoop = Was "ok" until I needed steady
> > performance (the single DC was used in a production environment)
> > 2.) Two DC's + integrated hadoop on 1 of 2 DCs = Was "ok" until my data
> > grew and in AWS compute is expensive compared to data storage... e.g.
> > running a 24x7 DC was a lot more expensive than the following solution...
> > 3.) Single DC + a constant "ETL" to S3 = Is still ok, I can spawn an
> > "arbitrarily large" EMR cluster.  And 24x7 data storage + transient EMR
> is
> > cost effective.
> >
> > But, one of my CF's has had a change of usage pattern making a large %,
> but
> > not all of the data, fairly pointless to store.  I thought I'd write a
> Pig
> > UDF that could peek at a row of data and delete if it fails my criteria.
> >  And it "works" in terms of logic, but not in terms of practical
> execution.
> >  The CF in question has O(billion) keys, and afterwards it will have ~10%
> > of that at most.
> >
> > I basically keep losing the jobs due to too many task failures, all
> rooted
> > in:
> > Caused by: TimedOutException()
> > at
> >
> >
> org.apache.cassandra.thrift.Cassandra$get_range_slices_result.read(Cassandra.java:13020)
> >
> > And yes, I've messed around with:
> > -Number of failures for map/reduce/tracker (in the hadoop confs)
> > -split_size (on the URL)
> > -cassandra.range.batch.size
> >
> > But it hasn't helped.  My failsafe is to roll my own distributed process,
> > rather than falling into a pit of internal hadoop settings.  But I feel
> > like I'm close.
> >
> > The problem in my opinion, watching how things are going, is the
> > correlation of splits <-> tasks.  I'm obviously using Pig, so this part
> of
> > the process is fairly opaque to me at the moment.  But, "something
> > somewhere" is picking 20 tasks for my job, and this is fairly independent
> > of the # of task slots (I've booted EMR cluster with different #'s and
> > always get 20).  Why does this matter?  When a task fails, it retries
> from
> > the start, which is a killer for me as I "delete as I go", making that
> > pointless work and massively increasing the odds of an overall job
> failure.
> >  If hadoop/pig chose a large number of tasks, the retries would be much
> > less of a burden.  But, I don't see where/what lets me mess with that
> > logic.
> >
> > Pig gives the ability to mess with reducers (PARALLEL), but I'm in the
> load
> > path, which is all mappers.  I've never jumped to the lower, raw hadoop
> > level before.  But, I'm worried that will be the "falling into a pit"
> > issue...
> >
> > I'm using Cassandra 1.2.15.
> >
> > will
> >
>

Re: using hadoop + cassandra for CF mutations (delete)

Reply via email to