timeout while running simple hadoop job

2010-05-07 Thread gabriele renzi
Hi everyone, I am trying to develop a mapreduce job that does a simple selection+filter on the rows in our store. Of course it is mostly based on the WordCount example :) Sadly, while it seems the app runs fine on a test keyspace with little data, when run on a larger test index (but still on a

Re: Cassandra training on May 21 in Palo Alto

2010-05-07 Thread S Ahmed
toronto :) If not toronto, Virginia. On Thu, May 6, 2010 at 5:28 PM, Jonathan Ellis wrote: > We're planning that now. Where would you like to see one? > > On Thu, May 6, 2010 at 2:40 PM, S Ahmed wrote: > > Do you have rough ideas when you would be doing the next one? Maybe in 1 > or > > 2 mo

bloom filter

2010-05-07 Thread vineet daniel
Hi what is the benefit of creating bloom filter when cassandra writes data, how does it helps ? ___ Vineet Daniel ___ Let your email find you

Re: bloom filter

2010-05-07 Thread David Strauss
On 2010-05-07 10:51, vineet daniel wrote: > what is the benefit of creating bloom filter when cassandra writes data, > how does it helps ? http://wiki.apache.org/cassandra/ArchitectureOverview -- David Strauss | da...@fourkitchens.com Four Kitchens | http://fourkitchens.com | +1 512 454

Re: Cassandra training on May 21 in Palo Alto

2010-05-07 Thread Matt Revelle
Reston, VA is a good spot in the DC metro area for tech events. The recent Pragmatic Programmer Clojure class sold out and already has two more return visits planned. On May 7, 2010, at 6:42 AM, S Ahmed wrote: > toronto :) > > If not toronto, Virginia. > > On Thu, May 6, 2010 at 5:28 PM, Jo

Re: bloom filter

2010-05-07 Thread Peter Schüller
> what is the benefit of creating bloom filter when cassandra writes data, how > does it helps ? It allows Cassandra to answer requests for non-existent keys without going to disk, except in cases where the bloom filter gives a false positive. See: http://spyced.blogspot.com/2009/01/all-you-ever

Re: bloom filter

2010-05-07 Thread vineet daniel
Thanks David and Peter. Is there any way to view the content of this file. ___ Vineet Daniel ___ Let your email find you On Fri, May 7, 2010 at 4:24 PM, David Strauss wrote: > On 2010-05-07 10:51, vineet daniel wrote:

Re: bloom filter

2010-05-07 Thread David Strauss
On 2010-05-07 10:55, Peter Schüller wrote: >> what is the benefit of creating bloom filter when cassandra writes data, how >> does it helps ? > > It allows Cassandra to answer requests for non-existent keys without > going to disk, except in cases where the bloom filter gives a false > positive. >

Re: bloom filter

2010-05-07 Thread vineet daniel
1. Peter said 'without going to disk' so that means bloom filters reside in memory, always or just when request to that particular CF is made. 2. "It is also important for identifying which SSTable files to look inside even when a key is present." - David can you please throw some more light on you

Re: bloom filter

2010-05-07 Thread David Strauss
On 2010-05-07 11:03, vineet daniel wrote: > 2. "It is also important for identifying which SSTable files to look inside > even when a key is present." - David can you please throw some more > light on your point, like what are the implications, why do we need to > identify etc. A bloom filter is a

Re: bloom filter

2010-05-07 Thread David Strauss
On 2010-05-07 10:58, vineet daniel wrote: > Is there any way to view the content of this file. Which file? -- David Strauss | da...@fourkitchens.com Four Kitchens | http://fourkitchens.com | +1 512 454 6659 [office] | +1 512 870 8453 [direct] signature.asc Description: OpenPGP dig

Re: pagination through slices with deleted keys

2010-05-07 Thread Joost Ouwerkerk
+1. There is some disagreement on whether or not the API should return empty columns or skip rows when no data is found. In all of our use cases, we would prefer skipped rows. And based on how frequently new cassandra users appear to be confused about the current behaviour, this might be a more

Re: timeout while running simple hadoop job

2010-05-07 Thread Jonathan Ellis
Sounds like you need to configure Hadoop to not create a whole bunch of Map tasks at once On Fri, May 7, 2010 at 3:47 AM, gabriele renzi wrote: > Hi everyone, > > I am trying to develop a mapreduce job that does a simple > selection+filter on the rows in our store. > Of course it is mostly based

Re: timeout while running simple hadoop job

2010-05-07 Thread Joost Ouwerkerk
Huh? Isn't that the whole point of using Map/Reduce? On Fri, May 7, 2010 at 8:44 AM, Jonathan Ellis wrote: > Sounds like you need to configure Hadoop to not create a whole bunch > of Map tasks at once > > On Fri, May 7, 2010 at 3:47 AM, gabriele renzi wrote: >> Hi everyone, >> >> I am trying to

Re: timeout while running simple hadoop job

2010-05-07 Thread Joseph Stein
The problem could be that you are crunching more data than will be completed within the interval expire setting. In Hadoop you need to kind of tell the task tracker that you are still doing stuff which is done by setting status or incrementing counter on the Reporter object. http://allthingshadoo

Re: timeout while running simple hadoop job

2010-05-07 Thread Matt Revelle
There's also the mapred.task.timeout property that can be tweaked. But reporting is the correct way to fix timeouts during execution. On May 7, 2010, at 8:49 AM, Joseph Stein wrote: > The problem could be that you are crunching more data than will be > completed within the interval expire setti

Re: timeout while running simple hadoop job

2010-05-07 Thread Joost Ouwerkerk
Joseph, the stacktrace suggests that it's Thrift that's timing out, not the Task. Gabriele, I believe that your problem is caused by too much load on Cassandra. Get_range_slices is presently an expensive operation. I had some success in reducing (although, it turns out, not eliminating) this prob

Re: pagination through slices with deleted keys

2010-05-07 Thread Mark Greene
I like your idea about specifying it at the SP level. On Fri, May 7, 2010 at 8:29 AM, Joost Ouwerkerk wrote: > +1. There is some disagreement on whether or not the API should > return empty columns or skip rows when no data is found. In all of > our use cases, we would prefer skipped rows. And

Re: timeout while running simple hadoop job

2010-05-07 Thread Jonathan Ellis
The whole point is to parallelize to use the available capacity across multiple machines. If you go past that point (fairly easy when you have a single machine) then you're just contending for resources, not making things faster. On Fri, May 7, 2010 at 7:48 AM, Joost Ouwerkerk wrote: > Huh? Isn'

Re: timeout while running simple hadoop job

2010-05-07 Thread gabriele renzi
On Fri, May 7, 2010 at 3:02 PM, Joost Ouwerkerk wrote: > Joseph, the stacktrace suggests that it's Thrift that's timing out, > not the Task. > > Gabriele, I believe that your problem is caused by too much load on > Cassandra.  Get_range_slices is presently an expensive operation. I > had some succ

Re: timeout while running simple hadoop job

2010-05-07 Thread gabriele renzi
On Fri, May 7, 2010 at 2:53 PM, Matt Revelle wrote: > There's also the mapred.task.timeout property that can be tweaked.  But > reporting is the correct way to fix timeouts during execution. re: not reporting, I thought this was not needed with the new mapred api (Mapper class vs Mapper interf

Re: timeout while running simple hadoop job

2010-05-07 Thread Matt Revelle
On May 7, 2010, at 9:40, gabriele renzi wrote: On Fri, May 7, 2010 at 2:53 PM, Matt Revelle wrote: re: not reporting, I thought this was not needed with the new mapred api (Mapper class vs Mapper interface), plus I can see that the mappers do work, report percentage and happily terminate

Re: timeout while running simple hadoop job

2010-05-07 Thread gabriele renzi
On Fri, May 7, 2010 at 2:44 PM, Jonathan Ellis wrote: > Sounds like you need to configure Hadoop to not create a whole bunch > of Map tasks at once interesting, from a quick check it seems there are a dozen threads running. Yet , setNumMapTasks seems to be deprecated (together with JobConf) and

Re: Updating (as opposed to just setting) Cassandra data via Hadoop

2010-05-07 Thread Ian Kallen
On 5/6/10 3:26 PM, Stu Hood wrote: Ian: I think that as get_range_slice gets faster, the approach that Mark was heading toward may be considerably more efficient than reading the old value in the OutputFormat. Interesting, I'm trying to understand the performance impact of the different wa

Re: timeout while running simple hadoop job

2010-05-07 Thread Joseph Stein
you can manage the number of map tasks by node mapred.tasktracker.map.tasks.maximum=1 On Fri, May 7, 2010 at 9:53 AM, gabriele renzi wrote: > On Fri, May 7, 2010 at 2:44 PM, Jonathan Ellis wrote: >> Sounds like you need to configure Hadoop to not create a whole bunch >> of Map tasks at once >

Re: timeout while running simple hadoop job

2010-05-07 Thread Joost Ouwerkerk
The number of map tasks for a job is a function of the InputFormat, which in the case of ColumnInputFormat is a function of the global number of keys in Cassandra. The number of concurrent maps being executed at any given time per TaskTracker (per node) is set by mapred.tasktracker.reduce.tasks.ma

Re: Cassandra training on May 21 in Palo Alto

2010-05-07 Thread S Ahmed
It would be great if you could make a video of this event. Yes it won't like being there 1-1, but it sure would help get up to speed. On Fri, May 7, 2010 at 6:56 AM, Matt Revelle wrote: > Reston, VA is a good spot in the DC metro area for tech events. The recent > Pragmatic Programmer Clojure

Re: Updating (as opposed to just setting) Cassandra data via Hadoop

2010-05-07 Thread Jesse McConnell
does anyone have a feel for how performant m/r operations are when backed by cassandra as opposed to hdfs in terms of network utilization and volume of data being pushed around? jesse -- jesse mcconnell jesse.mcconn...@gmail.com On Fri, May 7, 2010 at 08:54, Ian Kallen wrote: > On 5/6/10 3:26

Re: Is SuperColumn necessary?

2010-05-07 Thread Eric Evans
On Wed, 2010-05-05 at 11:31 -0700, Ed Anuff wrote: > Follow-up from last weeks discussion, I've been playing around with a > simple > column comparator for composite column names that I put up on github. > I'd > be interested to hear what people think of this approach. > > http://github.com/edanuf

Re: Cassandra training on May 21 in Palo Alto

2010-05-07 Thread Todd Burruss
+1 -Original Message- From: S Ahmed [sahmed1...@gmail.com] Received: 5/7/10 7:09 AM To: user@cassandra.apache.org [u...@cassandra.apache.org] Subject: Re: Cassandra training on May 21 in Palo Alto It would be great if you could make a video of this event. Yes it won't like being there

Re: pagination through slices with deleted keys

2010-05-07 Thread Mike Malone
On Fri, May 7, 2010 at 5:29 AM, Joost Ouwerkerk wrote: > +1. There is some disagreement on whether or not the API should > return empty columns or skip rows when no data is found. In all of > our use cases, we would prefer skipped rows. And based on how > frequently new cassandra users appear t

Re: Is SuperColumn necessary?

2010-05-07 Thread Ed Anuff
On Thu, May 6, 2010 at 11:10 PM, Mike Malone wrote: > > The upshot is, the Cassandra data model would go from being "it's a nested > dictionary, just kidding no it's not!" to being "it's a nested dictionary, > for serious." Again, these are all just ideas... but I think this > simplified > data m

Re: Virtualization vs. Cassandra and Hadloop

2010-05-07 Thread Vijay
Probably only me... but we have seen a higher latencies when using VMWare, also i think it depends on the H/W and VM configuration I have to figure out why (You might also try to mix the application's which runs on the hw) i think there are people who run it on Amazons EC. Regards, On

Cassandra position in San Mateo

2010-05-07 Thread Amol Deshpande
We have a Lead Datastore Engineer position at Gazillion Entertainment, looking for someone with Cassandra (or similar) experience. Please feel free to ping me if you have any questions. Details here: http://tbe.taleo.net/NA5/ats/careers/requisition.jsp?org=NR2B&cws=1&rid= 334 Thanks,

Overfull node

2010-05-07 Thread David Koblas
I've got two (out of five) nodes on my cassandra ring that somehow got too full (e.g. over 60% disk space utilization). I've now gotten a few new machines added to the ring, but evertime one of the overfull nodes attempts to stream its data it runs out of diskspace... I've tried half a dozen

key is sorted?

2010-05-07 Thread AJ Chen
I have a super column family for "topic", key being the name of the topic. When I retrieve the rows, the rows are not sorted by the key. Is the row key sorted in cassandra by default? -aj -- AJ Chen, PhD Chair, Semantic Web SIG, sdforum.org http://web2express.org twitter @web2express Palo Alto,

Re: key is sorted?

2010-05-07 Thread Roger Schildmeijer
Columns are sorted (see CompareWith/CompareSubcolumnsWith) keys are not. On 7 maj 2010, at 22.10em, AJ Chen wrote: > I have a super column family for "topic", key being the name of the topic. > CompareSubcolumnsWith="BytesType" /> > When I retrieve the rows, the rows are not sorted by the key.

RE: key is sorted?

2010-05-07 Thread Stu Hood
Your IPartitioner implementation decides how the row keys are sorted: see http://wiki.apache.org/cassandra/StorageConfiguration#Partitioner . You need to be using one of the OrderPreservingPartitioners if you'd like a reasonable order for the keys. -Original Message- From: "AJ Chen" Se

Re: Overfull node

2010-05-07 Thread Jonathan Ellis
If you're using RackUnawareStrategy (the default replication strategy) then you can "bootstrap" manually fairly easily -- copy all the data (not system) sstables from an overfull machine to a new machine, assign the new one a token that gives it about half of the old node's range, then start it wit

Is multiget_slice performant when you're looking for lots of keys?

2010-05-07 Thread James
Hi all, Apologies if I'm still stuck in RDBMS mentality - first project using Cassandra! I'll be using Cassandra to store quite a lot (10s of millions) of records, each of which has a type. I'll want to query the records to get all of a certain type; it's an analagous situation to the TaggedPosts

BinaryMemtable and collisions

2010-05-07 Thread Tobias Jungen
Greetings, Started getting my feet wet with Cassandra in earnest this week. I'm building a custom inverted index of sorts on top of Cassandra, in part inspired by the work of Jake Luciani in Lucandra. I've successfully loaded nearly a million documents over a 3-node cluster, and initial query test

Re: BinaryMemtable and collisions

2010-05-07 Thread Chris Goffinet
> > So my question is: If I properly flush every node after performing a larger > bulk insert, can Cassandra merge multiple writes on a single row & column > family when using the BMT interface? Or is using BMT only feasible for > loading data on rows that don't exist yet? > Yes. When you flu

Re: key is sorted?

2010-05-07 Thread AJ Chen
thanks, that works. -aj On Fri, May 7, 2010 at 1:17 PM, Stu Hood wrote: > Your IPartitioner implementation decides how the row keys are sorted: see > http://wiki.apache.org/cassandra/StorageConfiguration#Partitioner . You > need to be using one of the OrderPreservingPartitioners if you'd like a

Cassandra vs. Voldemort benchmark

2010-05-07 Thread Kristian Eide
There is a benchmark comparing Cassandra to Voldemort performance here: http://blog.medallia.com/2010/05/choosing_a_keyvalue_storage_sy.html -- Kristian

Data Modeling Conundrum

2010-05-07 Thread William Ashley
List, I have a case where visitors to a site are tracked via a persistent cookie containing a guid. This cookie is created and set when missing. Some of these visitors are logged in, meaning a userId may also be available. What I’m looking to do is have a way to associate each userId with all of

RE: Cassandra vs. Voldemort benchmark

2010-05-07 Thread Todd Burruss
i did a lot of comparisons between voldemort and cassandra and in the end i decided to go with cassandra. the main reason was recovery and balancing operations. on the surface voldemort is s*** hot fast, until you need to restore a node or add nodes. BDB (the default persistence solution) isn

Re: BinaryMemtable and collisions

2010-05-07 Thread Tobias Jungen
> Yes. When you flush from BMT, its like any other SSTable. Cassandra will > merge them through compaction. > > That's good news, thanks for clarifying! A few more related questions: Are there any problems with issuing the flush command directly from code at the end up a bulk insert? The BMT exam

Benefits of using framed transport over non-framed transport?

2010-05-07 Thread 王一锋
Hi everyone, Can anyone throw a light at the benefits of using framed transport over non-framed transport? We are trying to sum up some performance tuning approaches of cassandra in our project. Can framed transport be counted? Thanks 2010-05-08

Re: BinaryMemtable and collisions

2010-05-07 Thread Jake Luciani
Any reason why you aren't using Lucandra directly? On Fri, May 7, 2010 at 8:21 PM, Tobias Jungen wrote: > Greetings, > > Started getting my feet wet with Cassandra in earnest this week. I'm > building a custom inverted index of sorts on top of Cassandra, in part > inspired by the work of Jake Luc

Re: BinaryMemtable and collisions

2010-05-07 Thread Tobias Jungen
Without going into too much depth: Our retrieval model is a bit more structured than standard lucene retrieval, and I'm trying to leverage that structure. Some of the terms we're going to retrieve against have high occurrence, and because of that I'm worried about getting killed by processing large

Re: BinaryMemtable and collisions

2010-05-07 Thread Jake Luciani
Got it. I'm working on making term vectors optional and just store frequency in this case. Just FYI. On Sat, May 8, 2010 at 1:17 AM, Tobias Jungen wrote: > Without going into too much depth: Our retrieval model is a bit more > structured than standard lucene retrieval, and I'm trying to leverag

Re: Data Modeling Conundrum

2010-05-07 Thread vineet daniel
Query : Why are you sorting AFAIK cassandra sorts the keys by itself if you are using ordered partitioning. And how do you store data pertaining to single user but having several GUID's to attach with. ___ Vineet Daniel ___