Hi everyone,
I am trying to develop a mapreduce job that does a simple
selection+filter on the rows in our store.
Of course it is mostly based on the WordCount example :)
Sadly, while it seems the app runs fine on a test keyspace with little
data, when run on a larger test index (but still on a
toronto :)
If not toronto, Virginia.
On Thu, May 6, 2010 at 5:28 PM, Jonathan Ellis wrote:
> We're planning that now. Where would you like to see one?
>
> On Thu, May 6, 2010 at 2:40 PM, S Ahmed wrote:
> > Do you have rough ideas when you would be doing the next one? Maybe in 1
> or
> > 2 mo
Hi
what is the benefit of creating bloom filter when cassandra writes data, how
does it helps ?
___
Vineet Daniel
___
Let your email find you
On 2010-05-07 10:51, vineet daniel wrote:
> what is the benefit of creating bloom filter when cassandra writes data,
> how does it helps ?
http://wiki.apache.org/cassandra/ArchitectureOverview
--
David Strauss
| da...@fourkitchens.com
Four Kitchens
| http://fourkitchens.com
| +1 512 454
Reston, VA is a good spot in the DC metro area for tech events. The recent
Pragmatic Programmer Clojure class sold out and already has two more return
visits planned.
On May 7, 2010, at 6:42 AM, S Ahmed wrote:
> toronto :)
>
> If not toronto, Virginia.
>
> On Thu, May 6, 2010 at 5:28 PM, Jo
> what is the benefit of creating bloom filter when cassandra writes data, how
> does it helps ?
It allows Cassandra to answer requests for non-existent keys without
going to disk, except in cases where the bloom filter gives a false
positive.
See:
http://spyced.blogspot.com/2009/01/all-you-ever
Thanks David and Peter.
Is there any way to view the content of this file.
___
Vineet Daniel
___
Let your email find you
On Fri, May 7, 2010 at 4:24 PM, David Strauss wrote:
> On 2010-05-07 10:51, vineet daniel wrote:
On 2010-05-07 10:55, Peter Schüller wrote:
>> what is the benefit of creating bloom filter when cassandra writes data, how
>> does it helps ?
>
> It allows Cassandra to answer requests for non-existent keys without
> going to disk, except in cases where the bloom filter gives a false
> positive.
>
1. Peter said 'without going to disk' so that means bloom filters reside in
memory, always or just when request to that particular CF is made.
2. "It is also important for identifying which SSTable files to look inside
even when a key is present." - David can you please throw some more light on
you
On 2010-05-07 11:03, vineet daniel wrote:
> 2. "It is also important for identifying which SSTable files to look inside
> even when a key is present." - David can you please throw some more
> light on your point, like what are the implications, why do we need to
> identify etc.
A bloom filter is a
On 2010-05-07 10:58, vineet daniel wrote:
> Is there any way to view the content of this file.
Which file?
--
David Strauss
| da...@fourkitchens.com
Four Kitchens
| http://fourkitchens.com
| +1 512 454 6659 [office]
| +1 512 870 8453 [direct]
signature.asc
Description: OpenPGP dig
+1. There is some disagreement on whether or not the API should
return empty columns or skip rows when no data is found. In all of
our use cases, we would prefer skipped rows. And based on how
frequently new cassandra users appear to be confused about the current
behaviour, this might be a more
Sounds like you need to configure Hadoop to not create a whole bunch
of Map tasks at once
On Fri, May 7, 2010 at 3:47 AM, gabriele renzi wrote:
> Hi everyone,
>
> I am trying to develop a mapreduce job that does a simple
> selection+filter on the rows in our store.
> Of course it is mostly based
Huh? Isn't that the whole point of using Map/Reduce?
On Fri, May 7, 2010 at 8:44 AM, Jonathan Ellis wrote:
> Sounds like you need to configure Hadoop to not create a whole bunch
> of Map tasks at once
>
> On Fri, May 7, 2010 at 3:47 AM, gabriele renzi wrote:
>> Hi everyone,
>>
>> I am trying to
The problem could be that you are crunching more data than will be
completed within the interval expire setting.
In Hadoop you need to kind of tell the task tracker that you are still
doing stuff which is done by setting status or incrementing counter on
the Reporter object.
http://allthingshadoo
There's also the mapred.task.timeout property that can be tweaked. But
reporting is the correct way to fix timeouts during execution.
On May 7, 2010, at 8:49 AM, Joseph Stein wrote:
> The problem could be that you are crunching more data than will be
> completed within the interval expire setti
Joseph, the stacktrace suggests that it's Thrift that's timing out,
not the Task.
Gabriele, I believe that your problem is caused by too much load on
Cassandra. Get_range_slices is presently an expensive operation. I
had some success in reducing (although, it turns out, not eliminating)
this prob
I like your idea about specifying it at the SP level.
On Fri, May 7, 2010 at 8:29 AM, Joost Ouwerkerk wrote:
> +1. There is some disagreement on whether or not the API should
> return empty columns or skip rows when no data is found. In all of
> our use cases, we would prefer skipped rows. And
The whole point is to parallelize to use the available capacity across
multiple machines. If you go past that point (fairly easy when you
have a single machine) then you're just contending for resources, not
making things faster.
On Fri, May 7, 2010 at 7:48 AM, Joost Ouwerkerk wrote:
> Huh? Isn'
On Fri, May 7, 2010 at 3:02 PM, Joost Ouwerkerk wrote:
> Joseph, the stacktrace suggests that it's Thrift that's timing out,
> not the Task.
>
> Gabriele, I believe that your problem is caused by too much load on
> Cassandra. Get_range_slices is presently an expensive operation. I
> had some succ
On Fri, May 7, 2010 at 2:53 PM, Matt Revelle wrote:
> There's also the mapred.task.timeout property that can be tweaked. But
> reporting is the correct way to fix timeouts during execution.
re: not reporting, I thought this was not needed with the new mapred
api (Mapper class vs Mapper interf
On May 7, 2010, at 9:40, gabriele renzi wrote:
On Fri, May 7, 2010 at 2:53 PM, Matt Revelle
wrote:
re: not reporting, I thought this was not needed with the new mapred
api (Mapper class vs Mapper interface), plus I can see that the
mappers do work, report percentage and happily terminate
On Fri, May 7, 2010 at 2:44 PM, Jonathan Ellis wrote:
> Sounds like you need to configure Hadoop to not create a whole bunch
> of Map tasks at once
interesting, from a quick check it seems there are a dozen threads running.
Yet , setNumMapTasks seems to be deprecated (together with JobConf)
and
On 5/6/10 3:26 PM, Stu Hood wrote:
Ian: I think that as get_range_slice gets faster, the approach that Mark was
heading toward may be considerably more efficient than reading the old value in
the OutputFormat.
Interesting, I'm trying to understand the performance impact of the
different wa
you can manage the number of map tasks by node
mapred.tasktracker.map.tasks.maximum=1
On Fri, May 7, 2010 at 9:53 AM, gabriele renzi wrote:
> On Fri, May 7, 2010 at 2:44 PM, Jonathan Ellis wrote:
>> Sounds like you need to configure Hadoop to not create a whole bunch
>> of Map tasks at once
>
The number of map tasks for a job is a function of the InputFormat,
which in the case of ColumnInputFormat is a function of the global
number of keys in Cassandra. The number of concurrent maps being
executed at any given time per TaskTracker (per node) is set by
mapred.tasktracker.reduce.tasks.ma
It would be great if you could make a video of this event. Yes it won't
like being there 1-1, but it sure would help get up to speed.
On Fri, May 7, 2010 at 6:56 AM, Matt Revelle wrote:
> Reston, VA is a good spot in the DC metro area for tech events. The recent
> Pragmatic Programmer Clojure
does anyone have a feel for how performant m/r operations are when
backed by cassandra as opposed to hdfs in terms of network utilization
and volume of data being pushed around?
jesse
--
jesse mcconnell
jesse.mcconn...@gmail.com
On Fri, May 7, 2010 at 08:54, Ian Kallen wrote:
> On 5/6/10 3:26
On Wed, 2010-05-05 at 11:31 -0700, Ed Anuff wrote:
> Follow-up from last weeks discussion, I've been playing around with a
> simple
> column comparator for composite column names that I put up on github.
> I'd
> be interested to hear what people think of this approach.
>
> http://github.com/edanuf
+1
-Original Message-
From: S Ahmed [sahmed1...@gmail.com]
Received: 5/7/10 7:09 AM
To: user@cassandra.apache.org [u...@cassandra.apache.org]
Subject: Re: Cassandra training on May 21 in Palo Alto
It would be great if you could make a video of this event. Yes it won't like
being there
On Fri, May 7, 2010 at 5:29 AM, Joost Ouwerkerk wrote:
> +1. There is some disagreement on whether or not the API should
> return empty columns or skip rows when no data is found. In all of
> our use cases, we would prefer skipped rows. And based on how
> frequently new cassandra users appear t
On Thu, May 6, 2010 at 11:10 PM, Mike Malone wrote:
>
> The upshot is, the Cassandra data model would go from being "it's a nested
> dictionary, just kidding no it's not!" to being "it's a nested dictionary,
> for serious." Again, these are all just ideas... but I think this
> simplified
> data m
Probably only me... but we have seen a higher latencies when using VMWare,
also i think it depends on the H/W and VM configuration I have to figure
out why (You might also try to mix the application's which runs on the
hw) i think there are people who run it on Amazons EC.
Regards,
On
We have a Lead Datastore Engineer position at Gazillion Entertainment,
looking for someone with Cassandra (or similar) experience.
Please feel free to ping me if you have any questions. Details here:
http://tbe.taleo.net/NA5/ats/careers/requisition.jsp?org=NR2B&cws=1&rid=
334
Thanks,
I've got two (out of five) nodes on my cassandra ring that somehow got
too full (e.g. over 60% disk space utilization). I've now gotten a few
new machines added to the ring, but evertime one of the overfull nodes
attempts to stream its data it runs out of diskspace... I've tried half
a dozen
I have a super column family for "topic", key being the name of the topic.
When I retrieve the rows, the rows are not sorted by the key. Is the row key
sorted in cassandra by default?
-aj
--
AJ Chen, PhD
Chair, Semantic Web SIG, sdforum.org
http://web2express.org
twitter @web2express
Palo Alto,
Columns are sorted (see CompareWith/CompareSubcolumnsWith) keys are not.
On 7 maj 2010, at 22.10em, AJ Chen wrote:
> I have a super column family for "topic", key being the name of the topic.
> CompareSubcolumnsWith="BytesType" />
> When I retrieve the rows, the rows are not sorted by the key.
Your IPartitioner implementation decides how the row keys are sorted: see
http://wiki.apache.org/cassandra/StorageConfiguration#Partitioner . You need to
be using one of the OrderPreservingPartitioners if you'd like a reasonable
order for the keys.
-Original Message-
From: "AJ Chen"
Se
If you're using RackUnawareStrategy (the default replication strategy)
then you can "bootstrap" manually fairly easily -- copy all the data
(not system) sstables from an overfull machine to a new machine,
assign the new one a token that gives it about half of the old node's
range, then start it wit
Hi all,
Apologies if I'm still stuck in RDBMS mentality - first project using
Cassandra!
I'll be using Cassandra to store quite a lot (10s of millions) of records,
each of which has a type.
I'll want to query the records to get all of a certain type; it's an
analagous situation to the TaggedPosts
Greetings,
Started getting my feet wet with Cassandra in earnest this week. I'm
building a custom inverted index of sorts on top of Cassandra, in part
inspired by the work of Jake Luciani in Lucandra. I've successfully loaded
nearly a million documents over a 3-node cluster, and initial query test
>
> So my question is: If I properly flush every node after performing a larger
> bulk insert, can Cassandra merge multiple writes on a single row & column
> family when using the BMT interface? Or is using BMT only feasible for
> loading data on rows that don't exist yet?
>
Yes. When you flu
thanks, that works. -aj
On Fri, May 7, 2010 at 1:17 PM, Stu Hood wrote:
> Your IPartitioner implementation decides how the row keys are sorted: see
> http://wiki.apache.org/cassandra/StorageConfiguration#Partitioner . You
> need to be using one of the OrderPreservingPartitioners if you'd like a
There is a benchmark comparing Cassandra to Voldemort performance here:
http://blog.medallia.com/2010/05/choosing_a_keyvalue_storage_sy.html
--
Kristian
List,
I have a case where visitors to a site are tracked via a persistent cookie
containing a guid. This cookie is created and set when missing. Some of these
visitors are logged in, meaning a userId may also be available. What I’m
looking to do is have a way to associate each userId with all of
i did a lot of comparisons between voldemort and cassandra and in the end i
decided to go with cassandra. the main reason was recovery and balancing
operations. on the surface voldemort is s*** hot fast, until you need to
restore a node or add nodes. BDB (the default persistence solution) isn
> Yes. When you flush from BMT, its like any other SSTable. Cassandra will
> merge them through compaction.
>
>
That's good news, thanks for clarifying!
A few more related questions:
Are there any problems with issuing the flush command directly from code at
the end up a bulk insert? The BMT exam
Hi everyone,
Can anyone throw a light at the benefits of using framed transport over
non-framed transport?
We are trying to sum up some performance tuning approaches of cassandra in our
project.
Can framed transport be counted?
Thanks
2010-05-08
Any reason why you aren't using Lucandra directly?
On Fri, May 7, 2010 at 8:21 PM, Tobias Jungen wrote:
> Greetings,
>
> Started getting my feet wet with Cassandra in earnest this week. I'm
> building a custom inverted index of sorts on top of Cassandra, in part
> inspired by the work of Jake Luc
Without going into too much depth: Our retrieval model is a bit more
structured than standard lucene retrieval, and I'm trying to leverage that
structure. Some of the terms we're going to retrieve against have high
occurrence, and because of that I'm worried about getting killed by
processing large
Got it. I'm working on making term vectors optional and just store
frequency in this case. Just FYI.
On Sat, May 8, 2010 at 1:17 AM, Tobias Jungen wrote:
> Without going into too much depth: Our retrieval model is a bit more
> structured than standard lucene retrieval, and I'm trying to leverag
Query : Why are you sorting AFAIK cassandra sorts the keys by itself if you
are using ordered partitioning. And how do you store data pertaining to
single user but having several GUID's to attach with.
___
Vineet Daniel
___
52 matches
Mail list logo