Re: writes to Cassandra failing occasionally

2010-04-09 Thread Ted Zlatanov
On Thu, 8 Apr 2010 10:56:55 -0500 Jonathan Ellis  wrote: 

JE> is N:C:E possibly ignoring thrift exceptions?

I always pass them down to the user.  The user is responsible for
wrapping with eval().

Ted



Re: writes to Cassandra failing occasionally

2010-04-09 Thread Ted Zlatanov
On Thu, 08 Apr 2010 12:16:34 -0700 Mike Gallamore 
 wrote: 

MG> Hopefully my fix helps others. I imagine it is something you'll run
MG> into regardless of the language/interface you use, for example I'm
MG> pretty sure that the C/C++ time function truncates values too. I'd
MG> recommend anyone using time to generate your timestamp: be careful
MG> that your timestamp is always the same length (or at least that the
MG> sub components that you are concatenating are the length you expect
MG> them to be).

This was a Perl-related bug so I doubt other will see it.  It's really
caused by the fact that 32-bit Perl doesn't have a native 64-bit
pack/unpack function so I'm using the Bit::Vector wrappers and
consequently passing around Longs as strings.

MG> I've written a patch that zero pads the numbers. I've attached it to
MG> this post but encase attachments don't come through on this
MG> mailinglist here is the body:

Thanks so much for catching this.  I didn't notice it at all (it works
90% of the time!).  I uploaded N::C::Easy 0.10 to CPAN with the fix you
proposed, so now timestamps are produced correctly.

Ted



Re: writes to Cassandra failing occasionally

2010-04-09 Thread Ted Zlatanov
On Thu, 08 Apr 2010 11:50:38 -0700 Mike Gallamore 
 wrote: 

MG> Yes I agree single threaded is probably not the best. I wonder how
MG> much of a performance hit it is on a single CPU machine though? I
MG> guess I still would be blocking on ram writes but isn't like there is
MG> multiple CPUs I need to keep busy or anything.

Cassandra may have to load data from disk for a particular query but
another may already be in memory.  A third may cause a hit on another
cluster node.  So if you issue queries serially you'll see performance
drop off with the total number of queries because they are dependent on
each other's performance, while the distribution of the performance of
independent parallel queries will have skew and kurtosis much closer to
a normal distribution.  In other words, your slowest (or unluckiest)
queries are less damaging when you issue them in parallel.

On the client side you still have slow serialization/deserialization and
not much can be done about that.

Ted



Re: writes to Cassandra failing occasionally

2010-04-09 Thread Mike Gallamore
That makes sense. My data is coming in from the internet and is being
processed in chunks as it is using Active MQ with the stomp package. I'm
getting the log lines in 20-1000 line chunks (depending on the busyness of
customer sites) so there definitely is the potential for a lot of
parallelism. Some of my data will likely be in cache during write because of
the nature of the work. It's a reputation system so I first get a query from
the customer for the reputation and then afterwards within a minute or so
I'll get feedback back from them what the current events "score" was that
feedbacks into the system for updates of the value. Anyways lots of
parallelism opportunities.

2010/4/9 Ted Zlatanov 

> On Thu, 08 Apr 2010 11:50:38 -0700 Mike Gallamore <
> mike.e.gallam...@googlemail.com> wrote:
>
> MG> Yes I agree single threaded is probably not the best. I wonder how
> MG> much of a performance hit it is on a single CPU machine though? I
> MG> guess I still would be blocking on ram writes but isn't like there is
> MG> multiple CPUs I need to keep busy or anything.
>
> Cassandra may have to load data from disk for a particular query but
> another may already be in memory.  A third may cause a hit on another
> cluster node.  So if you issue queries serially you'll see performance
> drop off with the total number of queries because they are dependent on
> each other's performance, while the distribution of the performance of
> independent parallel queries will have skew and kurtosis much closer to
> a normal distribution.  In other words, your slowest (or unluckiest)
> queries are less damaging when you issue them in parallel.
>
> On the client side you still have slow serialization/deserialization and
> not much can be done about that.
>
> Ted
>
>


Re: Basic question

2010-04-09 Thread Jonathan Ellis
On Thu, Apr 8, 2010 at 12:09 AM, Palaniappan Thiyagarajan
 wrote:
> I am investigating how we can use Cassandra in our application.  We have
> tokens and session information stored in db now and I am thinking of moving
> to Cassandra.   Currently it’s write and read intensive and having
> performance issue.  Is it good idea to move couple of tables and integrate
> with application?

Sure.

> How do we find out which tables are best candidate for Cassandra?

Use the reporting tools in your existing database to figure out what
the highest-volume tables are, that don't require things like
transactions?

> I live in bay area and like to know if any group meet in bay area so that I
> can participate and understand more about Cassandra.

There's http://cassandrahackathon.eventbrite.com/, but it looks like
it's full now.

-Jonathan


Re: Very new user needs some troubleshooting pointers

2010-04-09 Thread Jonathan Ellis
A single-threaded test is meaningless.  You need a multithreaded (or
multiprocess) benchmark like the one in contrib/py_stress.

Picture worth 1000 words: http://spyced.blogspot.com/2010/01/cassandra-05.html

On Thu, Apr 8, 2010 at 3:59 PM, Heath Oderman  wrote:
> Hi All,
> I'm brand new to Cassandra and know absolutely nothing, so please forgive me
> in advance.
> A friend and I have each setup a few Cassandra stand alone nodes, completely
> default.
> His: Mac OSX Snow Leopard
>      Mac Book Pro
>      Intel Duo Core
>      4GB Ram
>      5400 rpm disk
> Mine: debian 5.x (lenny) with the deb pack from
> http://www.apache.org/dist/cassandra/debian
>      2  Desktops
>      Intel duo core
>      4GB ram
>      7200 sata drives
>     1 blade
>      8gb ram
>      1 rpm disk
>      dual xeon
>     (i have a windows box too like the 2 desktops)
>
>     (each of those machines is stand alone)
>
> My debian boxes are brand new installs, nothing else running, purely console
> environments, only SSH & Cassandra installed.
> The Cassandra configs are the *default configs* with only 'ListenAddress'
> and 'ThriftAddress' changed to the ext ip for those boxes.
> We generated a C# library with Thrift to connect to these servers.  We wrote
> a simple c# app that loops 10,000 times and does a
>          _client.batch_insert(_keyspace, map.Key.GetValue(o,
> null).ToString(), dict, ConsistencyLevel.ONE);
> "batch_insert" I guess is the key bit up there.
> The reason that I'm writing is that the batch_insert call takes 400,000
> ticks every time it is called when running against the debian boxes.  Any of
> them.
> The result is that 10,000 inserts against his machine takes about 30
> seconds, and it takes about 1 min 45 seconds against any of my servers.
>  (longer against the windows 7 server.)
> The MacBookPro is faster while I would expect to be slower.  (the macbook
> pro is his laptop and he's running mail and all kinds of other stuff
> simultaneously.)
> I'm on a gigabit network, iostat / top / bmon all show that the Cassandra
> server isn't working very hard.
> Performance mon on my windows client show my computer running the loop is
> hardly working.
> I am writing to you to ask where I might go to get information on comparing
> the environments, improving my performance, etc.  I've been googling all day
> and haven't been able to figure anything out.
> If this is the wrong forum, sorry!
> Thanks for any help/suggestions you might have.
> Stu
>
>
>
>


Re: Worst case #iops to read a row

2010-04-09 Thread Jonathan Ellis
worst case is 2 or 3, depending on row size:

one seek to read the right row index block
one seek to read the row header (bloom filter + column index)
if it's a big row, one seek to read the column block (block size is
configurable, default is 256KB)

On Thu, Apr 8, 2010 at 5:21 PM, Scott Shealy  wrote:
> Not knowing know anything  about the physical layout of the data on disk or 
> how it is accessed when it is read... Could someone who does help
> estimate the worst case scenario(no caching at any level) for the number of 
> iops to read a row of modest size and modest number of columns in a
> large column family.
>
> TIA,
>
> S.
>
>


Re: Very new user needs some troubleshooting pointers

2010-04-09 Thread Heath Oderman
Thanks for the reply Jonathan!

I started with multi threaded tests, but when my performance was so much
slower than my buddy's I switched to one to try to isolate and identify the
differences.  I got tunnel vision and kept on with the one thread tests.

I'll modify the tests and try again.

Thanks,
Stu

On Fri, Apr 9, 2010 at 11:34 AM, Jonathan Ellis  wrote:

> A single-threaded test is meaningless.  You need a multithreaded (or
> multiprocess) benchmark like the one in contrib/py_stress.
>
> Picture worth 1000 words:
> http://spyced.blogspot.com/2010/01/cassandra-05.html
>
> On Thu, Apr 8, 2010 at 3:59 PM, Heath Oderman  wrote:
> > Hi All,
> > I'm brand new to Cassandra and know absolutely nothing, so please forgive
> me
> > in advance.
> > A friend and I have each setup a few Cassandra stand alone nodes,
> completely
> > default.
> > His: Mac OSX Snow Leopard
> >  Mac Book Pro
> >  Intel Duo Core
> >  4GB Ram
> >  5400 rpm disk
> > Mine: debian 5.x (lenny) with the deb pack from
> > http://www.apache.org/dist/cassandra/debian
> >  2  Desktops
> >  Intel duo core
> >  4GB ram
> >  7200 sata drives
> > 1 blade
> >  8gb ram
> >  1 rpm disk
> >  dual xeon
> > (i have a windows box too like the 2 desktops)
> >
> > (each of those machines is stand alone)
> >
> > My debian boxes are brand new installs, nothing else running, purely
> console
> > environments, only SSH & Cassandra installed.
> > The Cassandra configs are the *default configs* with only 'ListenAddress'
> > and 'ThriftAddress' changed to the ext ip for those boxes.
> > We generated a C# library with Thrift to connect to these servers.  We
> wrote
> > a simple c# app that loops 10,000 times and does a
> >  _client.batch_insert(_keyspace, map.Key.GetValue(o,
> > null).ToString(), dict, ConsistencyLevel.ONE);
> > "batch_insert" I guess is the key bit up there.
> > The reason that I'm writing is that the batch_insert call takes 400,000
> > ticks every time it is called when running against the debian boxes.  Any
> of
> > them.
> > The result is that 10,000 inserts against his machine takes about 30
> > seconds, and it takes about 1 min 45 seconds against any of my servers.
> >  (longer against the windows 7 server.)
> > The MacBookPro is faster while I would expect to be slower.  (the macbook
> > pro is his laptop and he's running mail and all kinds of other stuff
> > simultaneously.)
> > I'm on a gigabit network, iostat / top / bmon all show that the Cassandra
> > server isn't working very hard.
> > Performance mon on my windows client show my computer running the loop is
> > hardly working.
> > I am writing to you to ask where I might go to get information on
> comparing
> > the environments, improving my performance, etc.  I've been googling all
> day
> > and haven't been able to figure anything out.
> > If this is the wrong forum, sorry!
> > Thanks for any help/suggestions you might have.
> > Stu
> >
> >
> >
> >
>


RE: Very new user needs some troubleshooting pointers

2010-04-09 Thread Mark Jones
Sounds like we are some experiencing the same problems. (I'm using 0.6RC1) I 
have a 3 node cluster with 8GB/machine (dual core CPU).  I'm peaking on inserts 
at about 6000-7000/second running 40 threads.  Separate spindles for commitlog 
and data.

My read speed is atrocious, 800/sec sustained (starts off at 1800+/second and 
falls back to 800/sec).  Of course that is only if I read from the "correct" 
node.  Depending on the moment, 2 of the nodes will return 1-2/second instead 
of 800, and only one node will return 800/second.  And if I spread the reads 
across many nodes, all the performance drops.   nodetool loadbalance can change 
which node is the "golden" node, but I don't know why.  I have doubled the # of 
concurrent read threads and seen some performance improvement, (that was the 
last thing I tried, and eeked out another 150/second)

So much about Cassandra makes we WANT it to work, I mean look at the fact that 
all nodes are essentially equal, that it replicates from rack to rack, from DC 
to DC, now, if I could just make it perform.

My machines are basically idle (a large amount of IOWait, but the time is spent 
in the pending queue, vs the device svctime).  So far I've got little insight 
into what could be wrong, I've increased the key cache 10X using JConsole but 
the hit rate is still at times abysmal.

I'm writing 400-800 byte blobs with an 8 byte key (supercolumn) and a 12 byte 
"subkey", then a 5 byte column name, something that would seem to be right up 
Cassandra's alley.

Right now I'm reworking my test to dump it into MySQL on the same machines, so 
I can compare the two for speed, because either I've got crap for hardware, or 
there is something rotten in Denmark.

From: Heath Oderman [mailto:he...@526valley.com]
Sent: Friday, April 09, 2010 10:40 AM
To: user@cassandra.apache.org
Subject: Re: Very new user needs some troubleshooting pointers

Thanks for the reply Jonathan!

I started with multi threaded tests, but when my performance was so much slower 
than my buddy's I switched to one to try to isolate and identify the 
differences.  I got tunnel vision and kept on with the one thread tests.

I'll modify the tests and try again.

Thanks,
Stu

On Fri, Apr 9, 2010 at 11:34 AM, Jonathan Ellis 
mailto:jbel...@gmail.com>> wrote:
A single-threaded test is meaningless.  You need a multithreaded (or
multiprocess) benchmark like the one in contrib/py_stress.

Picture worth 1000 words: http://spyced.blogspot.com/2010/01/cassandra-05.html

On Thu, Apr 8, 2010 at 3:59 PM, Heath Oderman 
mailto:he...@526valley.com>> wrote:
> Hi All,
> I'm brand new to Cassandra and know absolutely nothing, so please forgive me
> in advance.
> A friend and I have each setup a few Cassandra stand alone nodes, completely
> default.
> His: Mac OSX Snow Leopard
>  Mac Book Pro
>  Intel Duo Core
>  4GB Ram
>  5400 rpm disk
> Mine: debian 5.x (lenny) with the deb pack from
> http://www.apache.org/dist/cassandra/debian
>  2  Desktops
>  Intel duo core
>  4GB ram
>  7200 sata drives
> 1 blade
>  8gb ram
>  1 rpm disk
>  dual xeon
> (i have a windows box too like the 2 desktops)
>
> (each of those machines is stand alone)
>
> My debian boxes are brand new installs, nothing else running, purely console
> environments, only SSH & Cassandra installed.
> The Cassandra configs are the *default configs* with only 'ListenAddress'
> and 'ThriftAddress' changed to the ext ip for those boxes.
> We generated a C# library with Thrift to connect to these servers.  We wrote
> a simple c# app that loops 10,000 times and does a
>  _client.batch_insert(_keyspace, map.Key.GetValue(o,
> null).ToString(), dict, ConsistencyLevel.ONE);
> "batch_insert" I guess is the key bit up there.
> The reason that I'm writing is that the batch_insert call takes 400,000
> ticks every time it is called when running against the debian boxes.  Any of
> them.
> The result is that 10,000 inserts against his machine takes about 30
> seconds, and it takes about 1 min 45 seconds against any of my servers.
>  (longer against the windows 7 server.)
> The MacBookPro is faster while I would expect to be slower.  (the macbook
> pro is his laptop and he's running mail and all kinds of other stuff
> simultaneously.)
> I'm on a gigabit network, iostat / top / bmon all show that the Cassandra
> server isn't working very hard.
> Performance mon on my windows client show my computer running the loop is
> hardly working.
> I am writing to you to ask where I might go to get information on comparing
> the environments, improving my performance, etc.  I've been googling all day
> and haven't been able to figure anything out.
> If this is the wrong forum, sorry!
> Thanks for any help/suggestions you might have.
> Stu
>
>
>
>



Re: Very new user needs some troubleshooting pointers

2010-04-09 Thread Jonathan Ellis
If you're only seeing 1-2 RPS then you should turn on debug logging to
see where the latency is.

On Fri, Apr 9, 2010 at 11:14 AM, Mark Jones  wrote:
> Sounds like we are some experiencing the same problems. (I’m using 0.6RC1) I
> have a 3 node cluster with 8GB/machine (dual core CPU).  I’m peaking on
> inserts at about 6000-7000/second running 40 threads.  Separate spindles for
> commitlog and data…..
>
>
>
> My read speed is atrocious, 800/sec sustained (starts off at 1800+/second
> and falls back to 800/sec).  Of course that is only if I read from the
> “correct” node.  Depending on the moment, 2 of the nodes will return
> 1-2/second instead of 800, and only one node will return 800/second.  And if
> I spread the reads across many nodes, all the performance drops.   nodetool
> loadbalance can change which node is the “golden” node, but I don’t know
> why.  I have doubled the # of concurrent read threads and seen some
> performance improvement, (that was the last thing I tried, and eeked out
> another 150/second)
>
>
>
> So much about Cassandra makes we WANT it to work, I mean look at the fact
> that all nodes are essentially equal, that it replicates from rack to rack,
> from DC to DC, now, if I could just make it perform.
>
>
>
> My machines are basically idle (a large amount of IOWait, but the time is
> spent in the pending queue, vs the device svctime).  So far I’ve got little
> insight into what could be wrong, I’ve increased the key cache 10X using
> JConsole but the hit rate is still at times abysmal.
>
>
>
> I’m writing 400-800 byte blobs with an 8 byte key (supercolumn) and a 12
> byte “subkey”, then a 5 byte column name, something that would seem to be
> right up Cassandra’s alley.
>
>
>
> Right now I’m reworking my test to dump it into MySQL on the same machines,
> so I can compare the two for speed, because either I’ve got crap for
> hardware, or there is something rotten in Denmark.
>
>
>
> From: Heath Oderman [mailto:he...@526valley.com]
> Sent: Friday, April 09, 2010 10:40 AM
> To: user@cassandra.apache.org
> Subject: Re: Very new user needs some troubleshooting pointers
>
>
>
> Thanks for the reply Jonathan!
>
>
>
> I started with multi threaded tests, but when my performance was so much
> slower than my buddy's I switched to one to try to isolate and identify the
> differences.  I got tunnel vision and kept on with the one thread tests.
>
>
>
> I'll modify the tests and try again.
>
>
>
> Thanks,
>
> Stu
>
>
>
> On Fri, Apr 9, 2010 at 11:34 AM, Jonathan Ellis  wrote:
>
> A single-threaded test is meaningless.  You need a multithreaded (or
> multiprocess) benchmark like the one in contrib/py_stress.
>
> Picture worth 1000 words:
> http://spyced.blogspot.com/2010/01/cassandra-05.html
>
> On Thu, Apr 8, 2010 at 3:59 PM, Heath Oderman  wrote:
>> Hi All,
>> I'm brand new to Cassandra and know absolutely nothing, so please forgive
>> me
>> in advance.
>> A friend and I have each setup a few Cassandra stand alone nodes,
>> completely
>> default.
>> His: Mac OSX Snow Leopard
>>      Mac Book Pro
>>      Intel Duo Core
>>      4GB Ram
>>      5400 rpm disk
>> Mine: debian 5.x (lenny) with the deb pack from
>> http://www.apache.org/dist/cassandra/debian
>>      2  Desktops
>>      Intel duo core
>>      4GB ram
>>      7200 sata drives
>>     1 blade
>>      8gb ram
>>      1 rpm disk
>>      dual xeon
>>     (i have a windows box too like the 2 desktops)
>>
>>     (each of those machines is stand alone)
>>
>> My debian boxes are brand new installs, nothing else running, purely
>> console
>> environments, only SSH & Cassandra installed.
>> The Cassandra configs are the *default configs* with only 'ListenAddress'
>> and 'ThriftAddress' changed to the ext ip for those boxes.
>> We generated a C# library with Thrift to connect to these servers.  We
>> wrote
>> a simple c# app that loops 10,000 times and does a
>>          _client.batch_insert(_keyspace, map.Key.GetValue(o,
>> null).ToString(), dict, ConsistencyLevel.ONE);
>> "batch_insert" I guess is the key bit up there.
>> The reason that I'm writing is that the batch_insert call takes 400,000
>> ticks every time it is called when running against the debian boxes.  Any
>> of
>> them.
>> The result is that 10,000 inserts against his machine takes about 30
>> seconds, and it takes about 1 min 45 seconds against any of my servers.
>>  (longer against the windows 7 server.)
>> The MacBookPro is faster while I would expect to be slower.  (the macbook
>> pro is his laptop and he's running mail and all kinds of other stuff
>> simultaneously.)
>> I'm on a gigabit network, iostat / top / bmon all show that the Cassandra
>> server isn't working very hard.
>> Performance mon on my windows client show my computer running the loop is
>> hardly working.
>> I am writing to you to ask where I might go to get information on
>> comparing
>> the environments, improving my performance, etc.  I've been googling all
>> day
>> and haven't been able to figure anything

Re: RE: Very new user needs some troubleshooting pointers

2010-04-09 Thread Heath Oderman
What's interesting for my case is that I put a timer around the thrift
method to insert_batch

Every iteration of that call against debian (any hardware, same network or
in amazon cloud with windows machine in ec2 as well) takes 400,000 ticks.
Super consistent.  One thread.

My friends setup with cassandra on osx takes 400,000 ticks for the first
insert, vthen drops to 20,000 ticks for every consecutive call.

That's what is so strange.

On Apr 9, 2010 12:15 PM, "Mark Jones"  wrote:

 Sounds like we are some experiencing the same problems. (I’m using 0.6RC1)
I have a 3 node cluster with 8GB/machine (dual core CPU).  I’m peaking on
inserts at about 6000-7000/second running 40 threads.  Separate spindles for
commitlog and data…..



My read speed is atrocious, 800/sec sustained (starts off at 1800+/second
and falls back to 800/sec).  Of course that is only if I read from the
“correct” node.  Depending on the moment, 2 of the nodes will return
1-2/second instead of 800, and only one node will return 800/second.  And if
I spread the reads across many nodes, all the performance drops.   nodetool
loadbalance can change which node is the “golden” node, but I don’t know
why.  I have doubled the # of concurrent read threads and seen some
performance improvement, (that was the last thing I tried, and eeked out
another 150/second)



So much about Cassandra makes we WANT it to work, I mean look at the fact
that all nodes are essentially equal, that it replicates from rack to rack,
from DC to DC, now, if I could just make it perform.



My machines are basically idle (a large amount of IOWait, but the time is
spent in the pending queue, vs the device svctime).  So far I’ve got little
insight into what could be wrong, I’ve increased the key cache 10X using
JConsole but the hit rate is still at times abysmal.



I’m writing 400-800 byte blobs with an 8 byte key (supercolumn) and a 12
byte “subkey”, then a 5 byte column name, something that would seem to be
right up Cassandra’s alley.



Right now I’m reworking my test to dump it into MySQL on the same machines,
so I can compare the two for speed, because either I’ve got crap for
hardware, or there is something rotten in Denmark.



*From:* Heath Oderman [mailto:he...@526valley.com]
*Sent:* Friday, April 09, 2010 10:40 AM
*To:* user@cassandra.apache.org
*Subject:* Re: Very new user needs some troubleshooting pointers





Thanks for the reply Jonathan!



I started with multi threaded tests, but when my performance...


Re: Worst case #iops to read a row

2010-04-09 Thread Ryan King
On Fri, Apr 9, 2010 at 8:39 AM, Jonathan Ellis  wrote:
> worst case is 2 or 3, depending on row size:
>
> one seek to read the right row index block
> one seek to read the row header (bloom filter + column index)
> if it's a big row, one seek to read the column block (block size is
> configurable, default is 256KB)

This is all per-sstable that contains the row, right?

-ryan


RE: RE: Very new user needs some troubleshooting pointers

2010-04-09 Thread Mark Jones
I'm seeing an average write time of 20-30ms/insert with between the 60-67 
million row point.
(I think at this point I was actually running 80 threads simultaneously, 2 40 
thread clients).

From: Heath Oderman [mailto:he...@526valley.com]
Sent: Friday, April 09, 2010 11:23 AM
To: user@cassandra.apache.org
Subject: Re: RE: Very new user needs some troubleshooting pointers


What's interesting for my case is that I put a timer around the thrift method 
to insert_batch

Every iteration of that call against debian (any hardware, same network or in 
amazon cloud with windows machine in ec2 as well) takes 400,000 ticks.  Super 
consistent.  One thread.

My friends setup with cassandra on osx takes 400,000 ticks for the first 
insert, vthen drops to 20,000 ticks for every consecutive call.

That's what is so strange.
On Apr 9, 2010 12:15 PM, "Mark Jones" 
mailto:mjo...@imagehawk.com>> wrote:
Sounds like we are some experiencing the same problems. (I'm using 0.6RC1) I 
have a 3 node cluster with 8GB/machine (dual core CPU).  I'm peaking on inserts 
at about 6000-7000/second running 40 threads.  Separate spindles for commitlog 
and data.

My read speed is atrocious, 800/sec sustained (starts off at 1800+/second and 
falls back to 800/sec).  Of course that is only if I read from the "correct" 
node.  Depending on the moment, 2 of the nodes will return 1-2/second instead 
of 800, and only one node will return 800/second.  And if I spread the reads 
across many nodes, all the performance drops.   nodetool loadbalance can change 
which node is the "golden" node, but I don't know why.  I have doubled the # of 
concurrent read threads and seen some performance improvement, (that was the 
last thing I tried, and eeked out another 150/second)

So much about Cassandra makes we WANT it to work, I mean look at the fact that 
all nodes are essentially equal, that it replicates from rack to rack, from DC 
to DC, now, if I could just make it perform.

My machines are basically idle (a large amount of IOWait, but the time is spent 
in the pending queue, vs the device svctime).  So far I've got little insight 
into what could be wrong, I've increased the key cache 10X using JConsole but 
the hit rate is still at times abysmal.

I'm writing 400-800 byte blobs with an 8 byte key (supercolumn) and a 12 byte 
"subkey", then a 5 byte column name, something that would seem to be right up 
Cassandra's alley.

Right now I'm reworking my test to dump it into MySQL on the same machines, so 
I can compare the two for speed, because either I've got crap for hardware, or 
there is something rotten in Denmark.

From: Heath Oderman [mailto:he...@526valley.com]
Sent: Friday, April 09, 2010 10:40 AM
To: user@cassandra.apache.org
Subject: Re: Very new user needs some troubleshooting pointers




Thanks for the reply Jonathan!



I started with multi threaded tests, but when my performance...


Re: RE: Very new user needs some troubleshooting pointers

2010-04-09 Thread Jonathan Ellis
The jit on debian may take longer to warm up by default.

Do 100k ops first before benchmarking.

Benchmark with multiple threads.

And use a known benchmark first like py_stress.

On Fri, Apr 9, 2010 at 11:23 AM, Heath Oderman  wrote:
> What's interesting for my case is that I put a timer around the thrift
> method to insert_batch
>
> Every iteration of that call against debian (any hardware, same network or
> in amazon cloud with windows machine in ec2 as well) takes 400,000 ticks.
> Super consistent.  One thread.
>
> My friends setup with cassandra on osx takes 400,000 ticks for the first
> insert, vthen drops to 20,000 ticks for every consecutive call.
>
> That's what is so strange.
>
> On Apr 9, 2010 12:15 PM, "Mark Jones"  wrote:
>
> Sounds like we are some experiencing the same problems. (I’m using 0.6RC1) I
> have a 3 node cluster with 8GB/machine (dual core CPU).  I’m peaking on
> inserts at about 6000-7000/second running 40 threads.  Separate spindles for
> commitlog and data…..
>
>
>
> My read speed is atrocious, 800/sec sustained (starts off at 1800+/second
> and falls back to 800/sec).  Of course that is only if I read from the
> “correct” node.  Depending on the moment, 2 of the nodes will return
> 1-2/second instead of 800, and only one node will return 800/second.  And if
> I spread the reads across many nodes, all the performance drops.   nodetool
> loadbalance can change which node is the “golden” node, but I don’t know
> why.  I have doubled the # of concurrent read threads and seen some
> performance improvement, (that was the last thing I tried, and eeked out
> another 150/second)
>
>
>
> So much about Cassandra makes we WANT it to work, I mean look at the fact
> that all nodes are essentially equal, that it replicates from rack to rack,
> from DC to DC, now, if I could just make it perform.
>
>
>
> My machines are basically idle (a large amount of IOWait, but the time is
> spent in the pending queue, vs the device svctime).  So far I’ve got little
> insight into what could be wrong, I’ve increased the key cache 10X using
> JConsole but the hit rate is still at times abysmal.
>
>
>
> I’m writing 400-800 byte blobs with an 8 byte key (supercolumn) and a 12
> byte “subkey”, then a 5 byte column name, something that would seem to be
> right up Cassandra’s alley.
>
>
>
> Right now I’m reworking my test to dump it into MySQL on the same machines,
> so I can compare the two for speed, because either I’ve got crap for
> hardware, or there is something rotten in Denmark.
>
>
>
> From: Heath Oderman [mailto:he...@526valley.com]
> Sent: Friday, April 09, 2010 10:40 AM
> To: user@cassandra.apache.org
> Subject: Re: Very new user needs some troubleshooting pointers
>
>
>
> Thanks for the reply Jonathan!
>
>
>
> I started with multi threaded tests, but when my performance...


Re: Worst case #iops to read a row

2010-04-09 Thread Jonathan Ellis
Right.

On Fri, Apr 9, 2010 at 11:23 AM, Ryan King  wrote:
> On Fri, Apr 9, 2010 at 8:39 AM, Jonathan Ellis  wrote:
>> worst case is 2 or 3, depending on row size:
>>
>> one seek to read the right row index block
>> one seek to read the row header (bloom filter + column index)
>> if it's a big row, one seek to read the column block (block size is
>> configurable, default is 256KB)
>
> This is all per-sstable that contains the row, right?
>
> -ryan
>


Re: RE: Very new user needs some troubleshooting pointers

2010-04-09 Thread Heath Oderman
Will do, thanks for the advice. :)

On Apr 9, 2010 12:28 PM, "Jonathan Ellis"  wrote:

The jit on debian may take longer to warm up by default.

Do 100k ops first before benchmarking.

Benchmark with multiple threads.

And use a known benchmark first like py_stress.


On Fri, Apr 9, 2010 at 11:23 AM, Heath Oderman  wrote:
> What's interesting fo...


Grails Cassandra plugin 0.6 compatible

2010-04-09 Thread Ned Wolpert
Folks-

  Complete an upgrade to the grails Cassandra plugin I've been working on to
make it compatible with Cassandra 0.6. Let me know if anyone is having
trouble with it.

Thanks

-- 
Virtually, Ned Wolpert

"Settle thy studies, Faustus, and begin..."   --Marlowe


How to perform queries on Cassandra?

2010-04-09 Thread Onur AKTAS

Hi,
I want to use Cassandra for a new project, as you can guess I have a RDBMS 
background however do not have any experience with NoSQL databases except 
key/value pair in memory data grids/caches. (Oracle Coherence /Memcached.).
I'm trying to find out how do you perform queries with calculations on the fly 
without inserting the data as calculated from the beginning.
Lets say we have latitude and longitude coordinates of all users and we have  
Distance(from_lat, from_long, to_lat, to_long) function whichgives distance 
between lat/longs pairs in kilometers.
Ex:user1_lat = 40 user1_long = 20user2_lat = 30 user3_long = 50
So, if we want to do same operation in regular RDBMS we can use this kind of 
query to get users near to user_1's location.
* select user from users where Distance(40, 20, user.lat, user.long) = 5
How do we do this kind of operations in cassandra? 
If we insert data as calculated from the beginning, lets say we have 1 million 
users, then do we need to do 1 million insert operations for just updating 1 
users coordinates? (Ofcourse no but then how?).
I believe huge complexity calculations are possible with Cassandra, but do not 
know about querying out of accessing the data by it's key.
Thanks,
  
_
Yeni Windows 7: Size en uygun bilgisayarı bulun. Daha fazla bilgi edinin.
http://windows.microsoft.com/shop

Questions regarding network topography and automated adminstration

2010-04-09 Thread Todd Nine
Hi all,
  First off, thanks for putting out such a great product and documentation.
 I had a node up and running on CentOs in 10 minutes, and had our C# app
communicating with it in 10 more!

Now that I have basic prototyping working, I have a few networking and data
center configurations.  We will be installing our systems in 2 data centres.
 One will be in the US, the second will be in New Zealand or Australia.
Our system processes data from a satellite network.  The data will be sent
to the data centre that the user is closest to.  North American and EU
customers to the US, and all others will go to the NZ/AU servers.

I want the data that is written from the different partitioned processing
nodes (our c# app servers) to be available in both data centres.  I'm
assuming I would need an equal number of nodes at each data centre, then use
the RackAwareStrategy so that data is replicated across both locations.
 Both locations would need the same cluster name, is this correct?


Is there a way to secure the communication between data centres?  Given that
they will be on different sides of the world, I can't guarantee a secure
channel between them.


How is authorization of a new node in a cluster accomplished (if possible)?
 Is it currently done via firewall and cluster IPs, or can that be managed
in Cassandra internally?


Is there any sort of management interface for deploying nodes and
configuring peers?


For ease of administration if I have 10 nodes or more, can I have 2 peer IP
address per node in it's configuration, and deploy the nodes in overlapping
groups of 3?  I'm assuming once a node connects to another, it automatically
receives all node information about the cluster, is this correct?


Last, are there any tools out there that allow user data mining?  We'll
obviously need to document how our application persists data well so that
external applications can read the data.  Our sales and accounting teams use
our current MS SQL system to perform some data mining via SQL.  Giving them
an interface to allow them to query data (in any query language) is a must
for our migration.

Thanks in advance,

Todd


Re: How to perform queries on Cassandra?

2010-04-09 Thread Paul Prescod
2010/4/9 Onur AKTAS :
> ...
> I'm trying to find out how do you perform queries with calculations on the
> fly without inserting the data as calculated from the beginning.
> Lets say we have latitude and longitude coordinates of all users and we have
>  Distance(from_lat, from_long, to_lat, to_long) function which
> gives distance between lat/longs pairs in kilometers.

I'm not an expert, but I think that it boils down to "MapReduce" and "Hadoop".

I don't think that there's any top-down tutorial on those two words,
you'll have to research yourself starting here:

 * http://en.wikipedia.org/wiki/MapReduce

 * http://hadoop.apache.org/

 * http://wiki.apache.org/cassandra/HadoopSupport

I don't think it is all documented in any one place yet...

 Paul Prescod


Re: How to perform queries on Cassandra?

2010-04-09 Thread malsmith


It's sort of an interesting problem - in RDBMS one relatively simple
approach would be calculate a rectangle that is X km by Y km with User
1's location at the center.  So the rectangle is UserX - 10KmX ,
UserY-10KmY to UserX+10KmX , UserY+10KmY

Then you could query the database for all other users where that each
user considered is curUserX > UserX-10Km and curUserX < UserX+10KmX and
curUserY > UserY-10KmY and curUserY < UserY+10KmY  
* Not the 10KmX and 10KmY are really a translation from Kilometers to
degrees of  lat and longitude  (that you can find on a google search)

With the right indexes this query actually runs pretty well.   

Translating that to Cassandra seems a bit complex at first - but you
could try something like pre-calculating a grid with the right
resolution (like a square of 5KM per side) and assign every user to a
particular grid ID.  That way you just calculate with grid ID User1 is
in then do a direct key lookup to get a list of the users in that same
grid id. 

A second approach would be to have to column families -- one that maps a
Latitude to a list of users who are at that latitude and a second that
maps users who are at a particular longitude.  You could do the same
rectange calculation above then do a get_slice range lookup to get a
list of users from range of latitude and a second list from the range of
longitudes.You would then need to do a in-memory nested loop to find
the list of users that are in both lists.  This second approach could
cause some trouble depending on where you search and how many users you
really have -- some latitudes and longitudes have many many people in
them

So, it seems some version of a chunking / grid id thing would be the
better approach.   If you let people zoom in or zoom out - you could
just have different column families for each level of zoom.


I'm stuck on a stopped train so -- here is even more code:

static Decimal GetLatitudeMiles(Decimal lat) 
{
Decimal f = 0.0M;
lat = Math.Abs(lat);
f = 68.99M;
 if (lat >= 0.0M && lat < 10.0M) { f = 68.71M; } 
else if (lat >= 10.0M && lat < 20.0M) { f = 68.73M; }
else if (lat >= 20.0M && lat < 30.0M) { f = 68.79M; }
else if (lat >= 30.0M && lat < 40.0M) { f = 68.88M; }
else if (lat >= 40.0M && lat < 50.0M) { f = 68.99M; }
else if (lat >= 50.0M && lat < 60.0M) { f = 69.12M; }
else if (lat >= 60.0M && lat < 70.0M) { f = 69.23M; }
else if (lat >= 70.0M && lat < 80.0M) { f = 69.32M; }
else if (lat >= 80.0M) { f = 69.38M; }

return f;
}


Decimal MilesPerDegreeLatitude = GetLatitudeMiles(zList[0].Latitude);
Decimal MilesPerDegreeLongitude = ((Decimal) Math.Abs(Math.Cos((Double)
zList[0].Latitude))) * 24900.0M / 360.0M;
dRadius = 10.0M  // ten miles
Decimal deltaLat = dRadius / MilesPerDegreeLatitude;
Decimal deltaLong = dRadius / MilesPerDegreeLongitude;

ps.TopLatitude = zList[0].Latitude - deltaLat;
ps.TopLongitude = zList[0].Longitude - deltaLong;
ps.BottomLatitude = zList[0].Latitude + deltaLat;
ps.BottomLongitude = zList[0].Longitude + deltaLong;



On Fri, 2010-04-09 at 16:30 -0700, Paul Prescod wrote: 

> 2010/4/9 Onur AKTAS :
> > ...
> > I'm trying to find out how do you perform queries with calculations on the
> > fly without inserting the data as calculated from the beginning.
> > Lets say we have latitude and longitude coordinates of all users and we have
> >  Distance(from_lat, from_long, to_lat, to_long) function which
> > gives distance between lat/longs pairs in kilometers.
> 
> I'm not an expert, but I think that it boils down to "MapReduce" and "Hadoop".
> 
> I don't think that there's any top-down tutorial on those two words,
> you'll have to research yourself starting here:
> 
>  * http://en.wikipedia.org/wiki/MapReduce
> 
>  * http://hadoop.apache.org/
> 
>  * http://wiki.apache.org/cassandra/HadoopSupport
> 
> I don't think it is all documented in any one place yet...
> 
>  Paul Prescod




Re: How to perform queries on Cassandra?

2010-04-09 Thread Mike Gallamore
I apologize in advance if this goes into esoteric algorithms a bit too 
much but I think this will get to an interesting idea to solve your 
problem. My background is physics particularly computer simulations of 
complex systems. Anyways in cosmology an interesting algorithm is called 
an n-body tree code (its been around for at least 20 years so a lot is 
available online about it). Since every object with mass (well in 
general relativity actually anything with energy but I digress) 
interacts with every other object with mass, you end up with the 
"n-body" problem. The number of interactions in a system goes as n(n-1) 
~= n^2 where n is the number of elements. This lead to a nightmare to do 
simulations of large systems, say two galaxies colliding. 1 billion X 1 
billion minus one is huge and effectively incalculable since you would 
have to calculate this each time you wanted to increment the simulation 
a tiny bit ahead in time. How do you get a reasonable approximation to 
the solution? The answer or at least one of them is n-body "tree codes".


You take advantage of the fact that the the force that one star feels 
from another falls off as 1/r^2 and importantly two stars far away from 
the first star but close together relatively have roughly the same 
magnitude and direction of the "r" vector. So you can simply clump them 
together, ie sum there masses, and the force is GM1(M"sum")/r^2. To do 
this efficiently numerically you break down the system using binary 
search trees. Thinking in 2D just to keep it simple, you divide the 
space into top left, top right, bottom left bottom right as a first 
approximation. Then continually do that until you end up with each 
element in its own box. When you figure out the forces you are going to 
apply to the system you just take the distance to the middle of the box 
that contains the ones you are going to consider together (the closer to 
the star in question the smaller the boxes need to be because the 
direction of r changes quicker the closer the boxes are to the star, but 
farther away you can use larger and larger boxes (which would contain a 
2D tree like structure descending to the point where each of the stars 
contained are trapped in there own little box), sum the number of stars 
in the box and presto.


How would this help you? Well if you encoded the "box hierachy", say 1 
for top left, 2 for top right, 3 for bottom left, 4 for bottom right, 
then you could specify the box that someone is in based on a string like 
"14234". To find the set of stars/points/whatever that are at least x 
away you just would have to do a range search for all the points with 
their location "string" larger than or equal to the location sting 
corresponding to the closest corner of the biggest box such that its 
corner is at least "x" units away. Quite good as a first approximation 
and the search algorithm should run as O(nlog(n)) which is a logirithmic 
decrease in computation time. Ie the 1 billion times 1 billion -1 
problem becomes 1 billion times ~9, much much nicer. Really difficult 
thing to explain without looking over a diagram in person I admit but 
hopefully it makes sense if you look up the algorithm online.



On 04/09/2010 05:01 PM, malsmith wrote:



It's sort of an interesting problem - in RDBMS one relatively simple 
approach would be calculate a rectangle that is X km by Y km with User 
1's location at the center.  So the rectangle is UserX - 10KmX , 
UserY-10KmY to UserX+10KmX , UserY+10KmY


Then you could query the database for all other users where that each 
user considered is curUserX > UserX-10Km and curUserX < UserX+10KmX 
and curUserY > UserY-10KmY and curUserY < UserY+10KmY
* Not the 10KmX and 10KmY are really a translation from Kilometers to 
degrees of  lat and longitude  (that you can find on a google search)


With the right indexes this query actually runs pretty well.

Translating that to Cassandra seems a bit complex at first - but you 
could try something like pre-calculating a grid with the right 
resolution (like a square of 5KM per side) and assign every user to a 
particular grid ID.  That way you just calculate with grid ID User1 is 
in then do a direct key lookup to get a list of the users in that same 
grid id.


A second approach would be to have to column families -- one that maps 
a Latitude to a list of users who are at that latitude and a second 
that maps users who are at a particular longitude.  You could do the 
same rectange calculation above then do a get_slice range lookup to 
get a list of users from range of latitude and a second list from the 
range of longitudes.You would then need to do a in-memory nested 
loop to find the list of users that are in both lists.  This second 
approach could cause some trouble depending on where you search and 
how many users you really have -- some latitudes and longitudes have 
many many people in them


So, it seems some version of a chunking / grid id thing would be the

Re: How to perform queries on Cassandra?

2010-04-09 Thread Malcolm Smith

Mike are you stuck on a train too? :-)



On Apr 9, 2010, at 8:51 PM, Mike Gallamore > wrote:


I apologize in advance if this goes into esoteric algorithms a bit  
too much but I think this will get to an interesting idea to solve  
your problem. My background is physics particularly computer  
simulations of complex systems. Anyways in cosmology an interesting  
algorithm is called an n-body tree code (its been around for at  
least 20 years so a lot is available online about it). Since every  
object with mass (well in general relativity actually anything with  
energy but I digress) interacts with every other object with mass,  
you end up with the "n-body" problem. The number of interactions in  
a system goes as n(n-1) ~= n^2 where n is the number of elements.  
This lead to a nightmare to do simulations of large systems, say two  
galaxies colliding. 1 billion X 1 billion minus one is huge and  
effectively incalculable since you would have to calculate this each  
time you wanted to increment the simulation a tiny bit ahead in  
time. How do you get a reasonable approximation to the solution? The  
answer or at least one of them is n-body "tree codes".


You take advantage of the fact that the the force that one star  
feels from another falls off as 1/r^2 and importantly two stars far  
away from the first star but close together relatively have roughly  
the same magnitude and direction of the "r" vector. So you can  
simply clump them together, ie sum there masses, and the force is GM1 
(M"sum")/r^2. To do this efficiently numerically you break down the  
system using binary search trees. Thinking in 2D just to keep it  
simple, you divide the space into top left, top right, bottom left  
bottom right as a first approximation. Then continually do that  
until you end up with each element in its own box. When you figure  
out the forces you are going to apply to the system you just take  
the distance to the middle of the box that contains the ones you are  
going to consider together (the closer to the star in question the  
smaller the boxes need to be because the direction of r changes  
quicker the closer the boxes are to the star, but farther away you  
can use larger and larger boxes (which would contain a 2D tree like  
structure descending to the point where each of the stars contained  
are trapped in there own little box), sum the number of stars in the  
box and presto.


How would this help you? Well if you encoded the "box hierachy", say  
1 for top left, 2 for top right, 3 for bottom left, 4 for bottom  
right, then you could specify the box that someone is in based on a  
string like "14234". To find the set of stars/points/whatever that  
are at least x away you just would have to do a range search for all  
the points with their location "string" larger than or equal to the  
location sting corresponding to the closest corner of the biggest  
box such that its corner is at least "x" units away. Quite good as a  
first approximation and the search algorithm should run as O(nlog 
(n)) which is a logirithmic decrease in computation time. Ie the 1  
billion times 1 billion -1 problem becomes 1 billion times ~9, much  
much nicer. Really difficult thing to explain without looking over a  
diagram in person I admit but hopefully it makes sense if you look  
up the algorithm online.



On 04/09/2010 05:01 PM, malsmith wrote:




It's sort of an interesting problem - in RDBMS one relatively  
simple approach would be calculate a rectangle that is X km by Y km  
with User 1's location at the center.  So the rectangle is UserX -  
10KmX , UserY-10KmY to UserX+10KmX , UserY+10KmY


Then you could query the database for all other users where that  
each user considered is curUserX > UserX-10Km and curUserX < UserX 
+10KmX and curUserY > UserY-10KmY and curUserY < UserY+10KmY
* Not the 10KmX and 10KmY are really a translation from Kilometers  
to degrees of  lat and longitude  (that you can find on a google  
search)


With the right indexes this query actually runs pretty well.

Translating that to Cassandra seems a bit complex at first - but  
you could try something like pre-calculating a grid with the right  
resolution (like a square of 5KM per side) and assign every user to  
a particular grid ID.  That way you just calculate with grid ID  
User1 is in then do a direct key lookup to get a list of the users  
in that same grid id.


A second approach would be to have to column families -- one that  
maps a Latitude to a list of users who are at that latitude and a  
second that maps users who are at a particular longitude.  You  
could do the same rectange calculation above then do a get_slice  
range lookup to get a list of users from range of latitude and a  
second list from the range of longitudes.You would then need to  
do a in-memory nested loop to find the list of users that are in  
both lists.  This second approach could cause some trouble  
depending on where you search a

Re: How to perform queries on Cassandra?

2010-04-09 Thread dir dir
Does Cassandra has a default query language such as SQL in RDBMS
and Object Query in OODBMS?  Thank you.

Dir.

On Sat, Apr 10, 2010 at 7:01 AM, malsmith wrote:

>
>
> It's sort of an interesting problem - in RDBMS one relatively simple
> approach would be calculate a rectangle that is X km by Y km with User 1's
> location at the center.  So the rectangle is UserX - 10KmX , UserY-10KmY to
> UserX+10KmX , UserY+10KmY
>
> Then you could query the database for all other users where that each user
> considered is curUserX > UserX-10Km and curUserX < UserX+10KmX and curUserY
> > UserY-10KmY and curUserY < UserY+10KmY
> * Not the 10KmX and 10KmY are really a translation from Kilometers to
> degrees of  lat and longitude  (that you can find on a google search)
>
> With the right indexes this query actually runs pretty well.
>
> Translating that to Cassandra seems a bit complex at first - but you could
> try something like pre-calculating a grid with the right resolution (like a
> square of 5KM per side) and assign every user to a particular grid ID.  That
> way you just calculate with grid ID User1 is in then do a direct key lookup
> to get a list of the users in that same grid id.
>
> A second approach would be to have to column families -- one that maps a
> Latitude to a list of users who are at that latitude and a second that maps
> users who are at a particular longitude.  You could do the same rectange
> calculation above then do a get_slice range lookup to get a list of users
> from range of latitude and a second list from the range of longitudes.
> You would then need to do a in-memory nested loop to find the list of users
> that are in both lists.  This second approach could cause some trouble
> depending on where you search and how many users you really have -- some
> latitudes and longitudes have many many people in them
>
> So, it seems some version of a chunking / grid id thing would be the better
> approach.   If you let people zoom in or zoom out - you could just have
> different column families for each level of zoom.
>
>
> I'm stuck on a stopped train so -- here is even more code:
>
> static Decimal GetLatitudeMiles(Decimal lat)
> {
> Decimal f = 0.0M;
> lat = Math.Abs(lat);
> f = 68.99M;
>  if (lat >= 0.0M && lat < 10.0M) { f = 68.71M; }
> else if (lat >= 10.0M && lat < 20.0M) { f = 68.73M; }
> else if (lat >= 20.0M && lat < 30.0M) { f = 68.79M; }
> else if (lat >= 30.0M && lat < 40.0M) { f = 68.88M; }
> else if (lat >= 40.0M && lat < 50.0M) { f = 68.99M; }
> else if (lat >= 50.0M && lat < 60.0M) { f = 69.12M; }
> else if (lat >= 60.0M && lat < 70.0M) { f = 69.23M; }
> else if (lat >= 70.0M && lat < 80.0M) { f = 69.32M; }
> else if (lat >= 80.0M) { f = 69.38M; }
>
> return f;
> }
>
>
> Decimal MilesPerDegreeLatitude = GetLatitudeMiles(zList[0].Latitude);
> Decimal MilesPerDegreeLongitude = ((Decimal) Math.Abs(Math.Cos((Double)
> zList[0].Latitude))) * 24900.0M / 360.0M;
> dRadius = 10.0M  // ten miles
> Decimal deltaLat = dRadius / MilesPerDegreeLatitude;
> Decimal deltaLong = dRadius / MilesPerDegreeLongitude;
>
> ps.TopLatitude = zList[0].Latitude - deltaLat;
> ps.TopLongitude = zList[0].Longitude - deltaLong;
> ps.BottomLatitude = zList[0].Latitude + deltaLat;
> ps.BottomLongitude = zList[0].Longitude + deltaLong;
>
>
>
>
> On Fri, 2010-04-09 at 16:30 -0700, Paul Prescod wrote:
>
> 2010/4/9 Onur AKTAS :
> > ...
> > I'm trying to find out how do you perform queries with calculations on the
> > fly without inserting the data as calculated from the beginning.
> > Lets say we have latitude and longitude coordinates of all users and we have
> >  Distance(from_lat, from_long, to_lat, to_long) function which
> > gives distance between lat/longs pairs in kilometers.
>
> I'm not an expert, but I think that it boils down to "MapReduce" and "Hadoop".
>
> I don't think that there's any top-down tutorial on those two words,
> you'll have to research yourself starting here:
>
>  * http://en.wikipedia.org/wiki/MapReduce
>
>  * http://hadoop.apache.org/
>
>  * http://wiki.apache.org/cassandra/HadoopSupport
>
> I don't think it is all documented in any one place yet...
>
>  Paul Prescod
>
>
>


Re: How to perform queries on Cassandra?

2010-04-09 Thread Paul Prescod
No. Cassandra has an API.

http://wiki.apache.org/cassandra/API

On Fri, Apr 9, 2010 at 8:00 PM, dir dir  wrote:
> Does Cassandra has a default query language such as SQL in RDBMS
> and Object Query in OODBMS?  Thank you.
>
> Dir.
>
> On Sat, Apr 10, 2010 at 7:01 AM, malsmith 
> wrote:
>>
>>
>> It's sort of an interesting problem - in RDBMS one relatively simple
>> approach would be calculate a rectangle that is X km by Y km with User 1's
>> location at the center.  So the rectangle is UserX - 10KmX , UserY-10KmY to
>> UserX+10KmX , UserY+10KmY
>>
>> Then you could query the database for all other users where that each user
>> considered is curUserX > UserX-10Km and curUserX < UserX+10KmX and curUserY
>> > UserY-10KmY and curUserY < UserY+10KmY
>> * Not the 10KmX and 10KmY are really a translation from Kilometers to
>> degrees of  lat and longitude  (that you can find on a google search)
>>
>> With the right indexes this query actually runs pretty well.
>>
>> Translating that to Cassandra seems a bit complex at first - but you could
>> try something like pre-calculating a grid with the right resolution (like a
>> square of 5KM per side) and assign every user to a particular grid ID.  That
>> way you just calculate with grid ID User1 is in then do a direct key lookup
>> to get a list of the users in that same grid id.
>>
>> A second approach would be to have to column families -- one that maps a
>> Latitude to a list of users who are at that latitude and a second that maps
>> users who are at a particular longitude.  You could do the same rectange
>> calculation above then do a get_slice range lookup to get a list of users
>> from range of latitude and a second list from the range of longitudes.
>> You would then need to do a in-memory nested loop to find the list of users
>> that are in both lists.  This second approach could cause some trouble
>> depending on where you search and how many users you really have -- some
>> latitudes and longitudes have many many people in them
>>
>> So, it seems some version of a chunking / grid id thing would be the
>> better approach.   If you let people zoom in or zoom out - you could just
>> have different column families for each level of zoom.
>>
>>
>> I'm stuck on a stopped train so -- here is even more code:
>>
>> static Decimal GetLatitudeMiles(Decimal lat)
>> {
>> Decimal f = 0.0M;
>> lat = Math.Abs(lat);
>> f = 68.99M;
>>  if (lat >= 0.0M && lat < 10.0M) { f = 68.71M; }
>> else if (lat >= 10.0M && lat < 20.0M) { f = 68.73M; }
>> else if (lat >= 20.0M && lat < 30.0M) { f = 68.79M; }
>> else if (lat >= 30.0M && lat < 40.0M) { f = 68.88M; }
>> else if (lat >= 40.0M && lat < 50.0M) { f = 68.99M; }
>> else if (lat >= 50.0M && lat < 60.0M) { f = 69.12M; }
>> else if (lat >= 60.0M && lat < 70.0M) { f = 69.23M; }
>> else if (lat >= 70.0M && lat < 80.0M) { f = 69.32M; }
>> else if (lat >= 80.0M) { f = 69.38M; }
>>
>> return f;
>> }
>>
>>
>> Decimal MilesPerDegreeLatitude = GetLatitudeMiles(zList[0].Latitude);
>> Decimal MilesPerDegreeLongitude = ((Decimal) Math.Abs(Math.Cos((Double)
>> zList[0].Latitude))) * 24900.0M / 360.0M;
>>     dRadius = 10.0M  // ten miles
>> Decimal deltaLat = dRadius / MilesPerDegreeLatitude;
>> Decimal deltaLong = dRadius / MilesPerDegreeLongitude;
>>
>> ps.TopLatitude = zList[0].Latitude - deltaLat;
>> ps.TopLongitude = zList[0].Longitude - deltaLong;
>> ps.BottomLatitude = zList[0].Latitude + deltaLat;
>> ps.BottomLongitude = zList[0].Longitude + deltaLong;
>>
>>
>>
>> On Fri, 2010-04-09 at 16:30 -0700, Paul Prescod wrote:
>>
>> 2010/4/9 Onur AKTAS :
>> > ...
>> > I'm trying to find out how do you perform queries with calculations on
>> > the
>> > fly without inserting the data as calculated from the beginning.
>> > Lets say we have latitude and longitude coordinates of all users and we
>> > have
>> >  Distance(from_lat, from_long, to_lat, to_long) function which
>> > gives distance between lat/longs pairs in kilometers.
>>
>> I'm not an expert, but I think that it boils down to "MapReduce" and
>> "Hadoop".
>>
>> I don't think that there's any top-down tutorial on those two words,
>> you'll have to research yourself starting here:
>>
>>  * http://en.wikipedia.org/wiki/MapReduce
>>
>>  * http://hadoop.apache.org/
>>
>>  * http://wiki.apache.org/cassandra/HadoopSupport
>>
>> I don't think it is all documented in any one place yet...
>>
>>  Paul Prescod
>>
>
>


How many KeySpace will you use in a single application?

2010-04-09 Thread Dop Sun
Hi, a question troubles me now: how many KeySpaces one application is better
to use?

 

The question is coming out since 0.6, Cassandra introduced a new API named
as "login", which is done against a specific keySpace. Thanks to the
org.apache.cassandra.auth.AllowAllAuthenticator, the old version clients can
still work without authentication.

 

Actually, while I'm working with the previous version, I just take the
KeySpace as another level of the whole structure, KeySpace - ColumnFamily -
Super Column (optional) - Column - Value.  And consider the whole Cassandra
cluster as the root of all these, and one application controls everything
under this cluster.

 

Now, looks like I need to re-think this and put the KeySpace as a kind of
root. It may be better to make one application only takes one KeySpace (a
silly question? Since all old time, one application usually uses only one
database, but forgive me, I may abuses the flexibility of Cassandra.)? Is
there any pros or cons to user multiple key spaces vs. single key spaces,
other than the authentication requirements?

 

Can anyone give me some suggestions on this?

 

Dop