Re: Read Latency

2010-10-19 Thread Aaron Morton
Not sure how pycassa does it, but it a simple case of...- get_slice with start="", finish="" and count = 100,001- pop the last column and store it's name- get_slice with start as the last column name, finish="" and count = 100,001repeat. AOn 20 Oct, 2010,at 03:08 PM, Wayne wrote:Thanks for all of

Re: Read Latency

2010-10-19 Thread Wayne
Thanks for all of the feedback. I may not very well be doing a deep copy, so my numbers might not be accurate. I will test with writing to/from the disk to verify how long native python takes. I will also check how large the data is coming from cassandra is in size for comparison. Our high expecta

Re: bootstrap question

2010-10-19 Thread Jonathan Ellis
I think this code has had some changes since beta2. Here is what it looks like in trunk: if (DatabaseDescriptor.getNonSystemTables().size() > 0) { bootstrap(token); assert !isBootstrapMode; // bootstrap will block until finished

Re: Read Latency

2010-10-19 Thread Aaron Morton
Hard to say why your code performs that way, it may not be creating as many objects for example strings may not be re-created just referenced. Are your creating new objects for every column returned?Bring 600,000 to 10M columns back at once is always going to take time. I think any python database

bootstrap question

2010-10-19 Thread Yang
from line 396 of StorageService.java from the 0.7.0-beta2 source, it looks that when I boot up a completely new node, if there is not any keyspace defined in its storage.yaml, it would not even participate in the ring? in other words, let's say the cassandra instance currently has 10 nodes, and h

Re: Read Latency

2010-10-19 Thread Nicholas Knight
On Oct 20, 2010, at 6:30 AM, Wayne wrote: > I am not sure how many bytes, Then I don't think your performance numbers really mean anything substantial. Deserialization time is inevitably going to go up with the amount of data present, so unless you know how much data you actually have, there's

Re: How to get all rows inserted

2010-10-19 Thread Aaron Morton
The general pattern is to use get_range_slices as you describe, also http://wiki.apache.org/cassandra/FAQ#iter_worldNote you should be used the key fields on the KeyRange not the tokens. There have been a few issues around with using the RandomPartitioner so it may be best to get on  0.6.6 if you c

Re: How to get all rows inserted

2010-10-19 Thread Robert
For this case, the order doesn't matter, I just need to page over all of the data X rows at a time. When I use column_family.get_range from pycassa and pass in the last key as the new start key, I do not get all of the results. I have found a few posts about this, but I did not find a recommended

Re: How to get all rows inserted

2010-10-19 Thread Tyler Hobbs
I don't think I understand what you're trying to do. Do you want to page over the whole column family X rows at a time? Does it matter if the rows are in order? - Tyler On Tue, Oct 19, 2010 at 5:22 PM, Robert wrote: > I have a similar question. Is there a way to divide this into multiple > re

Re: Read Latency

2010-10-19 Thread Wayne
I am not sure how many bytes, but we do convert the cassandra object that is returned in 3s into a dictionary in ~1s and then again into a custom python object in about ~1.5s. Expectations are based on this timing. If we can convert what thrift returns into a completely new python object in 1s why

Re: How to get all rows inserted

2010-10-19 Thread Robert
I have a similar question. Is there a way to divide this into multiple requests? I am using Cassandra v0.6.4, RandomPartitioner, and the pycassa library. Can I use get_range_slices with a start_token=0, and then recalculate the token from the last value key returned until it equals it loops arou

Re: How to get all rows inserted

2010-10-19 Thread Aaron Morton
KeyRange as a count on it, the default is 100. For the ordering, double check you are using the OrderPreserving partitioner It it's still out of order send an example. CheersAaronOn 20 Oct, 2010,at 09:39 AM, Wicked J wrote:Hi,I inserted 500 rows (records) in Cassandra and I'm using the following c

How to get all rows inserted

2010-10-19 Thread Wicked J
Hi, I inserted 500 rows (records) in Cassandra and I'm using the following code to retrieve all the inserted rows. However, I'm able to get only 100 rows (in a random order). I'm using Cassandra v0.6.4 with OrderPreserving Partition on a single node/instance. How can I get all the rows inserted? i.

Re: Throttling ColumnFamilyRecordReader

2010-10-19 Thread Jonathan Ellis
(Moving to u...@.) Isn't reducing the number of map tasks the easiest way to tune this? Also: in 0.7 you can use NetworkTopologyStrategy to designate a group of nodes as your hadoop "datacenter" so the workloads won't overlap. On Tue, Oct 19, 2010 at 3:22 PM, Michael Moores wrote: > Does it mak

Re: Read Latency

2010-10-19 Thread Aaron Morton
 Just wondering how many bytes you are returning to the client to get an idea of how slow it is. The call to fastbinary is decoding the wireformat and creating the Python objects. When you ask for 600,000 columns your are creating a lot of python objects. Each column will be a ColumnOrSuperColumn,

Re: Read Latency

2010-10-19 Thread Wayne
It is an entire row which is 600,000 cols. We pass a limit of 10million to make sure we get it all. Our issue is that it seems Thrift itself has more overhead/latency added to a read that Cassandra takes itself to do the read. If cfstats for the slowest node reports 2.25s to us it is not acceptable

Re: Read Latency

2010-10-19 Thread Wayne
Our problem is not that Python is slow, our problem is that getting data from the Cassandra server is slow (while Cassandra itself is fast). Python can handle the result data a lot faster that whatever is it passing through now... I guess to ask a specific question what right now is the fastest me

Re: Read Latency

2010-10-19 Thread aaron morton
Wayne, I'm calling cassandra from Python and have not seen too many 3 second reads. Your last email with log messages in it looks like your are asking for 10,000,000 columns. How much data is this request actually transferring to the client? The column names suggest only a few. DEBUG [pool-1

Re: Dumping Cassandra into Hadoop

2010-10-19 Thread aaron morton
Depends on what you mean by dumping into Hadoop. If you want to read them from a Hadoop Job then you can use either native Hadoop or Pig. See the contrib/word_count and contrib/pig examples. If you want to copy the data into a Hadoop File System install then I guess almost anything that can r

Re: Read Latency

2010-10-19 Thread Jonathan Ellis
I would expect C++ or Java to be substantially faster than Python. However, I note that Hector (and I believe Pelops) don't yet use the newest, fastest Thrift library. On Tue, Oct 19, 2010 at 8:21 AM, Wayne wrote: > The changes seems to do the trick. We are down to about 1/2 of the original > quo

Re: Cassandra + Zookeeper, what is the current state?

2010-10-19 Thread Jeremy Hanna
That code has never existed in the public. It was taken out before it was open-sourced. On Oct 19, 2010, at 11:45 AM, Yang wrote: > Thanks guys. > > but I feel it would probably be better to refactor out the hooks and > make components like zookeeper pluggable , so users could use either > zoo

Re: Cassandra + Zookeeper, what is the current state?

2010-10-19 Thread Yang
Thanks guys. but I feel it would probably be better to refactor out the hooks and make components like zookeeper pluggable , so users could use either zookeeper or the current config-file based seeds discovery Yang On Tue, Oct 19, 2010 at 9:02 AM, Jeremy Hanna wrote: > Further, when they open-

Re: Cassandra + Zookeeper, what is the current state?

2010-10-19 Thread Jeremy Hanna
Further, when they open-sourced cassandra, they removed certain things that are specific to their environment at facebook, including zookeeper integration. In the paper it has some specific purposes, like finding seeds iirc. Apache Cassandra has no dependency on zookeeper. I think that's a go

Re: Cassandra security model? ( or, authentication docs ?)

2010-10-19 Thread Jeremy Hanna
just as an fyi, I created something in the wiki yesterday - it's just a start though - http://wiki.apache.org/cassandra/ExtensibleAuth there's also a FAQ entry on it now - http://wiki.apache.org/cassandra/FAQ#auth just for going forward - on the wiki itself, just trying to help there. On Oct 19,

Re: Hadoop Word Count Super Column Example?

2010-10-19 Thread Jeremy Hanna
It's relatively straightforward, the current mapper gets a map of column names to IColumns. The SuperColumn implements the IColumn interface. So you would probably need both the super column name and the subcolumn name to get at it, but you just need to cast the IColumn to a super column and h

Hadoop Word Count Super Column Example?

2010-10-19 Thread Frank LoVecchio
I have a Hadoop installation working with a cluster of 0.7 Beta 2 Nodes, and got the WordCount example to work using the standard configuration. I have been inserting data into a Super Column (Sensor) with TimeUUID as the compare type, it looks like this: get Sensor['DeviceID:Sensor'] => (super_c

Re: Cassandra/Pelops error processing get_slice

2010-10-19 Thread Frank LoVecchio
Aaron, It seems that we had a beta-1 node in our cluster of beta-2'. Haven't had the problem since. Thanks for the help, Frank On Sat, Oct 16, 2010 at 1:50 PM, aaron morton wrote: > Frank, > > Things are a bit clearer now. Think I had the wrong idea to start with. > > The server side error me

Dumping Cassandra into Hadoop

2010-10-19 Thread Mark
As the subject implies I am trying to dump Cassandra rows into Hadoop. What is the easiest way for me to accomplish this? Thanks. Should I be looking into pig for something like this?

Re: Read Latency

2010-10-19 Thread Wayne
The changes seems to do the trick. We are down to about 1/2 of the original quorum read performance. I did not see any more errors. More than 3 seconds on the client side is still not acceptable to us. We need the data in Python, but would we be better off going through Java or something else to i

Re: Thift version

2010-10-19 Thread Brayton Thompson
Go into the lib dir in Cassandra and look at the thrift jar. The name has in it the specific revision you need to use. Use svn to pull it down. Sent from my iPhone On Oct 18, 2010, at 10:50 PM, JKnight JKnight wrote: > Dear all, > > Which Thrift version does Cassandra 0.66 using? > Thank a

RE: Preventing an update of a CF row

2010-10-19 Thread Viktor Jevdokimov
Reverse timestamp. -Original Message- From: Sylvain Lebresne [mailto:sylv...@yakaz.com] Sent: Tuesday, October 19, 2010 10:44 AM To: user@cassandra.apache.org Subject: Re: Preventing an update of a CF row > Always specify some constant value for timestamp. Only 1st insertion with that >

Re: Cassandra + Zookeeper, what is the current state?

2010-10-19 Thread Norman Maurer
No Zookeeper is not used in cassandra. You can use Zookeeper as some kind of add-on to do locking etc. Bye, Norman 2010/10/19 Yang : > I read from the Facebook cassandra paper that zookeeper "is used > ." for certain things ( membership and Rack-aware placement) > > but I pulled 0.7.0-beta2

Cassandra + Zookeeper, what is the current state?

2010-10-19 Thread Yang
I read from the Facebook cassandra paper that zookeeper "is used ." for certain things ( membership and Rack-aware placement) but I pulled 0.7.0-beta2 source and couldn't grep out anything with "Zk" or "Zoo", nor any files with "Zk/Zoo" in the names is Zookeeper really used? docs/blog posts

Re: Cassandra security model? ( or, authentication docs ?)

2010-10-19 Thread Yang
Thanks a lot On Mon, Oct 18, 2010 at 11:44 AM, Eric Evans wrote: > On Sun, 2010-10-17 at 21:26 -0700, Yang wrote: >> I searched around, it seems that this is not clearly documented yet; >> the closest I found is: >> http://www.riptano.com/docs/0.6.5/install/auth-config >> http://cassandra-user-in

Re: TimeUUID makes me crazy

2010-10-19 Thread Sylvain Lebresne
In you first column family, you are using a UUID as a row key (your column names are strings apparently (phone, addres)). The CompareWith directive specify the comparator for *column names*. So you are providing strings where you indicated Cassandra you'll provide UUID, hence the exceptions. The s

Re: Preventing an update of a CF row

2010-10-19 Thread Sylvain Lebresne
> Always specify some constant value for timestamp. Only 1st insertion with that > timestamp will succeed. Others will be ignored, because will be considered > duplicates by cassandra. Well, that's not entirely true. When cassandra 'resolves' two columns having the same timestamp, it will compare

R: Re: TimeUUID makes me crazy

2010-10-19 Thread cbert...@libero.it
I am using Pelops for Cassandra 0.6.x The error that raise isInvalidRequestException(why:UUIDs must be exactly 16 bytes) For the UUID I am using the UuidHelper class provided.