Not sure how pycassa does it, but it a simple case of...- get_slice with start="", finish="" and count = 100,001- pop the last column and store it's name- get_slice with start as the last column name, finish="" and count = 100,001repeat. AOn 20 Oct, 2010,at 03:08 PM, Wayne wrote:Thanks for all of
Thanks for all of the feedback. I may not very well be doing a deep copy, so
my numbers might not be accurate. I will test with writing to/from the disk
to verify how long native python takes. I will also check how large the data
is coming from cassandra is in size for comparison.
Our high expecta
I think this code has had some changes since beta2. Here is what it
looks like in trunk:
if (DatabaseDescriptor.getNonSystemTables().size() > 0)
{
bootstrap(token);
assert !isBootstrapMode; // bootstrap will block until finished
Hard to say why your code performs that way, it may not be creating as many objects for example strings may not be re-created just referenced. Are your creating new objects for every column returned?Bring 600,000 to 10M columns back at once is always going to take time. I think any python database
from line 396 of StorageService.java from the 0.7.0-beta2 source, it
looks that when I boot up a completely new node,
if there is not any keyspace defined in its storage.yaml, it would
not even participate in the ring?
in other words, let's say the cassandra instance currently has 10
nodes, and h
On Oct 20, 2010, at 6:30 AM, Wayne wrote:
> I am not sure how many bytes,
Then I don't think your performance numbers really mean anything substantial.
Deserialization time is inevitably going to go up with the amount of data
present, so unless you know how much data you actually have, there's
The general pattern is to use get_range_slices as you describe, also http://wiki.apache.org/cassandra/FAQ#iter_worldNote you should be used the key fields on the KeyRange not the tokens. There have been a few issues around with using the RandomPartitioner so it may be best to get on 0.6.6 if you c
For this case, the order doesn't matter, I just need to page over all of the
data X rows at a time. When I use column_family.get_range from pycassa and
pass in the last key as the new start key, I do not get all of the results.
I have found a few posts about this, but I did not find a recommended
I don't think I understand what you're trying to do. Do you want to page
over the whole column
family X rows at a time? Does it matter if the rows are in order?
- Tyler
On Tue, Oct 19, 2010 at 5:22 PM, Robert wrote:
> I have a similar question. Is there a way to divide this into multiple
> re
I am not sure how many bytes, but we do convert the cassandra object that is
returned in 3s into a dictionary in ~1s and then again into a custom python
object in about ~1.5s. Expectations are based on this timing. If we can
convert what thrift returns into a completely new python object in 1s why
I have a similar question. Is there a way to divide this into multiple
requests? I am using Cassandra v0.6.4, RandomPartitioner, and the pycassa
library.
Can I use get_range_slices with a start_token=0, and then recalculate the
token from the last value key returned until it equals it loops arou
KeyRange as a count on it, the default is 100. For the ordering, double check you are using the OrderPreserving partitioner It it's still out of order send an example. CheersAaronOn 20 Oct, 2010,at 09:39 AM, Wicked J wrote:Hi,I inserted 500 rows (records) in Cassandra and I'm using the following c
Hi,
I inserted 500 rows (records) in Cassandra and I'm using the following code
to retrieve all the inserted rows. However, I'm able to get only 100 rows
(in a random order). I'm using Cassandra v0.6.4 with OrderPreserving
Partition on a single node/instance.
How can I get all the rows inserted? i.
(Moving to u...@.)
Isn't reducing the number of map tasks the easiest way to tune this?
Also: in 0.7 you can use NetworkTopologyStrategy to designate a group
of nodes as your hadoop "datacenter" so the workloads won't overlap.
On Tue, Oct 19, 2010 at 3:22 PM, Michael Moores wrote:
> Does it mak
Just wondering how many bytes you are returning to the client to get an idea of how slow it is. The call to fastbinary is decoding the wireformat and creating the Python objects. When you ask for 600,000 columns your are creating a lot of python objects. Each column will be a ColumnOrSuperColumn,
It is an entire row which is 600,000 cols. We pass a limit of 10million to
make sure we get it all. Our issue is that it seems Thrift itself has more
overhead/latency added to a read that Cassandra takes itself to do the read.
If cfstats for the slowest node reports 2.25s to us it is not acceptable
Our problem is not that Python is slow, our problem is that getting data
from the Cassandra server is slow (while Cassandra itself is fast). Python
can handle the result data a lot faster that whatever is it passing through
now...
I guess to ask a specific question what right now is the fastest me
Wayne,
I'm calling cassandra from Python and have not seen too many 3 second reads.
Your last email with log messages in it looks like your are asking for
10,000,000 columns. How much data is this request actually transferring to the
client? The column names suggest only a few.
DEBUG [pool-1
Depends on what you mean by dumping into Hadoop.
If you want to read them from a Hadoop Job then you can use either native
Hadoop or Pig. See the contrib/word_count and contrib/pig examples.
If you want to copy the data into a Hadoop File System install then I guess
almost anything that can r
I would expect C++ or Java to be substantially faster than Python.
However, I note that Hector (and I believe Pelops) don't yet use the
newest, fastest Thrift library.
On Tue, Oct 19, 2010 at 8:21 AM, Wayne wrote:
> The changes seems to do the trick. We are down to about 1/2 of the original
> quo
That code has never existed in the public. It was taken out before it was
open-sourced.
On Oct 19, 2010, at 11:45 AM, Yang wrote:
> Thanks guys.
>
> but I feel it would probably be better to refactor out the hooks and
> make components like zookeeper pluggable , so users could use either
> zoo
Thanks guys.
but I feel it would probably be better to refactor out the hooks and
make components like zookeeper pluggable , so users could use either
zookeeper or the current config-file based seeds discovery
Yang
On Tue, Oct 19, 2010 at 9:02 AM, Jeremy Hanna
wrote:
> Further, when they open-
Further, when they open-sourced cassandra, they removed certain things that are
specific to their environment at facebook, including zookeeper integration. In
the paper it has some specific purposes, like finding seeds iirc. Apache
Cassandra has no dependency on zookeeper. I think that's a go
just as an fyi, I created something in the wiki yesterday - it's just a start
though - http://wiki.apache.org/cassandra/ExtensibleAuth
there's also a FAQ entry on it now - http://wiki.apache.org/cassandra/FAQ#auth
just for going forward - on the wiki itself, just trying to help there.
On Oct 19,
It's relatively straightforward, the current mapper gets a map of column names
to IColumns. The SuperColumn implements the IColumn interface. So you would
probably need both the super column name and the subcolumn name to get at it,
but you just need to cast the IColumn to a super column and h
I have a Hadoop installation working with a cluster of 0.7 Beta 2 Nodes, and
got the WordCount example to work using the standard configuration. I have
been inserting data into a Super Column (Sensor) with TimeUUID as the
compare type, it looks like this:
get Sensor['DeviceID:Sensor']
=> (super_c
Aaron,
It seems that we had a beta-1 node in our cluster of beta-2'. Haven't had
the problem since.
Thanks for the help,
Frank
On Sat, Oct 16, 2010 at 1:50 PM, aaron morton wrote:
> Frank,
>
> Things are a bit clearer now. Think I had the wrong idea to start with.
>
> The server side error me
As the subject implies I am trying to dump Cassandra rows into Hadoop.
What is the easiest way for me to accomplish this? Thanks.
Should I be looking into pig for something like this?
The changes seems to do the trick. We are down to about 1/2 of the original
quorum read performance. I did not see any more errors.
More than 3 seconds on the client side is still not acceptable to us. We
need the data in Python, but would we be better off going through Java or
something else to i
Go into the lib dir in Cassandra and look at the thrift jar. The name has in it
the specific revision you need to use. Use svn to pull it down.
Sent from my iPhone
On Oct 18, 2010, at 10:50 PM, JKnight JKnight wrote:
> Dear all,
>
> Which Thrift version does Cassandra 0.66 using?
> Thank a
Reverse timestamp.
-Original Message-
From: Sylvain Lebresne [mailto:sylv...@yakaz.com]
Sent: Tuesday, October 19, 2010 10:44 AM
To: user@cassandra.apache.org
Subject: Re: Preventing an update of a CF row
> Always specify some constant value for timestamp. Only 1st insertion with that
>
No Zookeeper is not used in cassandra. You can use Zookeeper as some
kind of add-on to do locking etc.
Bye,
Norman
2010/10/19 Yang :
> I read from the Facebook cassandra paper that zookeeper "is used
> ." for certain things ( membership and Rack-aware placement)
>
> but I pulled 0.7.0-beta2
I read from the Facebook cassandra paper that zookeeper "is used
." for certain things ( membership and Rack-aware placement)
but I pulled 0.7.0-beta2 source and couldn't grep out anything with
"Zk" or "Zoo", nor any files with "Zk/Zoo" in the names
is Zookeeper really used? docs/blog posts
Thanks a lot
On Mon, Oct 18, 2010 at 11:44 AM, Eric Evans wrote:
> On Sun, 2010-10-17 at 21:26 -0700, Yang wrote:
>> I searched around, it seems that this is not clearly documented yet;
>> the closest I found is:
>> http://www.riptano.com/docs/0.6.5/install/auth-config
>> http://cassandra-user-in
In you first column family, you are using a UUID as a row key (your column
names are strings apparently (phone, addres)). The CompareWith directive
specify the comparator for *column names*. So you are providing strings where
you indicated Cassandra you'll provide UUID, hence the exceptions.
The s
> Always specify some constant value for timestamp. Only 1st insertion with that
> timestamp will succeed. Others will be ignored, because will be considered
> duplicates by cassandra.
Well, that's not entirely true. When cassandra 'resolves' two columns
having the
same timestamp, it will compare
I am using Pelops for Cassandra 0.6.x
The error that raise isInvalidRequestException(why:UUIDs must be exactly 16
bytes)
For the UUID I am using the UuidHelper class provided.
37 matches
Mail list logo