"Internal error processing get" during bootstrap

2011-07-26 Thread Rafael Almeida
Hello,

I'm evaluating cassandra for use in my system. I could add approximately 16 
million items using a single node. I'm using libcassandra (I can find my way 
through its code when I need to) to connect to it and I already have some 
infrastructure for handling and adding those items (I was using tokio cabinet 
before).

I couldn't find much documentation regarding how to make a cluster, but it 
seemed simple enough. At cassandra server A (10.0.0.2) I had seeds: "locahost". 
At server B (10.0.0.3) I configured seeds: "10.0.0.2" and auto_bootstrap: true. 
Then I created a keyspace and a few column families in it.

I imediately began to add items and to get all these "Internal error processing 
get". I found it quite odd, I thought it had to do with the load I was putting 
in, seeing that a few small tests had worked before. I spent quite some time 
debugging, when I finally decided to write this e-mail. I wanted to double 
check stuff, so I ran nodetool to see if everything was right. To my surprise, 
there was only one of the node available. It took a little while for the other 
one to show up as Joining and then as Normal.

After I waited that period, I was able to insert items to the cluster with no 
error at all. Is that expected behaviour? What is the recommended way to setup 
a cluster? Should it be done manually. Setting up the machines, creating all 
keyspaces and colum families then checking nodetool and waiting for it to get 
stable?

On a side note, sometimes I get "Default TException" (that seems to happen when 
the machine is in a heavier load than usual), commonly retrying the read or 
insert right after works fine.  Is that what's supposed to happen? Perhaps I 
should raise some timeout somewhere?

This is what ./bin/nodetool -h localhost ring reports me:

Address DC  Rack    Status State   Load    Owns    
Token   
   
119105113551249187083945476614048008053 
10.0.0.3 datacenter1 rack1   Up Normal  3.43 GB 65.90%  
61078635599166706937511052402724559481  
10.0.0.2    datacenter1 rack1   Up Normal  1.77 GB 34.10%  
119105113551249187083945476614048008053

It's still adding stuff. I have no idea why B owns so many more keys than A.

I'm sorry if what I'm asking is trivial. But I have been having a hard time 
finding documentation. I've found a lot of outdated stuff, which was 
frustrating. I hope you guys have the time to help me out or -- if not -- I 
hope you can give me good reading material.

Thank you,
Rafael



Re: Kundera 2.0.2 Released

2011-07-30 Thread Rafael Almeida
On Saturday, July 30, 2011, Amresh Singh  wrote:
> We are happy to announce release of Kundera 2.0.2
>
>
> Kundera is a JPA 2.0 compliant, Object-Datastore Mapping Library for
> NoSQL Datastores. The idea behind Kundera is to make working with
> NoSQL Databases drop-dead simple and fun. It currently supports
> Cassandra, HBase and MongoDB. New features added in this release are:
>
>
> 1. Kundera is now JPA 2.0 compliant. 

Interesting. I thought that, in order to be JPA compilant, you must
support transactions. Does Kundera implement transactions on top of
cassandra? I could be mixing things up, I have worked with EJB and
Hibernate inside an EJB environment. Maybe the transaction requirement
came from the EJB part of specification, not JPA. If I recall correctly,
JPA is just a part of EJB specification, right?

Also, you got the entire HQL to work with Cassandra, MongoDB and HBase?
That's impressive!


How tokens work?

2011-07-30 Thread Rafael Almeida
Hello,

I have computers that are better than others in my cluster. In special, there's
one which is much better and I'd like to give it more load than the others. Is 
it
possible? I'm using RandomPartitioner, should I use other? Should I select 
tokens
in some particular way? How is load distribution implemented in 
RandomPartitioner
with respect to tokens?

Thank  you,
Rafael



Re: How tokens work?

2011-07-31 Thread Rafael Almeida
On Saturday, July 30, 2011, Rafael Almeida  wrote:
> Hello,
> 
> I have computers that are better than others in my cluster. In special,
> there's one which is much better and I'd like to give it more load than the
> others.  Is it possible? I'm using RandomPartitioner, should I use other?
> Should I select tokens in some particular way? How is load distribution
> implemented in RandomPartitioner with respect to tokens?
> 

I'm answering myself this time. I think I've got things figured out, at least
for RandomPartitioner. The token space goes from 0 to 2^217. There are 2^217
tokens possible. The load a node will receive is proportional to the number of
tokens assigned to it. If you assign 2^217 / 2 tokens to a node, it will be
responsible for half the load in the system. If you assign 2^217 / 3 tokens to a
node it will be responsible for 1/3 the load and so on. 

But you assign only one token in cassandra's configuration file! True, but
that's the first token for that node, in a range of tokens it will accept. The
number of tokens actually assigned to it is the range from the value you wrote
in intiial_token in cassandra.yaml up to the next token.

I find it hard to explain that without an example. So, let's say the token space
is actually from 0 to 100 and we have 4 nodes (let's do this in order to make
things more manageble). In our example, we have the following initial_tokens:

node A = 0
node B = 20
node C = 70
node D = 90

Node A would have 0 - 20 tokens assigned to it (20/100 = 20% of the load).  Node
B would have 70 - 20 = 50 tokens assigned to it (50% of the load). Node C would
have 90 - 70 = 20 tokens assigned to it (20% of the load) and, finally, node D
would have 10% of the tokens assigned to it. See how that works? 

If you mess up in your configuration. Let's say you set up initial_token like
this:

node A = 10
node B = 20
node C = 70
node D = 90

That way you'd have 10 unhandled tokens. I think cassandra detects it and set
things up in a way no token is missing. But I'm not sure what it does exactly.
I've tested it with two nodes and, when I make such invalid configuration, I get
each node handling 50% of the load.

I hope I've been clear. Please correct me if I misunderstood something.



Re: "Internal error processing get" during bootstrap

2011-07-31 Thread Rafael Almeida
I'm going to tell you guys the answers I could find so far.

On Tuesday, July 26, 2011, Rafael Almeida  wrote:
> I couldn't find much documentation regarding how to make a cluster, but it 
> seemed simple enough. At cassandra server A (10.0.0.2) I had seeds: 
> "locahost". At server B (10.0.0.3) I configured seeds: 
> "10.0.0.2" and auto_bootstrap: true. Then I created a keyspace and a 
> few column families in it.
> 
> I imediately began to add items and to get all these "Internal error 
> processing get". I found it quite odd, I thought it had to do with the load 
> I was putting in, seeing that a few small tests had worked before. I spent 
> quite 
> some time debugging, when I finally decided to write this e-mail. I wanted to 
> double check stuff, so I ran nodetool to see if everything was right. To my 
> surprise, there was only one of the node available. It took a little while 
> for 
> the other one to show up as Joining and then as Normal.
> 
> After I waited that period, I was able to insert items to the cluster with no 
> error at all. Is that expected behaviour? What is the recommended way to 
> setup a 
> cluster? Should it be done manually. Setting up the machines, creating all 
> keyspaces and colum families then checking nodetool and waiting for it to get 
> stable?


The problem that I was having was mainly because I had set node A as seed of B
and B as seed of A. I don't know what possessed me! Regarding the schema
configuration. I made a schema file and I load it using:

    cassandra-cli -h localhost --batch < schema-file

It works alright.
 
> On a side note, sometimes I get "Default TException" (that seems to 
> happen when the machine is in a heavier load than usual), commonly retrying 
> the 
> read or insert right after works fine.  Is that what's supposed to happen? 
> Perhaps I should raise some timeout somewhere?


I still don't get why that error was so frequent. At first I was testing it on 
workstations, where people would compile stuff and run all sorts of software. I 
think
that slowed down things considerable and the system was having a hard time 
managing connections from the application. After I moved it to dedicated 
computers
those problems ceased to happen.

> This is what ./bin/nodetool -h localhost ring reports me:
> 
> Address DC  Rack    Status State   Load    
> Owns    
> Token   
>   
>  
> 119105113551249187083945476614048008053 
> 10.0.0.3 datacenter1 rack1   Up Normal  3.43 GB 65.90%  
> 61078635599166706937511052402724559481  
> 10.0.0.2    datacenter1 rack1   Up Normal  1.77 GB 34.10%  
> 119105113551249187083945476614048008053
> 
> It's still adding stuff. I have no idea why B owns so many more keys than A.


It happened due to my weird double-seed configuration. Now everything is fine. 
I've
explained how tokens work on a different thread.

Cheers,
Rafael



Re: Problems using Thrift API in C

2011-08-04 Thread Rafael Almeida
- Original Message -

> From: Konstantin  Naryshkin 
> To: user@cassandra.apache.org
> Cc: 
> Sent: Thursday, August 4, 2011 10:36 AM
> Subject: Re: Problems using Thrift API in C
> 
> I have had similar issues when I generated Cassandra for Erlang. It seems 
> that 
> Thrift 0.6.1 (the latest stable version) does not work with Cassandra. Using 
> Thrift 0.7 does.
> 
> I had issues where it would give me run time errors when trying to send an 
> insert (it would not serialize correctly).
> 

I have a problem using thrift on C as well. I'm using thrift 0.5 and if I try 
to add
a row to a column family that doesn't exists the exception I get is

  Default TException

very unspecific. Is that an issue of cassandra? Is there probably something 
wrong
with my setup? I was hoping to get an "Column family not found" message or
something in those lines.



Re: Best indexing solution for Cassandra

2011-09-28 Thread Rafael Almeida
>From Anthony Ikeda :
> Well, we go live with our project very soon and we are now looking into what 
> we will be doing for the next phase. One of the enhancements we would like to 
> consider is an indexing platform to start building searches into our 
> application.
>
>
> Right now we are just using column families to index the information 
> (different views based on what we want to find) however it is proving to be 
> quite a task to keep the index views in sync with the data - although not a 
> showstopper, it isn't something we want to be handling all the time 
> especially since operations like deletions require changes to multiple column 
> families.
>
>
> I've heard of Solandra and Lucandra but I want to understand the experiences 
> of people that may have used them or other suggestions.


I've had some experience with that. My main problem was that I had a limited 
vocabulary and a large number of documents. It seems like solandra kept all my 
documents on the same row for a given term. That means the documents don't get 
spread out throught the cluster and search was painfully slow. We ended up 
rolling up our own solution and not using cassandra at all for that purpose 
(althought we still use it for storage).



Creating column families per client

2011-12-21 Thread Rafael Almeida
Hello,

I am evaluating the usage of cassandra for my system. I will have several 
clients who won't share data with each other. My idea is to create one column 
family per client. When a new client comes in and adds data to the system, I'd 
like to create a column family dynamically. Is that reliable? Can I create a 
column family on a node and imediately add new data on that column family and 
be confident that the data added will eventually become visible to a read?

[]'s
Rafael




Re: List all keys with RandomPartitioner

2012-02-22 Thread Rafael Almeida
>
> From: Franc Carter 
>To: user@cassandra.apache.org 
>Sent: Wednesday, February 22, 2012 9:24 AM
>Subject: Re: List all keys with RandomPartitioner
> 
>
>On Wed, Feb 22, 2012 at 8:47 PM, Flavio Baronti  
>wrote:
>
>I need to iterate over all the rows in a column family stored with 
>RandomPartitioner.
>>When I reach the end of a key slice, I need to find the token of the last key 
>>in order to ask for the next slice.
>>I saw in an old email that the token for a specific key can be recoveder 
>>through FBUtilities.hash(). That class however is inside the full Cassandra 
>>jar, not inside the client-specific part.
>>Is there a way to iterate over all the keys which does not require the 
>>server-side Cassandra jar?
>>
>
>
>Does this help ?
>
>
> http://wiki.apache.org/cassandra/FAQ#iter_world


I don't get it. It says to use the last key read as start key, but what should 
be used as end key?


Re: Please advise -- 750MB object possible?

2012-02-22 Thread Rafael Almeida
Keep them where?



>
> From: Mohit Anchlia 
>To: user@cassandra.apache.org 
>Cc: potek...@bnl.gov 
>Sent: Wednesday, February 22, 2012 3:44 PM
>Subject: Re: Please advise -- 750MB object possible?
> 
>
>In my opinion if you are busy site or application keep blobs out of the 
>database.
>
>
>On Wed, Feb 22, 2012 at 9:37 AM, Dan Retzlaff  wrote:
>
>Chunking is a good idea, but you'll have to do it yourself. A few of the 
>columns in our application got quite large (maybe ~150MB) and the failure mode 
>was RPC timeout exceptions. Nodes couldn't always move that much data across 
>our data center interconnect in the default 10 seconds. With enough heap and a 
>faster network you could probably get by without chunking, but it's not ideal. 
>>
>>
>>
>>On Wed, Feb 22, 2012 at 9:04 AM, Maxim Potekhin  wrote:
>>
>>Hello everybody,
>>>
>>>I'm being asked whether we can serve an "object", which I assume is a blob, 
>>>of 750MB size?
>>>I guess the real question is of how to chunk it and/or even it's possible to 
>>>chunk it.
>>>
>>>Thanks!
>>>
>>>Maxim
>>>
>>>
>>
>
>
>