Re: New Chain for : Does Cassandra use vector clocks

2011-02-24 Thread tijoriwala.ritesh

If it cannot protect against lost updates, isn't that an issue? How is client
support to protect against concurrency? I see lot of users mentioning the
use of cages (i.e. use ZooKeeper) but involving locks on every writes at the
application level is certainly not acceptable. And again, the application
will end up using vector clocks anyways. IMHO, this support should be built
into cassandra especially when it provides all the knobs to the client to
choose the right consistency level. So if client chooses R + W > N, then it
should be possible for Cassandra to detect conflicts.
-- 
View this message in context: 
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/New-Chain-for-Does-Cassandra-use-vector-clocks-tp6058892p6059594.html
Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
Nabble.com.


Re: Fill disks more than 50%

2011-02-24 Thread Thibaut Britz
Hi,

How would you use rsync instead of repair in case of a node failure?

Rsync all files from the data directories from the adjacant nodes
(which are part of the quorum group) and then run a compactation which
will? remove all the unneeded keys?

Thanks,
Thibaut


On Thu, Feb 24, 2011 at 4:22 AM, Edward Capriolo  wrote:
> On Wed, Feb 23, 2011 at 9:39 PM, Terje Marthinussen
>  wrote:
>> Hi,
>> Given that you have have always increasing key values (timestamps) and never
>> delete and hardly ever overwrite data.
>> If you want to minimize work on rebalancing and statically assign (new)
>> token ranges to new nodes as you add them so they always get the latest
>> data
>> Lets say you add a new node each year to handle next years data.
>> In a scenario like this, could you with 0.7 be able to safely fill disks
>> significantly more than 50% and still manage things like repair/recovery of
>> faulty nodes?
>>
>> Regards,
>> Terje
>
> Since all your data for a day/month/year would sit on the same server.
> Meaning all your servers with old data would be idle and your servers
> with current data would be very busy. This is probably not a good way
> to go.
>
> There is a ticket open for 0.8 for efficient node moves joins. It is
> already a lot better in 0.7. Pretend you did not see this (you can
> join nodes using rsync if you know some tricks) if you are really
> afraid of joins, which you really should not be.
>
> As for the 50% statement. In a worse case scenario a major compaction
> will require double the disk size of your column family. So if you
> have more then 1 column family you do NOT need 50% overhead.
>


Re: Cassandra nodes on EC2 in two different regions not communicating

2011-02-24 Thread Daniel van Ham Colchete
Himanshi,

you could try adding your public IP address to an internal interface and
DNAT the packets to it. This shouldn't give you any problems with your
normal traffic. Tell Cassandra on listen on the public IPs and it should
work.

Linux commands would be:

# Create an internal interface using bridge-utils
brctl addbr cassth0

# add the ip
ip addr add dev cassth0 50.18.60.117/32

# DNAT incoming connections
iptables -t nat -A PREROUTING -p tcp --dport 7000 -d INTERNALIP -j DNAT
--to-destination 50.18.60.117

# SNAT outgoing connections
iptables -t nat -A OUTPUT -p tcp --dport 7000 -d 175.41.143.192 -j SNAT
--to-source INTERNALIP

This should work since Amazon you re-SNAT your outgoing packets to your
public IP again, so the other cassandra instance will see your public IP as
your source address.

I didn't test this setup here but it should work unless I forgot some small
detail. If you need to troubleshoot use the command "tcpdump -i INTERFACE -n
port 7000" where INTERFACE should be your public interface or your cassth0.

Please let me know if it worked.

Best regards,
Daniel Colchete

On Thu, Feb 24, 2011 at 4:04 AM, Himanshi Sharma wrote:

> giving private ip to rpc address gives the same exception
> and the keeping it blank and providing public to listen also fails. I tried
> keeping both blank and did telnet on 7000 so i get following o/p
>
> [root@ip-10-166-223-150 bin]# telnet 122.248.193.37 7000
> Trying 122.248.193.37...
> Connected to 122.248.193.37.
> Escape character is '^]'.
>
> Similarly from another achine
>
> [root@ip-10-136-75-201 bin]# telnet 184.72.22.87 7000
> Trying 184.72.22.87...
> Connected to 184.72.22.87.
> Escape character is '^]'.
>
>
>
> -Dave Viner wrote: -
>
> To: user@cassandra.apache.org
> From: Dave Viner 
> Date: 02/24/2011 11:59AM
> cc: Himanshi Sharma 
>
> Subject: Re: Cassandra nodes on EC2 in two different regions not
> communicating
>
> Try using the private ipv4 address in the rpc_address field, and the public
> ipv4 (NOT the elastic ip) in the listen_address.
>
> If that fails, go back to rpc_address empty, and start up cassandra.
>
> Then from the other node, please telnet to port 7000 on the first node.
>  And show the output of that session in your reply.
>
> I haven't actually constructed a cross-region cluster nor have I used v0.7,
> but this really sounds like it should be easy.
>
> On Wed, Feb 23, 2011 at 10:22 PM, Himanshi Sharma < himanshi.sha...@tcs.com
> > wrote:
>
>> Hi Dave,
>>
>> I tried with the public ips. If i mention the public ip in rpc address
>> field, Cassandra gives the same exception but if leave it blank then
>> Cassandra runs but again in the nodetool command with ring option it does'nt
>> show the node in another region.
>>
>> Thanks,
>> Himanshi
>>
>>
>> -Dave Viner wrote: -
>>
>> To: user@cassandra.apache.org
>> From: Dave Viner < davevi...@gmail.com >
>> Date: 02/24/2011 10:43AM
>>
>> Subject: Re: Cassandra nodes on EC2 in two different regions not
>> communicating
>>
>> That looks like it's not an issue of communicating between nodes.  It
>> appears that the node can not bind to the address on the localhost that
>> you're asking for.
>>
>> " java.net.BindException: Cannot assign requested address  "
>>
>> I think the issue is that the Elastic IP address is not actually an IP
>> address that's on the localhost.  So the daemon can not bind to that IP.
>>  Instead of using the EIP, use the local IP address for the rpc_address (i
>> think that's what you need since that is what Thrift will bind to).  Then
>> for the listen_address should be the ip address that is routable from the
>> other node.  I would first try with the actual public IP address (not the
>> Elastic IP).  Once you get that to work, then shutdown the cluster, change
>> the listen_address to the EIP, boot up and try again.
>>
>> Dave Viner
>>
>>
>> On Wed, Feb 23, 2011 at 8:54 PM, Himanshi Sharma < himanshi.sha...@tcs.com
>> > wrote:
>>
>>>
>>> Hey Dave,
>>>
>>> Sorry i forgot to mention the Non-seed configuration.
>>>
>>> for first node in us-west its as belowi.e its own elastic ip
>>>
>>> listen_address: 50.18.60.117
>>> rpc_address: 50.18.60.117
>>>
>>> and for second node in ap-southeast-1 its as belowi.e again its own
>>> elastic ip
>>>
>>> listen_address: 175.41.143.192
>>> rpc_address: 175.41.143.192
>>>
>>> Thanks,
>>> Himanshi
>>>
>>>
>>>
>>>
>>>
>>>   From:
>>> Dave Viner < davevi...@gmail.com >
>>>  To: user@cassandra.apache.org  Date: 02/23/2011 11:01 PM  Subject: Re:
>>> Cassandra nodes on EC2 in two different regions not communicating
>>> --
>>>
>>>
>>>
>>> internal EC2 ips (10.xxx.xxx.xxx) work across availability zones (e.g.,
>>> from us-east-1a to us-east-1b) but do not work across regions (e.g., us-east
>>> to us-west).  To do regions, you must use the public ip address assigned by
>>> amazon.
>>>
>>> Himanshi, when you log into 1 node, and telnet to port 7000 on the other
>>> node, which IP address di

Re: Understanding Indexes

2011-02-24 Thread Javier Canillas
I dont think i got the point in your question. But if you are thinking
about key indexes (like PKs), take in mind that cassandra will manage
keys using the partition strategy. By doing so, it will be able to
determine on which node the row with such key should be hold.
So, in another words, inside cassandra, each column family is treated
as a big table (hashtable). Taking this last in mind, there is no need
to have an index by key. Would you put an index over a hashtable's
keys??

Enviado desde mi iPhone

El 23/02/2011, a las 19:50, mcasandra  escribió:

>
> So far my understanding about indexes is that you can create indexes only on
> column values (username in below eg).
>
> Does it make sense to also have index on the keys that columnFamily uses to
> store rows (row keys "abc" in below example). I am thinking in an event rows
> keep growing would search be fast if there is an index on row keys if you
> want to retrieve for eg "def" only out of tons of rows?
>
> UserProfile = { // this is a ColumnFamily
>abc: {   // this is the key to this Row inside the CF
>// now we have an infinite # of columns in this row
>username: "phatduckk",
>email: "phatdu...@example.com",
>phone: "(900) 976-"
>}, // end row
>def: {   // this is the key to another row in the CF
>// now we have another infinite # of columns in this row
>username: "ieure",
>email: "ie...@example.com",
>phone: "(888) 555-1212"
>age: "66",
>gender: "undecided"
>},
> }
>
>
> 2) Is the hash of column key used or row key used by RandomPartitioner to
> distribute it accross the cassandra nodes?
> --
> View this message in context: 
> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Understanding-Indexes-tp6058238p6058238.html
> Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
> Nabble.com.


Re: New Chain for : Does Cassandra use vector clocks

2011-02-24 Thread Anthony John
My 2 cents ..

1. Focus should be on the core problem Cassandra is solving i.e.
Availability, Partitioning and a form of consistency that works (in spite of
all the questions) . All this with high performance is a huge step forward -
architecturally!
2. Any enhancement should shore up the core value proposition, should not
detract from it. specifically, packing every feature into  the product might
create an easy to use kitchen sink, but also create a less nimble behemoth
(not product names here ;))
3. The beauty of open source is the ability to combine different ideas to
solve a problem - with each piece (layer) providing an  identified set of
guarantees implemented with the greatest efficiency possible.

Finally, it will be a mistake to try drive Cassandra in the direction of an
ACID data store, watering down the core value proposition.

But I just talk!

-JA

On Thu, Feb 24, 2011 at 2:46 AM, tijoriwala.ritesh <
tijoriwala.rit...@gmail.com> wrote:

>
> If it cannot protect against lost updates, isn't that an issue? How is
> client
> support to protect against concurrency? I see lot of users mentioning the
> use of cages (i.e. use ZooKeeper) but involving locks on every writes at
> the
> application level is certainly not acceptable. And again, the application
> will end up using vector clocks anyways. IMHO, this support should be built
> into cassandra especially when it provides all the knobs to the client to
> choose the right consistency level. So if client chooses R + W > N, then it
> should be possible for Cassandra to detect conflicts.
> --
> View this message in context:
> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/New-Chain-for-Does-Cassandra-use-vector-clocks-tp6058892p6059594.html
> Sent from the cassandra-u...@incubator.apache.org mailing list archive at
> Nabble.com.
>


Re: Cassandra nodes on EC2 in two different regions not communicating

2011-02-24 Thread Himanshi Sharma
Thanks Daniel.

But SNAT command is not working and when i try tcpdump it gives 

[root@ip-10-136-75-201 ~]# tcpdump -i 50.18.60.117 -n port 7000
tcpdump: Invalid adapter index

Not able to figure out wats this ??

Thanks,
Himanshi




From:
Daniel van Ham Colchete 
To:
user@cassandra.apache.org
Date:
02/24/2011 04:27 PM
Subject:
Re: Cassandra nodes on EC2 in two different regions not communicating



Himanshi,

you could try adding your public IP address to an internal interface and 
DNAT the packets to it. This shouldn't give you any problems with your 
normal traffic. Tell Cassandra on listen on the public IPs and it should 
work.

Linux commands would be:

# Create an internal interface using bridge-utils
brctl addbr cassth0

# add the ip
ip addr add dev cassth0 50.18.60.117/32

# DNAT incoming connections
iptables -t nat -A PREROUTING -p tcp --dport 7000 -d INTERNALIP -j DNAT 
--to-destination 50.18.60.117

# SNAT outgoing connections
iptables -t nat -A OUTPUT -p tcp --dport 7000 -d 175.41.143.192 -j SNAT 
--to-source INTERNALIP

This should work since Amazon you re-SNAT your outgoing packets to your 
public IP again, so the other cassandra instance will see your public IP 
as your source address.

I didn't test this setup here but it should work unless I forgot some 
small detail. If you need to troubleshoot use the command "tcpdump -i 
INTERFACE -n port 7000" where INTERFACE should be your public interface or 
your cassth0.

Please let me know if it worked.

Best regards,
Daniel Colchete

On Thu, Feb 24, 2011 at 4:04 AM, Himanshi Sharma  
wrote:
giving private ip to rpc address gives the same exception
and the keeping it blank and providing public to listen also fails. I 
tried keeping both blank and did telnet on 7000 so i get following o/p
 
[root@ip-10-166-223-150 bin]# telnet 122.248.193.37 7000
Trying 122.248.193.37...
Connected to 122.248.193.37.
Escape character is '^]'.
 
Similarly from another achine
 
[root@ip-10-136-75-201 bin]# telnet 184.72.22.87 7000
Trying 184.72.22.87...
Connected to 184.72.22.87.
Escape character is '^]'.
 


-Dave Viner wrote: - 
To: user@cassandra.apache.org
From: Dave Viner 
Date: 02/24/2011 11:59AM
cc: Himanshi Sharma 

Subject: Re: Cassandra nodes on EC2 in two different regions not 
communicating

Try using the private ipv4 address in the rpc_address field, and the 
public ipv4 (NOT the elastic ip) in the listen_address. 

If that fails, go back to rpc_address empty, and start up cassandra. 

Then from the other node, please telnet to port 7000 on the first node. 
 And show the output of that session in your reply. 

I haven't actually constructed a cross-region cluster nor have I used 
v0.7, but this really sounds like it should be easy. 

On Wed, Feb 23, 2011 at 10:22 PM, Himanshi Sharma < 
himanshi.sha...@tcs.com > wrote: 
Hi Dave, 
  
I tried with the public ips. If i mention the public ip in rpc address 
field, Cassandra gives the same exception but if leave it blank then 
Cassandra runs but again in the nodetool command with ring option it 
does'nt show the node in another region. 
  
Thanks, 
Himanshi 


-Dave Viner wrote: - 
To: user@cassandra.apache.org 
From: Dave Viner < davevi...@gmail.com > 
Date: 02/24/2011 10:43AM 

Subject: Re: Cassandra nodes on EC2 in two different regions not 
communicating 

That looks like it's not an issue of communicating between nodes.  It 
appears that the node can not bind to the address on the localhost that 
you're asking for. 

" java.net.BindException: Cannot assign requested address  " 

I think the issue is that the Elastic IP address is not actually an IP 
address that's on the localhost.  So the daemon can not bind to that IP. 
 Instead of using the EIP, use the local IP address for the rpc_address (i 
think that's what you need since that is what Thrift will bind to).  Then 
for the listen_address should be the ip address that is routable from the 
other node.  I would first try with the actual public IP address (not the 
Elastic IP).  Once you get that to work, then shutdown the cluster, change 
the listen_address to the EIP, boot up and try again. 

Dave Viner 


On Wed, Feb 23, 2011 at 8:54 PM, Himanshi Sharma < himanshi.sha...@tcs.com 
> wrote: 

Hey Dave, 

Sorry i forgot to mention the Non-seed configuration. 

for first node in us-west its as belowi.e its own elastic ip 

listen_address: 50.18.60.117 
rpc_address: 50.18.60.117 

and for second node in ap-southeast-1 its as belowi.e again its own 
elastic ip 

listen_address: 175.41.143.192 
rpc_address: 175.41.143.192 

Thanks, 
Himanshi 





From: 
Dave Viner < davevi...@gmail.com > 
To: 
user@cassandra.apache.org 
Date: 
02/23/2011 11:01 PM 
Subject: 
Re: Cassandra nodes on EC2 in two different regions not communicating 




internal EC2 ips (10.xxx.xxx.xxx) work across availability zones (e.g., 
from us-east-1a to us-east-1b) but do not work across regions (e.g., 
us-east to us-west).  To do regions, you must use the public 

Re: Understand eventually consistent

2011-02-24 Thread Javier Canillas
First of all, in your example W=CL?

If it so, then the success of any read / write operarion will be
determine by if the CL required can be satisfied in that moment.

If you write with CL ONE over a CF with RF 3 when 1 node of the
replicas is down, then the operarion will success and HitedHandOff
will manage to propagate the op through the falling node when it comes
up.

Instead, when you execute the same OP using CL QUORUM, then it means
RF /2+1, it will try to write on the coordinator node and replica.
Considering only 1 replica is down, the OP will success too.

Now consider same OP but with CL ALL, it will fail since it cant
assure that coordinador and both replicas are updated.

Hope you can understand the relation between CL and RF

Enviado desde mi iPhone

El 23/02/2011, a las 21:43, mcasandra  escribió:

>
> I am reading this again http://wiki.apache.org/cassandra/HintedHandoff and
> got little confused. This is my understanding about how HH should work based
> on what I read in Dynamo Paper:
>
> 1) Say node A, B, C, D, E are in the cluster in a ring (in that order).
> 2) For a given key K RF=3.
> 3) Node B holds theyhash of that key K. Which means when K is written it
> will be written to B (owner of the hash) + C + D since RF = 3
> 4) If Node D goes down and there is a write again to key K then this time
> key K row will be written with W=1 to B (owner) + C + E (HH) since RF=3
> needs to be satisfied. Is this correct?
> 5) In above scenario where node D is down and if we are reading at W=2 and
> R=2 would it fail even though original nodes B + C are up? Here I am
> thinking W=2 and R=2 means that 2 nodes that hold the key K are up so it
> satisfies the CL and thus writes and read will not fail.
> --
> View this message in context: 
> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Understand-eventually-consistent-tp6038330p6058576.html
> Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
> Nabble.com.


My responses to this mailing list interpreted as SPAM

2011-02-24 Thread Anthony John
To the list owners - the error text that gmail comes back with is below

Now I understand that much of what I write is spam quality, so the mail
filter might actually be smart ;).

New posts works, as this one hopefully will. If is on reply that I have a
problem. Any pointers to avoid this situation will be super useful.

Error Text

Delivery to the following recipient failed permanently:

user@cassandra.apache.org

Technical details of permanent failure:
Google tried to deliver your message, but it was rejected by the recipient
domain. We recommend contacting the other email provider for further
information about the cause of this error. The error that the other server
returned was: 552 552 spam score (5.3) exceeded threshold
(FREEMAIL_FROM,FREEMAIL_REPLY,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL,URI_HEX
(state 18).
--


New thread for : How does Cassandra handle failure during synchronous writes

2011-02-24 Thread Anthony John
>>c. Read with CL = QUORUM. If read hits node1 and node2/node3, new data
that was written to node1 will be returned.

>>In this case - N1 will be identified as a discrepancy and the change will
be discarded via read repair

>>[Naren] How will Cassandra know this is a discrepancy?

Because at Q - only N1 will have the "new data" and other other nodes won't.
This lack of consistency on N! will be detected and repaired. The value that
meets Q - the values from N2-3 - will be returned.

HTH


Re: New Chain for : Does Cassandra use vector clocks

2011-02-24 Thread Sylvain Lebresne
On Thu, Feb 24, 2011 at 3:22 AM, Anthony John  wrote:

> Apologies : For some reason my response on the original mail keeps bouncing
> back, thus this new one!
> > From the other hand, the same article says:
> > "For conditional writes to work, the condition must be evaluated at all
> update
> > sites before the write can be allowed to succeed."
> >
> > This means, that when doing such an update CL=ALL must be used
>
> Sorry, but I am confused by that entire thread!
>
> Questions:-
> 1. Does Cassandra implement any kind of data locking - at any granularity
> whether it be row/colF/Col ?
>

No locking, no.


> 2. If the answer to 1 above is NO! - how does CL ALL prevent conflicts.
> Concurrent updates on exactly the same piece of data on different nodes can
> still mess each other up, right ?
>

Not sure why you are taking CL.ALL specifically. But in any CL, updating the
same piece of data means the same column value. In that case, the resolution
rules are the following:
  - If the updates have a different timestamp, keep the one with the higher
timestamp. That is, the more recent of two updates win.
  - It the timestamps are the same, then it compares the values (byte
comparison) and keep the highest value. This is just to break ties in a
consistent manner.

So if you do two truly concurrent updates (that is from two place at the
same instant), then you'll end with one of the update. This is the column
level.

However, if that simple conflict detection/resolution mechanism is not good
enough for some of your use case and you need to keep two concurrent
updates, it is easy enough. Just make sure that the update don't end up in
the same column. This is easily achieved by appending some unique identifier
to the column name for instance. And when reading, do a slice and reconcile
whatever you get back with whatever logic make sense. If you do that,
congrats, you've roughly emulated what vector clocks would do. Btw, no
locking or anything needed.

In my experience, for most things the timestamp resolution is enough. If the
same user update twice it's profile picture on you web site at the same
microsecond, it's usually fine to end up with one of the two pictures. In
the rare case where you need something more specific, using the cassandra
data model usually solves the problem easily. The reason for not having
vector clocks in Cassandra is that so far, we haven't really found much
example where it is no the case.

--
Sylvain


Re: Cassandra nodes on EC2 in two different regions not communicating

2011-02-24 Thread Daniel van Ham Colchete
Himanshi,

my bad, try this for iptables:

# SNAT outgoing connections
iptables -t nat -A POSTROUTING -p tcp --dport 7000 -d 175.41.143.192 -j SNAT
--to-source INTERNALIP

As for tcpdump the argument for the -i option is the interface name (eth0,
cassth0, etc...), and not the IP. So, it should be
tcpdump -i cassth0 -n port 7000
or
tcpdump -i eth0 -n port 7000

I`m assuming your main network card is eth0, but that should be the case.

Does it work?

Best,
Daniel

On Thu, Feb 24, 2011 at 9:27 AM, Himanshi Sharma wrote:

>
> Thanks Daniel.
>
> But SNAT command is not working and when i try tcpdump it gives
>
> [root@ip-10-136-75-201 ~]# tcpdump -i 50.18.60.117 -n port 7000
> tcpdump: Invalid adapter index
>
> Not able to figure out wats this ??
>
> Thanks,
> Himanshi
>
>
>
>  From: Daniel van Ham Colchete  To:
> user@cassandra.apache.org Date: 02/24/2011 04:27 PM Subject: Re: Cassandra
> nodes on EC2 in two different regions not communicating
> --
>
>
>
> Himanshi,
>
> you could try adding your public IP address to an internal interface and
> DNAT the packets to it. This shouldn't give you any problems with your
> normal traffic. Tell Cassandra on listen on the public IPs and it should
> work.
>
> Linux commands would be:
>
> # Create an internal interface using bridge-utils
> brctl addbr cassth0
>
> # add the ip
> ip addr add dev cassth0 *50.18.60.117/32* 
>
> # DNAT incoming connections
> iptables -t nat -A PREROUTING -p tcp --dport 7000 -d INTERNALIP -j DNAT
> --to-destination 50.18.60.117
>
> # SNAT outgoing connections
> iptables -t nat -A OUTPUT -p tcp --dport 7000 -d 175.41.143.192 -j SNAT
> --to-source INTERNALIP
>
> This should work since Amazon you re-SNAT your outgoing packets to your
> public IP again, so the other cassandra instance will see your public IP as
> your source address.
>
> I didn't test this setup here but it should work unless I forgot some small
> detail. If you need to troubleshoot use the command "tcpdump -i INTERFACE -n
> port 7000" where INTERFACE should be your public interface or your cassth0.
>
> Please let me know if it worked.
>
> Best regards,
> Daniel Colchete
>
> On Thu, Feb 24, 2011 at 4:04 AM, Himanshi Sharma <*himanshi.sha...@tcs.com
> * > wrote:
> giving private ip to rpc address gives the same exception
> and the keeping it blank and providing public to listen also fails. I tried
> keeping both blank and did telnet on 7000 so i get following o/p
>
> [root@ip-10-166-223-150 bin]# telnet 122.248.193.37 7000
> Trying 122.248.193.37...
> Connected to 122.248.193.37.
> Escape character is '^]'.
>
> Similarly from another achine
>
> [root@ip-10-136-75-201 bin]# telnet 184.72.22.87 7000
> Trying 184.72.22.87...
> Connected to 184.72.22.87.
> Escape character is '^]'.
>
>
>
> -Dave Viner wrote: -
> To: *user@cassandra.apache.org* 
> From: Dave Viner <*davevi...@gmail.com* >
> Date: 02/24/2011 11:59AM
> cc: Himanshi Sharma <*himanshi.sha...@tcs.com* >
>
> Subject: Re: Cassandra nodes on EC2 in two different regions not
> communicating
>
> Try using the private ipv4 address in the rpc_address field, and the public
> ipv4 (NOT the elastic ip) in the listen_address.
>
> If that fails, go back to rpc_address empty, and start up cassandra.
>
> Then from the other node, please telnet to port 7000 on the first node.
>  And show the output of that session in your reply.
>
> I haven't actually constructed a cross-region cluster nor have I used v0.7,
> but this really sounds like it should be easy.
>
> On Wed, Feb 23, 2011 at 10:22 PM, Himanshi Sharma < *himanshi.sha...@tcs.com
> * > wrote:
> Hi Dave,
>
> I tried with the public ips. If i mention the public ip in rpc address
> field, Cassandra gives the same exception but if leave it blank then
> Cassandra runs but again in the nodetool command with ring option it does'nt
> show the node in another region.
>
> Thanks,
> Himanshi
>
>
> -Dave Viner wrote: -
> To: *user@cassandra.apache.org * 
> From: Dave Viner < *davevi...@gmail.com * >
> Date: 02/24/2011 10:43AM
>
> Subject: Re: Cassandra nodes on EC2 in two different regions not
> communicating
>
> That looks like it's not an issue of communicating between nodes.  It
> appears that the node can not bind to the address on the localhost that
> you're asking for.
>
> " java.net.BindException: Cannot assign requested address  "
>
> I think the issue is that the Elastic IP address is not actually an IP
> address that's on the localhost.  So the daemon can not bind to that IP.
>  Instead of using the EIP, use the local IP address for the rpc_address (i
> think that's what you need since that is what Thrift will bind to).  Then
> for the listen_address should be the ip address that is routable from the
> other node.  I would first try with the actual public IP address (not the
> Elastic IP).  Once you get that to work, then shutdown the cluster, change
> the listen_address to the EIP, boot up and try again.
>
> Dave Viner

losing connection to Cassandra

2011-02-24 Thread Tomer B
Hi

i'm using a 3 node cluster of cassandra 0.6.1 together with hector as api to
java client.

every few days I get a situation where I cannot connect to cassandra, other
than that the data dir is filling up the whole disk space and the
synchronization stops at these times, the exceptions I get are as following:

Happended 3386 in 24H: POOL EXHAUSTED: 02:00:30.225 [MyThread[5]]: Unable to
connect to cassandra node xxx.xx.xx.32:9160 will try the next node, Pool
exhausted

Happened 6848 in 24H CONNECTION REFUSED: 06:14:48.598 [MyThread[4]]: Unable
to connect to cassandra node xxx.xx.xx.30:9160 will try the next node,
Unable to open transport to xxx.xx.xx.30:9160 , java.net.ConnectException:
Connection refused: connect

Happened 84 times in 24H: NULL OUTPUTSTREAM: 06:14:48.504 [MyThread[2]]:
async execution fail, Cannot write to null outputStream

Happened 14 times in 24H: CONNECTION TIMED OUT: 06:15:08.019 [MyThread[0]]:
Unable to connect to cassandra node xxx.xx.xx.31:9160 will try the next
node, Unable to open transp ort to xxx.xx.xx.31:9160 ,
java.net.ConnectException: Connection timed out: connect

Can anyone assist or suggest what could be the problem? note that the node
is funcioning well and it happens once every few days.


Re: Fill disks more than 50%

2011-02-24 Thread Edward Capriolo
On Thu, Feb 24, 2011 at 4:08 AM, Thibaut Britz
 wrote:
> Hi,
>
> How would you use rsync instead of repair in case of a node failure?
>
> Rsync all files from the data directories from the adjacant nodes
> (which are part of the quorum group) and then run a compactation which
> will? remove all the unneeded keys?
>
> Thanks,
> Thibaut
>
>
> On Thu, Feb 24, 2011 at 4:22 AM, Edward Capriolo  
> wrote:
>> On Wed, Feb 23, 2011 at 9:39 PM, Terje Marthinussen
>>  wrote:
>>> Hi,
>>> Given that you have have always increasing key values (timestamps) and never
>>> delete and hardly ever overwrite data.
>>> If you want to minimize work on rebalancing and statically assign (new)
>>> token ranges to new nodes as you add them so they always get the latest
>>> data
>>> Lets say you add a new node each year to handle next years data.
>>> In a scenario like this, could you with 0.7 be able to safely fill disks
>>> significantly more than 50% and still manage things like repair/recovery of
>>> faulty nodes?
>>>
>>> Regards,
>>> Terje
>>
>> Since all your data for a day/month/year would sit on the same server.
>> Meaning all your servers with old data would be idle and your servers
>> with current data would be very busy. This is probably not a good way
>> to go.
>>
>> There is a ticket open for 0.8 for efficient node moves joins. It is
>> already a lot better in 0.7. Pretend you did not see this (you can
>> join nodes using rsync if you know some tricks) if you are really
>> afraid of joins, which you really should not be.
>>
>> As for the 50% statement. In a worse case scenario a major compaction
>> will require double the disk size of your column family. So if you
>> have more then 1 column family you do NOT need 50% overhead.
>>
>
@Thibaut Britz
Caveat:Using simple strategy.
This works because cassandra scans data at startup and then serves
what it finds. For a join for example you can rsync all the data from
the node below/to the right of where the new node is joining. Then
join without bootstrap then cleanup both nodes. (also you have to
shutdown the first node so you do not have a lost write scenario in
the time between rsync and new node startup)

It does not make as much sense for repair because the data on a node
will tripple, before you compact/cleanup it.

@Terje
I am suggesting that your probably want to rethink your scheme design
since partitioning by year is going to be bad performance since the
old servers are going to be nothing more then expensive tape drives.


Re: My responses to this mailing list interpreted as SPAM

2011-02-24 Thread Sasha Dolgy
have you tried replying without copying in the entire conversation
thread to the message?

On Thu, Feb 24, 2011 at 1:40 PM, Anthony John  wrote:
> To the list owners - the error text that gmail comes back with is below
> Now I understand that much of what I write is spam quality, so the mail
> filter might actually be smart ;).
> New posts works, as this one hopefully will. If is on reply that I have a
> problem. Any pointers to avoid this situation will be super useful.


Re: My responses to this mailing list interpreted as SPAM

2011-02-24 Thread Anthony John
Do not copy the entire thread, only hit reply!

It seems as the thread grows in responses, the spam word count somehow kicks
in.

Thx,

-JA

On Thu, Feb 24, 2011 at 9:44 AM, Sasha Dolgy  wrote:

> have you tried replying without copying in the entire conversation
> thread to the message?
>
> On Thu, Feb 24, 2011 at 1:40 PM, Anthony John 
> wrote:
> > To the list owners - the error text that gmail comes back with is below
> > Now I understand that much of what I write is spam quality, so the mail
> > filter might actually be smart ;).
> > New posts works, as this one hopefully will. If is on reply that I have a
> > problem. Any pointers to avoid this situation will be super useful.
>


Non-latin implementation

2011-02-24 Thread A J
Hello,
Have there been Cassandra implementations in non-latin languages. In
particular: Mandarin (China) ,Devanagari (India), Korean (Korea)
I am interested in finding if there are storage, sorting or other
types of issues one should be aware of in these languages.

Thanks.


Re: New Chain for : Does Cassandra use vector clocks

2011-02-24 Thread Anthony John
>
> >>Time stamps are not used for conflict resolution - unless is is part of
> the application logic!!!
>

>>What is you definition of conflict resolution ? Because if you update
twice the same column (which
>>I'll call a conflict), then the timestamps are used to decide which update
wins (which I'll call a resolution).

I understand what you are saying, and yes semantics is very important here.
And yes we are responding to the immediate questions without covering all
questions in the thread.

The point being made here is that the timestamp of the column is not used by
Cassandra to figure out what data to return.

E.g. - Quorum is 2 nodes - and RF of 3 over N1/2/3
A Quorum  Write comes and add/updates the time stamp (TS2) of a particular
data element. It succeeds on N1 - fails on N2/3. So the write is returned as
failed - right ?
Now Quorum read comes in for exactly the same piece of data that the write
failed for.
So N1 has TS2 but both N2/3 have the old TS (say TS1)
And the read succeeds - Will it return TS1 or TS2.

I submit it will return TS1 - the old TS.

Are we on the same page with this interpretation ?

Regards,

-JA

On Thu, Feb 24, 2011 at 10:12 AM, Sylvain Lebresne wrote:

> On Thu, Feb 24, 2011 at 4:52 PM, Anthony John wrote:
>
>> Sylvan,
>>
>> Time stamps are not used for conflict resolution - unless is is part of
>> the application logic!!!
>>
>
> What is you definition of conflict resolution ? Because if you update twice
> the same column (which
> I'll call a conflict), then the timestamps are used to decide which update
> wins (which I'll call a resolution).
>
>
>> You can have "lost updates" w/Cassandra. You need to to use 3rd products -
>> cages for e.g. - to get ACID type consistency.
>>
>
> Then again, you'll have to define what you are calling "lost updates".
> Provided you use a reasonable consistency level, Cassandra provides fairly
> strong durability guarantee, so for some definition you don't "lose
> updates".
>
> That being said, I never pretended that Cassandra provided any ACID
> guarantee. ACID relates to transaction, which Cassandra doesn't support. If
> we're talking about the guarantees of transaction, then by all means,
> cassandra won't provide it. And yes you can use cages or the like to get
> transaction. But that was not the point of the thread, was it ? The thread
> is about vector clocks, and that has nothing to do with transaction (vector
> clocks certainly don't give you transactions).
>
> Sorry if I wasn't clear in my mail, but I was only responding to why so far
> I don't think vector clocks would really provide much for Cassandra.
>
> --
> Sylvain
>
>
>> -JA
>>
>>
>> On Thu, Feb 24, 2011 at 7:41 AM, Sylvain Lebresne 
>> wrote:
>>
>>> On Thu, Feb 24, 2011 at 3:22 AM, Anthony John wrote:
>>>
 Apologies : For some reason my response on the original mail keeps
 bouncing back, thus this new one!
 > From the other hand, the same article says:
 > "For conditional writes to work, the condition must be evaluated at
 all update
 > sites before the write can be allowed to succeed."
 >
 > This means, that when doing such an update CL=ALL must be used

 Sorry, but I am confused by that entire thread!

 Questions:-
 1. Does Cassandra implement any kind of data locking - at any
 granularity whether it be row/colF/Col ?

>>>
>>> No locking, no.
>>>
>>>
 2. If the answer to 1 above is NO! - how does CL ALL prevent conflicts.
 Concurrent updates on exactly the same piece of data on different nodes can
 still mess each other up, right ?

>>>
>>> Not sure why you are taking CL.ALL specifically. But in any CL, updating
>>> the same piece of data means the same column value. In that case, the
>>> resolution rules are the following:
>>>- If the updates have a different timestamp, keep the one with the
>>> higher timestamp. That is, the more recent of two updates win.
>>>   - It the timestamps are the same, then it compares the values (byte
>>> comparison) and keep the highest value. This is just to break ties in a
>>> consistent manner.
>>>
>>> So if you do two truly concurrent updates (that is from two place at the
>>> same instant), then you'll end with one of the update. This is the column
>>> level.
>>>
>>> However, if that simple conflict detection/resolution mechanism is not
>>> good enough for some of your use case and you need to keep two concurrent
>>> updates, it is easy enough. Just make sure that the update don't end up in
>>> the same column. This is easily achieved by appending some unique identifier
>>> to the column name for instance. And when reading, do a slice and reconcile
>>> whatever you get back with whatever logic make sense. If you do that,
>>> congrats, you've roughly emulated what vector clocks would do. Btw, no
>>> locking or anything needed.
>>>
>>> In my experience, for most things the timestamp resolution is enough. If
>>> the same user update twice it's profile picture on 

Re: New Chain for : Does Cassandra use vector clocks

2011-02-24 Thread Sylvain Lebresne
On Thu, Feb 24, 2011 at 5:34 PM, Anthony John  wrote:

> >>Time stamps are not used for conflict resolution - unless is is part of
>> the application logic!!!
>>
>
> >>What is you definition of conflict resolution ? Because if you update
> twice the same column (which
> >>I'll call a conflict), then the timestamps are used to decide which
> update wins (which I'll call a resolution).
>
> I understand what you are saying, and yes semantics is very important here.
> And yes we are responding to the immediate questions without covering all
> questions in the thread.
>
> The point being made here is that the timestamp of the column is not used
> by Cassandra to figure out what data to return.
>

Not quite true.


> E.g. - Quorum is 2 nodes - and RF of 3 over N1/2/3
> A Quorum  Write comes and add/updates the time stamp (TS2) of a particular
> data element. It succeeds on N1 - fails on N2/3. So the write is returned as
> failed - right ?
> Now Quorum read comes in for exactly the same piece of data that the write
> failed for.
> So N1 has TS2 but both N2/3 have the old TS (say TS1)
> And the read succeeds - Will it return TS1 or TS2.
>
> I submit it will return TS1 - the old TS.
>

It all depends on which (first 2) nodes respond to the read (since RF=3,
that can any two of N1/N2/N3). If N1 is part of the two that makes the
quorum, then TS2 will be returned, because cassandra will compare the
timestamp and decide what to return based on this. If N2/N3 responds
however, both timestamp will be TS1 and so, after timestamp resolution, it
will stil be TS1 that will be returned.
So yes timestamp is used for conflict resolution.

In your example, you could get TS1 back because a failed write can let you
cluster in an inconsistent state. You'd have to retry the quorum and only
when it succeeds can you be guaranteed that quorum read will always return
TS2.

This is because when a write fails, Cassandra doesn't guarantee that the
write did not made it in (there is no revert).


>
> Are we on the same page with this interpretation ?
>
> Regards,
>
> -JA
>
> On Thu, Feb 24, 2011 at 10:12 AM, Sylvain Lebresne 
> wrote:
>
>> On Thu, Feb 24, 2011 at 4:52 PM, Anthony John wrote:
>>
>>> Sylvan,
>>>
>>> Time stamps are not used for conflict resolution - unless is is part of
>>> the application logic!!!
>>>
>>
>> What is you definition of conflict resolution ? Because if you update
>> twice the same column (which
>> I'll call a conflict), then the timestamps are used to decide which update
>> wins (which I'll call a resolution).
>>
>>
>>> You can have "lost updates" w/Cassandra. You need to to use 3rd products
>>> - cages for e.g. - to get ACID type consistency.
>>>
>>
>> Then again, you'll have to define what you are calling "lost updates".
>> Provided you use a reasonable consistency level, Cassandra provides fairly
>> strong durability guarantee, so for some definition you don't "lose
>> updates".
>>
>> That being said, I never pretended that Cassandra provided any ACID
>> guarantee. ACID relates to transaction, which Cassandra doesn't support. If
>> we're talking about the guarantees of transaction, then by all means,
>> cassandra won't provide it. And yes you can use cages or the like to get
>> transaction. But that was not the point of the thread, was it ? The thread
>> is about vector clocks, and that has nothing to do with transaction (vector
>> clocks certainly don't give you transactions).
>>
>> Sorry if I wasn't clear in my mail, but I was only responding to why so
>> far I don't think vector clocks would really provide much for Cassandra.
>>
>> --
>> Sylvain
>>
>>
>>> -JA
>>>
>>>
>>> On Thu, Feb 24, 2011 at 7:41 AM, Sylvain Lebresne 
>>> wrote:
>>>
 On Thu, Feb 24, 2011 at 3:22 AM, Anthony John wrote:

> Apologies : For some reason my response on the original mail keeps
> bouncing back, thus this new one!
> > From the other hand, the same article says:
> > "For conditional writes to work, the condition must be evaluated at
> all update
> > sites before the write can be allowed to succeed."
> >
> > This means, that when doing such an update CL=ALL must be used
>
> Sorry, but I am confused by that entire thread!
>
> Questions:-
> 1. Does Cassandra implement any kind of data locking - at any
> granularity whether it be row/colF/Col ?
>

 No locking, no.


> 2. If the answer to 1 above is NO! - how does CL ALL prevent conflicts.
> Concurrent updates on exactly the same piece of data on different nodes 
> can
> still mess each other up, right ?
>

 Not sure why you are taking CL.ALL specifically. But in any CL, updating
 the same piece of data means the same column value. In that case, the
 resolution rules are the following:
- If the updates have a different timestamp, keep the one with the
 higher timestamp. That is, the more recent of two updates win.
   - It the timestamps are the same,

Re: New Chain for : Does Cassandra use vector clocks

2011-02-24 Thread Dave Revell
>Time stamps are not used for conflict resolution - unless is is part of the
application logic!!!

This is false. In fact, the main reason Cassandra keeps timestamps is to do
conflict resolution. If there is a conflict between two replicas, when doing
a read or a repair, then the highest timestamp always wins.

Example: say your replication factor is 5. So if you read at CL ALL, you
will ask 5 replicas for their value. If the value from only one of these
replicas has a timestamp that is newer than all the rest, this is the value
that will be retruned to the client. There is no "voting" scheme where the
most common value wins, the conflict resolution is based ONLY on the most
recent timestamp.

(irrelevant aside: in the above example, read repair would occur at the end,
after the different values were detected by the coordinating server)

Clients are free to use the timestamps for their own purposes, but clients
must be careful to choose timestamps that make Cassandra do the right thing
during conflict resolution.

Best,
Dave

On Thu, Feb 24, 2011 at 8:34 AM, Anthony John  wrote:

> >>Time stamps are not used for conflict resolution - unless is is part of
>> the application logic!!!
>>
>
> >>What is you definition of conflict resolution ? Because if you update
> twice the same column (which
> >>I'll call a conflict), then the timestamps are used to decide which
> update wins (which I'll call a resolution).
>
> I understand what you are saying, and yes semantics is very important here.
> And yes we are responding to the immediate questions without covering all
> questions in the thread.
>
> The point being made here is that the timestamp of the column is not used
> by Cassandra to figure out what data to return.
>
> E.g. - Quorum is 2 nodes - and RF of 3 over N1/2/3
> A Quorum  Write comes and add/updates the time stamp (TS2) of a particular
> data element. It succeeds on N1 - fails on N2/3. So the write is returned as
> failed - right ?
> Now Quorum read comes in for exactly the same piece of data that the write
> failed for.
> So N1 has TS2 but both N2/3 have the old TS (say TS1)
> And the read succeeds - Will it return TS1 or TS2.
>
> I submit it will return TS1 - the old TS.
>
> Are we on the same page with this interpretation ?
>
> Regards,
>
> -JA
>
> On Thu, Feb 24, 2011 at 10:12 AM, Sylvain Lebresne 
> wrote:
>
>> On Thu, Feb 24, 2011 at 4:52 PM, Anthony John wrote:
>>
>>> Sylvan,
>>>
>>> Time stamps are not used for conflict resolution - unless is is part of
>>> the application logic!!!
>>>
>>
>> What is you definition of conflict resolution ? Because if you update
>> twice the same column (which
>> I'll call a conflict), then the timestamps are used to decide which update
>> wins (which I'll call a resolution).
>>
>>
>>> You can have "lost updates" w/Cassandra. You need to to use 3rd products
>>> - cages for e.g. - to get ACID type consistency.
>>>
>>
>> Then again, you'll have to define what you are calling "lost updates".
>> Provided you use a reasonable consistency level, Cassandra provides fairly
>> strong durability guarantee, so for some definition you don't "lose
>> updates".
>>
>> That being said, I never pretended that Cassandra provided any ACID
>> guarantee. ACID relates to transaction, which Cassandra doesn't support. If
>> we're talking about the guarantees of transaction, then by all means,
>> cassandra won't provide it. And yes you can use cages or the like to get
>> transaction. But that was not the point of the thread, was it ? The thread
>> is about vector clocks, and that has nothing to do with transaction (vector
>> clocks certainly don't give you transactions).
>>
>> Sorry if I wasn't clear in my mail, but I was only responding to why so
>> far I don't think vector clocks would really provide much for Cassandra.
>>
>> --
>> Sylvain
>>
>>
>>> -JA
>>>
>>>
>>> On Thu, Feb 24, 2011 at 7:41 AM, Sylvain Lebresne 
>>> wrote:
>>>
 On Thu, Feb 24, 2011 at 3:22 AM, Anthony John wrote:

> Apologies : For some reason my response on the original mail keeps
> bouncing back, thus this new one!
> > From the other hand, the same article says:
> > "For conditional writes to work, the condition must be evaluated at
> all update
> > sites before the write can be allowed to succeed."
> >
> > This means, that when doing such an update CL=ALL must be used
>
> Sorry, but I am confused by that entire thread!
>
> Questions:-
> 1. Does Cassandra implement any kind of data locking - at any
> granularity whether it be row/colF/Col ?
>

 No locking, no.


> 2. If the answer to 1 above is NO! - how does CL ALL prevent conflicts.
> Concurrent updates on exactly the same piece of data on different nodes 
> can
> still mess each other up, right ?
>

 Not sure why you are taking CL.ALL specifically. But in any CL, updating
 the same piece of data means the same column value. In that case, the
 

Re: New Chain for : Does Cassandra use vector clocks

2011-02-24 Thread Anthony John
If you are correct and you are probably closer to the code - then CL of
Quorum does not guarantee a consistency.

On Thu, Feb 24, 2011 at 10:54 AM, Sylvain Lebresne wrote:

> On Thu, Feb 24, 2011 at 5:34 PM, Anthony John wrote:
>
>>  >>Time stamps are not used for conflict resolution - unless is is part
>>> of the application logic!!!
>>>
>>
>> >>What is you definition of conflict resolution ? Because if you update
>> twice the same column (which
>> >>I'll call a conflict), then the timestamps are used to decide which
>> update wins (which I'll call a resolution).
>>
>> I understand what you are saying, and yes semantics is very important
>> here. And yes we are responding to the immediate questions without covering
>> all questions in the thread.
>>
>> The point being made here is that the timestamp of the column is not used
>> by Cassandra to figure out what data to return.
>>
>
> Not quite true.
>
>
>> E.g. - Quorum is 2 nodes - and RF of 3 over N1/2/3
>> A Quorum  Write comes and add/updates the time stamp (TS2) of a particular
>> data element. It succeeds on N1 - fails on N2/3. So the write is returned as
>> failed - right ?
>> Now Quorum read comes in for exactly the same piece of data that the write
>> failed for.
>> So N1 has TS2 but both N2/3 have the old TS (say TS1)
>> And the read succeeds - Will it return TS1 or TS2.
>>
>> I submit it will return TS1 - the old TS.
>>
>
> It all depends on which (first 2) nodes respond to the read (since RF=3,
> that can any two of N1/N2/N3). If N1 is part of the two that makes the
> quorum, then TS2 will be returned, because cassandra will compare the
> timestamp and decide what to return based on this. If N2/N3 responds
> however, both timestamp will be TS1 and so, after timestamp resolution, it
> will stil be TS1 that will be returned.
> So yes timestamp is used for conflict resolution.
>
> In your example, you could get TS1 back because a failed write can let you
> cluster in an inconsistent state. You'd have to retry the quorum and only
> when it succeeds can you be guaranteed that quorum read will always return
> TS2.
>
> This is because when a write fails, Cassandra doesn't guarantee that the
> write did not made it in (there is no revert).
>
>
>>
>> Are we on the same page with this interpretation ?
>>
>> Regards,
>>
>> -JA
>>
>> On Thu, Feb 24, 2011 at 10:12 AM, Sylvain Lebresne 
>> wrote:
>>
>>> On Thu, Feb 24, 2011 at 4:52 PM, Anthony John wrote:
>>>
 Sylvan,

 Time stamps are not used for conflict resolution - unless is is part of
 the application logic!!!

>>>
>>> What is you definition of conflict resolution ? Because if you update
>>> twice the same column (which
>>> I'll call a conflict), then the timestamps are used to decide which
>>> update wins (which I'll call a resolution).
>>>
>>>
 You can have "lost updates" w/Cassandra. You need to to use 3rd products
 - cages for e.g. - to get ACID type consistency.

>>>
>>> Then again, you'll have to define what you are calling "lost updates".
>>> Provided you use a reasonable consistency level, Cassandra provides fairly
>>> strong durability guarantee, so for some definition you don't "lose
>>> updates".
>>>
>>> That being said, I never pretended that Cassandra provided any ACID
>>> guarantee. ACID relates to transaction, which Cassandra doesn't support. If
>>> we're talking about the guarantees of transaction, then by all means,
>>> cassandra won't provide it. And yes you can use cages or the like to get
>>> transaction. But that was not the point of the thread, was it ? The thread
>>> is about vector clocks, and that has nothing to do with transaction (vector
>>> clocks certainly don't give you transactions).
>>>
>>> Sorry if I wasn't clear in my mail, but I was only responding to why so
>>> far I don't think vector clocks would really provide much for Cassandra.
>>>
>>> --
>>> Sylvain
>>>
>>>
 -JA


 On Thu, Feb 24, 2011 at 7:41 AM, Sylvain Lebresne >>> > wrote:

> On Thu, Feb 24, 2011 at 3:22 AM, Anthony John 
> wrote:
>
>> Apologies : For some reason my response on the original mail keeps
>> bouncing back, thus this new one!
>> > From the other hand, the same article says:
>> > "For conditional writes to work, the condition must be evaluated at
>> all update
>> > sites before the write can be allowed to succeed."
>> >
>> > This means, that when doing such an update CL=ALL must be used
>>
>> Sorry, but I am confused by that entire thread!
>>
>> Questions:-
>> 1. Does Cassandra implement any kind of data locking - at any
>> granularity whether it be row/colF/Col ?
>>
>
> No locking, no.
>
>
>> 2. If the answer to 1 above is NO! - how does CL ALL prevent
>> conflicts. Concurrent updates on exactly the same piece of data on 
>> different
>> nodes can still mess each other up, right ?
>>
>
> Not sure why you are taking CL.ALL specifica

Re: New Chain for : Does Cassandra use vector clocks

2011-02-24 Thread Sylvain Lebresne
On Thu, Feb 24, 2011 at 6:01 PM, Anthony John  wrote:

> If you are correct and you are probably closer to the code - then CL of
> Quorum does not guarantee a consistency.


If the operation succeed, it does (for some definition of consistency which
is, following reads at Quorum will be guaranteed to see the new value of a
update at quorum). If it fails, then no, it does not guarantee consistency.

It is important to note that the word consistency has multiple meaning. In
particular, when we are talking of consistency in Cassandra, we are not
talking of the same definition as the C in ACID (see:
http://www.allthingsdistributed.com/2007/12/eventually_consistent.html)

>
> On Thu, Feb 24, 2011 at 10:54 AM, Sylvain Lebresne 
> wrote:
>
>> On Thu, Feb 24, 2011 at 5:34 PM, Anthony John wrote:
>>
>>>  >>Time stamps are not used for conflict resolution - unless is is part
 of the application logic!!!

>>>
>>> >>What is you definition of conflict resolution ? Because if you update
>>> twice the same column (which
>>> >>I'll call a conflict), then the timestamps are used to decide which
>>> update wins (which I'll call a resolution).
>>>
>>> I understand what you are saying, and yes semantics is very important
>>> here. And yes we are responding to the immediate questions without covering
>>> all questions in the thread.
>>>
>>> The point being made here is that the timestamp of the column is not used
>>> by Cassandra to figure out what data to return.
>>>
>>
>> Not quite true.
>>
>>
>>> E.g. - Quorum is 2 nodes - and RF of 3 over N1/2/3
>>> A Quorum  Write comes and add/updates the time stamp (TS2) of a
>>> particular data element. It succeeds on N1 - fails on N2/3. So the write is
>>> returned as failed - right ?
>>> Now Quorum read comes in for exactly the same piece of data that the
>>> write failed for.
>>> So N1 has TS2 but both N2/3 have the old TS (say TS1)
>>> And the read succeeds - Will it return TS1 or TS2.
>>>
>>> I submit it will return TS1 - the old TS.
>>>
>>
>> It all depends on which (first 2) nodes respond to the read (since RF=3,
>> that can any two of N1/N2/N3). If N1 is part of the two that makes the
>> quorum, then TS2 will be returned, because cassandra will compare the
>> timestamp and decide what to return based on this. If N2/N3 responds
>> however, both timestamp will be TS1 and so, after timestamp resolution, it
>> will stil be TS1 that will be returned.
>> So yes timestamp is used for conflict resolution.
>>
>> In your example, you could get TS1 back because a failed write can let you
>> cluster in an inconsistent state. You'd have to retry the quorum and only
>> when it succeeds can you be guaranteed that quorum read will always return
>> TS2.
>>
>> This is because when a write fails, Cassandra doesn't guarantee that the
>> write did not made it in (there is no revert).
>>
>>
>>>
>>> Are we on the same page with this interpretation ?
>>>
>>> Regards,
>>>
>>> -JA
>>>
>>> On Thu, Feb 24, 2011 at 10:12 AM, Sylvain Lebresne >> > wrote:
>>>
 On Thu, Feb 24, 2011 at 4:52 PM, Anthony John wrote:

> Sylvan,
>
> Time stamps are not used for conflict resolution - unless is is part of
> the application logic!!!
>

 What is you definition of conflict resolution ? Because if you update
 twice the same column (which
 I'll call a conflict), then the timestamps are used to decide which
 update wins (which I'll call a resolution).


> You can have "lost updates" w/Cassandra. You need to to use 3rd
> products - cages for e.g. - to get ACID type consistency.
>

 Then again, you'll have to define what you are calling "lost updates".
 Provided you use a reasonable consistency level, Cassandra provides fairly
 strong durability guarantee, so for some definition you don't "lose
 updates".

 That being said, I never pretended that Cassandra provided any ACID
 guarantee. ACID relates to transaction, which Cassandra doesn't support. If
 we're talking about the guarantees of transaction, then by all means,
 cassandra won't provide it. And yes you can use cages or the like to get
 transaction. But that was not the point of the thread, was it ? The thread
 is about vector clocks, and that has nothing to do with transaction (vector
 clocks certainly don't give you transactions).

 Sorry if I wasn't clear in my mail, but I was only responding to why so
 far I don't think vector clocks would really provide much for Cassandra.

 --
 Sylvain


> -JA
>
>
> On Thu, Feb 24, 2011 at 7:41 AM, Sylvain Lebresne <
> sylv...@datastax.com> wrote:
>
>> On Thu, Feb 24, 2011 at 3:22 AM, Anthony John 
>> wrote:
>>
>>> Apologies : For some reason my response on the original mail keeps
>>> bouncing back, thus this new one!
>>> > From the other hand, the same article says:
>>> > "For conditional writes to work, the condition m

Re: New Chain for : Does Cassandra use vector clocks

2011-02-24 Thread Anthony John
Completely understand!

All that I am quibbling over is whether a CL of quorum guarantees
consistency or not. That is what the documentation says - right. IF for a CL
of Q read - it depends on which node returns read first to determine the
actual returned result or other more convoluted conditions , then a Quorum
read/write is not consistent, by any definition.

I can still use Cassandra, and will use it, luv it!!! But let us not make
this statement on the Wiki architecture section:-

-

More specifically: R=read replica count W=write replica count N=replication
factor Q=*QUORUM* (Q = N / 2 + 1)

   -

   If W + R > N, you will have consistency
   - W=1, R=N
   - W=N, R=1
   - W=Q, R=Q where Q = N / 2 + 1

Cassandra provides consistency when R + W > N (read replica count + write
replica count > replication factor).




.

On Thu, Feb 24, 2011 at 11:22 AM, Sylvain Lebresne wrote:

> On Thu, Feb 24, 2011 at 6:01 PM, Anthony John wrote:
>
>> If you are correct and you are probably closer to the code - then CL of
>> Quorum does not guarantee a consistency.
>
>
> If the operation succeed, it does (for some definition of consistency which
> is, following reads at Quorum will be guaranteed to see the new value of a
> update at quorum). If it fails, then no, it does not guarantee consistency.
>
> It is important to note that the word consistency has multiple meaning. In
> particular, when we are talking of consistency in Cassandra, we are not
> talking of the same definition as the C in ACID (see:
> http://www.allthingsdistributed.com/2007/12/eventually_consistent.html)
>
>>
>> On Thu, Feb 24, 2011 at 10:54 AM, Sylvain Lebresne 
>> wrote:
>>
>>> On Thu, Feb 24, 2011 at 5:34 PM, Anthony John wrote:
>>>
  >>Time stamps are not used for conflict resolution - unless is is part
> of the application logic!!!
>

 >>What is you definition of conflict resolution ? Because if you update
 twice the same column (which
 >>I'll call a conflict), then the timestamps are used to decide which
 update wins (which I'll call a resolution).

 I understand what you are saying, and yes semantics is very important
 here. And yes we are responding to the immediate questions without covering
 all questions in the thread.

 The point being made here is that the timestamp of the column is not
 used by Cassandra to figure out what data to return.

>>>
>>> Not quite true.
>>>
>>>
 E.g. - Quorum is 2 nodes - and RF of 3 over N1/2/3
 A Quorum  Write comes and add/updates the time stamp (TS2) of a
 particular data element. It succeeds on N1 - fails on N2/3. So the write is
 returned as failed - right ?
 Now Quorum read comes in for exactly the same piece of data that the
 write failed for.
 So N1 has TS2 but both N2/3 have the old TS (say TS1)
 And the read succeeds - Will it return TS1 or TS2.

 I submit it will return TS1 - the old TS.

>>>
>>> It all depends on which (first 2) nodes respond to the read (since RF=3,
>>> that can any two of N1/N2/N3). If N1 is part of the two that makes the
>>> quorum, then TS2 will be returned, because cassandra will compare the
>>> timestamp and decide what to return based on this. If N2/N3 responds
>>> however, both timestamp will be TS1 and so, after timestamp resolution, it
>>> will stil be TS1 that will be returned.
>>> So yes timestamp is used for conflict resolution.
>>>
>>> In your example, you could get TS1 back because a failed write can let
>>> you cluster in an inconsistent state. You'd have to retry the quorum and
>>> only when it succeeds can you be guaranteed that quorum read will always
>>> return TS2.
>>>
>>> This is because when a write fails, Cassandra doesn't guarantee that the
>>> write did not made it in (there is no revert).
>>>
>>>

 Are we on the same page with this interpretation ?

 Regards,

 -JA

 On Thu, Feb 24, 2011 at 10:12 AM, Sylvain Lebresne <
 sylv...@datastax.com> wrote:

> On Thu, Feb 24, 2011 at 4:52 PM, Anthony John 
> wrote:
>
>> Sylvan,
>>
>> Time stamps are not used for conflict resolution - unless is is part
>> of the application logic!!!
>>
>
> What is you definition of conflict resolution ? Because if you update
> twice the same column (which
> I'll call a conflict), then the timestamps are used to decide which
> update wins (which I'll call a resolution).
>
>
>> You can have "lost updates" w/Cassandra. You need to to use 3rd
>> products - cages for e.g. - to get ACID type consistency.
>>
>
> Then again, you'll have to define what you are calling "lost updates".
> Provided you use a reasonable consistency level, Cassandra provides fairly
> strong durability guarantee, so for some definition you don't "lose
> updates".
>>

Re: Understanding Indexes

2011-02-24 Thread mcasandra

Generally no. But yes if retrieving the key through index is faster than
going through the hash buckets. 

Currently I am thinking there could be 100s of million or billion of rows
and in that case if we have to retrieve a row which one will be fast going
through hash bucket or index? I am thinking in such scenario Index would be
faster. Please help me understand where I am going wrong. Some example will
be helpful.
-- 
View this message in context: 
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Understanding-Indexes-tp6058238p6061197.html
Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
Nabble.com.


Re: New Chain for : Does Cassandra use vector clocks

2011-02-24 Thread Sylvain Lebresne
On Thu, Feb 24, 2011 at 6:33 PM, Anthony John  wrote:

> Completely understand!
>
> All that I am quibbling over is whether a CL of quorum guarantees
> consistency or not. That is what the documentation says - right. IF for a CL
> of Q read - it depends on which node returns read first to determine the
> actual returned result or other more convoluted conditions , then a Quorum
> read/write is not consistent, by any definition.
>

But that's the point. The definition of consistency we are talking about has
no meaning if you consider only a quorum read. The definition (which is the
de facto definition of consistency in 'eventually consistent') make sense if
we talk about a write followed by a read. And it is
considering succeeding write followed by succeeding read.
And that is the statement the wiki is making.

Honestly, we could debate forever on the definition of consistency and
whatnot. Cassandra guaranties that if you do a (succeeding) write on W
replica and then a (succeeding) read on R replica and if R+W>N, then it is
guaranteed that the read will see the preceding write. And this is what is
called consistency in the context of eventual consistency (which is not the
context of ACID).

If this is not the definition of consistency you had in mind then by all
mean, Cassandra probably don't guarantee this definition. But given that the
paragraph preceding what you pasted state clearly we are not talking about
ACID consistency, but eventual consistency, I don't think the wiki is making
any unfair statement.

That being said, the wiki may not be always as clear as it could. But it's
an editable wiki :)

--
Sylvain


>
> I can still use Cassandra, and will use it, luv it!!! But let us not make
> this statement on the Wiki architecture section:-
>
> -
>
> More specifically: R=read replica count W=write replica count N=replication
> factor Q=*QUORUM* (Q = N / 2 + 1)
>
>-
>
>If W + R > N, you will have consistency
>- W=1, R=N
>- W=N, R=1
>- W=Q, R=Q where Q = N / 2 + 1
>
> Cassandra provides consistency when R + W > N (read replica count + write
> replica count > replication factor).
>
> 
>
>
> .
>
>
> On Thu, Feb 24, 2011 at 11:22 AM, Sylvain Lebresne 
> wrote:
>
>> On Thu, Feb 24, 2011 at 6:01 PM, Anthony John wrote:
>>
>>> If you are correct and you are probably closer to the code - then CL of
>>> Quorum does not guarantee a consistency.
>>
>>
>> If the operation succeed, it does (for some definition of consistency
>> which is, following reads at Quorum will be guaranteed to see the new value
>> of a update at quorum). If it fails, then no, it does not guarantee
>> consistency.
>>
>> It is important to note that the word consistency has multiple meaning. In
>> particular, when we are talking of consistency in Cassandra, we are not
>> talking of the same definition as the C in ACID (see:
>> http://www.allthingsdistributed.com/2007/12/eventually_consistent.html)
>>
>>>
>>> On Thu, Feb 24, 2011 at 10:54 AM, Sylvain Lebresne >> > wrote:
>>>
 On Thu, Feb 24, 2011 at 5:34 PM, Anthony John wrote:

>  >>Time stamps are not used for conflict resolution - unless is is
>> part of the application logic!!!
>>
>
> >>What is you definition of conflict resolution ? Because if you update
> twice the same column (which
> >>I'll call a conflict), then the timestamps are used to decide which
> update wins (which I'll call a resolution).
>
> I understand what you are saying, and yes semantics is very important
> here. And yes we are responding to the immediate questions without 
> covering
> all questions in the thread.
>
> The point being made here is that the timestamp of the column is not
> used by Cassandra to figure out what data to return.
>

 Not quite true.


> E.g. - Quorum is 2 nodes - and RF of 3 over N1/2/3
> A Quorum  Write comes and add/updates the time stamp (TS2) of a
> particular data element. It succeeds on N1 - fails on N2/3. So the write 
> is
> returned as failed - right ?
> Now Quorum read comes in for exactly the same piece of data that the
> write failed for.
> So N1 has TS2 but both N2/3 have the old TS (say TS1)
> And the read succeeds - Will it return TS1 or TS2.
>
> I submit it will return TS1 - the old TS.
>

 It all depends on which (first 2) nodes respond to the read (since RF=3,
 that can any two of N1/N2/N3). If N1 is part of the two that makes the
 quorum, then TS2 will be returned, because cassandra will compare the
 timestamp and decide what to return based on this. If N2/N3 responds
 however, both timestamp will be TS1 and so, after timestamp resolution, it
 will stil be TS1 that will be returned.
 So yes timestamp is used for conflict resolution.

 In your example, you could get TS1 ba

Re: Cassandra nodes on EC2 in two different regions not communicating

2011-02-24 Thread Dave Viner
Another possibility is this:

why not setup 2 nodes in 1 region in 1 az, and get that to work.
Then, open a third node in the same region, but different AZ, and get that
to work.
Then, once you have that working, open a fourth node in a different region
and get that to work.

Seems like taking a piece-meal approach would be beneficial here.

Dave Viner


On Thu, Feb 24, 2011 at 6:11 AM, Daniel van Ham Colchete <
daniel.colch...@gmail.com> wrote:

> Himanshi,
>
> my bad, try this for iptables:
>
> # SNAT outgoing connections
> iptables -t nat -A POSTROUTING -p tcp --dport 7000 -d 175.41.143.192 -j
> SNAT --to-source INTERNALIP
>
> As for tcpdump the argument for the -i option is the interface name (eth0,
> cassth0, etc...), and not the IP. So, it should be
> tcpdump -i cassth0 -n port 7000
> or
> tcpdump -i eth0 -n port 7000
>
> I`m assuming your main network card is eth0, but that should be the case.
>
> Does it work?
>
> Best,
> Daniel
>
>
> On Thu, Feb 24, 2011 at 9:27 AM, Himanshi Sharma 
> wrote:
>
>>
>> Thanks Daniel.
>>
>> But SNAT command is not working and when i try tcpdump it gives
>>
>> [root@ip-10-136-75-201 ~]# tcpdump -i 50.18.60.117 -n port 7000
>> tcpdump: Invalid adapter index
>>
>> Not able to figure out wats this ??
>>
>> Thanks,
>> Himanshi
>>
>>
>>
>>  From: Daniel van Ham Colchete  To:
>> user@cassandra.apache.org Date: 02/24/2011 04:27 PM Subject: Re:
>> Cassandra nodes on EC2 in two different regions not communicating
>> --
>>
>>
>>
>> Himanshi,
>>
>> you could try adding your public IP address to an internal interface and
>> DNAT the packets to it. This shouldn't give you any problems with your
>> normal traffic. Tell Cassandra on listen on the public IPs and it should
>> work.
>>
>> Linux commands would be:
>>
>> # Create an internal interface using bridge-utils
>> brctl addbr cassth0
>>
>> # add the ip
>> ip addr add dev cassth0 *50.18.60.117/32* 
>>
>> # DNAT incoming connections
>> iptables -t nat -A PREROUTING -p tcp --dport 7000 -d INTERNALIP -j DNAT
>> --to-destination 50.18.60.117
>>
>> # SNAT outgoing connections
>> iptables -t nat -A OUTPUT -p tcp --dport 7000 -d 175.41.143.192 -j SNAT
>> --to-source INTERNALIP
>>
>> This should work since Amazon you re-SNAT your outgoing packets to your
>> public IP again, so the other cassandra instance will see your public IP as
>> your source address.
>>
>> I didn't test this setup here but it should work unless I forgot some
>> small detail. If you need to troubleshoot use the command "tcpdump -i
>> INTERFACE -n port 7000" where INTERFACE should be your public interface or
>> your cassth0.
>>
>> Please let me know if it worked.
>>
>> Best regards,
>> Daniel Colchete
>>
>> On Thu, Feb 24, 2011 at 4:04 AM, Himanshi Sharma <*
>> himanshi.sha...@tcs.com* > wrote:
>> giving private ip to rpc address gives the same exception
>> and the keeping it blank and providing public to listen also fails. I
>> tried keeping both blank and did telnet on 7000 so i get following o/p
>>
>> [root@ip-10-166-223-150 bin]# telnet 122.248.193.37 7000
>> Trying 122.248.193.37...
>> Connected to 122.248.193.37.
>> Escape character is '^]'.
>>
>> Similarly from another achine
>>
>> [root@ip-10-136-75-201 bin]# telnet 184.72.22.87 7000
>> Trying 184.72.22.87...
>> Connected to 184.72.22.87.
>> Escape character is '^]'.
>>
>>
>>
>> -Dave Viner wrote: -
>> To: *user@cassandra.apache.org* 
>> From: Dave Viner <*davevi...@gmail.com* >
>> Date: 02/24/2011 11:59AM
>> cc: Himanshi Sharma <*himanshi.sha...@tcs.com* >
>>
>> Subject: Re: Cassandra nodes on EC2 in two different regions not
>> communicating
>>
>> Try using the private ipv4 address in the rpc_address field, and the
>> public ipv4 (NOT the elastic ip) in the listen_address.
>>
>> If that fails, go back to rpc_address empty, and start up cassandra.
>>
>> Then from the other node, please telnet to port 7000 on the first node.
>>  And show the output of that session in your reply.
>>
>> I haven't actually constructed a cross-region cluster nor have I used
>> v0.7, but this really sounds like it should be easy.
>>
>> On Wed, Feb 23, 2011 at 10:22 PM, Himanshi Sharma < *himanshi.sha...@tcs.com
>> * > wrote:
>> Hi Dave,
>>
>> I tried with the public ips. If i mention the public ip in rpc address
>> field, Cassandra gives the same exception but if leave it blank then
>> Cassandra runs but again in the nodetool command with ring option it does'nt
>> show the node in another region.
>>
>> Thanks,
>> Himanshi
>>
>>
>> -Dave Viner wrote: -
>> To: *user@cassandra.apache.org * 
>> From: Dave Viner < *davevi...@gmail.com * >
>> Date: 02/24/2011 10:43AM
>>
>> Subject: Re: Cassandra nodes on EC2 in two different regions not
>> communicating
>>
>> That looks like it's not an issue of communicating between nodes.  It
>> appears that the node can not bind to the address on the localhost that
>> you're asking for.
>>
>> " java.net.BindException: Cannot assign req

Re: Understanding Indexes

2011-02-24 Thread Ed Anuff
If you mean does it make sense to have a CF where each row contains a set of
keys to other rows in another CF, then yes, that's a common design pattern,
although usually it's because you're creating collections of those rows
(i.e. a Groups CF where each row consists of a set of keys to rows in the
Users CF).  Not sure if that's what you're getting at, though.

On Thu, Feb 24, 2011 at 9:34 AM, mcasandra  wrote:

>
> Generally no. But yes if retrieving the key through index is faster than
> going through the hash buckets.
>
> Currently I am thinking there could be 100s of million or billion of rows
> and in that case if we have to retrieve a row which one will be fast going
> through hash bucket or index? I am thinking in such scenario Index would be
> faster. Please help me understand where I am going wrong. Some example will
> be helpful.
> --
> View this message in context:
> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Understanding-Indexes-tp6058238p6061197.html
> Sent from the cassandra-u...@incubator.apache.org mailing list archive at
> Nabble.com.
>


RE: Multiple Seeds

2011-02-24 Thread Jeremy.Truelove
Gotcha I had forgotten about the gossip piece, that makes sense.

-Original Message-
From: Edward Capriolo [mailto:edlinuxg...@gmail.com] 
Sent: Wednesday, February 23, 2011 5:00 PM
To: Truelove, Jeremy: IT (NYK)
Cc: user@cassandra.apache.org
Subject: Re: Multiple Seeds

On Wed, Feb 23, 2011 at 3:28 PM,   wrote:
> So does cassandra monitor the config file for changes? If it doesn't how else 
> would it know unless you restart you had added a new seed?
>
> -Original Message-
> From: Edward Capriolo [mailto:edlinuxg...@gmail.com]
> Sent: Wednesday, February 23, 2011 3:23 PM
> To: user@cassandra.apache.org
> Cc: Truelove, Jeremy: IT (NYK)
> Subject: Re: Multiple Seeds
>
> On Wed, Feb 23, 2011 at 2:59 PM,   wrote:
>> To add a host to the seeds list after it has had the data streamed to it I
>> need to
>>
>>
>>
>> 1.   stop it
>>
>> 2.   edit the yaml file to
>>
>> a.   include it in the seeds list
>>
>> b.  set auto boostrap to false
>>
>> 3.    restart it
>>
>>
>>
>> correct? Additionally you would need to add it to the other nodes seed lists
>> and restart them as well.
>>
>>
>>
>> From: Eric Gilmore [mailto:e...@datastax.com]
>> Sent: Wednesday, February 23, 2011 2:47 PM
>> To: user@cassandra.apache.org
>> Subject: Re: Multiple Seeds
>>
>>
>>
>> Well -- when you first bring a node into a ring, you will probably want to
>> stream data to it with auto_bootstrap: true.
>>
>> If you want that node to be a seed, then add it to the seeds list AFTER it
>> has joined the ring.
>>
>> I'd refer you to the "Seed List" and "Autoboostrapping" sections of the
>> Getting Started guide, which contain the following blurbs:
>>
>> There is no strict rule to determine which hosts need to be listed as seeds,
>> but all nodes in a cluster need the same seed list. For a production
>> deployment, DataStax recommends two seeds per data center.
>>
>> An autobootstrapping node cannot have itself in the list of seeds nor can it
>> contain an initial_token already claimed by another node. To add new seeds,
>> autobootstrap the nodes first, and then configure them as seeds.
>>
>>
>>
>>
>>
>>
>>
>> On Wed, Feb 23, 2011 at 11:39 AM, 
>> wrote:
>>
>> So all seeds should always be set to 'auto_bootstrap: false' in their .yaml
>> file.
>>
>> -Original Message-
>> From: Edward Capriolo [mailto:edlinuxg...@gmail.com]
>> Sent: Wednesday, February 23, 2011 2:36 PM
>> To: user@cassandra.apache.org
>>
>> Cc: Truelove, Jeremy: IT (NYK)
>> Subject: Re: Multiple Seeds
>>
>> On Wed, Feb 23, 2011 at 2:30 PM,  
>> wrote:
>>> Yeah I set the tokens, I'm more asking if I start the first seed node with
>>> autobootstrap set to false the second seed should have it set to true as
>>> well as all the slave nodes correct? I didn't see this in the docs but I
>>> may
>>> have just missed it.
>>>
>>>
>>>
>>> From: Eric Gilmore [mailto:e...@datastax.com]
>>> Sent: Wednesday, February 23, 2011 2:24 PM
>>> To: user@cassandra.apache.org
>>> Subject: Re: Multiple Seeds
>>>
>>>
>>>
>>> The DataStax documentation offers some answers to those questions in the
>>> Getting Started section and the Clustering reference docs.
>>>
>>> Autobootstrap should be true, but with the important caveat that
>>> intial_token values should be specified.  Have a look at those docs, and
>>> please give feedback on how helpful they are/aren't.
>>>
>>> Regards,
>>>
>>> Eric Gilmore
>>>
>>> On Wed, Feb 23, 2011 at 11:15 AM, 
>>> wrote:
>>>
>>> What's the best way to bring multiple seeds up, should only one of them
>>> have
>>> auto bootstrap set to true or should neither of them? Should they list
>>> themselves and the other seed in their seed section in the yaml config?
>>>
>>> ___
>>>
>>>
>>>
>>> This e-mail may contain information that is confidential, privileged or
>>> otherwise protected from disclosure. If you are not an intended recipient
>>> of
>>> this e-mail, do not duplicate or redistribute it by any means. Please
>>> delete
>>> it and any attachments and notify the sender that you have received it in
>>> error. Unless specifically indicated, this e-mail is not an offer to buy
>>> or
>>> sell or a solicitation to buy or sell any securities, investment products
>>> or
>>> other financial product or service, an official confirmation of any
>>> transaction, or an official statement of Barclays. Any views or opinions
>>> presented are solely those of the author and do not necessarily represent
>>> those of Barclays. This e-mail is subject to terms available at the
>>> following link: www.barcap.com/emaildisclaimer. By messaging with Barclays
>>> you consent to the foregoing.  Barclays Capital is the investment banking
>>> division of Barclays Bank PLC, a company registered in England (number
>>> 1026167) with its registered office at 1 Churchill Place, London, E14 5HP.
>>> This email may relate to or be sent from other members of the Barclays
>>> Group.
>>>
>>> ___

Re: Understanding Indexes

2011-02-24 Thread mcasandra

What I am trying to ask is that what if there are billions of row keys (eg:
abc, def, xyz in below eg.) and then client does a lookup/query on a row say
xyz (get all cols for row xyz). Now since there are billions of rows look up
using Hash mechanism, is it going to be slow? What algorithm will be used to
retrieve row xyz which could be anywhere in those billion rows on a
particular node.

Is it going to help if there is an index on row keys (eg: abc, xyz)?

> UserProfile = { // this is a ColumnFamily
>abc: {   // this is the key to this Row inside the CF
>// now we have an infinite # of columns in this row
>username: "phatduckk",
>email: "phatdu...@example.com",
>phone: "(900) 976-"
>}, // end row
>def: {   // this is the key to another row in the CF
>// now we have another infinite # of columns in this row
>username: "ieure",
>email: "ie...@example.com",
>phone: "(888) 555-1212"
>age: "66",
>gender: "undecided"
>},
> }
-- 
View this message in context: 
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Understanding-Indexes-tp6058238p6061356.html
Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
Nabble.com.


Re: Cassandra nodes on EC2 in two different regions not communicating

2011-02-24 Thread Frank LoVecchio
Not sure if there is a particular reason for you using different regions,
but Amazon states that each zone is a different physical location completely
separate from others, e.g. us-east-1a and us-east-1b.  Using the Amazon
internal IPs (10.x. etc) reduces latency greatly by not going outbound
through DNS (though us-east-1c is twice as latent), you have an RF 4
possible (in different physical locations!), and most importantly sh*t
just works.

Maybe this won't help you, but it may be useful for others :)

On Thu, Feb 24, 2011 at 10:53 AM, Dave Viner  wrote:

> Another possibility is this:
>
> why not setup 2 nodes in 1 region in 1 az, and get that to work.
> Then, open a third node in the same region, but different AZ, and get that
> to work.
> Then, once you have that working, open a fourth node in a different region
> and get that to work.
>
> Seems like taking a piece-meal approach would be beneficial here.
>
> Dave Viner
>
>
> On Thu, Feb 24, 2011 at 6:11 AM, Daniel van Ham Colchete <
> daniel.colch...@gmail.com> wrote:
>
>> Himanshi,
>>
>> my bad, try this for iptables:
>>
>> # SNAT outgoing connections
>> iptables -t nat -A POSTROUTING -p tcp --dport 7000 -d 175.41.143.192 -j
>> SNAT --to-source INTERNALIP
>>
>> As for tcpdump the argument for the -i option is the interface name (eth0,
>> cassth0, etc...), and not the IP. So, it should be
>> tcpdump -i cassth0 -n port 7000
>> or
>> tcpdump -i eth0 -n port 7000
>>
>> I`m assuming your main network card is eth0, but that should be the case.
>>
>> Does it work?
>>
>> Best,
>> Daniel
>>
>>
>> On Thu, Feb 24, 2011 at 9:27 AM, Himanshi Sharma > > wrote:
>>
>>>
>>> Thanks Daniel.
>>>
>>> But SNAT command is not working and when i try tcpdump it gives
>>>
>>> [root@ip-10-136-75-201 ~]# tcpdump -i 50.18.60.117 -n port 7000
>>> tcpdump: Invalid adapter index
>>>
>>> Not able to figure out wats this ??
>>>
>>> Thanks,
>>> Himanshi
>>>
>>>
>>>
>>>  From: Daniel van Ham Colchete  To:
>>> user@cassandra.apache.org Date: 02/24/2011 04:27 PM Subject: Re:
>>> Cassandra nodes on EC2 in two different regions not communicating
>>> --
>>>
>>>
>>>
>>> Himanshi,
>>>
>>> you could try adding your public IP address to an internal interface and
>>> DNAT the packets to it. This shouldn't give you any problems with your
>>> normal traffic. Tell Cassandra on listen on the public IPs and it should
>>> work.
>>>
>>> Linux commands would be:
>>>
>>> # Create an internal interface using bridge-utils
>>> brctl addbr cassth0
>>>
>>> # add the ip
>>> ip addr add dev cassth0 *50.18.60.117/32* 
>>>
>>> # DNAT incoming connections
>>> iptables -t nat -A PREROUTING -p tcp --dport 7000 -d INTERNALIP -j DNAT
>>> --to-destination 50.18.60.117
>>>
>>> # SNAT outgoing connections
>>> iptables -t nat -A OUTPUT -p tcp --dport 7000 -d 175.41.143.192 -j SNAT
>>> --to-source INTERNALIP
>>>
>>> This should work since Amazon you re-SNAT your outgoing packets to your
>>> public IP again, so the other cassandra instance will see your public IP as
>>> your source address.
>>>
>>> I didn't test this setup here but it should work unless I forgot some
>>> small detail. If you need to troubleshoot use the command "tcpdump -i
>>> INTERFACE -n port 7000" where INTERFACE should be your public interface or
>>> your cassth0.
>>>
>>> Please let me know if it worked.
>>>
>>> Best regards,
>>> Daniel Colchete
>>>
>>> On Thu, Feb 24, 2011 at 4:04 AM, Himanshi Sharma <*
>>> himanshi.sha...@tcs.com* > wrote:
>>> giving private ip to rpc address gives the same exception
>>> and the keeping it blank and providing public to listen also fails. I
>>> tried keeping both blank and did telnet on 7000 so i get following o/p
>>>
>>> [root@ip-10-166-223-150 bin]# telnet 122.248.193.37 7000
>>> Trying 122.248.193.37...
>>> Connected to 122.248.193.37.
>>> Escape character is '^]'.
>>>
>>> Similarly from another achine
>>>
>>> [root@ip-10-136-75-201 bin]# telnet 184.72.22.87 7000
>>> Trying 184.72.22.87...
>>> Connected to 184.72.22.87.
>>> Escape character is '^]'.
>>>
>>>
>>>
>>> -Dave Viner wrote: -
>>> To: *user@cassandra.apache.org* 
>>> From: Dave Viner <*davevi...@gmail.com* >
>>> Date: 02/24/2011 11:59AM
>>> cc: Himanshi Sharma <*himanshi.sha...@tcs.com* 
>>> >
>>>
>>> Subject: Re: Cassandra nodes on EC2 in two different regions not
>>> communicating
>>>
>>> Try using the private ipv4 address in the rpc_address field, and the
>>> public ipv4 (NOT the elastic ip) in the listen_address.
>>>
>>> If that fails, go back to rpc_address empty, and start up cassandra.
>>>
>>> Then from the other node, please telnet to port 7000 on the first node.
>>>  And show the output of that session in your reply.
>>>
>>> I haven't actually constructed a cross-region cluster nor have I used
>>> v0.7, but this really sounds like it should be easy.
>>>
>>> On Wed, Feb 23, 2011 at 10:22 PM, Himanshi Sharma < *himanshi.sha...@tcs.com
>>> * > wrote:
>>> Hi Dave,
>>>
>>> I tried with the 

Re: New Chain for : Does Cassandra use vector clocks

2011-02-24 Thread Anthony John
I see the point - apologies for putting everyone through this!

It was just militating against my mental model.

In summary, here is my take away - simple stuff but - IMO - important to
conclude this thread (I hope):-
1. I was splitting hair over a failed ( partial ) Q Write. Such an event
should be immediately followed by the same write going to a connection on to
another node ( potentially using connection caches of client implementations
) or a Read at CL of All. Because a write could have partially gone through.
2. Timestamps are used in determining the latest version ( correcting the
false impression I was propagating)

Finally, wrt "W + R > N for Q CL statement" holds, but could be broken in
case of a failed write as it is unsure whether the new value got written on
 any server or not. Is that a fair characterization ?

Bottom line - unlike traditional DBMS, errors do not ensure automatic
cleanup and revert back, app code has to follow up if  immediate - and not
eventual -  consistency is desired. I made that leap in almost all cases - I
think - but the case of a failed write.

My bad and I can live with this!

Regards,

-JA

On Thu, Feb 24, 2011 at 11:50 AM, Sylvain Lebresne wrote:

> On Thu, Feb 24, 2011 at 6:33 PM, Anthony John wrote:
>
>> Completely understand!
>>
>> All that I am quibbling over is whether a CL of quorum guarantees
>> consistency or not. That is what the documentation says - right. IF for a CL
>> of Q read - it depends on which node returns read first to determine the
>> actual returned result or other more convoluted conditions , then a Quorum
>> read/write is not consistent, by any definition.
>>
>
> But that's the point. The definition of consistency we are talking about
> has no meaning if you consider only a quorum read. The definition (which is
> the de facto definition of consistency in 'eventually consistent') make
> sense if we talk about a write followed by a read. And it is
> considering succeeding write followed by succeeding read.
> And that is the statement the wiki is making.
>
> Honestly, we could debate forever on the definition of consistency and
> whatnot. Cassandra guaranties that if you do a (succeeding) write on W
> replica and then a (succeeding) read on R replica and if R+W>N, then it is
> guaranteed that the read will see the preceding write. And this is what is
> called consistency in the context of eventual consistency (which is not the
> context of ACID).
>
> If this is not the definition of consistency you had in mind then by all
> mean, Cassandra probably don't guarantee this definition. But given that the
> paragraph preceding what you pasted state clearly we are not talking about
> ACID consistency, but eventual consistency, I don't think the wiki is making
> any unfair statement.
>
> That being said, the wiki may not be always as clear as it could. But it's
> an editable wiki :)
>
> --
> Sylvain
>
>
>>
>> I can still use Cassandra, and will use it, luv it!!! But let us not make
>> this statement on the Wiki architecture section:-
>>
>> -
>>
>> More specifically: R=read replica count W=write replica count N=replication
>> factor Q=*QUORUM* (Q = N / 2 + 1)
>>
>>-
>>
>>If W + R > N, you will have consistency
>>- W=1, R=N
>>- W=N, R=1
>>- W=Q, R=Q where Q = N / 2 + 1
>>
>> Cassandra provides consistency when R + W > N (read replica count + write
>> replica count > replication factor).
>>
>> 
>>
>>
>> .
>>
>>
>> On Thu, Feb 24, 2011 at 11:22 AM, Sylvain Lebresne 
>> wrote:
>>
>>> On Thu, Feb 24, 2011 at 6:01 PM, Anthony John wrote:
>>>
 If you are correct and you are probably closer to the code - then CL of
 Quorum does not guarantee a consistency.
>>>
>>>
>>> If the operation succeed, it does (for some definition of consistency
>>> which is, following reads at Quorum will be guaranteed to see the new value
>>> of a update at quorum). If it fails, then no, it does not guarantee
>>> consistency.
>>>
>>> It is important to note that the word consistency has multiple meaning.
>>> In particular, when we are talking of consistency in Cassandra, we are not
>>> talking of the same definition as the C in ACID (see:
>>> http://www.allthingsdistributed.com/2007/12/eventually_consistent.html)
>>>

 On Thu, Feb 24, 2011 at 10:54 AM, Sylvain Lebresne <
 sylv...@datastax.com> wrote:

> On Thu, Feb 24, 2011 at 5:34 PM, Anthony John 
> wrote:
>
>>  >>Time stamps are not used for conflict resolution - unless is is
>>> part of the application logic!!!
>>>
>>
>> >>What is you definition of conflict resolution ? Because if you
>> update twice the same column (which
>> >>I'll call a conflict), then the timestamps are used to decide which
>> update wins (which I'll call a resolution).
>>
>> I understand what you are saying, and yes semantics is very important
>> her

Re: Understand eventually consistent

2011-02-24 Thread mcasandra


Javier Canillas wrote:
> 
> Instead, when you execute the same OP using CL QUORUM, then it means
> RF /2+1, it will try to write on the coordinator node and replica.
> Considering only 1 replica is down, the OP will success too.
> 

I am assuming even read will succeed when CL QUORUM and RF=3 and 1 node is
down.


Javier Canillas wrote:
> 
> Now consider same OP but with CL ALL, it will fail since it cant
> assure that coordinador and both replicas are updated. 
> 

Can you please explain this little more? I thought CL ALL will fail because
it needs all the nodes to be up.
http://wiki.apache.org/cassandra/API

-- 
View this message in context: 
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Understand-eventually-consistent-tp6038330p6061399.html
Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
Nabble.com.


Re: Understanding Indexes

2011-02-24 Thread Javier Canillas
I really don't see the point.. Again, suppose a cluster with 3 nodes, where
there is a ColumnFamily that will hold data which key is basically consisted
on a word of 2 letters (pretty simple). That's make a total of 729 posible
keys.

RandomPartitioner then will tokenize each key and assign them to a node
within the cluster. Then, each node will handle 243 keys each (plus
replication, of course).

ok, Now suppose that you need to look for data on key "AG", the node that
you ask, will then use RandomPartitioner to tokenize the key and determine
which node is the coordinator for that key and proceed to ask that node for
the data (and ask the replicas an md5 version of the data to compare). So,
each node will only need to look for over 1/3 of the stored keys.

How do you think an Index is implemented? As far as I know, a simple index
is básically a HashTable that has the Index value as Key, and the position
as value. How do you think a search within the Index (Hashcode) is
implemented?

I don't know, maybe there is some magic behind indexes (I know there are
some complex indexes that hold some B-Tree, etc; like the one used over SQL
solutions), but I think all the whole thing will only add more complexity
over a more straight solution. How big should be the CF (in terms of keys)
to be able to present latency when searching over hashcodes? And then think,
if I need to add a new Key, what's the cost in the whole process? Now, lets
assume you can make the whole B-Tree in first place (even for the keys that
does not exists), how much memory would that cost? There should be some
papers that discuss this problem somewhere.

I would definitly make some volume calculations and some stress test over
this at least to be sure there is a problem before attempting any kind of
solution.

PD: I feel this is like the problem I present about TTL values, saying
basically, that a TTL value past 2050 year would throw an exception. Who
will be alive after 2012 doomsday? :)


On Thu, Feb 24, 2011 at 3:18 PM, mcasandra  wrote:

>
> What I am trying to ask is that what if there are billions of row keys (eg:
> abc, def, xyz in below eg.) and then client does a lookup/query on a row
> say
> xyz (get all cols for row xyz). Now since there are billions of rows look
> up
> using Hash mechanism, is it going to be slow? What algorithm will be used
> to
> retrieve row xyz which could be anywhere in those billion rows on a
> particular node.
>
> Is it going to help if there is an index on row keys (eg: abc, xyz)?
>
> > UserProfile = { // this is a ColumnFamily
> >abc: {   // this is the key to this Row inside the CF
> >// now we have an infinite # of columns in this row
> >username: "phatduckk",
> >email: "phatdu...@example.com",
> >phone: "(900) 976-"
> >}, // end row
> >def: {   // this is the key to another row in the CF
> >// now we have another infinite # of columns in this row
> >username: "ieure",
> >email: "ie...@example.com",
> >phone: "(888) 555-1212"
> >age: "66",
> >gender: "undecided"
> >},
> > }
> --
> View this message in context:
> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Understanding-Indexes-tp6058238p6061356.html
> Sent from the cassandra-u...@incubator.apache.org mailing list archive at
> Nabble.com.
>


Re: New Chain for : Does Cassandra use vector clocks

2011-02-24 Thread Ritesh Tijoriwala
Thanks all for good detail and clarification. I just wanted to get things
clear and understand correctly what is the expected behavior when working
with Cassandra against various failure conditions so that application can be
designed accordingly and provide proper locking/synchronization if required.

Thanks,
Ritesh

On Thu, Feb 24, 2011 at 10:25 AM, Anthony John wrote:

> I see the point - apologies for putting everyone through this!
>
> It was just militating against my mental model.
>
> In summary, here is my take away - simple stuff but - IMO - important to
> conclude this thread (I hope):-
> 1. I was splitting hair over a failed ( partial ) Q Write. Such an event
> should be immediately followed by the same write going to a connection on to
> another node ( potentially using connection caches of client implementations
> ) or a Read at CL of All. Because a write could have partially gone through.
> 2. Timestamps are used in determining the latest version ( correcting the
> false impression I was propagating)
>
> Finally, wrt "W + R > N for Q CL statement" holds, but could be broken in
> case of a failed write as it is unsure whether the new value got written on
>  any server or not. Is that a fair characterization ?
>
> Bottom line - unlike traditional DBMS, errors do not ensure automatic
> cleanup and revert back, app code has to follow up if  immediate - and not
> eventual -  consistency is desired. I made that leap in almost all cases - I
> think - but the case of a failed write.
>
> My bad and I can live with this!
>
> Regards,
>
> -JA
>
>
> On Thu, Feb 24, 2011 at 11:50 AM, Sylvain Lebresne 
> wrote:
>
>> On Thu, Feb 24, 2011 at 6:33 PM, Anthony John wrote:
>>
>>> Completely understand!
>>>
>>> All that I am quibbling over is whether a CL of quorum guarantees
>>> consistency or not. That is what the documentation says - right. IF for a CL
>>> of Q read - it depends on which node returns read first to determine the
>>> actual returned result or other more convoluted conditions , then a Quorum
>>> read/write is not consistent, by any definition.
>>>
>>
>> But that's the point. The definition of consistency we are talking about
>> has no meaning if you consider only a quorum read. The definition (which is
>> the de facto definition of consistency in 'eventually consistent') make
>> sense if we talk about a write followed by a read. And it is
>> considering succeeding write followed by succeeding read.
>> And that is the statement the wiki is making.
>>
>> Honestly, we could debate forever on the definition of consistency and
>> whatnot. Cassandra guaranties that if you do a (succeeding) write on W
>> replica and then a (succeeding) read on R replica and if R+W>N, then it is
>> guaranteed that the read will see the preceding write. And this is what is
>> called consistency in the context of eventual consistency (which is not the
>> context of ACID).
>>
>> If this is not the definition of consistency you had in mind then by all
>> mean, Cassandra probably don't guarantee this definition. But given that the
>> paragraph preceding what you pasted state clearly we are not talking about
>> ACID consistency, but eventual consistency, I don't think the wiki is making
>> any unfair statement.
>>
>> That being said, the wiki may not be always as clear as it could. But it's
>> an editable wiki :)
>>
>> --
>> Sylvain
>>
>>
>>>
>>> I can still use Cassandra, and will use it, luv it!!! But let us not make
>>> this statement on the Wiki architecture section:-
>>>
>>> -
>>>
>>> More specifically: R=read replica count W=write replica count N=replication
>>> factor Q=*QUORUM* (Q = N / 2 + 1)
>>>
>>>-
>>>
>>>If W + R > N, you will have consistency
>>>- W=1, R=N
>>>- W=N, R=1
>>>- W=Q, R=Q where Q = N / 2 + 1
>>>
>>> Cassandra provides consistency when R + W > N (read replica count + write
>>> replica count > replication factor).
>>>
>>> 
>>>
>>>
>>> .
>>>
>>>
>>> On Thu, Feb 24, 2011 at 11:22 AM, Sylvain Lebresne >> > wrote:
>>>
 On Thu, Feb 24, 2011 at 6:01 PM, Anthony John wrote:

> If you are correct and you are probably closer to the code - then CL of
> Quorum does not guarantee a consistency.


 If the operation succeed, it does (for some definition of consistency
 which is, following reads at Quorum will be guaranteed to see the new value
 of a update at quorum). If it fails, then no, it does not guarantee
 consistency.

 It is important to note that the word consistency has multiple meaning.
 In particular, when we are talking of consistency in Cassandra, we are not
 talking of the same definition as the C in ACID (see:
 http://www.allthingsdistributed.com/2007/12/eventually_consistent.html)

>
> On Thu, Feb 24, 2011 at 10:54 AM, Sylvain Lebresne <
> sylv...@datastax.com> wrote:
>
>> On Thu, Feb 24, 

Re: Understand eventually consistent

2011-02-24 Thread Javier Canillas
Well, it will need all nodes that are required on the operation to be up,
and to response in a timely fashion, even a time-out rpc of 1 replica will
get you a fail response.

CL is calculated based on the RF configured for the ColumnFamily.

"The ConsistencyLevel is an enum that controls both read and write behavior
based on  in your storage-conf.xml."

QUORUM = RF / 2 +1;
ALL = RF
ONE = 1
ANY = 0

Then, on a column family configured with RF = 6, QUORUM means "be sure to
write at least over 4 nodes before responding", but on a column family
configured with RF = 3, QUORUM means "be sure to write on 2 at least". In
cases where RF is 1 or 2, then QUORUM is like ALL ("be sure to write on all
nodes involved").


On Thu, Feb 24, 2011 at 3:29 PM, mcasandra  wrote:

>
>
> Javier Canillas wrote:
> >
> > Instead, when you execute the same OP using CL QUORUM, then it means
> > RF /2+1, it will try to write on the coordinator node and replica.
> > Considering only 1 replica is down, the OP will success too.
> >
>
> I am assuming even read will succeed when CL QUORUM and RF=3 and 1 node is
> down.
>
>
> Javier Canillas wrote:
> >
> > Now consider same OP but with CL ALL, it will fail since it cant
> > assure that coordinador and both replicas are updated.
> >
>
> Can you please explain this little more? I thought CL ALL will fail because
> it needs all the nodes to be up.
> http://wiki.apache.org/cassandra/API
>
> --
> View this message in context:
> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Understand-eventually-consistent-tp6038330p6061399.html
> Sent from the cassandra-u...@incubator.apache.org mailing list archive at
> Nabble.com.
>


Re: New thread for : How does Cassandra handle failure during synchronous writes

2011-02-24 Thread Narendra Sharma
You are missing the point. The coordinator node that is handling the request
won't wait for all the nodes to return their copy/digest of data. It just
wait for Q (RF/2+1) nodes to return. This is the reason I explained two
possible scenarios.

Further, on what basis Cassandra will know that the data on N1 is result of
a failure? Think about it!!

Also, take a look at http://wiki.apache.org/cassandra/API. Following is from
Cassandra wiki:
"Because the repair replication process only requires a write to reach a
single node to propagate, a write which 'fails' to meet consistency
requirements will still appear eventually so long at it was written to at
least one node. With W and R both using QUORUM, the best consistency we can
achieve is the guarantee that we will receive the same value regardless of
which nodes we read from. However, we can still peform a W=QUORUM that
"fails" but reaches one server, perform a R=QUORUM that reads the old value,
and then sometime later perform a R=QUORUM that reads the new value."

Hope this make things very clear!



On Thu, Feb 24, 2011 at 4:47 AM, Anthony John  wrote:

> >>c. Read with CL = QUORUM. If read hits node1 and node2/node3, new data
> that was written to node1 will be returned.
>
> >>In this case - N1 will be identified as a discrepancy and the change will
> be discarded via read repair
>
> >>[Naren] How will Cassandra know this is a discrepancy?
>
> Because at Q - only N1 will have the "new data" and other other nodes
> won't. This lack of consistency on N! will be detected and repaired. The
> value that meets Q - the values from N2-3 - will be returned.
>
> HTH
>


Re: New thread for : How does Cassandra handle failure during synchronous writes

2011-02-24 Thread Ritesh Tijoriwala
thanks Narendra. I read again the wiki quote you pasted below and now it
does make sense. Cassandra's design behavior is to propagate the failed
write if it was ever written successfully to atleast one server. I was
having hard time trying to work around this but I guess I am starting to
think the other way.

Question - what patterns do applications employ to deal with this type of
problems? Is there a way to know that even though the write failed, it might
have been partially succeeded? It helps if there is a way to know this
otherwise I cannot think of how to deal with this type of scenario. Please
help.

Thanks,
Ritesh

On Thu, Feb 24, 2011 at 10:55 AM, Narendra Sharma  wrote:

> You are missing the point. The coordinator node that is handling the
> request won't wait for all the nodes to return their copy/digest of data. It
> just wait for Q (RF/2+1) nodes to return. This is the reason I explained two
> possible scenarios.
>
> Further, on what basis Cassandra will know that the data on N1 is result of
> a failure? Think about it!!
>
> Also, take a look at http://wiki.apache.org/cassandra/API. Following is
> from Cassandra wiki:
> "Because the repair replication process only requires a write to reach a
> single node to propagate, a write which 'fails' to meet consistency
> requirements will still appear eventually so long at it was written to at
> least one node. With W and R both using QUORUM, the best consistency we can
> achieve is the guarantee that we will receive the same value regardless of
> which nodes we read from. However, we can still peform a W=QUORUM that
> "fails" but reaches one server, perform a R=QUORUM that reads the old value,
> and then sometime later perform a R=QUORUM that reads the new value."
>
> Hope this make things very clear!
>
>
>
>
> On Thu, Feb 24, 2011 at 4:47 AM, Anthony John wrote:
>
>> >>c. Read with CL = QUORUM. If read hits node1 and node2/node3, new data
>> that was written to node1 will be returned.
>>
>> >>In this case - N1 will be identified as a discrepancy and the change
>> will be discarded via read repair
>>
>> >>[Naren] How will Cassandra know this is a discrepancy?
>>
>> Because at Q - only N1 will have the "new data" and other other nodes
>> won't. This lack of consistency on N! will be detected and repaired. The
>> value that meets Q - the values from N2-3 - will be returned.
>>
>> HTH
>>
>
>


Re: New Chain for : Does Cassandra use vector clocks

2011-02-24 Thread A J
>>but could be broken in case of a failed write<<
You can think of a scenario where R + W >N still leads to
inconsistency even for successful writes. Say you keep W=1 and R=N .
Lets say the one node where a write happened with success goes down
before it made to the other N-1 nodes. Lets say it goes down for good
and is unrecoverable. The only option is to build a new node from
scratch from other active nodes. This will lead to a write that was
lost and you will end up serving stale copy of it.

It is better to talk in terms of use cases and if cassandra will be a
fit for it. Otherwise unless you have W=R=N and fsync before each
write commit, there will be scope for inconsistency.


On Thu, Feb 24, 2011 at 1:25 PM, Anthony John  wrote:
> I see the point - apologies for putting everyone through this!
> It was just militating against my mental model.
> In summary, here is my take away - simple stuff but - IMO - important to
> conclude this thread (I hope):-
> 1. I was splitting hair over a failed ( partial ) Q Write. Such an event
> should be immediately followed by the same write going to a connection on to
> another node ( potentially using connection caches of client implementations
> ) or a Read at CL of All. Because a write could have partially gone through.
> 2. Timestamps are used in determining the latest version ( correcting the
> false impression I was propagating)
> Finally, wrt "W + R > N for Q CL statement" holds, but could be broken in
> case of a failed write as it is unsure whether the new value got written on
>  any server or not. Is that a fair characterization ?
> Bottom line - unlike traditional DBMS, errors do not ensure automatic
> cleanup and revert back, app code has to follow up if  immediate - and not
> eventual -  consistency is desired. I made that leap in almost all cases - I
> think - but the case of a failed write.
> My bad and I can live with this!
> Regards,
> -JA
>
> On Thu, Feb 24, 2011 at 11:50 AM, Sylvain Lebresne 
> wrote:
>>
>> On Thu, Feb 24, 2011 at 6:33 PM, Anthony John 
>> wrote:
>>>
>>> Completely understand!
>>> All that I am quibbling over is whether a CL of quorum guarantees
>>> consistency or not. That is what the documentation says - right. IF for a CL
>>> of Q read - it depends on which node returns read first to determine the
>>> actual returned result or other more convoluted conditions , then a Quorum
>>> read/write is not consistent, by any definition.
>>
>> But that's the point. The definition of consistency we are talking about
>> has no meaning if you consider only a quorum read. The definition (which is
>> the de facto definition of consistency in 'eventually consistent') make
>> sense if we talk about a write followed by a read. And it is
>> considering succeeding write followed by succeeding read.
>> And that is the statement the wiki is making.
>> Honestly, we could debate forever on the definition of consistency and
>> whatnot. Cassandra guaranties that if you do a (succeeding) write on W
>> replica and then a (succeeding) read on R replica and if R+W>N, then it is
>> guaranteed that the read will see the preceding write. And this is what is
>> called consistency in the context of eventual consistency (which is not the
>> context of ACID).
>> If this is not the definition of consistency you had in mind then by all
>> mean, Cassandra probably don't guarantee this definition. But given that the
>> paragraph preceding what you pasted state clearly we are not talking about
>> ACID consistency, but eventual consistency, I don't think the wiki is making
>> any unfair statement.
>> That being said, the wiki may not be always as clear as it could. But it's
>> an editable wiki :)
>> --
>> Sylvain
>>
>>>
>>> I can still use Cassandra, and will use it, luv it!!! But let us not make
>>> this statement on the Wiki architecture section:-
>>> -
>>>
>>> More specifically: R=read replica count W=write replica
>>> count N=replication factor Q=QUORUM (Q = N / 2 + 1)
>>>
>>> If W + R > N, you will have consistency
>>>
>>> W=1, R=N
>>> W=N, R=1
>>> W=Q, R=Q where Q = N / 2 + 1
>>>
>>> Cassandra provides consistency when R + W > N (read replica count + write
>>> replica count > replication factor).
>>>
>>> 
>>>
>>> .
>>>
>>> On Thu, Feb 24, 2011 at 11:22 AM, Sylvain Lebresne 
>>> wrote:

 On Thu, Feb 24, 2011 at 6:01 PM, Anthony John 
 wrote:
>
> If you are correct and you are probably closer to the code - then CL of
> Quorum does not guarantee a consistency.

 If the operation succeed, it does (for some definition of consistency
 which is, following reads at Quorum will be guaranteed to see the new value
 of a update at quorum). If it fails, then no, it does not guarantee
 consistency.
 It is important to note that the word consistency has multiple meaning.
 In particular, when we are talking of consistency

Re: Understanding Indexes

2011-02-24 Thread mcasandra

Thanks! I am thinking more in terms where you have millions of keys (rows).
For eg: UUID as a row key. or there could millions of users. 

So are we saying that we should NOT create column families with these many
keys? What are the other options in such cases?

UserProfile = { // this is a ColumnFamily
>1 {   // this is the key to this Row inside the CF
>// now we have an infinite # of columns in this row
>username: "phatduckk",
>email: "[hidden email]",
>phone: "(900) 976-"
>}, // end row
>2 {   // this is the key to another row in the CF
>// now we have another infinite # of columns in this row
>username: "ieure",
>email: "[hidden email]",
>phone: "(888) 555-1212"
>age: "66",
>gender: "undecided"
>},
> }

-- 
View this message in context: 
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Understanding-Indexes-tp6058238p6061574.html
Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
Nabble.com.


Re: Understand eventually consistent

2011-02-24 Thread mcasandra

Does HH count towards QUORUM? Say  RF=1 and CL of W=QUORUM and one node that
owns the key dies. Would subsequent write operations for that key be
successful? I am guessing it will not succeed.
-- 
View this message in context: 
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Understand-eventually-consistent-tp6038330p6061593.html
Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
Nabble.com.


Re: Understand eventually consistent

2011-02-24 Thread Tyler Hobbs
On Thu, Feb 24, 2011 at 1:26 PM, mcasandra  wrote:

>
> Does HH count towards QUORUM? Say  RF=1 and CL of W=QUORUM and one node
> that
> owns the key dies. Would subsequent write operations for that key be
> successful? I am guessing it will not succeed.
>

No, it would not succeed. It would only succeed at CL.ANY.

-- 
Tyler Hobbs
Software Engineer, DataStax 
Maintainer of the pycassa  Cassandra
Python client library


Re: Understanding Indexes

2011-02-24 Thread Javier Canillas
I don't say you shouldn't. In case you feel like there is a problem, you may
think of splitting column families into N. But I think you won't get that
problem. You can read about RowCacheSize and KeyCache support on 0.7.X of
Cassandra, if you rows are small, you may cache a lot of them and avoid a
lot of latency issues when reading writing.

On Thu, Feb 24, 2011 at 4:18 PM, mcasandra  wrote:

>
> Thanks! I am thinking more in terms where you have millions of keys (rows).
> For eg: UUID as a row key. or there could millions of users.
>
> So are we saying that we should NOT create column families with these many
> keys? What are the other options in such cases?
>
> UserProfile = { // this is a ColumnFamily
> >1 {   // this is the key to this Row inside the CF
> >// now we have an infinite # of columns in this row
> >username: "phatduckk",
> >email: "[hidden email]",
> >phone: "(900) 976-"
> >}, // end row
> >2 {   // this is the key to another row in the CF
> >// now we have another infinite # of columns in this row
> >username: "ieure",
> >email: "[hidden email]",
> >phone: "(888) 555-1212"
> >age: "66",
> >gender: "undecided"
> >},
> > }
>
> --
> View this message in context:
> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Understanding-Indexes-tp6058238p6061574.html
> Sent from the cassandra-u...@incubator.apache.org mailing list archive at
> Nabble.com.
>


Re: Understand eventually consistent

2011-02-24 Thread Javier Canillas
HH is some kind of write repair, so it has nothing to do with CL that is a
requirement of the operation; and it won't be used over reads.

In your example QUORUM is the same as ALL, since you only have 1 RF (only
the data holder - coordinator). If that node fails, all read / writes will
fail.

Now, on another scenario, with RF = 3 and 1 node down:

CL = QUORUM. Will work, but the coordination will mark an HH over the write
and attempt to do it for some time over the failed node. Despite this, the
operation will success for the client.
CL = ALL. Will fail.
CL = ONE. Will work. 2 HH will be sent to replicas to perform the update.

*Consider CL is the client minimum requirement over an operation to succeed*.
If the cluster can assure that value, then the operation will succeed and
returned to the client (despite some HH work needs to be done after), if not
an error response will be returned.


On Thu, Feb 24, 2011 at 4:26 PM, mcasandra  wrote:

>
> Does HH count towards QUORUM? Say  RF=1 and CL of W=QUORUM and one node
> that
> owns the key dies. Would subsequent write operations for that key be
> successful? I am guessing it will not succeed.
> --
> View this message in context:
> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Understand-eventually-consistent-tp6038330p6061593.html
> Sent from the cassandra-u...@incubator.apache.org mailing list archive at
> Nabble.com.
>


Re: New Chain for : Does Cassandra use vector clocks

2011-02-24 Thread Anthony John
The leap of faith here is that an error does not mean a clean backing out to
prior state - as we are used to with databases. It means that the operation
in error could have gone through partially

Again, this is not an absolutely unfamiliar territory and can be dealt with.

-JA

On Thu, Feb 24, 2011 at 1:16 PM, A J  wrote:

> >>but could be broken in case of a failed write<<
> You can think of a scenario where R + W >N still leads to
> inconsistency even for successful writes. Say you keep W=1 and R=N .
> Lets say the one node where a write happened with success goes down
> before it made to the other N-1 nodes. Lets say it goes down for good
> and is unrecoverable. The only option is to build a new node from
> scratch from other active nodes. This will lead to a write that was
> lost and you will end up serving stale copy of it.
>
> It is better to talk in terms of use cases and if cassandra will be a
> fit for it. Otherwise unless you have W=R=N and fsync before each
> write commit, there will be scope for inconsistency.
>
>
> On Thu, Feb 24, 2011 at 1:25 PM, Anthony John 
> wrote:
> > I see the point - apologies for putting everyone through this!
> > It was just militating against my mental model.
> > In summary, here is my take away - simple stuff but - IMO - important to
> > conclude this thread (I hope):-
> > 1. I was splitting hair over a failed ( partial ) Q Write. Such an event
> > should be immediately followed by the same write going to a connection on
> to
> > another node ( potentially using connection caches of client
> implementations
> > ) or a Read at CL of All. Because a write could have partially gone
> through.
> > 2. Timestamps are used in determining the latest version ( correcting the
> > false impression I was propagating)
> > Finally, wrt "W + R > N for Q CL statement" holds, but could be broken in
> > case of a failed write as it is unsure whether the new value got written
> on
> >  any server or not. Is that a fair characterization ?
> > Bottom line - unlike traditional DBMS, errors do not ensure automatic
> > cleanup and revert back, app code has to follow up if  immediate - and
> not
> > eventual -  consistency is desired. I made that leap in almost all cases
> - I
> > think - but the case of a failed write.
> > My bad and I can live with this!
> > Regards,
> > -JA
> >
> > On Thu, Feb 24, 2011 at 11:50 AM, Sylvain Lebresne  >
> > wrote:
> >>
> >> On Thu, Feb 24, 2011 at 6:33 PM, Anthony John 
> >> wrote:
> >>>
> >>> Completely understand!
> >>> All that I am quibbling over is whether a CL of quorum guarantees
> >>> consistency or not. That is what the documentation says - right. IF for
> a CL
> >>> of Q read - it depends on which node returns read first to determine
> the
> >>> actual returned result or other more convoluted conditions , then a
> Quorum
> >>> read/write is not consistent, by any definition.
> >>
> >> But that's the point. The definition of consistency we are talking about
> >> has no meaning if you consider only a quorum read. The definition (which
> is
> >> the de facto definition of consistency in 'eventually consistent') make
> >> sense if we talk about a write followed by a read. And it is
> >> considering succeeding write followed by succeeding read.
> >> And that is the statement the wiki is making.
> >> Honestly, we could debate forever on the definition of consistency and
> >> whatnot. Cassandra guaranties that if you do a (succeeding) write on W
> >> replica and then a (succeeding) read on R replica and if R+W>N, then it
> is
> >> guaranteed that the read will see the preceding write. And this is what
> is
> >> called consistency in the context of eventual consistency (which is not
> the
> >> context of ACID).
> >> If this is not the definition of consistency you had in mind then by all
> >> mean, Cassandra probably don't guarantee this definition. But given that
> the
> >> paragraph preceding what you pasted state clearly we are not talking
> about
> >> ACID consistency, but eventual consistency, I don't think the wiki is
> making
> >> any unfair statement.
> >> That being said, the wiki may not be always as clear as it could. But
> it's
> >> an editable wiki :)
> >> --
> >> Sylvain
> >>
> >>>
> >>> I can still use Cassandra, and will use it, luv it!!! But let us not
> make
> >>> this statement on the Wiki architecture section:-
> >>> -
> >>>
> >>> More specifically: R=read replica count W=write replica
> >>> count N=replication factor Q=QUORUM (Q = N / 2 + 1)
> >>>
> >>> If W + R > N, you will have consistency
> >>>
> >>> W=1, R=N
> >>> W=N, R=1
> >>> W=Q, R=Q where Q = N / 2 + 1
> >>>
> >>> Cassandra provides consistency when R + W > N (read replica count
> + write
> >>> replica count > replication factor).
> >>>
> >>> 
> >>>
> >>> .
> >>>
> >>> On Thu, Feb 24, 2011 at 11:22 AM, Sylvain Lebresne <
> sylv...@datastax.com>
> >>> wrote:
> >>>

Exception in thread "main" java.lang.NoClassDefFoundError

2011-02-24 Thread ko...@vivinavi.com
Hi everyone

I am new to JAVA and Cassandra.
I just get started to install Cassandra.
My Machine is Debian 5.0.6.
I installed jdk1.6.0_24 to /usr/local
java -version is as following.
java version "1.6.0_24"
Java(TM) SE Runtime Environment (build 1.6.0_24-b07)
Java HotSpot(TM) Server VM (build 19.1-b02, mixed mode)
javac -J-version is as following.
java version "1.6.0_24"
Java(TM) SE Runtime Environment (build 1.6.0_24-b07)
Java HotSpot(TM) Client VM (build 19.1-b02, mixed mode, sharing)

and then I installed apache-cassandra-0.6.12 to /user/local

I add the following PATH on /etc/profile
#for Java
export JAVA_HOME="/usr/local/java"
export CLASSPATH=".:$JAVA_HOME/lib/tools.jar:$JAVA_HOME/lib/dt.jar;"
export PATH="$JAVA_HOME/bin:$PATH"

#for Java VM
export JVM_OPTS="-Xmx1G -Xms512M -Xss256K"

#for Cassandra
export CASSANDRA_HOME="/usr/local/cassandra/bin"
export CASSANDRA_CONF="/usr/local/cassandra/conf"
export
CASSANDRA_MAIN="/usr/local/cassandra/javadoc/org/apache/cassandra/thrift/CassandraDaemon.html"
export CASSANDRA_INCLUDE="/usr/local/cassandra/bin/cassandra.in.sh"
export PATH="$PATH:/usr/local/cassandra/bin"

I did source /etc/profile.
And checked $JAVA_HOME,$CLASS_PATH,$CASSANDRA_HOME etc.

And then I started /usr/local/cassandra/bin/cassandra -f
However I met the following Error message.

Exception in thread "main" java.lang.NoClassDefFoundError:
/usr/local/cassandra/javadoc/org/apache/cassandra/thrift/CassandraDaemon
Caused by: java.lang.ClassNotFoundException:
.usr.local.cassandra.javadoc.org.apache.cassandra.thrift.CassandraDaemon
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:248)
Could not find the main class:
.usr.local.cassandra.javadoc.org.apache.cassandra.thrift.CassandraDaemon. 
Program
will exit.

I don't know what's wrong?
I don't know what to do to solve this problem.
I searched this error message and then found it but mostly for Win not
Linux.
My classpath is wrong? I can find only many html(inc.
CassandraDaemon.html) files
at /usr/local/cassandra/javadoc/org/apache/cassandra/thrift/.
Is this OK?
if my classpath is wrong , what is a correct path? (I can't find
CassandraDaemon.java)

Please advise me to solve this problem.
Thank you for your help in advance.

Best Regards
Mac Kondo

-- 
*
Mamoru Kondo
Vivid Navigation,Inc.
http://www.vivinavi.com
ko...@vivinavi.com
*



Re: New Chain for : Does Cassandra use vector clocks

2011-02-24 Thread A J
yes, that is difficult to digest and one has to be sure if the use
case can afford it.

Some other NOSQL databases deals with it differently (though I don't
think any of them use atomic 2-phase commit). MongoDB for example will
ask you to read from the node you wrote first (primary node) unless
you are ok with eventual consistency. If the write did not make to
majority of other nodes, it will be rolled-back from the original
primary when it comes up again as a secondary.
In some cases, you still could server either new value (that was
returned as failed) or the old one. But it is different from Cassandra
in the sense that Cassandra will never rollback.



On Thu, Feb 24, 2011 at 2:47 PM, Anthony John  wrote:
> The leap of faith here is that an error does not mean a clean backing out to
> prior state - as we are used to with databases. It means that the operation
> in error could have gone through partially
>
> Again, this is not an absolutely unfamiliar territory and can be dealt with.
> -JA
> On Thu, Feb 24, 2011 at 1:16 PM, A J  wrote:
>>
>> >>but could be broken in case of a failed write<<
>> You can think of a scenario where R + W >N still leads to
>> inconsistency even for successful writes. Say you keep W=1 and R=N .
>> Lets say the one node where a write happened with success goes down
>> before it made to the other N-1 nodes. Lets say it goes down for good
>> and is unrecoverable. The only option is to build a new node from
>> scratch from other active nodes. This will lead to a write that was
>> lost and you will end up serving stale copy of it.
>>
>> It is better to talk in terms of use cases and if cassandra will be a
>> fit for it. Otherwise unless you have W=R=N and fsync before each
>> write commit, there will be scope for inconsistency.
>>
>>
>> On Thu, Feb 24, 2011 at 1:25 PM, Anthony John 
>> wrote:
>> > I see the point - apologies for putting everyone through this!
>> > It was just militating against my mental model.
>> > In summary, here is my take away - simple stuff but - IMO - important to
>> > conclude this thread (I hope):-
>> > 1. I was splitting hair over a failed ( partial ) Q Write. Such an event
>> > should be immediately followed by the same write going to a connection
>> > on to
>> > another node ( potentially using connection caches of client
>> > implementations
>> > ) or a Read at CL of All. Because a write could have partially gone
>> > through.
>> > 2. Timestamps are used in determining the latest version ( correcting
>> > the
>> > false impression I was propagating)
>> > Finally, wrt "W + R > N for Q CL statement" holds, but could be broken
>> > in
>> > case of a failed write as it is unsure whether the new value got written
>> > on
>> >  any server or not. Is that a fair characterization ?
>> > Bottom line - unlike traditional DBMS, errors do not ensure automatic
>> > cleanup and revert back, app code has to follow up if  immediate - and
>> > not
>> > eventual -  consistency is desired. I made that leap in almost all cases
>> > - I
>> > think - but the case of a failed write.
>> > My bad and I can live with this!
>> > Regards,
>> > -JA
>> >
>> > On Thu, Feb 24, 2011 at 11:50 AM, Sylvain Lebresne
>> > 
>> > wrote:
>> >>
>> >> On Thu, Feb 24, 2011 at 6:33 PM, Anthony John 
>> >> wrote:
>> >>>
>> >>> Completely understand!
>> >>> All that I am quibbling over is whether a CL of quorum guarantees
>> >>> consistency or not. That is what the documentation says - right. IF
>> >>> for a CL
>> >>> of Q read - it depends on which node returns read first to determine
>> >>> the
>> >>> actual returned result or other more convoluted conditions , then a
>> >>> Quorum
>> >>> read/write is not consistent, by any definition.
>> >>
>> >> But that's the point. The definition of consistency we are talking
>> >> about
>> >> has no meaning if you consider only a quorum read. The definition
>> >> (which is
>> >> the de facto definition of consistency in 'eventually consistent') make
>> >> sense if we talk about a write followed by a read. And it is
>> >> considering succeeding write followed by succeeding read.
>> >> And that is the statement the wiki is making.
>> >> Honestly, we could debate forever on the definition of consistency and
>> >> whatnot. Cassandra guaranties that if you do a (succeeding) write on W
>> >> replica and then a (succeeding) read on R replica and if R+W>N, then it
>> >> is
>> >> guaranteed that the read will see the preceding write. And this is what
>> >> is
>> >> called consistency in the context of eventual consistency (which is not
>> >> the
>> >> context of ACID).
>> >> If this is not the definition of consistency you had in mind then by
>> >> all
>> >> mean, Cassandra probably don't guarantee this definition. But given
>> >> that the
>> >> paragraph preceding what you pasted state clearly we are not talking
>> >> about
>> >> ACID consistency, but eventual consistency, I don't think the wiki is
>> >> making
>> >> any unfair statement.
>> >> That being said, the 

Re: Understand eventually consistent

2011-02-24 Thread mcasandra


Javier Canillas wrote:
> 
> HH is some kind of write repair, so it has nothing to do with CL that is a
> requirement of the operation; and it won't be used over reads.
> 
> In your example QUORUM is the same as ALL, since you only have 1 RF (only
> the data holder - coordinator). If that node fails, all read / writes will
> fail.
> 
> Now, on another scenario, with RF = 3 and 1 node down:
> 
> CL = QUORUM. Will work, but the coordination will mark an HH over the
> write
> and attempt to do it for some time over the failed node. Despite this, the
> operation will success for the client.
> CL = ALL. Will fail.
> CL = ONE. Will work. 2 HH will be sent to replicas to perform the update.
> 
> *Consider CL is the client minimum requirement over an operation to
> succeed*.
> If the cluster can assure that value, then the operation will succeed and
> returned to the client (despite some HH work needs to be done after), if
> not
> an error response will be returned.
> 
> 
> On Thu, Feb 24, 2011 at 4:26 PM, mcasandra  wrote:
> 
>>
>> Does HH count towards QUORUM? Say  RF=1 and CL of W=QUORUM and one node
>> that
>> owns the key dies. Would subsequent write operations for that key be
>> successful? I am guessing it will not succeed.
>> --
>> View this message in context:
>> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Understand-eventually-consistent-tp6038330p6061593.html
>> Sent from the cassandra-u...@incubator.apache.org mailing list archive at
>> Nabble.com.
>>
> 
> 

Thanks! In above scenario what happens if 2 nodes die and RF=3, CL of
W=QUORUM. Would a write succeed since one write can be made to coordinator
node with HH and other to the replica node that is up.

And similarly in above scenario would read succeed. Would HH be considered
towards CL in this case?
-- 
View this message in context: 
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Understand-eventually-consistent-tp6038330p6061772.html
Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
Nabble.com.


Re: Understanding Indexes

2011-02-24 Thread Ed Anuff
It all depends on what you're trying to do.  What you're proposing doing, by
defintion, is creating a secondary index.  The primary index is your row
key.  Depending on the partitioner, it might or might not be a conveniently
iterable index or sorted index.  If you need your keys sorted in a different
order than the partitioner does, if you need your keys organized into groups
that can be quickly retrieved or membership in tested against, or some other
reason why the primary index doesn't suffice, then you need a secondary
index.  It all depends on whether you need to retrieve rows based on a
different criteria than what the primary index provides.  If so, then yes,
you'll probably end up doing something that involves creating rows that are
full of row keys.  But, if you're not storing a subset of your full key set
or you don't have specific needs for ordering and iterating, then it would
be redundant.


On Thu, Feb 24, 2011 at 11:18 AM, mcasandra  wrote:

>
> Thanks! I am thinking more in terms where you have millions of keys (rows).
> For eg: UUID as a row key. or there could millions of users.
>
> So are we saying that we should NOT create column families with these many
> keys? What are the other options in such cases?
>
> UserProfile = { // this is a ColumnFamily
> >1 {   // this is the key to this Row inside the CF
> >// now we have an infinite # of columns in this row
> >username: "phatduckk",
> >email: "[hidden email]",
> >phone: "(900) 976-"
> >}, // end row
> >2 {   // this is the key to another row in the CF
> >// now we have another infinite # of columns in this row
> >username: "ieure",
> >email: "[hidden email]",
> >phone: "(888) 555-1212"
> >age: "66",
> >gender: "undecided"
> >},
> > }
>
> --
> View this message in context:
> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Understanding-Indexes-tp6058238p6061574.html
> Sent from the cassandra-u...@incubator.apache.org mailing list archive at
> Nabble.com.
>


Re: Understand eventually consistent

2011-02-24 Thread Javier Canillas
No, since you are intentionally asking that at least a QUORUM of the RFs are
written. So in your scenario, only 1 node is up of 3, and QUORUM value is 2.
So that operation will fail, no HH is made.

A read won't succedd either, since you are asking that the data to be
returned must be validated at least by 2 nodes.

HH only takes place on write operations and when the OP succeded because the
CL can be satisfied and other replicas are down. Then the coordinator uses
HH to perform the updates on the failed replicas (as soon as they get up).

On Thu, Feb 24, 2011 at 5:13 PM, mcasandra  wrote:

>
>
> Javier Canillas wrote:
> >
> > HH is some kind of write repair, so it has nothing to do with CL that is
> a
> > requirement of the operation; and it won't be used over reads.
> >
> > In your example QUORUM is the same as ALL, since you only have 1 RF (only
> > the data holder - coordinator). If that node fails, all read / writes
> will
> > fail.
> >
> > Now, on another scenario, with RF = 3 and 1 node down:
> >
> > CL = QUORUM. Will work, but the coordination will mark an HH over the
> > write
> > and attempt to do it for some time over the failed node. Despite this,
> the
> > operation will success for the client.
> > CL = ALL. Will fail.
> > CL = ONE. Will work. 2 HH will be sent to replicas to perform the update.
> >
> > *Consider CL is the client minimum requirement over an operation to
> > succeed*.
> > If the cluster can assure that value, then the operation will succeed and
> > returned to the client (despite some HH work needs to be done after), if
> > not
> > an error response will be returned.
> >
> >
> > On Thu, Feb 24, 2011 at 4:26 PM, mcasandra 
> wrote:
> >
> >>
> >> Does HH count towards QUORUM? Say  RF=1 and CL of W=QUORUM and one node
> >> that
> >> owns the key dies. Would subsequent write operations for that key be
> >> successful? I am guessing it will not succeed.
> >> --
> >> View this message in context:
> >>
> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Understand-eventually-consistent-tp6038330p6061593.html
> >> Sent from the cassandra-u...@incubator.apache.org mailing list archive
> at
> >> Nabble.com.
> >>
> >
> >
>
> Thanks! In above scenario what happens if 2 nodes die and RF=3, CL of
> W=QUORUM. Would a write succeed since one write can be made to coordinator
> node with HH and other to the replica node that is up.
>
> And similarly in above scenario would read succeed. Would HH be considered
> towards CL in this case?
> --
> View this message in context:
> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Understand-eventually-consistent-tp6038330p6061772.html
> Sent from the cassandra-u...@incubator.apache.org mailing list archive at
> Nabble.com.
>


"null" vs "value not found"?

2011-02-24 Thread buddhasystem

I'm doing insertion with a pycassa client. It seems to work in most cases,
but sometimes, when I go to Cassandra-cli, and query with key and column
that I inserted, I get "null" whereas I shouldn't. What could be causes for
that?
-- 
View this message in context: 
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/null-vs-value-not-found-tp6061828p6061828.html
Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
Nabble.com.


Re: Understand eventually consistent

2011-02-24 Thread mcasandra

Thanks. This helps a lot!
-- 
View this message in context: 
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Understand-eventually-consistent-tp6038330p6061838.html
Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
Nabble.com.


Re: Understand eventually consistent

2011-02-24 Thread Javier Canillas
You're welcomed!

On Thu, Feb 24, 2011 at 5:30 PM, mcasandra  wrote:

>
> Thanks. This helps a lot!
> --
> View this message in context:
> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Understand-eventually-consistent-tp6038330p6061838.html
> Sent from the cassandra-u...@incubator.apache.org mailing list archive at
> Nabble.com.
>


Re: "null" vs "value not found"?

2011-02-24 Thread Tyler Hobbs
On Thu, Feb 24, 2011 at 2:27 PM, buddhasystem  wrote:

>
> I'm doing insertion with a pycassa client. It seems to work in most cases,
> but sometimes, when I go to Cassandra-cli, and query with key and column
> that I inserted, I get "null" whereas I shouldn't. What could be causes for
> that?
>

Could you clarify what column name and value you are using as well as the
comparator and validator types?

-- 
Tyler Hobbs
Software Engineer, DataStax 
Maintainer of the pycassa  Cassandra
Python client library


Re: New Chain for : Does Cassandra use vector clocks

2011-02-24 Thread Edward Capriolo
On Thu, Feb 24, 2011 at 3:03 PM, A J  wrote:
> yes, that is difficult to digest and one has to be sure if the use
> case can afford it.
>
> Some other NOSQL databases deals with it differently (though I don't
> think any of them use atomic 2-phase commit). MongoDB for example will
> ask you to read from the node you wrote first (primary node) unless
> you are ok with eventual consistency. If the write did not make to
> majority of other nodes, it will be rolled-back from the original
> primary when it comes up again as a secondary.
> In some cases, you still could server either new value (that was
> returned as failed) or the old one. But it is different from Cassandra
> in the sense that Cassandra will never rollback.
>
>
>
> On Thu, Feb 24, 2011 at 2:47 PM, Anthony John  wrote:
>> The leap of faith here is that an error does not mean a clean backing out to
>> prior state - as we are used to with databases. It means that the operation
>> in error could have gone through partially
>>
>> Again, this is not an absolutely unfamiliar territory and can be dealt with.
>> -JA
>> On Thu, Feb 24, 2011 at 1:16 PM, A J  wrote:
>>>
>>> >>but could be broken in case of a failed write<<
>>> You can think of a scenario where R + W >N still leads to
>>> inconsistency even for successful writes. Say you keep W=1 and R=N .
>>> Lets say the one node where a write happened with success goes down
>>> before it made to the other N-1 nodes. Lets say it goes down for good
>>> and is unrecoverable. The only option is to build a new node from
>>> scratch from other active nodes. This will lead to a write that was
>>> lost and you will end up serving stale copy of it.
>>>
>>> It is better to talk in terms of use cases and if cassandra will be a
>>> fit for it. Otherwise unless you have W=R=N and fsync before each
>>> write commit, there will be scope for inconsistency.
>>>
>>>
>>> On Thu, Feb 24, 2011 at 1:25 PM, Anthony John 
>>> wrote:
>>> > I see the point - apologies for putting everyone through this!
>>> > It was just militating against my mental model.
>>> > In summary, here is my take away - simple stuff but - IMO - important to
>>> > conclude this thread (I hope):-
>>> > 1. I was splitting hair over a failed ( partial ) Q Write. Such an event
>>> > should be immediately followed by the same write going to a connection
>>> > on to
>>> > another node ( potentially using connection caches of client
>>> > implementations
>>> > ) or a Read at CL of All. Because a write could have partially gone
>>> > through.
>>> > 2. Timestamps are used in determining the latest version ( correcting
>>> > the
>>> > false impression I was propagating)
>>> > Finally, wrt "W + R > N for Q CL statement" holds, but could be broken
>>> > in
>>> > case of a failed write as it is unsure whether the new value got written
>>> > on
>>> >  any server or not. Is that a fair characterization ?
>>> > Bottom line - unlike traditional DBMS, errors do not ensure automatic
>>> > cleanup and revert back, app code has to follow up if  immediate - and
>>> > not
>>> > eventual -  consistency is desired. I made that leap in almost all cases
>>> > - I
>>> > think - but the case of a failed write.
>>> > My bad and I can live with this!
>>> > Regards,
>>> > -JA
>>> >
>>> > On Thu, Feb 24, 2011 at 11:50 AM, Sylvain Lebresne
>>> > 
>>> > wrote:
>>> >>
>>> >> On Thu, Feb 24, 2011 at 6:33 PM, Anthony John 
>>> >> wrote:
>>> >>>
>>> >>> Completely understand!
>>> >>> All that I am quibbling over is whether a CL of quorum guarantees
>>> >>> consistency or not. That is what the documentation says - right. IF
>>> >>> for a CL
>>> >>> of Q read - it depends on which node returns read first to determine
>>> >>> the
>>> >>> actual returned result or other more convoluted conditions , then a
>>> >>> Quorum
>>> >>> read/write is not consistent, by any definition.
>>> >>
>>> >> But that's the point. The definition of consistency we are talking
>>> >> about
>>> >> has no meaning if you consider only a quorum read. The definition
>>> >> (which is
>>> >> the de facto definition of consistency in 'eventually consistent') make
>>> >> sense if we talk about a write followed by a read. And it is
>>> >> considering succeeding write followed by succeeding read.
>>> >> And that is the statement the wiki is making.
>>> >> Honestly, we could debate forever on the definition of consistency and
>>> >> whatnot. Cassandra guaranties that if you do a (succeeding) write on W
>>> >> replica and then a (succeeding) read on R replica and if R+W>N, then it
>>> >> is
>>> >> guaranteed that the read will see the preceding write. And this is what
>>> >> is
>>> >> called consistency in the context of eventual consistency (which is not
>>> >> the
>>> >> context of ACID).
>>> >> If this is not the definition of consistency you had in mind then by
>>> >> all
>>> >> mean, Cassandra probably don't guarantee this definition. But given
>>> >> that the
>>> >> paragraph preceding what you pasted state clearly we are not

Re: Understanding Indexes

2011-02-24 Thread mcasandra

I wasn't aware that there is an index on primary key (that is row keys). So
from what I understand there is by default an index on for eg: , in
below example? Where can I read more about it?

UserProfile = { // this is a ColumnFamily
 {   // this is the key to this Row inside the CF
// now we have an infinite # of columns in this row
username: "phatduckk",
email: "[hidden email]",
phone: "(900) 976-"
}, // end row
 {   // this is the key to another row in the CF
// now we have another infinite # of columns in this row
username: "ieure",
email: "[hidden email]",
phone: "(888) 555-1212"
age: "66",
gender: "undecided"
},
 }


-- 
View this message in context: 
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Understanding-Indexes-tp6058238p6061857.html
Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
Nabble.com.


Does CL on "ALL" have special semantics like "ANY" does

2011-02-24 Thread Anthony John
All:

So "ANY" CL seems to mean that Write (and read) on any node, even if it is a
hinted handoff, and return success. Correct ?
Guessing this accommodates node failure - right ?


Does "ALL"  succeed even if there is a single surviving replica for the
given piece of data ?
Again, tolerates node failure. Does it really mean - from ALL surviving
nodes ?

-JA


Re: Does CL on "ALL" have special semantics like "ANY" does

2011-02-24 Thread Tyler Hobbs
On Thu, Feb 24, 2011 at 2:36 PM, Anthony John  wrote:

>
> Does "ALL"  succeed even if there is a single surviving replica for the
> given piece of data ?
> Again, tolerates node failure. Does it really mean - from ALL surviving
> nodes ?
>

All replicas (RF) for that row must respond before an operation at ALL is
considered a success.  That's all there is to it.

-- 
Tyler Hobbs
Software Engineer, DataStax 
Maintainer of the pycassa  Cassandra
Python client library


Re: "null" vs "value not found"?

2011-02-24 Thread buddhasystem

Thanks Tyler,

ColumnFamily: index1
  Columns sorted by: org.apache.cassandra.db.marshal.AsciiType
  Row cache size / save period: 0.0/0
  Key cache size / save period: 1.0/3600
  Memtable thresholds: 0.8765625/50/60
  GC grace seconds: 864000
  Compaction min/max thresholds: 4/32
  Read repair chance: 1.0
  Built indexes: []

I pretty much went with the default settings, and the column name is
'CATALOG'.

Maxim




Tyler Hobbs-2 wrote:
> 
> On Thu, Feb 24, 2011 at 2:27 PM, buddhasystem  wrote:
> 
>>
>> I'm doing insertion with a pycassa client. It seems to work in most
>> cases,
>> but sometimes, when I go to Cassandra-cli, and query with key and column
>> that I inserted, I get "null" whereas I shouldn't. What could be causes
>> for
>> that?
>>
> 
> Could you clarify what column name and value you are using as well as the
> comparator and validator types?
> 
> -- 
> Tyler Hobbs
> Software Engineer, DataStax 
> Maintainer of the pycassa  Cassandra
> Python client library
> 
> 

-- 
View this message in context: 
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/null-vs-value-not-found-tp6061828p6061900.html
Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
Nabble.com.


Re: Understanding Indexes

2011-02-24 Thread Edward Capriolo
On Thu, Feb 24, 2011 at 3:34 PM, mcasandra  wrote:
>
> I wasn't aware that there is an index on primary key (that is row keys). So
> from what I understand there is by default an index on for eg: , in
> below example? Where can I read more about it?
>
> UserProfile = { // this is a ColumnFamily
>     {   // this is the key to this Row inside the CF
>        // now we have an infinite # of columns in this row
>        username: "phatduckk",
>        email: "[hidden email]",
>        phone: "(900) 976-"
>    }, // end row
>     {   // this is the key to another row in the CF
>        // now we have another infinite # of columns in this row
>        username: "ieure",
>        email: "[hidden email]",
>        phone: "(888) 555-1212"
>        age: "66",
>        gender: "undecided"
>    },
>  }
>
>
> --
> View this message in context: 
> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Understanding-Indexes-tp6058238p6061857.html
> Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
> Nabble.com.
>


Dude! You are running before you can walk why are your worried about
secondary indexing before you know what the primary index is? :)

http://wiki.apache.org/cassandra/ArchitectureOverview
http://wiki.apache.org/cassandra/ArchitectureSSTable


Re: "null" vs "value not found"?

2011-02-24 Thread Dan Kuebrich
When I've gotten "null" as a result in cassandra-cli, it turned out to mean
that there were exceptions being thrown on the server side. Have you checked
your Cassandra logs?

On Thu, Feb 24, 2011 at 3:44 PM, buddhasystem  wrote:

>
> Thanks Tyler,
>
>ColumnFamily: index1
>  Columns sorted by: org.apache.cassandra.db.marshal.AsciiType
>  Row cache size / save period: 0.0/0
>  Key cache size / save period: 1.0/3600
>  Memtable thresholds: 0.8765625/50/60
>  GC grace seconds: 864000
>  Compaction min/max thresholds: 4/32
>  Read repair chance: 1.0
>  Built indexes: []
>
> I pretty much went with the default settings, and the column name is
> 'CATALOG'.
>
> Maxim
>
>
>
>
> Tyler Hobbs-2 wrote:
> >
> > On Thu, Feb 24, 2011 at 2:27 PM, buddhasystem  wrote:
> >
> >>
> >> I'm doing insertion with a pycassa client. It seems to work in most
> >> cases,
> >> but sometimes, when I go to Cassandra-cli, and query with key and column
> >> that I inserted, I get "null" whereas I shouldn't. What could be causes
> >> for
> >> that?
> >>
> >
> > Could you clarify what column name and value you are using as well as the
> > comparator and validator types?
> >
> > --
> > Tyler Hobbs
> > Software Engineer, DataStax 
> > Maintainer of the pycassa  Cassandra
> > Python client library
> >
> >
>
> --
> View this message in context:
> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/null-vs-value-not-found-tp6061828p6061900.html
> Sent from the cassandra-u...@incubator.apache.org mailing list archive at
> Nabble.com.
>


Re: Understanding Indexes

2011-02-24 Thread mcasandra

Either I am not explaning properly or I don't understand the data model just
yet. Please check again:

In below example this is what I understand:

1) UserProfile is a CF
2)  is a row key
3) username is a column. Each row (eg ) has username column

My understanding is that secondary indexes can be created only on column
value. Which means I can create secondary index only on username, email etc.
not on .  is the row key, but you keep saying that I need secondary
index, but I am actually asking about index on the row key.

Is my understanding incorrect about this?

> UserProfile = { // this is a ColumnFamily 
> {   // this is the key to this Row inside the CF 
>// now we have an infinite # of columns in this row 
>username: "phatduckk", 
>email: "[hidden email]", 
>phone: "(900) 976-" 
>}, // end row 
> {   // this is the key to another row in the CF 
>// now we have another infinite # of columns in this row 
>username: "ieure", 
>email: "[hidden email]", 
>phone: "(888) 555-1212" 
>age: "66", 
>gender: "undecided" 
>}, 
>  } 

-- 
View this message in context: 
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Understanding-Indexes-tp6058238p6061959.html
Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
Nabble.com.


Re: New Chain for : Does Cassandra use vector clocks

2011-02-24 Thread A J
While we are at it, there's more to consider than just CAP in distributed :)
http://voltdb.com/blog/clarifications-cap-theorem-and-data-related-errors

On Thu, Feb 24, 2011 at 3:31 PM, Edward Capriolo  wrote:
> On Thu, Feb 24, 2011 at 3:03 PM, A J  wrote:
>> yes, that is difficult to digest and one has to be sure if the use
>> case can afford it.
>>
>> Some other NOSQL databases deals with it differently (though I don't
>> think any of them use atomic 2-phase commit). MongoDB for example will
>> ask you to read from the node you wrote first (primary node) unless
>> you are ok with eventual consistency. If the write did not make to
>> majority of other nodes, it will be rolled-back from the original
>> primary when it comes up again as a secondary.
>> In some cases, you still could server either new value (that was
>> returned as failed) or the old one. But it is different from Cassandra
>> in the sense that Cassandra will never rollback.
>>
>>
>>
>> On Thu, Feb 24, 2011 at 2:47 PM, Anthony John  wrote:
>>> The leap of faith here is that an error does not mean a clean backing out to
>>> prior state - as we are used to with databases. It means that the operation
>>> in error could have gone through partially
>>>
>>> Again, this is not an absolutely unfamiliar territory and can be dealt with.
>>> -JA
>>> On Thu, Feb 24, 2011 at 1:16 PM, A J  wrote:

 >>but could be broken in case of a failed write<<
 You can think of a scenario where R + W >N still leads to
 inconsistency even for successful writes. Say you keep W=1 and R=N .
 Lets say the one node where a write happened with success goes down
 before it made to the other N-1 nodes. Lets say it goes down for good
 and is unrecoverable. The only option is to build a new node from
 scratch from other active nodes. This will lead to a write that was
 lost and you will end up serving stale copy of it.

 It is better to talk in terms of use cases and if cassandra will be a
 fit for it. Otherwise unless you have W=R=N and fsync before each
 write commit, there will be scope for inconsistency.


 On Thu, Feb 24, 2011 at 1:25 PM, Anthony John 
 wrote:
 > I see the point - apologies for putting everyone through this!
 > It was just militating against my mental model.
 > In summary, here is my take away - simple stuff but - IMO - important to
 > conclude this thread (I hope):-
 > 1. I was splitting hair over a failed ( partial ) Q Write. Such an event
 > should be immediately followed by the same write going to a connection
 > on to
 > another node ( potentially using connection caches of client
 > implementations
 > ) or a Read at CL of All. Because a write could have partially gone
 > through.
 > 2. Timestamps are used in determining the latest version ( correcting
 > the
 > false impression I was propagating)
 > Finally, wrt "W + R > N for Q CL statement" holds, but could be broken
 > in
 > case of a failed write as it is unsure whether the new value got written
 > on
 >  any server or not. Is that a fair characterization ?
 > Bottom line - unlike traditional DBMS, errors do not ensure automatic
 > cleanup and revert back, app code has to follow up if  immediate - and
 > not
 > eventual -  consistency is desired. I made that leap in almost all cases
 > - I
 > think - but the case of a failed write.
 > My bad and I can live with this!
 > Regards,
 > -JA
 >
 > On Thu, Feb 24, 2011 at 11:50 AM, Sylvain Lebresne
 > 
 > wrote:
 >>
 >> On Thu, Feb 24, 2011 at 6:33 PM, Anthony John 
 >> wrote:
 >>>
 >>> Completely understand!
 >>> All that I am quibbling over is whether a CL of quorum guarantees
 >>> consistency or not. That is what the documentation says - right. IF
 >>> for a CL
 >>> of Q read - it depends on which node returns read first to determine
 >>> the
 >>> actual returned result or other more convoluted conditions , then a
 >>> Quorum
 >>> read/write is not consistent, by any definition.
 >>
 >> But that's the point. The definition of consistency we are talking
 >> about
 >> has no meaning if you consider only a quorum read. The definition
 >> (which is
 >> the de facto definition of consistency in 'eventually consistent') make
 >> sense if we talk about a write followed by a read. And it is
 >> considering succeeding write followed by succeeding read.
 >> And that is the statement the wiki is making.
 >> Honestly, we could debate forever on the definition of consistency and
 >> whatnot. Cassandra guaranties that if you do a (succeeding) write on W
 >> replica and then a (succeeding) read on R replica and if R+W>N, then it
 >> is
 >> guaranteed that the read will see the preceding write. And this is what
 >> is
 >> called consistency in the context of even

Re: "null" vs "value not found"?

2011-02-24 Thread buddhasystem

Thanks! You are right. I see exception but have no idea what went wrong.


ERROR [ReadStage:14] 2011-02-24 21:51:29,374 AbstractCassandraDaemon.java
(line 113) Fatal exception in thread Thread[ReadStage:14,5,main]
java.io.IOError: java.io.EOFException
at
org.apache.cassandra.db.columniterator.SSTableNamesIterator.(SSTableNamesIterator.java:75)
at
org.apache.cassandra.db.filter.NamesQueryFilter.getSSTableColumnIterator(NamesQueryFilter.java:59)
at
org.apache.cassandra.db.filter.QueryFilter.getSSTableColumnIterator(QueryFilter.java:80)
at
org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(ColumnFamilyStore.java:1316)
at
org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1205)
at
org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1134)
at org.apache.cassandra.db.Table.getRow(Table.java:386)
at
org.apache.cassandra.db.SliceByNamesReadCommand.getRow(SliceByNamesReadCommand.java:60)
at
org.apache.cassandra.db.ReadVerbHandler.doVerb(ReadVerbHandler.java:69)
at
org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:70)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown
Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown
Source)
at java.lang.Thread.run(Unknown Source)
Caused by: java.io.EOFException
at java.io.DataInputStream.readInt(Unknown Source)
at
org.apache.cassandra.utils.BloomFilterSerializer.deserialize(BloomFilterSerializer.java:48)
at
org.apache.cassandra.utils.BloomFilterSerializer.deserialize(BloomFilterSerializer.java:30)
at
org.apache.cassandra.io.sstable.IndexHelper.defreezeBloomFilter(IndexHelper.java:108)
at
org.apache.cassandra.db.columniterator.SSTableNamesIterator.read(SSTableNamesIterator.java:106)
at
org.apache.cassandra.db.columniterator.SSTableNamesIterator.(SSTableNamesIterator.java:71)
... 12 more

-- 
View this message in context: 
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/null-vs-value-not-found-tp6061828p6061983.html
Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
Nabble.com.


Re: Understanding Indexes

2011-02-24 Thread Edward Capriolo
On Thu, Feb 24, 2011 at 3:55 PM, mcasandra  wrote:
>
> Either I am not explaning properly or I don't understand the data model just
> yet. Please check again:
>
> In below example this is what I understand:
>
> 1) UserProfile is a CF
> 2)  is a row key
> 3) username is a column. Each row (eg ) has username column
>
> My understanding is that secondary indexes can be created only on column
> value. Which means I can create secondary index only on username, email etc.
> not on .  is the row key, but you keep saying that I need secondary
> index, but I am actually asking about index on the row key.
>
> Is my understanding incorrect about this?
>
>> UserProfile = { // this is a ColumnFamily
>>     {   // this is the key to this Row inside the CF
>>        // now we have an infinite # of columns in this row
>>        username: "phatduckk",
>>        email: "[hidden email]",
>>        phone: "(900) 976-"
>>    }, // end row
>>     {   // this is the key to another row in the CF
>>        // now we have another infinite # of columns in this row
>>        username: "ieure",
>>        email: "[hidden email]",
>>        phone: "(888) 555-1212"
>>        age: "66",
>>        gender: "undecided"
>>    },
>>  }
>
> --
> View this message in context: 
> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Understanding-Indexes-tp6058238p6061959.html
> Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
> Nabble.com.
>

You do not need secondary indexes to search on the RowKey. The Row Key
is used by the partitioner to locate your data across the cluster. The
Row Key is also used as the primary sort of the SSTables. Thus the row
key is naturally indexed.


Re: Understanding Indexes

2011-02-24 Thread mcasandra

Thanks! I just started reading about Bloom Filter. Is this something that is
inbuilt by default or is it something that need to be explicitly configured?
-- 
View this message in context: 
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Understanding-Indexes-tp6058238p6062010.html
Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
Nabble.com.


Re: "null" vs "value not found"?

2011-02-24 Thread Dan Kuebrich
 I should mention that it took me a while to figure this out too. Might be a
candidate for an improvement in the cli?

On Thu, Feb 24, 2011 at 4:01 PM, buddhasystem  wrote:

>
> Thanks! You are right. I see exception but have no idea what went wrong.
>
>
> ERROR [ReadStage:14] 2011-02-24 21:51:29,374 AbstractCassandraDaemon.java
> (line 113) Fatal exception in thread Thread[ReadStage:14,5,main]
> java.io.IOError: java.io.EOFException
>at
>
> org.apache.cassandra.db.columniterator.SSTableNamesIterator.(SSTableNamesIterator.java:75)
>at
>
> org.apache.cassandra.db.filter.NamesQueryFilter.getSSTableColumnIterator(NamesQueryFilter.java:59)
>at
>
> org.apache.cassandra.db.filter.QueryFilter.getSSTableColumnIterator(QueryFilter.java:80)
>at
>
> org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(ColumnFamilyStore.java:1316)
>at
>
> org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1205)
>at
>
> org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1134)
>at org.apache.cassandra.db.Table.getRow(Table.java:386)
>at
>
> org.apache.cassandra.db.SliceByNamesReadCommand.getRow(SliceByNamesReadCommand.java:60)
>at
> org.apache.cassandra.db.ReadVerbHandler.doVerb(ReadVerbHandler.java:69)
>at
>
> org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:70)
>at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown
> Source)
>at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown
> Source)
>at java.lang.Thread.run(Unknown Source)
> Caused by: java.io.EOFException
>at java.io.DataInputStream.readInt(Unknown Source)
>at
>
> org.apache.cassandra.utils.BloomFilterSerializer.deserialize(BloomFilterSerializer.java:48)
>at
>
> org.apache.cassandra.utils.BloomFilterSerializer.deserialize(BloomFilterSerializer.java:30)
>at
>
> org.apache.cassandra.io.sstable.IndexHelper.defreezeBloomFilter(IndexHelper.java:108)
>at
>
> org.apache.cassandra.db.columniterator.SSTableNamesIterator.read(SSTableNamesIterator.java:106)
>at
>
> org.apache.cassandra.db.columniterator.SSTableNamesIterator.(SSTableNamesIterator.java:71)
>... 12 more
>
> --
> View this message in context:
> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/null-vs-value-not-found-tp6061828p6061983.html
> Sent from the cassandra-u...@incubator.apache.org mailing list archive at
> Nabble.com.
>


Re: Understanding Indexes

2011-02-24 Thread Tyler Hobbs
On Thu, Feb 24, 2011 at 3:07 PM, mcasandra  wrote:

>
> Thanks! I just started reading about Bloom Filter. Is this something that
> is
> inbuilt by default or is it something that need to be explicitly
> configured?
>

It's built in, no configuration needed.

-- 
Tyler Hobbs
Software Engineer, DataStax 
Maintainer of the pycassa  Cassandra
Python client library


dropped mutations, UnavailableException, and long GC

2011-02-24 Thread Jeffrey Wang
Hey all,

Our setup is 5 machines running Cassandra 0.7.0 with 24GB of heap and 1.5TB 
disk each collocated in a DC. We're doing bulk imports from each of the nodes 
with RF = 2 and write consistency ANY (write perf is very important). The 
behavior we're seeing is this:


-  Nodes often see each other as dead even though none of the nodes 
actually go down. I suspect this may be due to long GCs. It seems like 
increasing the RPC timeout could help this, but I'm not convinced this is the 
root of the problem. Note that in this case writes return with the 
UnavailableException.

-  As mentioned, long GCs. We see the ParNew GC doing a lot of smaller 
collections (few hundred MB) which are very fast (few hundred ms), but every 
once in a while the ConcurrentMarkSweep will take a LONG time (up to 15 min!) 
to collect upwards of 15GB at once.

-  On some nodes, we see a lot of pending MutationStages build up (e.g. 
500K), which leads to the messages "Dropped X MUTATION messages in the last 
5000ms," presumably meaning that Cassandra has decided to not write one of the 
replicas of the data. This is not a HUGE deal, but is less than ideal.

-  The end result is that a bunch of writes end up failing due to the 
UnavailableExceptions, so not all of our data is getting into Cassandra.

So my question is: what is the best way to avoid this behavior? Our memtable 
thresholds are fairly low (256MB) so there should be plenty of heap space to 
work with. We may experiment with write consistency ONE or ALL to see if the 
perf hit is not too bad, but I wanted to get some opinions on why this might be 
happening. Thanks!

-Jeffrey



Re: New Chain for : Does Cassandra use vector clocks

2011-02-24 Thread Edward Capriolo
On Thu, Feb 24, 2011 at 3:56 PM, A J  wrote:
> While we are at it, there's more to consider than just CAP in distributed :)
> http://voltdb.com/blog/clarifications-cap-theorem-and-data-related-errors
>
> On Thu, Feb 24, 2011 at 3:31 PM, Edward Capriolo  
> wrote:
>> On Thu, Feb 24, 2011 at 3:03 PM, A J  wrote:
>>> yes, that is difficult to digest and one has to be sure if the use
>>> case can afford it.
>>>
>>> Some other NOSQL databases deals with it differently (though I don't
>>> think any of them use atomic 2-phase commit). MongoDB for example will
>>> ask you to read from the node you wrote first (primary node) unless
>>> you are ok with eventual consistency. If the write did not make to
>>> majority of other nodes, it will be rolled-back from the original
>>> primary when it comes up again as a secondary.
>>> In some cases, you still could server either new value (that was
>>> returned as failed) or the old one. But it is different from Cassandra
>>> in the sense that Cassandra will never rollback.
>>>
>>>
>>>
>>> On Thu, Feb 24, 2011 at 2:47 PM, Anthony John  wrote:
 The leap of faith here is that an error does not mean a clean backing out 
 to
 prior state - as we are used to with databases. It means that the operation
 in error could have gone through partially

 Again, this is not an absolutely unfamiliar territory and can be dealt 
 with.
 -JA
 On Thu, Feb 24, 2011 at 1:16 PM, A J  wrote:
>
> >>but could be broken in case of a failed write<<
> You can think of a scenario where R + W >N still leads to
> inconsistency even for successful writes. Say you keep W=1 and R=N .
> Lets say the one node where a write happened with success goes down
> before it made to the other N-1 nodes. Lets say it goes down for good
> and is unrecoverable. The only option is to build a new node from
> scratch from other active nodes. This will lead to a write that was
> lost and you will end up serving stale copy of it.
>
> It is better to talk in terms of use cases and if cassandra will be a
> fit for it. Otherwise unless you have W=R=N and fsync before each
> write commit, there will be scope for inconsistency.
>
>
> On Thu, Feb 24, 2011 at 1:25 PM, Anthony John 
> wrote:
> > I see the point - apologies for putting everyone through this!
> > It was just militating against my mental model.
> > In summary, here is my take away - simple stuff but - IMO - important to
> > conclude this thread (I hope):-
> > 1. I was splitting hair over a failed ( partial ) Q Write. Such an event
> > should be immediately followed by the same write going to a connection
> > on to
> > another node ( potentially using connection caches of client
> > implementations
> > ) or a Read at CL of All. Because a write could have partially gone
> > through.
> > 2. Timestamps are used in determining the latest version ( correcting
> > the
> > false impression I was propagating)
> > Finally, wrt "W + R > N for Q CL statement" holds, but could be broken
> > in
> > case of a failed write as it is unsure whether the new value got written
> > on
> >  any server or not. Is that a fair characterization ?
> > Bottom line - unlike traditional DBMS, errors do not ensure automatic
> > cleanup and revert back, app code has to follow up if  immediate - and
> > not
> > eventual -  consistency is desired. I made that leap in almost all cases
> > - I
> > think - but the case of a failed write.
> > My bad and I can live with this!
> > Regards,
> > -JA
> >
> > On Thu, Feb 24, 2011 at 11:50 AM, Sylvain Lebresne
> > 
> > wrote:
> >>
> >> On Thu, Feb 24, 2011 at 6:33 PM, Anthony John 
> >> wrote:
> >>>
> >>> Completely understand!
> >>> All that I am quibbling over is whether a CL of quorum guarantees
> >>> consistency or not. That is what the documentation says - right. IF
> >>> for a CL
> >>> of Q read - it depends on which node returns read first to determine
> >>> the
> >>> actual returned result or other more convoluted conditions , then a
> >>> Quorum
> >>> read/write is not consistent, by any definition.
> >>
> >> But that's the point. The definition of consistency we are talking
> >> about
> >> has no meaning if you consider only a quorum read. The definition
> >> (which is
> >> the de facto definition of consistency in 'eventually consistent') make
> >> sense if we talk about a write followed by a read. And it is
> >> considering succeeding write followed by succeeding read.
> >> And that is the statement the wiki is making.
> >> Honestly, we could debate forever on the definition of consistency and
> >> whatnot. Cassandra guaranties that if you do a (succeeding) write on W
> >> replica and then a (succeeding) read on R replica and if R

Re: Understanding Indexes

2011-02-24 Thread Michal Augustýn
Retrieving data using row key is the primary way how to get data from
Cassandra, so it's highly optimized.
Firstly, node responsible for the row is computed using partitioner. You can
use RandomPartitioner (distributes md5 of keys) or
OrderPreservingPartitioner (key must be UTF8 string).
Then the row is found on the node using bloom filter (
http://wiki.apache.org/cassandra/ArchitectureOverview).

So when you want to retrieve row by its key then it's the fastest way you
can get the row.

Augi

2011/2/24 mcasandra 

>
> Thanks! I just started reading about Bloom Filter. Is this something that
> is
> inbuilt by default or is it something that need to be explicitly
> configured?
> --
> View this message in context:
> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Understanding-Indexes-tp6058238p6062010.html
> Sent from the cassandra-u...@incubator.apache.org mailing list archive at
> Nabble.com.
>


Homebrew CF-indexing vs secondary indexing

2011-02-24 Thread Ron Siemens

I am doing some experimenting with indexing.  My data CF has about 25000 rows 
around 1KB each.  I set up a special column of boolean value to use as the 
secondary index.  I also created my own index in a separate CF where each index 
is one row and the column names are the data keys.

The implementation is in Hector 0.7.0-27, and run options are -Xms64m -Xmx256m

Below are two sample runs, the first using the secondary index with 
IndexedSlicesQuery.  The second using my homebrew CF index and createSliceQuery 
for the index followed by createMultigetSliceQuery for the data.  The timing 
output is from result.getExecutionTimeMicro(), but it looks like ms.  I'm not 
sure if its purpose is as I'm assuming and using here.  By the way, THS is just 
the same of the index, which is a subset of 7293 rows of the some 25000.

Anyway, it looks like the custom index does significantly better.  Is this 
expected?  Why?  I expected them to be about the same, having read the 
secondary index also uses a column family internally.  But more disconcerting, 
the secondary index implementation runs out of space, while the custom one runs 
along with only a few notable slow downs.  Both implementations are using the 
same column-processing/deserialization code so that doesn't seem to be to 
blame.  What gives?

Ron


Sample run: Secondary index.

DEBUG Retrieved THS / 7293 rows, in 2012 ms
DEBUG Retrieved THS / 7293 rows, in 1956 ms
DEBUG Retrieved THS / 7293 rows, in 1843 ms
DEBUG Retrieved THS / 7293 rows, in 2295 ms
DEBUG Retrieved THS / 7293 rows, in 1828 ms
DEBUG Retrieved THS / 7293 rows, in 1740 ms
DEBUG Retrieved THS / 7293 rows, in 1899 ms
DEBUG Retrieved THS / 7293 rows, in 2266 ms
DEBUG Retrieved THS / 7293 rows, in 2310 ms
DEBUG Retrieved THS / 7293 rows, in 2395 ms
DEBUG Retrieved THS / 7293 rows, in 2829 ms
DEBUG Retrieved THS / 7293 rows, in 2725 ms
DEBUG Retrieved THS / 7293 rows, in 3752 Exception in thread "main" 
java.lang.OutOfMemoryError: Java heap space
at java.nio.CharBuffer.wrap(CharBuffer.java:350)
at java.nio.CharBuffer.wrap(CharBuffer.java:373)
at java.lang.StringCoding$StringDecoder.decode(StringCoding.java:138)
at java.lang.StringCoding.decode(StringCoding.java:173)
at java.lang.String.(String.java:443)
at 
me.prettyprint.cassandra.serializers.StringSerializer.fromByteBuffer(StringSerializer.java:40)
at 
me.prettyprint.cassandra.serializers.StringSerializer.fromByteBuffer(StringSerializer.java:13)
at 
me.prettyprint.cassandra.serializers.AbstractSerializer.fromBytes(AbstractSerializer.java:38)
at 
me.prettyprint.cassandra.model.HColumnImpl.(HColumnImpl.java:48)
at 
me.prettyprint.cassandra.model.ColumnSliceImpl.(ColumnSliceImpl.java:27)
at me.prettyprint.cassandra.model.RowImpl.(RowImpl.java:32)
at me.prettyprint.cassandra.model.RowsImpl.(RowsImpl.java:33)
at 
me.prettyprint.cassandra.model.OrderedRowsImpl.(OrderedRowsImpl.java:30)
at 
me.prettyprint.cassandra.model.IndexedSlicesQuery$1.doInKeyspace(IndexedSlicesQuery.java:143)
at 
me.prettyprint.cassandra.model.IndexedSlicesQuery$1.doInKeyspace(IndexedSlicesQuery.java:131)
at 
me.prettyprint.cassandra.model.KeyspaceOperationCallback.doInKeyspaceAndMeasure(KeyspaceOperationCallback.java:20)
at 
me.prettyprint.cassandra.model.ExecutingKeyspace.doExecute(ExecutingKeyspace.java:85)
at 
me.prettyprint.cassandra.model.IndexedSlicesQuery.execute(IndexedSlicesQuery.java:130)



Sample run: Homebrew CF-indexing

DEBUG CFIndex THS / 7293 read in 262 ms
DEBUG Retrieved THS / 7293 rows, in 1579 ms
DEBUG CFIndex THS / 7293 read in 44 ms
DEBUG Retrieved THS / 7293 rows, in 1771 ms
DEBUG CFIndex THS / 7293 read in 38 ms
DEBUG Retrieved THS / 7293 rows, in 1275 ms
DEBUG CFIndex THS / 7293 read in 18 ms
DEBUG Retrieved THS / 7293 rows, in 1364 ms
DEBUG CFIndex THS / 7293 read in 18 ms
DEBUG Retrieved THS / 7293 rows, in 1590 ms
DEBUG CFIndex THS / 7293 read in 22 ms
DEBUG Retrieved THS / 7293 rows, in 1118 ms
DEBUG CFIndex THS / 7293 read in 18 ms
DEBUG Retrieved THS / 7293 rows, in 1280 ms
DEBUG CFIndex THS / 7293 read in 21 ms
DEBUG Retrieved THS / 7293 rows, in 1466 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1589 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1772 ms
DEBUG CFIndex THS / 7293 read in 20 ms
DEBUG Retrieved THS / 7293 rows, in 1660 ms
DEBUG CFIndex THS / 7293 read in 20 ms
DEBUG Retrieved THS / 7293 rows, in 1931 ms
DEBUG CFIndex THS / 7293 read in 18 ms
DEBUG Retrieved THS / 7293 rows, in 1626 ms
DEBUG CFIndex THS / 7293 read in 18 ms
DEBUG Retrieved THS / 7293 rows, in 1750 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1557 ms
DEBUG CFIndex THS / 7293 read in 19 ms
DEBUG Retrieved THS / 7293 rows, in 9409 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1709 ms
DEBUG CFIndex T

Re: Homebrew CF-indexing vs secondary indexing

2011-02-24 Thread buddhasystem

FWIW, for me the advantage of homebrew indexes is that they can be a lot more
sophisticated than the standard -- I can hash combinations of column values
to whatever I want. I also put counters on column values in the index, so
there is lots of functionality. Of course, I can do it because my data
becomes read-only, I know it's a luxury.

-- 
View this message in context: 
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Homebrew-CF-indexing-vs-secondary-indexing-tp6062677p6062705.html
Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
Nabble.com.


Re: Homebrew CF-indexing vs secondary indexing

2011-02-24 Thread Ron Siemens

I failed to mention: this is just doing repeated data retrievals using the 
index.

> ...
> 
> Sample run: Secondary index.
> 
> DEBUG Retrieved THS / 7293 rows, in 2012 ms
> DEBUG Retrieved THS / 7293 rows, in 1956 ms
> DEBUG Retrieved THS / 7293 rows, in 1843 ms
...



Re: map reduce job over indexed range of keys

2011-02-24 Thread Matt Kennedy
Right, so I'm interpreting silence as a confirmation on all points. I
opened:
https://issues.apache.org/jira/browse/CASSANDRA-2245
https://issues.apache.org/jira/browse/CASSANDRA-2246

to work on these.

On Wed, Feb 23, 2011 at 5:31 PM, Matt Kennedy  wrote:

> Let me start out by saying that I think I'm going to have to write a patch
> to get what I want, but I'm fine with that.  I just wanted to check here
> first to make sure that I'm not missing something obvious.
>
> I'd like to be able to run a MapReduce job that takes a value in an indexed
> column as a parameter, and use that to select the data that the MapReduce
> job operates on.  Right now, it looks like this isn't possible because
> org.apache.cassandra.hadoop.ColumnFamilyRecordReader will only fetch data
> with get_range_slices, not get_indexed_slices.
>
> An example might be useful.  Let's say I want to run a map reduce job over
> all the data for a particular country.  Right now I can do this in Map
> Reduce by simply discarding all the data that is not from the country I want
> to process on. I suspect it will be faster if I can reduce the size of the
> Map Reduce job by only selecting the data I want by using secondary indexes
> in Cassandra.
>
> So, first question: Am I wrong?  Is there some clever way to enable the
> behavior I'm looking for (without modifying the cassandra codebase)?
>
> Second question: If I'm not wrong, should I open a JIRA issue for this and
> start coding up this feature?
>
> Finally, the real reason that I want to get this working is so that I can
> enhance the CassandraStorage pig loadfunc so that it can take query
> parameters on in the URL string that is used to specify the keyspace and
> column family.  So for example, you might load data into Pig with this
> sytax:
>
> rows = LOAD 'cassandra://mykeyspace/mycolumnfamily?country=UK' using
> CassandraStorage();
>
> I'd like to get some feedback on that syntax.
>
> Thanks,
> Matt Kennedy
>


Re: dropped mutations, UnavailableException, and long GC

2011-02-24 Thread Narendra Sharma
1. Why 24GB of heap? Do you need this high heap? Bigger heap can lead to
longer GC cycles but 15min look too long.
2. Do you have ROW cache enabled?
3. How many column families do you have?
4. Enable GC logs and monitor what GC is doing to get idea of why it is
taking so long. You can add following to enable gc log.
# GC logging options -- uncomment to enable
# JVM_OPTS="$JVM_OPTS -XX:+PrintGCDetails"
# JVM_OPTS="$JVM_OPTS -XX:+PrintGCTimeStamps"
# JVM_OPTS="$JVM_OPTS -XX:+PrintClassHistogram"
# JVM_OPTS="$JVM_OPTS -XX:+PrintTenuringDistribution"
# JVM_OPTS="$JVM_OPTS -XX:+PrintGCApplicationStoppedTime"
# JVM_OPTS="$JVM_OPTS -Xloggc:/var/log/cassandra/gc.log"

5. Move to Cassandra 0.7.2, if possible. It has following nice feature:
"added flush_largest_memtables_at and reduce_cache_sizes_at options to
cassandra.yaml as an escape value for memory pressure"

Thanks,
Naren


On Thu, Feb 24, 2011 at 2:21 PM, Jeffrey Wang  wrote:

> Hey all,
>
>
>
> Our setup is 5 machines running Cassandra 0.7.0 with 24GB of heap and 1.5TB
> disk each collocated in a DC. We’re doing bulk imports from each of the
> nodes with RF = 2 and write consistency ANY (write perf is very important).
> The behavior we’re seeing is this:
>
>
>
> -  Nodes often see each other as dead even though none of the
> nodes actually go down. I suspect this may be due to long GCs. It seems like
> increasing the RPC timeout could help this, but I’m not convinced this is
> the root of the problem. Note that in this case writes return with the
> UnavailableException.
>
> -  As mentioned, long GCs. We see the ParNew GC doing a lot of
> smaller collections (few hundred MB) which are very fast (few hundred ms),
> but every once in a while the ConcurrentMarkSweep will take a LONG time (up
> to 15 min!) to collect upwards of 15GB at once.
>
> -  On some nodes, we see a lot of pending MutationStages build up
> (e.g. 500K), which leads to the messages “Dropped X MUTATION messages in the
> last 5000ms,” presumably meaning that Cassandra has decided to not write one
> of the replicas of the data. This is not a HUGE deal, but is less than
> ideal.
>
> -  The end result is that a bunch of writes end up failing due to
> the UnavailableExceptions, so not all of our data is getting into Cassandra.
>
>
>
> So my question is: what is the best way to avoid this behavior? Our
> memtable thresholds are fairly low (256MB) so there should be plenty of heap
> space to work with. We may experiment with write consistency ONE or ALL to
> see if the perf hit is not too bad, but I wanted to get some opinions on why
> this might be happening. Thanks!
>
>
>
> -Jeffrey
>
>
>


unsubscribe

2011-02-24 Thread Jun Young Kim


--
Junyoung Kim (juneng...@gmail.com)



unsubscribe

2011-02-24 Thread Ardi Chen
2011/2/25 Jun Young Kim 

>
> --
> Junyoung Kim (juneng...@gmail.com)
>
>


Re: unsubscribe

2011-02-24 Thread Eric Evans

http://goo.gl/3sjE5

On Fri, 2011-02-25 at 10:33 +0800, Ardi Chen wrote:
> 2011/2/25 Jun Young Kim 
> 
> >
> > --
> > Junyoung Kim (juneng...@gmail.com)


-- 
Eric Evans
eev...@rackspace.com



Re: How does Cassandra handle failure during synchronous writes

2011-02-24 Thread Jonathan Ellis
This is where things starts getting subtle.

If Cassandra's failure detector knows ahead of time that not enough
writes are available, that is the only time we truly fail a write, and
nothing will be written anywhere.  But if a write starts during the
window where a node is failed but we don't know it yet, then it will
return TimedOutException.

This is commonly called a "failed write" but that is incorrect -- the
write is in progress, but we can't guarantee it's been replicated to
the desired number of replicas.

It's important to note that even in this situation, quorum reads +
writes provide strong consistency.  ("Strong consistency" is defined
as "after an update completes, any subsequent access will return the
updated value.") Quorum eads will be unable to complete as well until
enough machines come back to satisfy the quorum, which is the same
number as needed to finish the write.  So either the original writer
retrying, or the first reader will cause the write to be completed,
after which we're on familiar ground.

Consider the simplest non-trivial quorum, where we are replicating to
nodes X, Y, and Z.  For the case we are interested in, the original
quorum write attempt must time out, so 2 of the 3 replicas (Y and Z)
are temporarily unavailable. The write is applied to one replica (X),
and the client gets a TimedOutException. The write is not failed, it
is not succeeded, it is in progress (and the client should retry,
because it doesn't know for sure that it was applied anywhere at all).

While Y and Z stay down, quorum reads will be rejected.

When they come back up*, a read could achieve a quorum as {X, Y} or
{X, Z} or {Y, Z}.

{Y, Z} is the more interesting case because neither has the new write
yet.  The client will get the old version back, which is fine
according to our contract since the write is still in-progress.  Read
repair will see the new version on X and send it to X and Y.  As soon
as it gets to one of those, the original write is complete, and all
subsequent reads will see the new version.

{X, Y} and {X, Z} are equivalent: one node with the write, and one
without. The read will recognize that X's version needs to be sent to
Z, and the write will be complete.  This read and all subsequent ones
will see the write.  (Z will be replicated to asynchronously via read
repair.)

*If only one comes back up, then you of course only have the {X, Y} or
{X, Z} case.

The important guarantee this gives you is that once one quorum read
sees the new value, all others will too.  You can't see the newest
version, then see an older version on a subsequent write, which is the
characteristic of non-strong consistency (and which you can see in
Cassandra, temporarily, with lower ConsistencyLevels).

On Tue, Feb 22, 2011 at 10:22 PM, tijoriwala.ritesh
 wrote:
>
> Hi,
> I wanted to get details on how does cassandra do synchronous writes to W
> replicas (out of N)? Does it do a 2PC? If not, how does it deal with
> failures of of nodes before it gets to write to W replicas? If the
> orchestrating node cannot write to W nodes successfully, I guess it will
> fail the write operation but what happens to the completed writes on X (W >
> X) nodes?
>
> Thanks,
> Ritesh
> --
> View this message in context: 
> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/How-does-Cassandra-handle-failure-during-synchronous-writes-tp6055152p6055152.html
> Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
> Nabble.com.
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com


Re: Does CL on "ALL" have special semantics like "ANY" does

2011-02-24 Thread baskar.duraikannu...@gmail.com
 Even though the client did not get a success message, it is possible 
that write may have succeeded on one of the replicas.  Let us say that 
client did a retry and the write succeeded.


Let us also assume that I was trying to withdraw $100. Initially $100 
was withdrawn as per one of the replicas. Since all replicas did not 
respond, client retries, results in another $100 withdrawal.


During the hinted hand off/read repair, will first $100 succeed on other 
replicas? if so, is there a way to avoid this inconsistency?



On 2/24/11 3:43 PM, Tyler Hobbs wrote:
On Thu, Feb 24, 2011 at 2:36 PM, Anthony John > wrote:



Does "ALL"  succeed even if there is a single surviving replica
for the given piece of data ?
Again, tolerates node failure. Does it really mean - from ALL
surviving nodes ?


All replicas (RF) for that row must respond before an operation at ALL 
is considered a success.  That's all there is to it.


--
Tyler Hobbs
Software Engineer, DataStax 
Maintainer of the pycassa  
Cassandra Python client library