Hadoop + Cassandra

2012-01-06 Thread Alain RODRIGUEZ
Hello. I have a 4 nodes cluster running Cassandra (without Datastax Brisk) in production. Now I want to add hadoop (and maybe Pig / Hive ?) to be able to perform some analytics. I don't know how to get started ? Is there a tutorial explaining how to install, configure and use hadoop andadvantages

Re: Composite column docs

2012-01-06 Thread Shimi Kiviti
On Thu, Jan 5, 2012 at 9:13 PM, aaron morton wrote: > What client are you using ? > I am writing a client. > For example pycassa has some sweet documentation > http://pycassa.github.com/pycassa/assorted/composite_types.html > It is a sweet documentation but it doesn't help me. I a lower level do

Re: Hadoop + Cassandra

2012-01-06 Thread Jeremy Hanna
I would first look at http://wiki.apache.org/cassandra/HadoopSupport - you'll want to look in the section on cluster configuration. DataStax also has a product that makes it pretty simple to use Hadoop with Cassandra if you don't mind paying for it - http://www.datastax.com/products/enterprise

Re: Should I throttle deletes?

2012-01-06 Thread Vitalii Tymchyshyn
05.01.12 22:29, Philippe ???(??): Then I do have a question, what do people generally use as the batch size? I used to do batches from 500 to 2000 like you do. After investigating issues such as the one you've encountered I've moved to batches of 20 for writes and 256 for reads. Ev

Re: is it bad to have lots of column families?

2012-01-06 Thread Vitalii Tymchyshyn
Yes, as far as I know. Note, that it's not full index, but "sampling" one. See index_interval configuration parameter and it's description. As of bloom filters, it's not configurable now, yet there is a ticket with a patch that should make it configurable. 05.01.12 22:45, Carlo Pires написав(ла

Re: Should I throttle deletes?

2012-01-06 Thread Philippe
But you will then get timeouts. Le 6 janv. 2012 15:17, "Vitalii Tymchyshyn" a écrit : > ** > 05.01.12 22:29, Philippe написав(ла): > > Then I do have a question, what do people generally use as the batch >> size? >> > I used to do batches from 500 to 2000 like you do. > After investigating issu

Re: Should I throttle deletes?

2012-01-06 Thread Vitalii Tymchyshyn
Do you mean on writes? Yes, your timeouts must be so that your write batch could complete until timeout elapsed. But this will lower write load, so reads should not timeout. Best regards, Vitalii Tymchyshym 06.01.12 17:37, Philippe написав(ла): But you will then get timeouts. Le 6 janv. 201

Re: Cannot start cassandra node anymore

2012-01-06 Thread Peter Schuller
(too much to quote) Looks to me like this should be fixable by removing hints from the node in question (I don't remember whether this is a bug that's been identified and fixed or not). I may be wrong because I'm just basing this on a quick look at the stack trace but it seems to me there are hin

Dynamic columns in a column family?

2012-01-06 Thread Frank Yang
Hi everyone, I am wondering whether it is possible to not to define the column metadata when creating a column family, but to specify the column when client updates data, for example: CREATE COLUMN FAMILY products WITH default_validation_class= UTF8Type AND key_validation_class=UTF8Type AND compa

How to reliably achieve unique constraints with Cassandra?

2012-01-06 Thread Drew Kutcharian
Hi Everyone, What's the best way to reliably have unique constraints like functionality with Cassandra? I have the following (which I think should be very common) use case. User CF Row Key: user email Columns: userId: UUID, etc... UserAttribute1 CF: Row Key: userId (which is the uuid that's map

Re: Dynamic columns in a column family?

2012-01-06 Thread Jonathan Ellis
Yes. (Please keep these to the user list rather than dev.) On Fri, Jan 6, 2012 at 11:59 AM, Frank Yang wrote: > Hi everyone, > > I am wondering whether it is possible to not to define the column > metadata when creating a column family, but to specify the column when > client updates data, for e

Pending on ReadStage

2012-01-06 Thread Daning Wang
Hi all, We have 5 nodes cluster(0.8.6), but the performance from one node is way behind others, I checked tpstats, It always show non-zero pending ReadStage, I don't see this problem on other nodes. What caused the problem? I/O? Memory? Cpu usage is still low. How to fix this problem? ~/bin/node

Re: Pending on ReadStage

2012-01-06 Thread Mohit Anchlia
Are all your nodes equally balanced in terms of read requests? Are you using RandomPartitioner? Are you reading using indexes? First thing you can do is compare iostat -x output between the 2 nodes to rule out any io issues assuming your read requests are equally balanced. On Fri, Jan 6, 2012 at

Is this correct way to create a Composite Type

2012-01-06 Thread investtr
I am trying to understand the composite type. Is this a right way to create a Composite Data ? <>Follower_For_Users <>#{"userID",n}:"followerID" for simplicity I have replaced userID by followerID. regards, Ramesh

Re: How to reliably achieve unique constraints with Cassandra?

2012-01-06 Thread Mohit Anchlia
On Fri, Jan 6, 2012 at 10:03 AM, Drew Kutcharian wrote: > Hi Everyone, > > What's the best way to reliably have unique constraints like functionality > with Cassandra? I have the following (which I think should be very common) > use case. > > User CF > Row Key: user email > Columns: userId: UUID

Re: How to reliably achieve unique constraints with Cassandra?

2012-01-06 Thread Drew Kutcharian
Yes, my issue is with handling concurrent requests. I'm not sure how your logic will work with eventual consistency. I'm going to have the same issue in the "tracker" CF too, no? On Jan 6, 2012, at 10:38 AM, Mohit Anchlia wrote: > On Fri, Jan 6, 2012 at 10:03 AM, Drew Kutcharian wrote: >> Hi

Re: How to reliably achieve unique constraints with Cassandra?

2012-01-06 Thread Mohit Anchlia
I don't think if you read and write with QUORUM On Fri, Jan 6, 2012 at 11:01 AM, Drew Kutcharian wrote: > Yes, my issue is with handling concurrent requests. I'm not sure how your > logic will work with eventual consistency. I'm going to have the same issue > in the "tracker" CF too, no? > > >

Re: Pending on ReadStage

2012-01-06 Thread Daning Wang
Thanks for your reply. Nodes are equally balanced. and it is RandomPartitioner. I think that machine is slower, Are you saying it is IO issue? Daning On Fri, Jan 6, 2012 at 10:25 AM, Mohit Anchlia wrote: > Are all your nodes equally balanced in terms of read requests? Are you > using RandomPart

Beginner Question - Super Column Family Alternative

2012-01-06 Thread investtr
Please help me understand this. I am not sure if this is the right place to ask such question. I read that it is not safe to use Super Column Family. And the alternative, I found was to use Composite Column Names. Many managers will have many employees. *<>Manager_Employee <>#managerID <>#emplo

Re: Beginner Question - Super Column Family Alternative

2012-01-06 Thread investtr
On 01/06/2012 01:48 PM, investtr wrote: Please help me understand this. I am not sure if this is the right place to ask such question. I read that it is not safe to use Super Column Family. And the alternative, I found was to use Composite Column Names. Many managers will have many employees.

OutOfMemory Errors with Cassandra 1.0.5

2012-01-06 Thread Caleb Rackliffe
Hi Everybody, I have a 10-node cluster running 1.0.5. The hardware/configuration for each box looks like this: Hardware: 4 GB RAM, 400 GB SATAII HD for commitlog, 50 GB SATAIII SSD for data directory, 1 GB SSD swap partition OS: CentOS 6, vm.swapiness = 0 Cassandra: disk access mode = standard

Re: OutOfMemory Errors with Cassandra 1.0.5

2012-01-06 Thread Caleb Rackliffe
One other item… java –version java version "1.7.0_01" Java(TM) SE Runtime Environment (build 1.7.0_01-b08) Java HotSpot(TM) 64-Bit Server VM (build 21.1-b02, mixed mode) Caleb Rackliffe | Software Developer M 949.981.0159 | ca...@steelhouse.com From: Caleb Rackliffe mailto:ca...@steelhouse.com

Re: How to reliably achieve unique constraints with Cassandra?

2012-01-06 Thread Bryce Allen
On Fri, 6 Jan 2012 10:38:17 -0800 Mohit Anchlia wrote: > It could be as simple as reading before writing to make sure that > email doesn't exist. But I think you are looking at how to handle 2 > concurrent requests for same email? Only way I can think of is: > > 1) Create new CF say tracker > 2)

Re: How to reliably achieve unique constraints with Cassandra?

2012-01-06 Thread Bryce Allen
On Fri, 6 Jan 2012 10:03:38 -0800 Drew Kutcharian wrote: > I know that this can be done using a lock manager such as ZooKeeper > or HazelCast, but the issue with using either of them is that if > ZooKeeper or HazelCast is down, then you can't be sure about the > reliability of the lock. So this po

Re: How to reliably achieve unique constraints with Cassandra?

2012-01-06 Thread Jeremiah Jordan
Correct, any kind of locking in Cassandra requires clocks that are in sync, and requires you to wait "possible clock out of sync time" before reading to check if you got the lock, to prevent the issue you describe below. There was a pretty detailed discussion of locking with only Cassandra a

Re: How to reliably achieve unique constraints with Cassandra?

2012-01-06 Thread Jeremiah Jordan
Since a Zookeeper cluster is a quorum based system similar to Cassandra, it only goes down when n/2 nodes go down. And the same way you have to stop writing to Cassandra if N/2 nodes are down (if using QUoRUM), your App will have to wait for the Zookeeper cluster to come online again before it

java.lang.IllegalArgumentException occurred when creating a keyspcace with replication factor

2012-01-06 Thread Sajith Kariyawasam
Hi all, I tried creating a keyspace with the replication factor 3, using cli interface ... in Cassandra 1.0.6 (earlier tried in 0.8.2 and failed too) But I'm getting an exception "java.lang.IllegalArgumentException: No enum const class org.apache.cassandra.cli.CliClient$AddKeyspaceArgument.REPL

Re: How to reliably achieve unique constraints with Cassandra?

2012-01-06 Thread Bryce Allen
This looks like it: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Implementing-locks-using-cassandra-only-tp5527076p5527076.html There's also some interesting JIRA tickets related to locking/CAS: https://issues.apache.org/jira/browse/CASSANDRA-2686 https://issues.apache.org/jira

How to find out when a nodetool operation has ended?

2012-01-06 Thread Maxim Potekhin
Suppose I start a repair on one or a few nodes in my cluster, from an interactive machine in the office, and leave for the day (which is a very realistic scenario imho). Is there a way to know, from a remote machine, when a particular action, such as compaction or repair, has been finished? I fi

Re: How to reliably achieve unique constraints with Cassandra?

2012-01-06 Thread Mohit Anchlia
This looks like right way to do it. But remember this still doesn't gurantee if your clocks drifts way too much. But it's trade-off with having to manage one additional component or use something internal to C*. It would be good to see similar functionality implemented in C* so that clients don't h

Re: How to reliably achieve unique constraints with Cassandra?

2012-01-06 Thread Drew Kutcharian
Bryce, I'm not sure about ZooKeeper, but I know if you have a partition between HazelCast nodes, than the nodes can acquire the same lock independently in each divided partition. How does ZooKeeper handle this situation? -- Drew On Jan 6, 2012, at 12:48 PM, Bryce Allen wrote: > On Fri, 6 Ja

Re: How to reliably achieve unique constraints with Cassandra?

2012-01-06 Thread Bryce Allen
I don't think it's just clock drift. There is also the period of time between when the client selects a timestamp, and when the data ends up committed to cassandra. That drift seems harder to control, when the nodes and/or clients are under load. I agree that it would be nice to have something lik

Re: How to reliably achieve unique constraints with Cassandra?

2012-01-06 Thread Mohit Anchlia
On Fri, Jan 6, 2012 at 1:41 PM, Bryce Allen wrote: > I don't think it's just clock drift. There is also the period of time > between when the client selects a timestamp, and when the data ends up > committed to cassandra. That drift seems harder to control, when the > nodes and/or clients are unde

Re: How to reliably achieve unique constraints with Cassandra?

2012-01-06 Thread Jeremiah Jordan
By using quorum. One of the partitions will may be able to acquire locks, the other one won't... On 01/06/2012 03:36 PM, Drew Kutcharian wrote: Bryce, I'm not sure about ZooKeeper, but I know if you have a partition between HazelCast nodes, than the nodes can acquire the same lock independen

Re: How to reliably achieve unique constraints with Cassandra?

2012-01-06 Thread Bryce Allen
That's a good question, and I'm not sure - I'm fairly new to both ZK and Cassandra. I found this wiki page: http://wiki.apache.org/hadoop/ZooKeeper/FailureScenarios and I think the lock recipe still works, even if a stale read happens. Assuming that wiki page is correct. There is still subtlety to

Re: How to reliably achieve unique constraints with Cassandra?

2012-01-06 Thread Drew Kutcharian
Thanks everyone for the replies. Seems like there is no easy way to handle this. It's very surprising that no one seems to have solved such a common use case. -- Drew On Jan 6, 2012, at 2:11 PM, Bryce Allen wrote: > That's a good question, and I'm not sure - I'm fairly new to both ZK > and Cas

Re: java.lang.IllegalArgumentException occurred when creating a keyspcace with replication factor

2012-01-06 Thread R. Verlangen
Try this: create keyspace testkeyspace; update keyspace testkeyspace with placement_strategy = 'org.apache.cassandra.locator.SimpleStrategy' and strategy_options = {replication_factor:3}; Good luck! 2012/1/6 Sajith Kariyawasam > Hi all, > > I tried creating a keyspace with the replication fact

Re: How to find out when a nodetool operation has ended?

2012-01-06 Thread R. Verlangen
You might consider: - installing DataStax OpsCenter ( http://www.datastax.com/products/opscenter ) - starting the repair in a linux screen (so you can attach to the screen from another location) I prefer the OpsCener. 2012/1/6 Maxim Potekhin > Suppose I start a repair on one or a few nodes in

Re: How to reliably achieve unique constraints with Cassandra?

2012-01-06 Thread Narendra Sharma
>>>It's very surprising that no one seems to have solved such a common use case. I would say people have solved it using RIGHT tools for the task. On Fri, Jan 6, 2012 at 2:35 PM, Drew Kutcharian wrote: > Thanks everyone for the replies. Seems like there is no easy way to handle > this. It's ve

Re: How to find out when a nodetool operation has ended?

2012-01-06 Thread Maxim Potekhin
Thanks, so I take it there is no solution outside of Opcenter. I mean of course I can redirect the output, with additional timestamps if needed, to a log file -- which I can access remotely. I just thought there would be some "status" command by chance, to tell me what maintenance the node is

Re: What is the future of supercolumns ?

2012-01-06 Thread Aklin_81
Any comments please ? On Thu, Jan 5, 2012 at 11:07 AM, Aklin_81 wrote: > I have seen supercolumns usage been discouraged most of the times. > However sometimes the supercolumns seem to fit the scenario most > appropriately not only in terms of how the data is stored but also in > terms of how is

Re: How to reliably achieve unique constraints with Cassandra?

2012-01-06 Thread Drew Kutcharian
So what are the common RIGHT solutions/tools for this? On Jan 6, 2012, at 2:46 PM, Narendra Sharma wrote: > >>>It's very surprising that no one seems to have solved such a common use > >>>case. > I would say people have solved it using RIGHT tools for the task. > > > > On Fri, Jan 6, 2012 at

Re: OutOfMemory Errors with Cassandra 1.0.5

2012-01-06 Thread Caleb Rackliffe
I saw this article - http://comments.gmane.org/gmane.comp.db.cassandra.user/2225 I'm using the Hector client (for connection pooling), with ~3200 threads active according to JConsole. Caleb Rackliffe | Software Developer M 949.981.0159 | ca...@steelhouse.com From: Caleb Rackliffe mailto:ca...@s

Re: OutOfMemory Errors with Cassandra 1.0.5 (fixed)

2012-01-06 Thread Caleb Rackliffe
Okay, it looks like I was slightly underestimating the number of connections open on the cluster. This probably won't be a problem after I tighten up the Hector pool maximums. Sorry for the spam… Caleb Rackliffe | Software Developer M 949.981.0159 | ca...@steelhouse.com From: Caleb Rackliffe

How does Cassandra decide when to do a minor compaction?

2012-01-06 Thread Maxim Potekhin
The subject says it all -- pointers appreciated. Thanks Maxim

Re: What is the future of supercolumns ?

2012-01-06 Thread Terje Marthinussen
Please realize that I do not make any decisions here and I am not part of the core Cassandra developer team. What has been said before is that they will most likely go away and at least under the hood be replaced by composite columns. Jonathan have however stated that he would like the supercol

Re: What is the future of supercolumns ?

2012-01-06 Thread Aklin_81
I read entire columns inside the supercolumns at any time but as for writing them, I write the columns at different times. I don't have the need to update them except that die after their TTL period of 60 days. But since they are going to be deprecated, I don't know if it would be really advisable

Re: How to reliably achieve unique constraints with Cassandra?

2012-01-06 Thread Narendra Sharma
Instead of trying to solve the generic problem of uniqueness, I would focus on the specific problem. For eg lets consider your usecase of user registration with email address as key. You can do following: 1. Create CF (Users) where row key is UUID and has user info specific columns. 2. Whenever us

Re: How to reliably achieve unique constraints with Cassandra?

2012-01-06 Thread Drew Kutcharian
It makes great sense. You're a genius!! On Jan 6, 2012, at 10:43 PM, Narendra Sharma wrote: > Instead of trying to solve the generic problem of uniqueness, I would focus > on the specific problem. > > For eg lets consider your usecase of user registration with email address as > key. You can