Hello.
I have a 4 nodes cluster running Cassandra (without Datastax Brisk) in
production.
Now I want to add hadoop (and maybe Pig / Hive ?) to be able to perform
some analytics.
I don't know how to get started ? Is there a tutorial explaining how to
install, configure and use hadoop andadvantages
On Thu, Jan 5, 2012 at 9:13 PM, aaron morton wrote:
> What client are you using ?
>
I am writing a client.
> For example pycassa has some sweet documentation
> http://pycassa.github.com/pycassa/assorted/composite_types.html
>
It is a sweet documentation but it doesn't help me. I a lower level
do
I would first look at http://wiki.apache.org/cassandra/HadoopSupport - you'll
want to look in the section on cluster configuration. DataStax also has a
product that makes it pretty simple to use Hadoop with Cassandra if you don't
mind paying for it - http://www.datastax.com/products/enterprise
05.01.12 22:29, Philippe ???(??):
Then I do have a question, what do people generally use as the
batch size?
I used to do batches from 500 to 2000 like you do.
After investigating issues such as the one you've encountered I've
moved to batches of 20 for writes and 256 for reads. Ev
Yes, as far as I know. Note, that it's not full index, but "sampling"
one. See index_interval configuration parameter and it's description.
As of bloom filters, it's not configurable now, yet there is a ticket
with a patch that should make it configurable.
05.01.12 22:45, Carlo Pires написав(ла
But you will then get timeouts.
Le 6 janv. 2012 15:17, "Vitalii Tymchyshyn" a écrit :
> **
> 05.01.12 22:29, Philippe написав(ла):
>
> Then I do have a question, what do people generally use as the batch
>> size?
>>
> I used to do batches from 500 to 2000 like you do.
> After investigating issu
Do you mean on writes? Yes, your timeouts must be so that your write
batch could complete until timeout elapsed. But this will lower write
load, so reads should not timeout.
Best regards, Vitalii Tymchyshym
06.01.12 17:37, Philippe написав(ла):
But you will then get timeouts.
Le 6 janv. 201
(too much to quote)
Looks to me like this should be fixable by removing hints from the
node in question (I don't remember whether this is a bug that's been
identified and fixed or not).
I may be wrong because I'm just basing this on a quick look at the
stack trace but it seems to me there are hin
Hi everyone,
I am wondering whether it is possible to not to define the column
metadata when creating a column family, but to specify the column when
client updates data, for example:
CREATE COLUMN FAMILY products WITH default_validation_class= UTF8Type
AND key_validation_class=UTF8Type AND compa
Hi Everyone,
What's the best way to reliably have unique constraints like functionality with
Cassandra? I have the following (which I think should be very common) use case.
User CF
Row Key: user email
Columns: userId: UUID, etc...
UserAttribute1 CF:
Row Key: userId (which is the uuid that's map
Yes.
(Please keep these to the user list rather than dev.)
On Fri, Jan 6, 2012 at 11:59 AM, Frank Yang wrote:
> Hi everyone,
>
> I am wondering whether it is possible to not to define the column
> metadata when creating a column family, but to specify the column when
> client updates data, for e
Hi all,
We have 5 nodes cluster(0.8.6), but the performance from one node is way
behind others, I checked tpstats, It always show non-zero pending
ReadStage, I don't see this problem on other nodes.
What caused the problem? I/O? Memory? Cpu usage is still low. How to fix
this problem?
~/bin/node
Are all your nodes equally balanced in terms of read requests? Are you
using RandomPartitioner? Are you reading using indexes?
First thing you can do is compare iostat -x output between the 2 nodes
to rule out any io issues assuming your read requests are equally
balanced.
On Fri, Jan 6, 2012 at
I am trying to understand the composite type.
Is this a right way to create a Composite Data ?
<>Follower_For_Users
<>#{"userID",n}:"followerID"
for simplicity I have replaced userID by followerID.
regards,
Ramesh
On Fri, Jan 6, 2012 at 10:03 AM, Drew Kutcharian wrote:
> Hi Everyone,
>
> What's the best way to reliably have unique constraints like functionality
> with Cassandra? I have the following (which I think should be very common)
> use case.
>
> User CF
> Row Key: user email
> Columns: userId: UUID
Yes, my issue is with handling concurrent requests. I'm not sure how your logic
will work with eventual consistency. I'm going to have the same issue in the
"tracker" CF too, no?
On Jan 6, 2012, at 10:38 AM, Mohit Anchlia wrote:
> On Fri, Jan 6, 2012 at 10:03 AM, Drew Kutcharian wrote:
>> Hi
I don't think if you read and write with QUORUM
On Fri, Jan 6, 2012 at 11:01 AM, Drew Kutcharian wrote:
> Yes, my issue is with handling concurrent requests. I'm not sure how your
> logic will work with eventual consistency. I'm going to have the same issue
> in the "tracker" CF too, no?
>
>
>
Thanks for your reply.
Nodes are equally balanced. and it is RandomPartitioner. I think that
machine is slower, Are you saying it is IO issue?
Daning
On Fri, Jan 6, 2012 at 10:25 AM, Mohit Anchlia wrote:
> Are all your nodes equally balanced in terms of read requests? Are you
> using RandomPart
Please help me understand this.
I am not sure if this is the right place to ask such question.
I read that it is not safe to use Super Column Family.
And the alternative, I found was to use Composite Column Names.
Many managers will have many employees.
*<>Manager_Employee
<>#managerID
<>#emplo
On 01/06/2012 01:48 PM, investtr wrote:
Please help me understand this.
I am not sure if this is the right place to ask such question.
I read that it is not safe to use Super Column Family.
And the alternative, I found was to use Composite Column Names.
Many managers will have many employees.
Hi Everybody,
I have a 10-node cluster running 1.0.5. The hardware/configuration for each
box looks like this:
Hardware: 4 GB RAM, 400 GB SATAII HD for commitlog, 50 GB SATAIII SSD for data
directory, 1 GB SSD swap partition
OS: CentOS 6, vm.swapiness = 0
Cassandra: disk access mode = standard
One other item…
java –version
java version "1.7.0_01"
Java(TM) SE Runtime Environment (build 1.7.0_01-b08)
Java HotSpot(TM) 64-Bit Server VM (build 21.1-b02, mixed mode)
Caleb Rackliffe | Software Developer
M 949.981.0159 | ca...@steelhouse.com
From: Caleb Rackliffe mailto:ca...@steelhouse.com
On Fri, 6 Jan 2012 10:38:17 -0800
Mohit Anchlia wrote:
> It could be as simple as reading before writing to make sure that
> email doesn't exist. But I think you are looking at how to handle 2
> concurrent requests for same email? Only way I can think of is:
>
> 1) Create new CF say tracker
> 2)
On Fri, 6 Jan 2012 10:03:38 -0800
Drew Kutcharian wrote:
> I know that this can be done using a lock manager such as ZooKeeper
> or HazelCast, but the issue with using either of them is that if
> ZooKeeper or HazelCast is down, then you can't be sure about the
> reliability of the lock. So this po
Correct, any kind of locking in Cassandra requires clocks that are in
sync, and requires you to wait "possible clock out of sync time" before
reading to check if you got the lock, to prevent the issue you describe
below.
There was a pretty detailed discussion of locking with only Cassandra a
Since a Zookeeper cluster is a quorum based system similar to Cassandra,
it only goes down when n/2 nodes go down. And the same way you have to
stop writing to Cassandra if N/2 nodes are down (if using QUoRUM), your
App will have to wait for the Zookeeper cluster to come online again
before it
Hi all,
I tried creating a keyspace with the replication factor 3, using cli
interface ... in Cassandra 1.0.6 (earlier tried in 0.8.2 and failed too)
But I'm getting an exception
"java.lang.IllegalArgumentException: No enum const class
org.apache.cassandra.cli.CliClient$AddKeyspaceArgument.REPL
This looks like it:
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Implementing-locks-using-cassandra-only-tp5527076p5527076.html
There's also some interesting JIRA tickets related to locking/CAS:
https://issues.apache.org/jira/browse/CASSANDRA-2686
https://issues.apache.org/jira
Suppose I start a repair on one or a few nodes in my cluster,
from an interactive machine in the office, and leave for the day
(which is a very realistic scenario imho).
Is there a way to know, from a remote machine, when a particular
action, such as compaction or repair, has been finished?
I fi
This looks like right way to do it. But remember this still doesn't
gurantee if your clocks drifts way too much. But it's trade-off with
having to manage one additional component or use something internal to
C*. It would be good to see similar functionality implemented in C* so
that clients don't h
Bryce,
I'm not sure about ZooKeeper, but I know if you have a partition between
HazelCast nodes, than the nodes can acquire the same lock independently in each
divided partition. How does ZooKeeper handle this situation?
-- Drew
On Jan 6, 2012, at 12:48 PM, Bryce Allen wrote:
> On Fri, 6 Ja
I don't think it's just clock drift. There is also the period of time
between when the client selects a timestamp, and when the data ends up
committed to cassandra. That drift seems harder to control, when the
nodes and/or clients are under load.
I agree that it would be nice to have something lik
On Fri, Jan 6, 2012 at 1:41 PM, Bryce Allen wrote:
> I don't think it's just clock drift. There is also the period of time
> between when the client selects a timestamp, and when the data ends up
> committed to cassandra. That drift seems harder to control, when the
> nodes and/or clients are unde
By using quorum. One of the partitions will may be able to acquire
locks, the other one won't...
On 01/06/2012 03:36 PM, Drew Kutcharian wrote:
Bryce,
I'm not sure about ZooKeeper, but I know if you have a partition between
HazelCast nodes, than the nodes can acquire the same lock independen
That's a good question, and I'm not sure - I'm fairly new to both ZK
and Cassandra. I found this wiki page:
http://wiki.apache.org/hadoop/ZooKeeper/FailureScenarios
and I think the lock recipe still works, even if a stale read happens.
Assuming that wiki page is correct.
There is still subtlety to
Thanks everyone for the replies. Seems like there is no easy way to handle
this. It's very surprising that no one seems to have solved such a common use
case.
-- Drew
On Jan 6, 2012, at 2:11 PM, Bryce Allen wrote:
> That's a good question, and I'm not sure - I'm fairly new to both ZK
> and Cas
Try this:
create keyspace testkeyspace;
update keyspace testkeyspace with placement_strategy =
'org.apache.cassandra.locator.SimpleStrategy' and strategy_options =
{replication_factor:3};
Good luck!
2012/1/6 Sajith Kariyawasam
> Hi all,
>
> I tried creating a keyspace with the replication fact
You might consider:
- installing DataStax OpsCenter ( http://www.datastax.com/products/opscenter
)
- starting the repair in a linux screen (so you can attach to the screen
from another location)
I prefer the OpsCener.
2012/1/6 Maxim Potekhin
> Suppose I start a repair on one or a few nodes in
>>>It's very surprising that no one seems to have solved such a common use
case.
I would say people have solved it using RIGHT tools for the task.
On Fri, Jan 6, 2012 at 2:35 PM, Drew Kutcharian wrote:
> Thanks everyone for the replies. Seems like there is no easy way to handle
> this. It's ve
Thanks, so I take it there is no solution outside of Opcenter.
I mean of course I can redirect the output, with additional timestamps
if needed,
to a log file -- which I can access remotely. I just thought there would
be some "status"
command by chance, to tell me what maintenance the node is
Any comments please ?
On Thu, Jan 5, 2012 at 11:07 AM, Aklin_81 wrote:
> I have seen supercolumns usage been discouraged most of the times.
> However sometimes the supercolumns seem to fit the scenario most
> appropriately not only in terms of how the data is stored but also in
> terms of how is
So what are the common RIGHT solutions/tools for this?
On Jan 6, 2012, at 2:46 PM, Narendra Sharma wrote:
> >>>It's very surprising that no one seems to have solved such a common use
> >>>case.
> I would say people have solved it using RIGHT tools for the task.
>
>
>
> On Fri, Jan 6, 2012 at
I saw this article - http://comments.gmane.org/gmane.comp.db.cassandra.user/2225
I'm using the Hector client (for connection pooling), with ~3200 threads active
according to JConsole.
Caleb Rackliffe | Software Developer
M 949.981.0159 | ca...@steelhouse.com
From: Caleb Rackliffe mailto:ca...@s
Okay, it looks like I was slightly underestimating the number of connections
open on the cluster. This probably won't be a problem after I tighten up the
Hector pool maximums.
Sorry for the spam…
Caleb Rackliffe | Software Developer
M 949.981.0159 | ca...@steelhouse.com
From: Caleb Rackliffe
The subject says it all -- pointers appreciated.
Thanks
Maxim
Please realize that I do not make any decisions here and I am not part of the
core Cassandra developer team.
What has been said before is that they will most likely go away and at least
under the hood be replaced by composite columns.
Jonathan have however stated that he would like the supercol
I read entire columns inside the supercolumns at any time but as for
writing them, I write the columns at different times. I don't have the
need to update them except that die after their TTL period of 60 days.
But since they are going to be deprecated, I don't know if it would be
really advisable
Instead of trying to solve the generic problem of uniqueness, I would focus
on the specific problem.
For eg lets consider your usecase of user registration with email address
as key. You can do following:
1. Create CF (Users) where row key is UUID and has user info specific
columns.
2. Whenever us
It makes great sense. You're a genius!!
On Jan 6, 2012, at 10:43 PM, Narendra Sharma wrote:
> Instead of trying to solve the generic problem of uniqueness, I would focus
> on the specific problem.
>
> For eg lets consider your usecase of user registration with email address as
> key. You can
49 matches
Mail list logo