Duplicate result of get_indexed_slices, depending on indexClause.count

2011-04-14 Thread sam_
Hi All, I have been using Cassandra 0.7.2 and 0.7.4 with Thrift API (using Java). I noticed that if I am querying a Column Family with indexed columns sometimes I get a duplicate result in get_indexed_slices depending on the number of rows in the CF and the count that I set in IndexClause.count.

Re: Cassandra Database Modeling

2011-04-14 Thread csharpplusproject
Aaron, Thank you so much. So, the way things appear, it is definitely possible that I could be making queries that would return all 10M particle pairs (at least, I should plan for it). What would be the best design in such a case? I read somewhere that the recommended maximum size of a row (meani

Re: What's the best modeling approach for ordering events by date?

2011-04-14 Thread Guillermo Winkler
Hi Ethan, I want to present the events ordered by time, always in pages of 20/40 events. If the events are tweets, you can have 1000 tweets from the same second or you can have 30 tweets in a 10 minute range. But I always wanna be able to page through the results in an orderly fashion. I think th

Wildcard character for CF in access.properties?

2011-04-14 Thread Mike Heffner
Is there a wildcard for the COLUMNFAMILY field in `access.properties`? I'd like to split read-write and read-only access between my backend and frontend users, respectively, however the full list of CFs is not known a priori. I'm using 0.7.4. Cheers, Mike -- Mike Heffner Librato, I

Re: RE: batch_mutate failed: out of sequence response

2011-04-14 Thread Dan Washusen
I've looked over the Pelops code again and I really can't see how it could be at fault here... -- Dan Washusen On Wednesday, 13 April 2011 at 3:20 AM, Stephen McKamey wrote: > [I wrote this Apr 10, 2011 at 12:09 but my message seems to have gotten lost > along the way.] > > I use Pelops (the

Re: Pyramid Organization of Data

2011-04-14 Thread Patrick Julien
On Thu, Apr 14, 2011 at 4:47 PM, Adrian Cockcroft wrote: > What you are asking for breaks the eventual consistency model, so you need to > create a separate cluster in NYC that collects the same updates but has a > much longer setting to timeout the data for deletion, or doesn't get the > delet

Re: Starting the Cassandra server from Java (without command line)

2011-04-14 Thread Jason Pell
You can make use of embedded Cassandra server. Both Hector and Pelops have classes that you can use in your own unit tests, the Pelops one is quite good. Sent from my iPhone On Apr 15, 2011, at 3:59, sam_ wrote: > Hello there, > > To start the Cassandra server we can use the following command

Re: Pyramid Organization of Data

2011-04-14 Thread Adrian Cockcroft
What you are asking for breaks the eventual consistency model, so you need to create a separate cluster in NYC that collects the same updates but has a much longer setting to timeout the data for deletion, or doesn't get the deletes. One way is to have a trigger on writes on your pyramid nodes

Re: Indexes on heterogeneous rows

2011-04-14 Thread aaron morton
> (This is a case were 1/3 of the rows are of type 2, but, say only a few > hundred rows of type 2 have e=5.) How many rows would have e=5 without worrying about their type value? Aaron On 14 Apr 2011, at 23:48, David Boxenhorn wrote: > Thanks. I'm aware that I can roll my own. I wanted to av

Re: which nodes contain the data? column family as indexes?

2011-04-14 Thread aaron morton
By default all data for a keyspace, and it's CFs, are spread out over all nodes in the cluster. If you want to see which endpoints have data for a particular key there is an operation on the o.a.c.db.StorageService MBean you can use via JConsole. Some (out of date) docs here may help http://wi

Re: pycassa + celery

2011-04-14 Thread aaron morton
This is going to be a bug in your code, so it's a bit tricky to know but... How / when is the email added to the DB? What does the rawEmail function do ? Set a break point, what are the two strings you are feeding into the hash functions ? Aaron On 15 Apr 2011, at 03:50, pob wrote: > Hello, >

Re: What's the best modeling approach for ordering events by date?

2011-04-14 Thread Ethan Rowe
How do you plan to read the data? Entire histories, or in relatively confined slices of time? Do the events have any attributes by which you might segregate them, apart from time? If you can divide time into a fixed series of intervals, you can insert members of a given interval as columns (or s

All nodes down even though ring shows up

2011-04-14 Thread mcasandra
I ran stress test to read 50K rows and since then I am getting below error even though ring show all nodes are up: ERROR 12:40:29,999 Exception: me.prettyprint.hector.api.exceptions.HectorException: All host pools marked down. Retry burden pushed out to client. at me.prettyprint.cassandra.

Problem with running test example with Cassandra 0.7.4

2011-04-14 Thread Anda P
Hello, I have set up my single node Cassandra 0.7.4 and started the service with bin/cassandra -f. Now I am trying to use the Hector API (v. 0.7.0) to manage the DB. The Cassandra CLI works fine and I can create keyspaces and so on. I tried to run the test example and create a single keyspace:

Re: Starting the Cassandra server from Java (without command line)

2011-04-14 Thread Narendra Sharma
The write up is a year old but still will give you fair idea of how to do. http://prettyprint.me/2010/02/14/running-cassandra-as-an-embedded-service/ Thanks, Naren On Thu, Apr 14, 2011 at 10:59 AM, sam_ wrote: > Hello

Starting the Cassandra server from Java (without command line)

2011-04-14 Thread sam_
Hello there, To start the Cassandra server we can use the following command in command prompt: cassandra -f I am wondering if it is possible to directly start the server inside a Java program using thrift API or a lower level class inside Cassandra implementation. The purpose of this is to be ab

Re: Pyramid Organization of Data

2011-04-14 Thread Patrick Julien
Thanks for your input Adrian, we've pretty much settled on this too. What I'm trying to figure out is how we do deletes. We want to do deletes in the satellites because: a) we'll run out of disk space very quickly with the amount of data we have b) we don't need more than 3 days worth of history

Stress testing disk configurations. Your thoughts?

2011-04-14 Thread Nathan Milford
Ahoy, I'm building out a new 0.7.4 cluster to migrate our 0.6.6 cluster to. While I'm waiting for the dev-side to get time to work on their side of the project I have a 10 node cluster evenly split across two data centers (NY & LA) and was looking to do some testing while I could. My primary foc

Re: Indexes on heterogeneous rows

2011-04-14 Thread Jonathan Ellis
On Thu, Apr 14, 2011 at 6:48 AM, David Boxenhorn wrote: > The reason why I put "type" first is that queries on type will > always be an exact match, whereas the other clauses might be inequalities. Expression order doesn't matter, but as you imply, non-equalities can't be used in an index lookup

which nodes contain the data? column family as indexes?

2011-04-14 Thread tinhuty he
I just started 5 nodes in a cluster and set a replica factor of 3 on a keyspace ks. My question is how do I know which 3 nodes will contain the data of this keyspace(or particular column family in this keyspace)? I often read that “use a separate column family for storing our indexes”, here is

What's the best modeling approach for ordering events by date?

2011-04-14 Thread Guillermo Winkler
I have a huge number of events I need to consume later, ordered by the date the event occured. My first approach to this problem was to use seconds since epoch as row key, and event ids as column names (empty value), this way: EventsByDate : { SecondsSinceEpoch: { evid:"", evid:"", ev

Re: Indexes on heterogeneous rows

2011-04-14 Thread Jonathan Ellis
This should work reasonably well w/ 0.7 indexes. Cassandra tracks statistics on index selectivity, so it would plan that query as "index lookup on e=5, then iterate over those results and return only rows that also have type=2." On Thu, Apr 14, 2011 at 5:33 AM, David Boxenhorn wrote: > Thank you

pycassa + celery

2011-04-14 Thread pob
Hello, I'm experiencing really strange problem. I wrote data into cassandra cluster. I'm trying to check if data inserted then fetched are equally to source data (file). Code below is the task for celery that does the comparison with sha1(). The problem is that celery worker returning since time

Re: Pyramid Organization of Data

2011-04-14 Thread Adrian Cockcroft
We have similar requirements for wide area backup/archive at Netflix. I think what you want is a replica with RF of at least 3 in NY for all the satellites, then each satellite could have a lower RF, but if you want safe local quorum I would use 3 everywhere. Then NY is the sum of all the satel

Re: Update the Keyspace replication factor online

2011-04-14 Thread Yudong Gao
Thanks, Aaron! I will try the scenario in small scale first. I appreciate if anyone else have tried this before and can share the experience with us. Thanks! Yudong On Thu, Apr 14, 2011 at 4:26 AM, aaron morton wrote: > It looks like you are dropping DC1, in that case perhaps you could just mo

Re: Pyramid Organization of Data

2011-04-14 Thread Patrick Julien
Thanks, I'm still working the problem so anything I find out I will post here. Yes, you're right, that is the question I am asking. No, adding more storage is not a solution since new york would have several hundred times more storage. On Apr 14, 2011 6:38 AM, "aaron morton" wrote: > I think yo

Re: Indexes on heterogeneous rows

2011-04-14 Thread David Boxenhorn
Thanks. I'm aware that I can roll my own. I wanted to avoid that, for ease of use, but especially for atomicity concerns. I thought that the secondary index would bring into memory all keys where type=2, and then iterate over them to find keys where=5. (This is a case were 1/3 of the rows are of t

Re: Indexes on heterogeneous rows

2011-04-14 Thread aaron morton
You could make your own inverted index by using keys like "e=5-type=2" where the columns are either the keys for the object or the objects themselves. Then just grab the full row back. If you know you always want to run queries like that. This recent discussion and blog post from Ed is good b

Re: choose which Hector's serializer for Cassandra performance?

2011-04-14 Thread aaron morton
1. Order for RP is, well, random'ish for some large value of ish see http://wiki.apache.org/cassandra/FAQ#range_rp 2. Not really. Aaron On 14 Apr 2011, at 14:45, 박용욱 wrote: > Thanks very much Aaron! > > 4. Not sure how Hector handles it, try > https://github.com/zznate/cassandra-tutorial or

Re: Pyramid Organization of Data

2011-04-14 Thread aaron morton
I think your question is "NY is the archive, after a certain amount of time we want to delete the row from the original DC but keep it in the archive in NY." Once you delete a row, it's deleted as far as the client is concerned. GCGaceSeconds is only concerned with when the tombstone marker can

Re: Indexes on heterogeneous rows

2011-04-14 Thread David Boxenhorn
Thank you for your answer, and sorry about the sloppy terminology. I'm thinking of the scenario where there are a small number of results in the result set, but there are billions of rows in the first of your secondary indexes. That is, I want to do something like (not sure of the CQL syntax): s

Re: forced index creation?

2011-04-14 Thread Sasha Dolgy
Aha. that would be it. cheers On Apr 14, 2011 11:47 AM, "aaron morton" wrote: > Checked the code, build_indexes comes from the JMX services and is only shown if the client can connect to JMX. > > It is cannot connect it should print "WARNING: Could not connect to the JMX on %s:%d, information wo

Re: forced index creation?

2011-04-14 Thread aaron morton
Checked the code, build_indexes comes from the JMX services and is only shown if the client can connect to JMX. It is cannot connect it should print "WARNING: Could not connect to the JMX on %s:%d, information won't be shown.%n%n" If you are using a non default JMX port use --jmxport when star

Re: Indexes on heterogeneous rows

2011-04-14 Thread aaron morton
Need to clear up some terminology here. Rows have a key and can be retrieved by key. This is *sort of* the primary index, but not primary in the normal RDBMS sense. Rows can have different columns and the column names are sorted and can be efficiently selected. There are "secondary indexes" in

Re: database design

2011-04-14 Thread aaron morton
In some cases you may be able to add a secondary index. Aaron On 14 Apr 2011, at 03:11, Edward Capriolo wrote: > On Wed, Apr 13, 2011 at 10:39 AM, Jean-Yves LEBLEU wrote: >> Hi all, >> >> Just some thoughts and question I have about cassandra data modeling. >> >> If I understand well, cassan

Re: Cassandra Database Modeling

2011-04-14 Thread aaron morton
WRT your query, it depends on how big a slice you want to get how time critical it is. e.g. Could you be making queries that would return all 10M pairs ? Or would the queries generally want to get some small fraction of the data set? Again, depends on how the sim runs. If you sim has stop the w

Re: Update the Keyspace replication factor online

2011-04-14 Thread aaron morton
It looks like you are dropping DC1, in that case perhaps you could just move the nodes from DC1 into DC 3. I *think* in your case if you made the RF change, ran repair on them, and worked at Quorum or ALL your clients would be ok. *BUT* I've not done this myself, please take care or ask for a

Re: flush_largest_memtables_at messages in 7.4

2011-04-14 Thread Peter Schuller
And btw I've been assuming your reads are running without those 250k column inserts going at the same time. It would be difficult to see what's going on if you have both of those traffic patterns at the same time. -- / Peter Schuller

Re: flush_largest_memtables_at messages in 7.4

2011-04-14 Thread Peter Schuller
And btw you can also directly read out the average wait time column and average latency column of iostat too, to confirm that individual i/o requests are taking longer than on a non-saturated drive. -- / Peter Schuller

Re: flush_largest_memtables_at messages in 7.4

2011-04-14 Thread Peter Schuller
> Actually when I run 2 stress clients in parallel I see Read Latency stay the > same. I wonder if cassandra is reporting accurate nos. Or you're just bottlenecking on something else. Are you running the extra stress client on different machines for example, so that the client isn't just saturatin

Re: raid 0 and ssd

2011-04-14 Thread Terje Marthinussen
Hm... You should notice that unless you have TRIM, which I don't think any OS support with any raid functionality yet, then once you have written once to the whole SSD, it is always full! That is, when you delete a file, you don't "clear" the blocks on the SSD so as far as the SSD goes, the data