Re: Cassandra disk space utilization WAY higher than I would expect

2010-08-06 Thread Peter Schuller
> One thing to keep in mind is that SSTables are not actually removed > from disk until the garbage collector has identified the relevant > in-memory structures as garbage (there is a note on the wiki about However I forgot that the 'load' reported by nodetool ring does not, I think, represent on-

Re: error using get_range_slice with random partitioner

2010-08-06 Thread Thomas Heller
Wild guess here, but are you using start_token/end_token here when you should be using start_key? Looks to me like you are trying end_token = ''. HTH, /thomas On Thursday, August 5, 2010, Adam Crain wrote: > Hi, > > I'm on 0.6.4. Previous tickets in the JIRA in searching the web indicated > tha

RE: error using get_range_slice with random partitioner

2010-08-06 Thread Adam Crain
Thomas, That was indeed the source of the problem. I naively assumed that the token range would help me avoid retrieving duplicate rows. If you iterate over the keys, how do you avoid retrieving duplicate keys? I tried this morning and I seem to get odd results. Maybe this is just a consequenc

Re: error using get_range_slice with random partitioner

2010-08-06 Thread Dave Viner
Funny you should ask... I just went through the same exercise. You must use Cassandra 0.6.4. Otherwise you will get duplicate keys. However, here is a snippet of perl that you can use. our $WANTED_COLUMN_NAME = 'mycol'; get_key_to_one_column_map('myKeySpace', 'myColFamily', 'mySuperCol', QUORUM

RE: error using get_range_slice with random partitioner

2010-08-06 Thread Adam Crain
Thanks Dave. I'm using 0.6.4 since I say this issue in the JIRA, but I just discovered that the client I'm using mutates the order of keys after retrieving the result with the thrift API... pretty much making key iteration impossible. So time to fork and see if they'll fix it :(. I'll review yo

Re: error using get_range_slice with random partitioner

2010-08-06 Thread Jeremy Hanna
Sounds like what you're seeing is in the client, but there was another duplicate bug with get_range_slice that was recently fixed on cassandra-0.6 branch. It's slated for 0.6.5 which will probably be out sometime this month, based on previous minor releases. https://issues.apache.org/jira/brow

Re: set ReplicationFactor and Token at Column Family/SuperColumn level.

2010-08-06 Thread Zhong Li
If I create 3-4 keyspaces, will this impact performance and resources (esp. memory and disk I/O) too much? Thanks, Zhong On Aug 5, 2010, at 4:52 PM, Benjamin Black wrote: On Thu, Aug 5, 2010 at 12:59 PM, Zhong Li wrote: The big thing bother me is initial ring token. We have some Column

Cassandra 0.7 Ruby/Thrift Bindings

2010-08-06 Thread Mark
Has anyone had any success using Cassandra 0.7 w/ ruby? I'm attempting to use the fauan/cassandra gem (http://github.com/fauna/cassandra/) which has explicit support for 0.7 but I keep receiving the following error message when making a request. Thrift::TransportException: end of file reached

Re: Cassandra 0.7 Ruby/Thrift Bindings

2010-08-06 Thread Ryan King
Make sure the client and server are both using the same transport (framed vs. non) -ryan On Fri, Aug 6, 2010 at 9:47 AM, Mark wrote: > Has anyone had any success using Cassandra 0.7 w/ ruby? I'm attempting to > use the fauan/cassandra gem (http://github.com/fauna/cassandra/) which has > explicit

Re: Cassandra 0.7 Ruby/Thrift Bindings

2010-08-06 Thread Mark
Wow.. fast answer AND correct. In Cassandra.yml # Frame size for thrift (maximum field length). # 0 disables TFramedTransport in favor of TSocket. thrift_framed_transport_size_in_mb: 15 I just had to change that value to 0 and everything worked. Now for my follow up question :) What is the dif

Re: Cassandra 0.7 Ruby/Thrift Bindings

2010-08-06 Thread Ryan King
On Fri, Aug 6, 2010 at 9:57 AM, Mark wrote: > Wow.. fast answer AND correct. In Cassandra.yml > > # Frame size for thrift (maximum field length). > # 0 disables TFramedTransport in favor of TSocket. > thrift_framed_transport_size_in_mb: 15 > > I just had to change that value to 0 and everything wo

Re: Question on load balancing in a cluster

2010-08-06 Thread Bill Au
If nodetool loadbalance does not do what it's name implies, should it be renamed or maybe even remove altogether since the recommendation is to _never_ use it in production? Bill On Thu, Aug 5, 2010 at 6:41 AM, aaron morton wrote: > This comment from Ben Black may help... > > "I recommend you _n

Re: Cassandra 0.7 Ruby/Thrift Bindings

2010-08-06 Thread Jonathan Ellis
We are defaulting to framed in 0.7 because it enables the fix to https://issues.apache.org/jira/browse/CASSANDRA-475 I am strongly in favor of removing the unframed option entirely in 0.8 On Fri, Aug 6, 2010 at 12:57 PM, Mark wrote: > Wow.. fast answer AND correct. In Cassandra.yml > > # Frame s

How to migrate any relational database to Cassandra

2010-08-06 Thread sonia gehlot
Hi All, Little background about myself. I am ETL engineer worked in only relational databases. I have been reading and trying Cassandra since 3-4 weeks. I kind of understood Cassandra data model, its structure, nodes etc. I also installed Cassandra and played around with it, like cassandra>

batch_mutate atomicity

2010-08-06 Thread B. Todd Burruss
if i am using batch_mutate to update/insert two columns in the same CF and same key, is this an atomic operation? i understand that an operation on a single key in a CF is atomic, but not sure if the above scenario boils down to two operations or considered one operation. thx

Re: batch_mutate atomicity

2010-08-06 Thread B. Todd Burruss
ok i just saw the FAQ (http://wiki.apache.org/cassandra/FAQ#batch_mutate_atomic) follow up question ... it states that "As a special case, mutations against a single key are atomic, but more generally no" ... i interpret that to also mean " .. mutations against a single key in the same CF ... "

Re: How to migrate any relational database to Cassandra

2010-08-06 Thread Michael Dürgner
In my opinion it's the wrong approach when so ask how to migrate from MySQL to Cassandra from a database level view. The lack of joins in NoSQL should lead to think about what you wanna get out of your persistent storage and afterwards think about how to migrate and most of the time how to denor

Re: How to migrate any relational database to Cassandra

2010-08-06 Thread sonia gehlot
Thanks for reply, I am sorry It seems my question comes out wrong.. * My question is what are the considration should I keep in mind to Migrate to Cassandra? * Like we do in ETL to extract data from source data we write query and then load it in our database after applying desired transformation

one question about cassandra write

2010-08-06 Thread Maifi Khan
Hi I have a question about the internal of cassandra write. Say, I already have the following in the database - (row_x,col_y,val1) Now if I try to insert (row_x,col_y,val100), what will happen? Will it overwrite the old data? I mean, will it overwrite the data physically or will it keep both the o

RE: error using get_range_slice with random partitioner

2010-08-06 Thread Adam Crain
Hi Jeremy, So, I fixed my client so it preserves the ordering and I get results that may be related to the bug. If I insert 30 keys into the random partitioner with names [key1, key2, ... key30] and then start the iteration (with a batch size of 10) I get the following debug output during the

Re: error using get_range_slice with random partitioner

2010-08-06 Thread Jeremy Hanna
If you're willing to try it out, the easiest way to check to see if it is resolved by the patch for CASSANDRA-1145, you could checkout the 0.6 branch: svn checkout http://svn.apache.org/repos/asf/cassandra/branches/cassandra-0.6/ cassandra-0.6 Then run `ant` to build the binaries. On Aug 6, 20

Re: Question on load balancing in a cluster

2010-08-06 Thread Benjamin Black
Yes, imo, it should be renamed. On Fri, Aug 6, 2010 at 10:10 AM, Bill Au wrote: > If nodetool loadbalance does not do what it's name implies, should it be > renamed or maybe even remove altogether since the recommendation is to > _never_ use it in production? > > Bill > > On Thu, Aug 5, 2010 at 6

Re: one question about cassandra write

2010-08-06 Thread Benjamin Black
On Fri, Aug 6, 2010 at 12:51 PM, Maifi Khan wrote: > Hi > I have a question about the internal of cassandra write. > Say, I already have the following in the database - > (row_x,col_y,val1) > > Now if I try to insert > (row_x,col_y,val100), what will happen? > Will it overwrite the old data? > I m

Re: set ReplicationFactor and Token at Column Family/SuperColumn level.

2010-08-06 Thread Benjamin Black
"Additional keyspaces have very little overhead (unlike CFs)." On Fri, Aug 6, 2010 at 9:42 AM, Zhong Li wrote: > > If I create 3-4 keyspaces, will this impact performance and resources (esp. > memory and disk I/O) too much? > > Thanks, > > Zhong > > On Aug 5, 2010, at 4:52 PM, Benjamin Black wrot

Re: How to migrate any relational database to Cassandra

2010-08-06 Thread Benjamin Black
http://maxgrinev.com/2010/07/12/do-you-really-need-sql-to-do-it-all-in-cassandra/ http://www.slideshare.net/benjaminblack/cassandra-basics-indexing On Fri, Aug 6, 2010 at 11:42 AM, sonia gehlot wrote: > Thanks for reply, > > I am sorry It seems my question comes out wrong.. > > * My question is w

Re: Cassandra 0.7 Ruby/Thrift Bindings

2010-08-06 Thread Benjamin Black
Ryan, I believe my branch was merged into fauna some time ago by jmhodges. However, 0.7 support must be explicitly enabled by require 'cassandra/0.7' as it currently defaults to 0.6. b On Fri, Aug 6, 2010 at 10:02 AM, Ryan King wrote: > On Fri, Aug 6, 2010 at 9:57 AM, Mark wrote: >> Wow.. fas

Re: Cassandra disk space utilization WAY higher than I would expect

2010-08-06 Thread Rob Coli
On 8/5/10 11:51 AM, Peter Schuller wrote: Also, the variation in disk space in your most recent post looks entirely as expected to me and nothing really extreme. The temporary disk space occupied during the compact/cleanup would easily be as high as your original disk space usage to begin with, a

RE: one question about cassandra write

2010-08-06 Thread Jeremiah Jordan
If you want to be able to get the data over time, you need to store it in multiple columns. You can use TimeUUID columns if you need to be able to get ranges of times through queries. -Original Message- From: Maifi Khan [mailto:maifi.k...@gmail.com] Sent: Friday, August 06, 2010 2:51 PM

Re: non blocking Cassandra with Tornado

2010-08-06 Thread Erik Onnen
On Thu, Jul 29, 2010 at 9:57 PM, Ryan Daum wrote: > > Barring this we (place where I work, Chango) will probably eventually fork > Cassandra to have a RESTful interface and use the Jetty async HTTP client to > connect to it. It's just ridiculous for us to have threads and associated > resources t

RE: error using get_range_slice with random partitioner

2010-08-06 Thread Adam Crain
I ran against the 0.6 branch I still see similarly odd results. My test cases prove that set of keys have been successfully inserted, but usually I never see the first key again or I reach the first key before having seen all of the keys. -Adam -Original Message- From: Jeremy Hanna [m

Re: one question about cassandra write

2010-08-06 Thread Rob Coli
On 8/6/10 2:13 PM, Benjamin Black wrote: Assuming the old version is already on disk in an SSTable, the new version will not overwrite it, and both versions will be in the system. A compaction will remove the old version, however. To be clear, a compaction will only remove the old version if :

Re: Cassandra disk space utilization WAY higher than I would expect

2010-08-06 Thread Peter Schuller
> Your post refers to "obsolete" sstables, but the only thing that makes them > "obsolete" in this case is that they have been compacted? Yes. > As I understand Julie's case, she is : > > a) initializing her cluster > b) inserting some number of unique keys with CL.ALL > c) noticing that more dis

Re: Columns limit

2010-08-06 Thread Software Dev
I'm a little retarded. Can you explain this a little more in depth? What you mean by "index rows named...". Do you mean create a separate ColumnFamily? On Sat, Jul 31, 2010 at 9:32 PM, Benjamin Black wrote: > Have the TimeUUID as the key, and then index rows named for the time > intervals, each

Re: one question about cassandra write

2010-08-06 Thread Peter Schuller
> a) It is a major compaction [1] > b) The old version was deleted/overwritten more than GCGraceSeconds ago [2] c) and the memtable containing the delete/overwrite has been flushed. (I suppose that's kinda obvious in retrospect, but it took me a little bit to realize this was why a 'nodetool comp

Re: error using get_range_slice with random partitioner

2010-08-06 Thread Thomas Heller
Hey, [junit] key24 [junit] Query w/ Range(key24,,10) result size: 10 [junit] key24 I think this is actually the expected result, whenever you are using range_slices with start_key/end_key you must increment the last key you received and then use that in the next slice start_key. I also tried to u

Re: error using get_range_slice with random partitioner

2010-08-06 Thread Peter Schuller
> I think this is actually the expected result, whenever you are using > range_slices with start_key/end_key you must increment the last key > you received and then use that in the next slice start_key. I also > tried to use token because of exactly that behaviour and the doc > talking about inclus

Re: Cassandra disk space utilization WAY higher than I would expect

2010-08-06 Thread Rob Coli
On 8/6/10 3:30 PM, Peter Schuller wrote: *However*, and this is what I meant with my follow-up, that still does not explain the data from her post unless 'nodetool ring' reports total sstable size rather than the total size of live sstables. Relatively limited time available to respond to this

Re: Cassandra disk space utilization WAY higher than I would expect

2010-08-06 Thread Peter Schuller
> sstables waiting for the GC to trigger actual file removal. *However*, > and this is what I meant with my follow-up, that still does not > explain the data from her post unless 'nodetool ring' reports total > sstable size rather than the total size of live sstables. As far as I can tell, the inf

RE: error using get_range_slice with random partitioner

2010-08-06 Thread Adam Crain
I took this approach... reject the first result of subsequent get_range_slice requests. If you look back at output I posted (below) you'll notice that not all of the 30 keys [key1...key30] get listed! The iteration dies and can't proceed past key2. 1) 1st batch gets 10 unique keys. 2) 2nd batch

Re: error using get_range_slice with random partitioner

2010-08-06 Thread Thomas Heller
> > Another way to do it is to filter results to exclude columns received > twice due to being on iteration end points. Well, depends on the size of your rows, keeping lists of 1mil+ column names will eventually become rally slow (at least in ruby). > > This is useful because it is not always

backport of pre cache load

2010-08-06 Thread Artie Copeland
would it be possible to backport the 0.7 feature, the ability to safe and preload row caches after a restart. i think that is a very nice and important feature that would help users with very large caches, that take a long time to get the proper hot set. for example we can get pretty good cache r

Lost data which succeeded to insert right before server was down

2010-08-06 Thread Ken Matsumoto
Hi all, I'm now checking reliability when inserting. I set one cluster of 3 nodes with replication factor 2 and OrderPreservingPartitioner. I insert data (A,B,C in this order) with consistency level ONE to node No.3. I immediately shut off (turn off N/W I/F)node No.3 right after inserting dat

Re: error using get_range_slice with random partitioner

2010-08-06 Thread Thomas Heller
On Sat, Aug 7, 2010 at 1:05 AM, Adam Crain wrote: > I took this approach... reject the first result of subsequent get_range_slice > requests. If you look back at output I posted (below) you'll notice that not > all of the 30 keys [key1...key30] get listed! The iteration dies and can't > proceed

row cache during bootstrap

2010-08-06 Thread Artie Copeland
the way i understand how row caches work is that each node has an independent cache, in that they do not push there cache contents with other nodes. if that the case is it also true that when a new node is added to the cluster it has to build up its own cache. if thats the case i see that as a po

Re: Columns limit

2010-08-06 Thread Thomas Heller
. ColumnFamily Standard: LogRecords, CompareWith=TimeUUIDType Row Key "20100806": Column Name: TimeUUID.new Value: JSON({'remote_addr':..., 'user_agent':, 'url':) ..., more Columns In my case I chose to "partition" by day, if you are getting too man

Re: Columns limit

2010-08-06 Thread Software Dev
limit (it was also logging related). Solution > was pretty simple, log data is immutable, so no SuperColumn needed. > > ColumnFamily Standard: LogRecords, CompareWith=TimeUUIDType > > Row Key "20100806": > Column Name: TimeUUID.new Value: JSON({'remote_addr':..

Re: Columns limit

2010-08-06 Thread Thomas Heller
t;LogByRemoteAddrAndDate" CompareWith: TimeUUID Row: "127.0.0.1:20100806" Column TimeUUID/JSON as usual. If you want to "link" to the actual log record (to avoid writing if multiple times) just insert the same timeuuid you inserted into the other CF and leave the value empty. So yo

Re: Cassandra Scaling Questions

2010-08-06 Thread Rob Coli
On 8/5/10 1:42 AM, Oleg Anastasjev wrote: 3.) When using the random partitioner how much difference should be expected (or has been observed) between nodes? 2%? 10%? This depends on data. It will distribute keys almost equal between nodes, nut sizes of row data could be different for different

Re: batch_mutate atomicity

2010-08-06 Thread Jonathan Ellis
Everything in the same key of a batch_mutate is atomic. (But not isolated.) On Fri, Aug 6, 2010 at 2:15 PM, B. Todd Burruss wrote: > ok i just saw the FAQ > (http://wiki.apache.org/cassandra/FAQ#batch_mutate_atomic) > > follow up question ... > > it states that "As a special case, mutations agai

Re: non blocking Cassandra with Tornado

2010-08-06 Thread Jonathan Ellis
See comments to https://issues.apache.org/jira/browse/CASSANDRA-1256 On Fri, Jul 30, 2010 at 12:57 AM, Ryan Daum wrote: > An asynchronous thrift client in Java would be something that we could > really use; I'm trying to get a sense of whether this async client is usable > with Cassandra at this

Re: backport of pre cache load

2010-08-06 Thread Jonathan Ellis
are you caching 100% of the CF? if not this is not super useful. On Fri, Aug 6, 2010 at 7:10 PM, Artie Copeland wrote: > would it be possible to backport the 0.7 feature, the ability to safe and > preload row caches after a restart.  i think that is a very nice and > important feature that would

Re: Lost data which succeeded to insert right before server was down

2010-08-06 Thread Jonathan Ellis
did you check the log on 3 for exceptions? On Fri, Aug 6, 2010 at 7:10 PM, Ken Matsumoto wrote: > Hi all, > > I'm now checking reliability when inserting. > > I set one cluster of 3 nodes with replication factor 2 and > OrderPreservingPartitioner. > I insert data (A,B,C in this order) with consis

Re: Columns limit

2010-08-06 Thread Benjamin Black
rColumn needed. >> >> ColumnFamily Standard: LogRecords, CompareWith=TimeUUIDType >> >> Row Key "20100806": >>  Column Name: TimeUUID.new Value: JSON({'remote_addr':..., >> 'user_agent':, 'url':) >>  ..., more Column

Re: Columns limit

2010-08-06 Thread Mark
ot;LogByRemoteAddrAndDate" CompareWith: TimeUUID Row: "127.0.0.1:20100806" Column TimeUUID/JSON as usual. If you want to "link" to the actual log record (to avoid writing if multiple times) just insert the same timeuuid you inserted into the other CF and leave the value e

Re: How to migrate any relational database to Cassandra

2010-08-06 Thread Peter Harrison
On Sat, Aug 7, 2010 at 6:00 AM, sonia gehlot wrote: > Can you please help me how to move forward? How should I do all the setup > for this? My view is that Cassandra is fundamentally different from SQL databases. There may be artefact's which are superficially similar between the two systems, bu

Re: Columns limit

2010-08-06 Thread Benjamin Black
ll >> them out later in some map/reduce fashion. What you want is another >> column Family and a similar structure. >> >> ColumnFamily Standard "LogByRemoteAddrAndDate" CompareWith: TimeUUID >> >> Row: "127.0.0.1:20100806" Column TimeUUID/JSON as usual. If yo

Re: batch_mutate atomicity

2010-08-06 Thread james anderson
good morning; On 2010-08-07, at 02:45 , Jonathan Ellis wrote: Everything in the same key of a batch_mutate is atomic. (But not isolated.) what does the distinction mean in the context of cassandra? is it that the execution of an operation with the same key could see the effect of the 'f

Re: Columns limit

2010-08-06 Thread Mark
olumnFamily Standard "LogByRemoteAddrAndDate" CompareWith: TimeUUID Row: "127.0.0.1:20100806" Column TimeUUID/JSON as usual. If you want to "link" to the actual log record (to avoid writing if multiple times) just insert the same timeuuid you inserted into the other CF and leave t

Re: Columns limit

2010-08-06 Thread Mark
olumnFamily Standard "LogByRemoteAddrAndDate" CompareWith: TimeUUID Row: "127.0.0.1:20100806" Column TimeUUID/JSON as usual. If you want to "link" to the actual log record (to avoid writing if multiple times) just insert the same timeuuid you inserted into the other CF and leave t

Key level index

2010-08-06 Thread Mark
In the "CassandraLimitations" wiki it states: " Cassandra has two levels of indexes: key and column" I understand how the column and subcolumn indexes work but can someone explain to me how the key level index works?