Re: Basic Architecture Question

2010-04-30 Thread Patricio Echagüe
Roger, if you include the last read key as the start key for the next API call, will that retrieve the same key/row twice? The documentation says that both keys (start, finish) are included. Thanks On Thu, Apr 29, 2010 at 1:31 PM, Brandon Williams wrote: > On Thu, Apr 29, 2010 at 10:19 AM, Davi

Re: Single Split ColumnFamilyRecordReader returns duplicate rows

2010-04-30 Thread Jonathan Ellis
Can you create a ticket? On Fri, Apr 30, 2010 at 4:55 PM, Joost Ouwerkerk wrote: > There's a bug in ColumnFamilyRecordReader that appears when processing > a single split.  When the start and end tokens of the split are equal, > duplicate rows can be returned. > > Example with 5 rows: > token (st

Re: cassandra 0.5.1 java.lang.OutOfMemoryError: Java heap space issue

2010-04-30 Thread Eric Yu
try specify the InitialToken. In your cluster, set the token to i*(2**127/6), i = [1,6]. It will helps. On Sat, May 1, 2010 at 8:03 AM, elsif wrote: > I upgraded to 0.6.1 and was able to bring up all the nodes and make > queries. > > After adding some new data, the java vm ran out of memory on t

Re: cassandra 0.5.1 java.lang.OutOfMemoryError: Java heap space issue

2010-04-30 Thread elsif
I upgraded to 0.6.1 and was able to bring up all the nodes and make queries. After adding some new data, the java vm ran out of memory on three of the nodes. Cassandra continues to run for about 20 minutes before it exits completely: DEBUG [ROW-MUTATION-STAGE:2] 2010-04-30 16:02:27,298 RowMutati

Single Split ColumnFamilyRecordReader returns duplicate rows

2010-04-30 Thread Joost Ouwerkerk
There's a bug in ColumnFamilyRecordReader that appears when processing a single split. When the start and end tokens of the split are equal, duplicate rows can be returned. Example with 5 rows: token (start and end) = 53193025635115934196771903670925341736 Tokens returned by first get_range_slic

Re: Cassandra data model for financial data

2010-04-30 Thread Rob Coli
On 4/30/10 6:36 AM, Jonathan Ellis wrote: each row has a [column] index and bloom filter of column names, and then there is the overhead of the java objects. In addition to the aforementioned row column index, there's also the row key index, which is an int and a key-length-(string now/byte[]

Re: nodetools, available commands

2010-04-30 Thread Rob Coli
On 4/30/10 4:47 AM, Douglas Santos wrote: Hi all, We are writing an article for a magazine and would like to write about monitoring, more precisely on the nodetools, but did not find many things about the tool. I want to help or a brief explanation about nodetools commands ... Available command

Re: why the sum of all the nodes' loads is much bigger than the size of the inserted data?

2010-04-30 Thread Rob Coli
On 4/30/10 5:21 AM, Bingbing Liu wrote: > hi, > thanks for your help. > i run the nodetool -h compact > but the load keep the same , is there anyone can tell me why? "compact" and "cleanup" are two different operations. "compact" does a major compaction. "cleanup" is a superset of "compact" w

Re: Re: Re: compaction slow while sstable>25GB,limitation of thesstablesize?

2010-04-30 Thread Schubert Zhang
I have ever modify the code to set INDEX_INTERVAL = 512, to decrease the memory usage. And it seems working fine. Is it right? 2010/4/30 casablinca126.com > hi, >It seems changing the INDEX_INTERVAL with conflict with > AntiEntropyService, right? >I will reconstruct my sstables.

Re: Batch mutate doesn't work

2010-04-30 Thread Anthony Molinaro
On Fri, Apr 30, 2010 at 03:58:09PM +0200, Zubair Quraishi wrote: > % > % set second property ( fails! - why? ) > % > MutationMap = > { >Key, >{ > <<"KeyValue">>, > [ >#mutation{ > column_or_supercolumn = #column{ name = "property" , value = > "value" , times

inserting new rows with one key vs. inserting new columns in a row performance

2010-04-30 Thread Даниел Симеонов
Hi, I've checked two similar scenarios and one of them seem to be more performant. So timestamped data is being appended, the first use case is with an OPP and new rows being created every with only one column (there are about 7-8 CFs). The second cases is to have rows with more columns and Rand

Re: Not able to install cassandra in linux

2010-04-30 Thread David King
> [r...@calculus apache-cassandra-0.6.1]# bin/cassandra -f > Can't start up: not enough memory My guess is that you don't have enough memory

Re: Problem with JVM? concurrent mode failure

2010-04-30 Thread Jonathan Ellis
Great, thanks for testing that. On Fri, Apr 30, 2010 at 11:45 AM, Daniel Gimenez wrote: > > The code is working now without memory leaks using your patch in the 0.6.2. I > have done more than 100M without problems until now... > > Thanks! > Daniel Gimenez. > -- > View this message in context: >

Re: Batch mutate doesn't work

2010-04-30 Thread Jonathan Ellis
like I told you on the other list, erlang or the erlang thrift compiler is not exposing the error the cassandra server is sending you. "bad_return_value" is not it. Unless someone with actual erlang experience chimes in here, I would suggest trying with Python first, at least that will show you t

RE: Problem with JVM? concurrent mode failure

2010-04-30 Thread Daniel Gimenez
The code is working now without memory leaks using your patch in the 0.6.2. I have done more than 100M without problems until now... Thanks! Daniel Gimenez. -- View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Problem-with-JVM-concurrent-mode-failure

Re: ColumnFamilyInputFormat KeyRange scans on a CF

2010-04-30 Thread Utku Can Topçu
I meant in the first sentence "running the get_range_slices from a single point" On Fri, Apr 30, 2010 at 4:08 PM, Utku Can Topçu wrote: > Do you mean, running the get_range_slices from a single? Yes, it would be > reasonable for a relatively small key range, when it comes to analyze a > really b

Re: ColumnFamilyInputFormat KeyRange scans on a CF

2010-04-30 Thread Utku Can Topçu
Do you mean, running the get_range_slices from a single? Yes, it would be reasonable for a relatively small key range, when it comes to analyze a really big range in really big data collection (i.e. like the one we currently populate) having a way for distributing the reads among the cluster seems

Re: Cassandra reverting deletes?

2010-04-30 Thread Joost Ouwerkerk
Great, thank you. Do you have a hypothesis about where things might be going wrong? Let me know what I can do to help. On Fri, Apr 30, 2010 at 9:33 AM, Jonathan Ellis wrote: > https://issues.apache.org/jira/browse/CASSANDRA-1040 > > On Thu, Apr 29, 2010 at 6:55 PM, Joost Ouwerkerk wrote: >> Ok

Batch mutate doesn't work

2010-04-30 Thread Zubair Quraishi
I have the following code in Erlang to set a value and then add a property. The first set works but the mutate fails. Can anyone enlighten me? Thanks {ok, C} = thrift_client:start_link("127.0.0.1",9160, cassandra_thrift), Key = "Key1", % % set first property % thrift_client:call( C,

Re: Re: why the sum of all the nodes' loads is much bigger than the sizeof the inserted data?

2010-04-30 Thread Bingbing Liu
thanks according to your explanation , the result sounds reasonable thanks again~~~ 2010-04-30 Bingbing Liu 发件人: Sylvain Lebresne 发送时间: 2010-04-30 20:54:04 收件人: user 抄送: 主题: Re: why the sum of all the nodes' loads is much bigger than the sizeof the inserted data? I believe on

Not able to install cassandra in linux

2010-04-30 Thread sharanabasava raddi
[r...@calculus apache-cassandra-0.6.1]# bin/cassandra -f Can't start up: not enough memory Am beginner in Cassandra. My installation failing due to the above error in linux. Could any one give solution for this.? Thanks in advance.

Re: ColumnFamilyOutputFormat?

2010-04-30 Thread Jonathan Ellis
On Fri, Apr 30, 2010 at 7:14 AM, Utku Can Topçu wrote: > Hey All, > > I've been looking at the documentation and related articles about Cassandra > and Hadoop integration, I'm only seeing ColumnFamilyInputFormat for now. > What if I want to write directly to cassandra after a reduce? Then you jus

Re: Inserting files to Cassandra timeouts

2010-04-30 Thread Jonathan Ellis
compaction starts but never finishes. are you inserting all these files into the same row? don't do that. On Fri, Apr 30, 2010 at 3:04 AM, Spacejatsi wrote: > I ran again the test, inserting 64 files (15-25MB per file) with 2 threads > inserting file file at the time. > First 30 files goes rel

Re: Cassandra data model for financial data

2010-04-30 Thread Jonathan Ellis
each row has an index and bloom filter of column names, and then there is the overhead of the java objects. On Thu, Apr 29, 2010 at 11:05 PM, Andrew Nguyen wrote: > When making rough calculations regarding the potential size of a single row, > what sort of overhead is there to consider?  In other

Re: Key distribution

2010-04-30 Thread Jonathan Ellis
Nope. You could write one using bin/sstablekeys though. On Thu, Apr 29, 2010 at 8:58 PM, Carlos Sanchez wrote: > All, > > Does anyone know of a program (series of classes) that can capture the key > distribution of the rows in a ColumnFamily, sort of a [sub] string-histogram. > > Thanks, > > Ca

Re: Cassandra reverting deletes?

2010-04-30 Thread Jonathan Ellis
https://issues.apache.org/jira/browse/CASSANDRA-1040 On Thu, Apr 29, 2010 at 6:55 PM, Joost Ouwerkerk wrote: > Ok, I reproduced without mapred.  Here is my recipe: > > On a single-node cassandra cluster with basic config (-Xmx:1G) > loop { >   * insert 5,000 records in a single columnfamily with

Re: ColumnFamilyInputFormat KeyRange scans on a CF

2010-04-30 Thread Jonathan Ellis
Sounds like doing this w/o m/r with get_range_slices is a reasonable way to go. On Thu, Apr 29, 2010 at 6:04 PM, Utku Can Topçu wrote: > I'm currently writing collected data continuously to Cassandra, having keys > starting with a timestamp and a unique identifier (like > 2009.01.01.00.00.00.RAND

Re: why the sum of all the nodes' loads is much bigger than the size of the inserted data?

2010-04-30 Thread Sylvain Lebresne
I believe one of the reason is all the metadata. As far as I understand what you said, you have 500 millions rows with each having only one column. The problem is that a row have a bunch of metadata: a bloom filter, a column index plus a few other bytes to store the number of column, if the row is

Re: Re: why the sum of all the nodes' loads is much bigger than the size of the inserted data?

2010-04-30 Thread Jordan Pittier
Dont forget to count timestamps for each column. 2010/4/30 Bingbing Liu > hi, > > thanks for your help. > > i run the nodetool -h compact > > but the load keep the same , is there anyone can tell me why? > > > 2010-04-30 > -- > Bingbing Liu > --

Re: Re: why the sum of all the nodes' loads is much bigger than the size of the inserted data?

2010-04-30 Thread Bingbing Liu
hi, thanks for your help. i run the nodetool -h compact but the load keep the same , is there anyone can tell me why? 2010-04-30 Bingbing Liu 发件人: casablinca126.com 发送时间: 2010-04-30 15:52:09 收件人: user@cassandra.apache.org 抄送: 主题: Re: why the sum of all the nodes' loads is muc

ColumnFamilyOutputFormat?

2010-04-30 Thread Utku Can Topçu
Hey All, I've been looking at the documentation and related articles about Cassandra and Hadoop integration, I'm only seeing ColumnFamilyInputFormat for now. What if I want to write directly to cassandra after a reduce? What comes to my mind is, in the Reducer's setup I'd initialize a Cassandra c

nodetools, available commands

2010-04-30 Thread Douglas Santos
Hi all, We are writing an article for a magazine and would like to write about monitoring, more precisely on the nodetools, but did not find many things about the tool. I want to help or a brief explanation about nodetools commands ... Available commands: ring, info, cleanup, compact, cfstats, sna

Re: Re: Re: compaction slow while sstable>25GB,limitation of thesstablesize?

2010-04-30 Thread casablinca126.com
hi, It seems changing the INDEX_INTERVAL with conflict with AntiEntropyService, right? I will reconstruct my sstables. Thank you, Jonathan! cheers, Cao Jiguang -- casablinca126.com 2010-04-30 ---

Re: Inserting files to Cassandra timeouts

2010-04-30 Thread Spacejatsi
I ran again the test, inserting 64 files (15-25MB per file) with 2 threads inserting file file at the time. First 30 files goes relatively fast in, but then it jams, and finally timeouts. This tpstats is taken when the first timeout came. I also tested to split the files max of 5 mb per file. T

Re: How does cassandra deal with collisions?

2010-04-30 Thread Sylvain Lebresne
Two rows are never compared by the MD5 of their keys. The md5 of a row key is just used to choose which nodes of the cluster are responsible for the row. On Fri, Apr 30, 2010 at 5:37 AM, Mark Jones wrote: > MD5 is not a perfect hash, it can produce collisions, how are these dealt > with? > > Is t

Re: why the sum of all the nodes' loads is much bigger than the size of the inserted data?

2010-04-30 Thread casablinca126.com
hi, Have you ever run anti-compaction(more than 1 time, maybe), but never run cleanup on the anti-compaction node? cheers, Cao Jiguang 2010-04-30 casablinca126.com 发件人: Bingbing Liu 发送时间: 2010-04-30 15:24:45 收件人: user 抄送: 主题: why the sum of all the nodes' loads is much bigger

why the sum of all the nodes' loads is much bigger than the size of the inserted data?

2010-04-30 Thread Bingbing Liu
i insert 500,000,000 rows each of which has a key of 20 bytes and a column of 110 bytes. and the repilcationfactor is set to 3, so i expect the load of the cluster should be 0.5 billion * 130 * 3 = 195 G bytes. but in the fact the load i get through "nodetool -h localhost ring" is about 443G.

Re: Detailed behavior of insert() operation?

2010-04-30 Thread Roland Hänel
Here is the ticket: https://issues.apache.org/jira/browse/CASSANDRA-1039 Thanks, Roland 2010/4/29 Jonathan Ellis > 2010/4/29 Roland Hänel : > > Imagine the following rule: if we are in doubt whether to repair a column > > with timestamp T (because two values X and Y are present within the > clu

Re: Correct data model for Cassandra

2010-04-30 Thread Oleg Ivanov
Thanks Ellis, so the common scenario is to store data in one CF and any index (inverted?) in another CF? 2010/4/30 Jonathan Ellis > the correct data model is one where you can pull the data you want out > as a slice of a row, or (sometimes) as a slice of sequential rows. > usually this involv