Re: Pelops - a new Java client library paradigm

2010-06-14 Thread Dominic Williams
Hi Ran, thanks for the compliment. It is true that we benefited enormously
from batch mutate. Without that the Mutator/Selector paradigm would not have
been possible in the same way. It will be interesting to see where Cassandra
takes us next. Best, Dominic

On 12 June 2010 20:05, Ran Tavory  wrote:

> Nice going, Dominic, having a clear API for cassandra is a big step forward
> :)
> Interestingly, at hector we came up with similar approach, just didn't find
> the time for code that, as production systems keep me busy at nights as
> well... We started with the implementation of BatchMutation, but the rest of
> the API improvements are still TODO
> Keep up the good work, competition keeps us healthy ;)
>
>
> On Fri, Jun 11, 2010 at 4:41 PM, Dominic Williams <
> thedwilli...@googlemail.com> wrote:
>
>> Pelops is a new high quality Java client library for Cassandra.
>>
>> It has a design that:
>> * reveals the full power of Cassandra through an elegant "Mutator and
>> Selector" paradigm
>>  * generates better, cleaner, less bug prone code
>> * reduces the learning curve for new users
>> * drives rapid application development
>> * encapsulates advanced pooling algorithms
>>
>> An article introducing Pelops can be found at
>>
>> http://ria101.wordpress.com/2010/06/11/pelops-the-beautiful-cassandra-database-client-for-java/
>>
>> Thanks for reading.
>> Best, Dominic
>>
>
>


Re: Pelops - a new Java client library paradigm

2010-06-14 Thread Torsten Curdt
Also think this looks really promising.
The fact that there are so many API wrappers now (3?) doesn't reflect
well on the native API though :)

/me ducks and runs

On Mon, Jun 14, 2010 at 11:55, Dominic Williams
 wrote:
> Hi Ran, thanks for the compliment. It is true that we benefited enormously
> from batch mutate. Without that the Mutator/Selector paradigm would not have
> been possible in the same way. It will be interesting to see where Cassandra
> takes us next. Best, Dominic
>
> On 12 June 2010 20:05, Ran Tavory  wrote:
>>
>> Nice going, Dominic, having a clear API for cassandra is a big step
>> forward :)
>> Interestingly, at hector we came up with similar approach, just didn't
>> find the time for code that, as production systems keep me busy at nights as
>> well... We started with the implementation of BatchMutation, but the rest of
>> the API improvements are still TODO
>> Keep up the good work, competition keeps us healthy ;)
>>
>> On Fri, Jun 11, 2010 at 4:41 PM, Dominic Williams
>>  wrote:
>>>
>>> Pelops is a new high quality Java client library for Cassandra.
>>> It has a design that:
>>> * reveals the full power of Cassandra through an elegant "Mutator and
>>> Selector" paradigm
>>> * generates better, cleaner, less bug prone code
>>> * reduces the learning curve for new users
>>> * drives rapid application development
>>> * encapsulates advanced pooling algorithms
>>> An article introducing Pelops can be found at
>>>
>>> http://ria101.wordpress.com/2010/06/11/pelops-the-beautiful-cassandra-database-client-for-java/
>>> Thanks for reading.
>>> Best, Dominic
>
>


batch_mutate atomic?

2010-06-14 Thread Per Olesen
Can I expect batch_mutate to work in what I would think of as an atomic 
operation?

That either all the mutations in the batch_mutate call are executed or none of 
them are? Or can some of them fail while some of them succeeds?



Re: Beginner Assumptions

2010-06-14 Thread Torsten Curdt
>> 
>> TBH while we are using super columns, the somehow feel wrong to me. I
>> would be happier if we could move what we do with super columns into
>> the row key space. But in our case that does not seem to be so easy.
>> 
>>
>
> I'd be quite interested to learn what you are doing with super columns
> that cannot be replicated with composite keys and range queries.

We are storing events as they come in. The timestamp is the key:

 2010-10-01 14:35 event1 someattr1=someval
 2010-10-01 14:35 event1 someattr1=someval
 2010-10-01 14:36 event1 someattr1=someval

We need to access them in time buckets/groups. For example "all events
that happened at  2010-10-01 14:35". Now I see the following options:

1) Store the events in a normal column and use a range query on the row key

 2010-10-01 14:35/UUID: event1 someattr1=someval
 2010-10-01 14:35/UUID: event1 someattr1=someval
 2010-10-01 14:36/UUID: event1 someattr1=someval

 Access: range("2010-10-01 14:35".."2010-10-01 14:36")

 Problem: For the range query on the row key it needs to use the
OrdererdPartitioner ...which leads too hot spots as this is timeline
data. The hot spot would just cycle through the ring.

2) Store the events in a normal column and update an index

 2010-10-01 14:35/UUID1: event1 someattr1=someval
 2010-10-01 14:35/UUID2: event1 someattr1=someval
 2010-10-01 14:36/UUID3: event1 someattr1=someval

 2010-10-01 14:35: [ UUID1, UUID2 ]
 2010-10-01 14:36: [ UUID3 ]

 Access: read the index and then read the rows for all the events in that bucket

Problem: The index need to maintained in an atomic fashion ...so a
JSON blob is not a great idea probably. Could probably be implemented
by using the UUIDs as column names instead. That could to lead to way
more column names than one should use though. (1-10 column
names are not a great idea IIUC)

3) Store per event type and use the time as the column name

 event1: {
   2010-10-01 14:35/UUID1: event1 someattr1=someval
   2010-10-01 14:35/UUID2: event1 someattr1=someval
   2010-10-01 14:36/UUID3: event1 someattr1=someval
 }

 event2: {
 }

 Access: For all event type use a slice("2010-10-01 14:35".."2010-10-01 14:36")
 Problem: Storing per event type is not natural for our application.
Plus it requires a request per type. Also a lot of column names.
Cassandra scales better on the row level.

4) use a super column

 2010-10-01 14:35: {
   UUID1: event1 someattr1=someval
   UUID2: event1 someattr1=someval
 }

2010-10-01 14:36: {
   UUID3: event1 someattr1=someval
}

 Access: Just a single get request for the bucket (or page through if
too many results)
 Problem: Also has many super column names but this is a native
Cassandra primitive so one assume this is optimized ...or will become
more optimized. (I am wondering though: Do super columns reside on a
single node? I hope not)

So what would you pick then?
cheers
--
Torsten


Re: batch_mutate atomic?

2010-06-14 Thread Ran Tavory
no, it's not atomic. it just shortens the roundtrip of many update requests.
Some may fail and some may succeed

On Mon, Jun 14, 2010 at 2:40 PM, Per Olesen  wrote:

> Can I expect batch_mutate to work in what I would think of as an atomic
> operation?
>
> That either all the mutations in the batch_mutate call are executed or none
> of them are? Or can some of them fail while some of them succeeds?
>
>


Re: Pelops - a new Java client library paradigm

2010-06-14 Thread Dominic Williams
Hi Riyad,

No problem. Because it is a new library, I cannot provide a large list of
production deployments. However, there are various reasons you should have
confidence in the library:-

1/ Firstly the background of Pelops is that it is being used as the basis of
a serious commercial project that makes very heavy use of Cassandra. The
project itself is best described as a social network/games venture aimed at
kids 6-13. I cannot go into commercial details because the information is
sensitive, but all I can say is that scalability is very important to this
venture, it has sufficient funds to ensure that whether or not it is
ultimately successful it will have to support complex and extensive data
processing in the context of large numbers of users, and the library has
been created and will continue to be developed on the basis that we will
suffer substantial commercial pain if it has bugs or deficiencies. I
personally wrote most of the library, and have 18 years of solid programming
experience. Every days large amounts of Cassandra code is being written here
using the library, if/where problems appear they will be immediately
reported to me and fixed with urgency. Once the venture is in production -
hopefully this is not double digits weeks away now - this will provide the
best affirmation, but until then the above will have to suffice (if anyone
else is using Pelops successfully, would be great to hear)

2/ Before going into some more technical detail, I just want to reiterate
that fundamentally Pelops is a wrapper to the Thrift API. Therefore, it does
not have particular bearing on the scalability of Cassandra systems per se.
However we do try to add value through our connection pooling and load
balancing strategy, and that is something I will explore a little more
below.

3/ Connection pooling and load balancing: As you know, one of the features
of Pelops is that it separates data processing from lower level details like
connection pooling. One benefit of this approach is that code becomes much
more readable and less bug prone, but a really big benefit is that Pelops is
able to "lend" connections to data processing code only for the moments that
calls to Thrift are in progress. This makes it possible to perform client
load balancing by counting how many "outstanding' Thrift API calls exist to
each node, and always choosing to perform operations against the node that
has the smallest number of Thrift calls running. This is the best available
strategy available without actually knowing the CPU/memory etc load on
Cassandra nodes - which, anyway, has various pitfalls and will probably
offer only an enhancement, not an alternative system. Using this strategy
adds a little to the complexity of the connection pooling system which of
course increases the surface area for mistakes. It has been working for us,
but I do invite people to code review it and will be very happy to answer
questions and address any issues found.
In terms of how the existing connection pooling system can be improved, I
think in general it is pretty much the best optional available now, but
there is one area where I plan an improvement. At the moment, Pelops
maintains a "context" for each node it knows about in the Cassandra cluster.
Each context has a refiller thread, which creates and caches new connections
to the Cassandra node in question with the aim of ensuring a sufficient
number of free connections exist to be available for spikes in usage. You
can configure a target number of connections, a minimum number of free
connections, and a maximum number of connections through the Policy. The
area I see for improvement at the moment, is that each context only has a
single "pool refiller" thread responsible for creating new free connections
when the number falls below a low water mark. It would be better if this was
multi-threaded, since in extreme situations where the buffer was depleted
rapidly, it could be more rapidly restored (since in the synchronous model
presented by Thrift, creating new connections is a blocking operation). This
is quite a minor improvement, but I plan on addressing this shortly.
Hope this helps
Best Dominic

On 11 June 2010 16:11, Riyad Kalla  wrote:

> Dominic,
>
> I like the API; reads clearly and fairly intuitive.
>
> I think Ian was asking about what large-scale production deployments Pelops
> has been deployed in that you could speak to -- he's trying to get a
> confidence index and I am interested as well ;)
>
> Best,
> Riyad
>
>
> On Fri, Jun 11, 2010 at 7:04 AM, Dominic Williams <
> thedwilli...@googlemail.com> wrote:
>
>> Hi good question.
>>
>> The scalability of Pelops is dependent on Cassandra, not the library
>> itself. The library aims to provide an more effective access layer on top of
>> the Thrift API.
>>
>> The library does perform connection pooling, and you can control the size
>> of the pool and other parameters using a policy object. But connection
>> pooling itself does not increase scalab

Re: batch_mutate atomic?

2010-06-14 Thread Gary Dusbabek
This question has been coming up quite regularly now.  I've added an
entry to the FAQ.  Please feel free to expand an clarify.
http://wiki.apache.org/cassandra/FAQ#batch_mutate_atomic

Gary.

On Mon, Jun 14, 2010 at 06:43, Ran Tavory  wrote:
> no, it's not atomic. it just shortens the roundtrip of many update requests.
> Some may fail and some may succeed
>
> On Mon, Jun 14, 2010 at 2:40 PM, Per Olesen  wrote:
>>
>> Can I expect batch_mutate to work in what I would think of as an atomic
>> operation?
>>
>> That either all the mutations in the batch_mutate call are executed or
>> none of them are? Or can some of them fail while some of them succeeds?
>>
>
>


Data modelling question

2010-06-14 Thread Per Olesen
Hi,
I have a question that relates to how to best model data. I have some pretty 
simple tabular data, which I am to show to a large amount of users, and the 
users need to be able to search some of the columns.

Given this tabular data:

Company| Amount|...many more columns here
--
Ajax A/S   | 12345 |
Dude A/S   | 54321 |
Ajax A/S   |  5436 |
...many more rows here...

If I need to store this in cassandra, but also be able to search quite fast on 
"Company" and on "Amount", how might I go about storing this? The current plan 
I have for modelling it in cassandra, is to use one CF "Dashboard" for the 
tabular data itself and one for each "index" I would like to be able to 
retrieve it on.

Like this:

Super-CF "Dashboard":

uuid-1 -> { company:"Ajax A/S", Amount:12345 }
uuid-2 -> { company:"Dude A/S", Amount:54321 }
uuid-3 -> { company:"Ajax A/S", Amount:5436 }

Where the SC value simply is a unique identifier.

Super-CF "DashboardCompanyIndex":

"Ajax A/S" -> { uuid-1:"", uuid-3:"" }
"Dude A/S" -> { uuid-2:"" }


Super-CF "DashboardAmountIndex":

"12345" -> { uuid-1:"" }
"54321" -> { uuid-2:"" }
"5436" -> { uuid-3:"" }

So, in my use case, when searching on e.g. company, I can then access the 
"DashboardCompanyIndex" with a slice on its SC and then grab all the uuids from 
the columns, and after this, make a lookup in the Dashboard CF for each uuid 
found in the index.

Is this the preferred way of doing this in cassandra?
Or am I trying to apply relational algebra modelling on something that is 
supposed to be used differently?

/Per

RE: Running Cassandra as a Windows Service

2010-06-14 Thread Kochheiser,Todd W - TO-DITT1
I'll put something together and submit it.  Thanks for the help.

Todd

-Original Message-
From: Gary Dusbabek [mailto:gdusba...@gmail.com] 
Sent: Friday, June 11, 2010 4:49 AM
To: user@cassandra.apache.org
Subject: Re: Running Cassandra as a Windows Service

Sure.  Please create a jira ticket
(https://issues.apache.org/jira/browse/CASSANDRA) and attach the files
you wish to contribute.  One of the committers (probably myself) will
review them and decide how to integrate them into the project.

If it's not too much trouble, an ant build script would be excellent
and would help to automate the process of generating testable builds.

Gary.

On Thu, Jun 10, 2010 at 17:31, Kochheiser,Todd W - TO-DITT1
 wrote:
> I agree that bitrot might be happen if all of the core Cassandra developers 
> are using Linux. Your suggestion of putting things in a contrib area where 
> curious (or desperate) parties suffering on the Windows platform could pick 
> it up seems like a reasonable place to start.  It might also be an 
> opportunity to increase the number of "application" developers using 
> Cassandra if Cassandra was slightly more approachable on the Windows platform.
>
> Any suggestions on next steps?
>
> Todd.
>
> -Original Message-
> From: Gary Dusbabek [mailto:gdusba...@gmail.com]
> Sent: Thursday, June 10, 2010 10:59 AM
> To: user@cassandra.apache.org
> Subject: Re: Running Cassandra as a Windows Service
>
> IMO this is one of those things that would bitrot fairly quickly if it
> were not maintained.  It may be useful in contrib, where curious
> parties could pick it up, get it back in shape, and send in the
> changes to be committed.
>
> Judging by the sparse interest so far, this probably wouldn't be a
> good fit in core since there don't seem to be many (any?) cassandra
> developers who run windows.
>
> Gary.
>
>
> On Thu, Jun 10, 2010 at 12:34, Kochheiser,Todd W - TO-DITT1
>  wrote:
>> For various reasons I am required to deploy systems on Windows.  As such, I
>> went looking for information on running Cassandra as a Windows service.
>> I've read some of the user threads regarding running Cassandra as a Windows
>> service, such as this one:
>>
>>     http://www.mail-archive.com/user@cassandra.apache.org/msg01656.html
>>
>> I also found the following JIRA issue:
>>
>>     https://issues.apache.org/jira/browse/CASSANDRA-292
>>
>> As it didn't look like anyone has contributed a formal solution and having
>> some experience using Apache's Procrun
>> (http://commons.apache.org/daemon/procrun.html), I decided to go ahead and
>> write a batch script and a simple "WindowsService" class to accomplish the
>> task.  The WindowsService class only makes calls to public methods in
>> CassandraDeamon and is fairly simple.  In combination with the batch script,
>> it is very easy to install and remove the service.  At this point, I've
>> installed Cassandra as a Windows service on XP (32 bit), Windows 7 (64 bit)
>> and Windows Server 2008 R1/R2 (64 bit).  It should work fine on other
>> version of Windows (2K, 2K3).
>>
>> Questions:
>>
>>
>> Has anyone else already done this work?
>> If not, I wouldn't mind sharing the code/script or contributing it back to
>> the project.  Is there any interest in this from the Cassandra dev team or
>> the user community?
>>
>>
>> Ideally the WindowsService could be included in the distributed
>> source/binary distributions (perhaps in a contrib area) as well as the batch
>> script and associated procrun executables.  Or, perhaps it could be posted
>> to a Cassandra community site (is there one?).
>>
>> Todd
>>
>>
>>
>>
>>
>


Re: Pelops - a new Java client library paradigm

2010-06-14 Thread Jonathan Ellis
That's the tradeoff we made to get basic functionality for a dozen or
so languages for free; it's impossible to be idiomatic with Thrift.

The glass-half-full view is, having lots of API wrappers shows that
building on Thrift is far easier than throwing bytes around at the
socket layer the way a traditional binary API would require. :)

On Mon, Jun 14, 2010 at 4:22 AM, Torsten Curdt  wrote:
> Also think this looks really promising.
> The fact that there are so many API wrappers now (3?) doesn't reflect
> well on the native API though :)
>
> /me ducks and runs
>
> On Mon, Jun 14, 2010 at 11:55, Dominic Williams
>  wrote:
>> Hi Ran, thanks for the compliment. It is true that we benefited enormously
>> from batch mutate. Without that the Mutator/Selector paradigm would not have
>> been possible in the same way. It will be interesting to see where Cassandra
>> takes us next. Best, Dominic
>>
>> On 12 June 2010 20:05, Ran Tavory  wrote:
>>>
>>> Nice going, Dominic, having a clear API for cassandra is a big step
>>> forward :)
>>> Interestingly, at hector we came up with similar approach, just didn't
>>> find the time for code that, as production systems keep me busy at nights as
>>> well... We started with the implementation of BatchMutation, but the rest of
>>> the API improvements are still TODO
>>> Keep up the good work, competition keeps us healthy ;)
>>>
>>> On Fri, Jun 11, 2010 at 4:41 PM, Dominic Williams
>>>  wrote:

 Pelops is a new high quality Java client library for Cassandra.
 It has a design that:
 * reveals the full power of Cassandra through an elegant "Mutator and
 Selector" paradigm
 * generates better, cleaner, less bug prone code
 * reduces the learning curve for new users
 * drives rapid application development
 * encapsulates advanced pooling algorithms
 An article introducing Pelops can be found at

 http://ria101.wordpress.com/2010/06/11/pelops-the-beautiful-cassandra-database-client-for-java/
 Thanks for reading.
 Best, Dominic
>>
>>
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com


Re: Data format stability

2010-06-14 Thread Matthew Conway

On Jun 13, 2010, at Sun Jun 13, 9:34 PM, Benjamin Black wrote:

> On Sun, Jun 13, 2010 at 5:58 PM, Matthew Conway  wrote:
>> The ability to dynamically add new column families.  Our app is currently 
>> under heavy development, and we will be adding new column families at least 
>> once a week after we have shipped the initial production app. From the 
>> existing docs, it seemed to me that the procedure for changing schema in 0.6 
>> is very manual in nature and thus error prone and likely to cause data 
>> corruption.  Feel free to correct me if I'm wrong :)
>> 
> 
> I do schema manipulations in 0.6 regularly.  The answer is automation.

I already have automation, whats missing are the details of the exact steps I 
need to automate to accomplish the schema modification on a live cluster.  Even 
the FAQ just points to the feature in 0.7 trunk.  :) If all I need to do is add 
new column families, is adding them to the storage-conf.xml and doing a rolling 
restart of all nodes sufficient?

> As for data corruption: what did you read that gave you that
> impression?

I forget where I saw that, its been a while since I did the initial digging. 
maybe it was an incorrect impression, or only related to schema renames.

> If this is the only motivator and you are really only changing things
> once/week or so, I suggest sticking with 0.6 and figuring out some
> automation.  You should be using it, anyway.
> 

Even if I were to use 0.6, the same question stands - if the data format is 
going to change, I need to know if there will be an upgrade path between the 
formats.  I'm using trunk right now, but once I find a version that works for 
me (i.e. without the file descriptor leaks) I'm unlikely to upgrade again till 
0.7 is released.

Matt



RE: Pelops - a new Java client library paradigm

2010-06-14 Thread Kochheiser,Todd W - TO-DITT1
Great API that looks easy and intuitive to use.  Regarding your connection pool 
implementation, how does it handle failed/crashed nodes?  Will the pool 
auto-detect failed nodes via a "tester" thread or will a failed node, and hence 
its pooled connection(s), be removed only when they are used?  Conversely, how 
will the pool be repopulated once the failed/crashed node becomes available?

Todd


From: Dominic Williams [mailto:thedwilli...@googlemail.com]
Sent: Friday, June 11, 2010 7:05 AM
To: user@cassandra.apache.org
Subject: Re: Pelops - a new Java client library paradigm

Hi good question.

The scalability of Pelops is dependent on Cassandra, not the library itself. 
The library aims to provide an more effective access layer on top of the Thrift 
API.

The library does perform connection pooling, and you can control the size of 
the pool and other parameters using a policy object. But connection pooling 
itself does not increase scalability, only efficiency.

Hope this helps.
BEst, Dominic

On 11 June 2010 14:47, Ian Soboroff 
mailto:isobor...@gmail.com>> wrote:
Sounds nice.  Can you say something about the scales at which you've used this 
library?  Both write and read load?  Size of clusters and size of data?

Ian

On Fri, Jun 11, 2010 at 9:41 AM, Dominic Williams 
mailto:thedwilli...@googlemail.com>> wrote:
Pelops is a new high quality Java client library for Cassandra.

It has a design that:
* reveals the full power of Cassandra through an elegant "Mutator and Selector" 
paradigm
* generates better, cleaner, less bug prone code
* reduces the learning curve for new users
* drives rapid application development
* encapsulates advanced pooling algorithms

An article introducing Pelops can be found at
http://ria101.wordpress.com/2010/06/11/pelops-the-beautiful-cassandra-database-client-for-java/

Thanks for reading.
Best, Dominic




Re: Data format stability

2010-06-14 Thread Jonathan Ellis
On Mon, Jun 14, 2010 at 7:27 AM, Matthew Conway  wrote:
>
> On Jun 13, 2010, at Sun Jun 13, 9:34 PM, Benjamin Black wrote:
>
>> On Sun, Jun 13, 2010 at 5:58 PM, Matthew Conway  wrote:
>>> The ability to dynamically add new column families.  Our app is currently 
>>> under heavy development, and we will be adding new column families at least 
>>> once a week after we have shipped the initial production app. From the 
>>> existing docs, it seemed to me that the procedure for changing schema in 
>>> 0.6 is very manual in nature and thus error prone and likely to cause data 
>>> corruption.  Feel free to correct me if I'm wrong :)
>>>
>>
>> I do schema manipulations in 0.6 regularly.  The answer is automation.
>
> I already have automation, whats missing are the details of the exact steps I 
> need to automate to accomplish the schema modification on a live cluster.  
> Even the FAQ just points to the feature in 0.7 trunk.

Huh?  http://wiki.apache.org/cassandra/FAQ#modify_cf_config

> Even if I were to use 0.6, the same question stands - if the data format is 
> going to change, I need to know if there will be an upgrade path between the 
> formats.  I'm using trunk right now, but once I find a version that works for 
> me (i.e. without the file descriptor leaks) I'm unlikely to upgrade again 
> till 0.7 is released.

0.7 will be able to read 0.6 data files.

-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com


Re: Data format stability

2010-06-14 Thread Matthew Conway

On Jun 14, 2010, at Mon Jun 14, 10:45 AM, Jonathan Ellis wrote:
>> 
>> I already have automation, whats missing are the details of the exact steps 
>> I need to automate to accomplish the schema modification on a live cluster.  
>> Even the FAQ just points to the feature in 0.7 trunk.
> 
> Huh?  http://wiki.apache.org/cassandra/FAQ#modify_cf_config
> 

Sorry, I was thinking about this: 
http://wiki.apache.org/cassandra/FAQ#no_keyspaces

Matt



RE: Pelops - a new Java client library paradigm

2010-06-14 Thread Dop Sun
Good to have new API wrappers.

I guess different APIs because people look Cassandra from different angle
with different backgrounds/ skills. At this stage, it's better that
different API find the good ideas from each other, and maybe one day, there
is one, which widely accepted. That's good for all.

Thanks,
Regards,
Dop 

-Original Message-
From: Torsten Curdt [mailto:tcu...@vafer.org] 
Sent: Monday, June 14, 2010 7:22 PM
To: user@cassandra.apache.org
Subject: Re: Pelops - a new Java client library paradigm

Also think this looks really promising.
The fact that there are so many API wrappers now (3?) doesn't reflect
well on the native API though :)

/me ducks and runs

On Mon, Jun 14, 2010 at 11:55, Dominic Williams
 wrote:
> Hi Ran, thanks for the compliment. It is true that we benefited enormously
> from batch mutate. Without that the Mutator/Selector paradigm would not
have
> been possible in the same way. It will be interesting to see where
Cassandra
> takes us next. Best, Dominic
>
> On 12 June 2010 20:05, Ran Tavory  wrote:
>>
>> Nice going, Dominic, having a clear API for cassandra is a big step
>> forward :)
>> Interestingly, at hector we came up with similar approach, just didn't
>> find the time for code that, as production systems keep me busy at nights
as
>> well... We started with the implementation of BatchMutation, but the rest
of
>> the API improvements are still TODO
>> Keep up the good work, competition keeps us healthy ;)
>>
>> On Fri, Jun 11, 2010 at 4:41 PM, Dominic Williams
>>  wrote:
>>>
>>> Pelops is a new high quality Java client library for Cassandra.
>>> It has a design that:
>>> * reveals the full power of Cassandra through an elegant "Mutator and
>>> Selector" paradigm
>>> * generates better, cleaner, less bug prone code
>>> * reduces the learning curve for new users
>>> * drives rapid application development
>>> * encapsulates advanced pooling algorithms
>>> An article introducing Pelops can be found at
>>>
>>>
http://ria101.wordpress.com/2010/06/11/pelops-the-beautiful-cassandra-databa
se-client-for-java/
>>> Thanks for reading.
>>> Best, Dominic
>
>




Re: Pelops - a new Java client library paradigm

2010-06-14 Thread Dominic Williams
Hi, re: pools and detecting node failure...

Pooling is handled by ThriftPool. This class maintains a separate
NodeContext object for each known node. This in turn maintains a pool of
connections to its node.

Each NodeContext has a single "poolRefiller" object/thread, which runs
either when signalled, or every ~2s, whichever is the sooner. Whenever it
runs, the first thing it does is check which of its existing pooled
connections are open. This is necessary for it to correctly calculate the
number of new connections to open (assuming it has to)

To check whether a connection is open, it calls TTransport.isOpen, which is
TSocket.isOpen, which is Socket.isConnected. If a connection is not open,
then it is binned.

Therefore, pretty quickly if a node has failed, the NodeContext will not be
holding any connections to it. This causes the NodeContext.isAvailable
method to return false. When this is the case, that node is not considered
by ThriftPool when it is seeking to return a connection to an operand
(Mutator, Selector, KeyDeletor etc object)

The pool refiller thread keeps on trying to create connections to a node,
even after all connections to it have failed. When/if it becomes available
again, then as soon as a connection is made NodeContext.isAvailable will
return true and it comes "back online" for the purposes of the operands.

NOTE: Some of my colleagues were working on Windows machines separated from
our local development servers by low-end NAT routers. After some period
using this Cassandra, inside Pelops even though TSocket.isOpen was returning
true, when an operand tried using connections they were getting a timeout or
other network exception. Calling setKeepAlive(true) on the underlying socket
does not prevent this (although this option is best set because in general
it should force timely detection of connection failure). Hector also
experienced similar problems and we adopt a similar response - by default
you'll see that Pelops sets Policy.getKillNodeConnsOnException() to true by
default. What this means is that if a network exception is thrown when an
operand interacts with a node, the NodeContext destroys all pooled
connections to that node on the basis that the general failure of
connections to that node may not be detectable because of the network setup.
Of course, not many people will be running their Cassandra clients from
Windows behind NAT in production, but the option is set by default because
otherwise a segment of developers trying the library will experience
persistent problems due to this network (and/or Thrift) strangeness and in
production we are ourselves will switch it off (although note the worse
downside is that the occasional network error to a node will cause the
refreshing of its pool)

Hope this makes sense.
Best, Dominic

On 14 June 2010 15:32, Kochheiser,Todd W - TO-DITT1 wrote:

>  Great API that looks easy and intuitive to use.  Regarding your
> connection pool implementation, how does it handle failed/crashed nodes?
> Will the pool auto-detect failed nodes via a “tester” thread or will a
> failed node, and hence its pooled connection(s), be removed only when they
> are used?  Conversely, how will the pool be repopulated once the
> failed/crashed node becomes available?
>
>
>
> Todd
>
>
>  --
>
> *From:* Dominic Williams [mailto:thedwilli...@googlemail.com]
> *Sent:* Friday, June 11, 2010 7:05 AM
> *To:* user@cassandra.apache.org
> *Subject:* Re: Pelops - a new Java client library paradigm
>
>
>
> Hi good question.
>
>
>
> The scalability of Pelops is dependent on Cassandra, not the library
> itself. The library aims to provide an more effective access layer on top of
> the Thrift API.
>
>
>
> The library does perform connection pooling, and you can control the size
> of the pool and other parameters using a policy object. But connection
> pooling itself does not increase scalability, only efficiency.
>
>
>
> Hope this helps.
>
> BEst, Dominic
>
>
>
> On 11 June 2010 14:47, Ian Soboroff  wrote:
>
> Sounds nice.  Can you say something about the scales at which you've used
> this library?  Both write and read load?  Size of clusters and size of data?
>
> Ian
>
>
>
> On Fri, Jun 11, 2010 at 9:41 AM, Dominic Williams <
> thedwilli...@googlemail.com> wrote:
>
> Pelops is a new high quality Java client library for Cassandra.
>
>
>
> It has a design that:
>
> * reveals the full power of Cassandra through an elegant "Mutator and
> Selector" paradigm
>
> * generates better, cleaner, less bug prone code
>
> * reduces the learning curve for new users
>
> * drives rapid application development
>
> * encapsulates advanced pooling algorithms
>
>
>
> An article introducing Pelops can be found at
>
>
> http://ria101.wordpress.com/2010/06/11/pelops-the-beautiful-cassandra-database-client-for-java/
>
>
>
> Thanks for reading.
>
> Best, Dominic
>
>
>
>
>


Re: File Descriptor leak

2010-06-14 Thread Matthew Conway
Done, https://issues.apache.org/jira/browse/CASSANDRA-1188
It was find_all_by_service_id that was the culprit, and it resolves down to a 
multiget_slice on a super column family.  The  super CF is acting as an index 
back into a regular CF, thus I'm providing key, supercolumn name, and getting 
back a set of columns which have date uuid names and values as keys back into 
the regular CF.  The leaked file descriptors seem to be for the Super CF data 
file.

Matt

On Jun 13, 2010, at Sun Jun 13, 3:32 PM, Jonathan Ellis wrote:

> Can you open a new ticket, then?  Preferably with the thrift code
> involved, I'm not sure what find_by_natural_key or find_all_by_service
> is translating into.  (It looks like just one of those is responsible
> for the leak.)




Re: scans stopped returning values for some keys

2010-06-14 Thread Pawel Dabrowski
Hi,

thanks for your answer.
my comparators for this CF were:
CompareWith="LongType", CompareSubcolumnWith="UTF8Type" (which is not really 
correct, as I use Longs for both, but I guess it should not cause such an 
error).

I didn't test for empty start, and I already refactored the code not to do any 
deletes, so I can't reproduce the situation easily. 

regards
Pawel

On 2010-06-10, at 23:26, Jonathan Ellis wrote:

> How is your CF defined?  (what comparator?)
> 
> did you try start=empty byte array instead of Long.MAX_VALUE?
> 
> On Wed, Jun 9, 2010 at 8:06 AM, Pawel Dabrowski  wrote:
>> Hi,
>> 
>> I'm using Cassandra to store some aggregated data in a structure like this:
>> 
>> KEY - product_id
>> SUPER COLUMN NAME - timestamp
>> and in the super column, I have a few columns with actual data.
>> 
>> I am using a scan operation to find the latest super column 
>> (start=Long.MAX_VALUE, reversed=true, count=1) for a key, which worked fine 
>> for quite some time.
>> But recently I needed to remove some of the columns within the super columns.
>> After that things got weird: for some keys, the scan for latest super column 
>> work normally, but for some of them they stopped returning any results. I 
>> checked the data using the CLI and the data is obviously there. I can get it 
>> if I specify the super column name, but scanning for latest does not work. 
>> If I scan for previous data (start=some other timestamp less than maximum 
>> timestamp in cassandra), it works fine.
>> I compared the data for keys that work, and those that don't, but there is 
>> no difference - the super column names are exactly the same and they contain 
>> the same amounts of columns.
>> 
>> But the really weird thing is that the scans did not stop working 
>> immediately after some columns were removed. I was able to scan for the data 
>> and verify that the columns were removed correctly and only after a couple 
>> of minutes some scans stopped returning data. When I looked in the log, I've 
>> seen that Cassandra has been doing some compacting, flushing and deleting of 
>> .db files more or less at the time that the scans stopped working.
>> I tried restarting Cassandra, but it did not help.
>> Anyone had a similar problem?
>> 
>> regards
>> Pawel Dabrowski
> 
> 
> 
> -- 
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of Riptano, the source for professional Cassandra support
> http://riptano.com



Re: Data modelling question

2010-06-14 Thread Benjamin Black
On Mon, Jun 14, 2010 at 6:09 AM, Per Olesen  wrote:
>
> So, in my use case, when searching on e.g. company, I can then access the 
> "DashboardCompanyIndex" with a slice on its SC and then grab all the uuids 
> from the columns, and after this, make a lookup in the Dashboard CF for each 
> uuid found in the index.
>

That's the normal way to do it.


b


help for designing a cassandra

2010-06-14 Thread Johannes Weissensel
Hi everyone,
i am new to nosql databases and especially column-oriented Databases
like cassandra.
I am a student on information-systems and i evaluate a fitting no-sql
database for a web analytics system. Got the use-case of data like
webserver-logfile.
in an RDBMS it would be for every hit a row in the database, and than
endless grouping and counting on the data for getting the metrics you
want.
Is there anyone who has experiences with data like that in hypertable,
how should i design the database?
Also for every hit a single row, or maybe for every session an
aggregated version of the data, or for every day and every page a
single aggregated version.
Maybe some has an idea, how to design the database? Just like an
typical not normalized sql database?
Hope you have some ideas :)
Johannes


Re: Data modelling question

2010-06-14 Thread Per Olesen

On Jun 14, 2010, at 6:29 PM, Benjamin Black wrote:

> On Mon, Jun 14, 2010 at 6:09 AM, Per Olesen  wrote:
>> 
>> So, in my use case, when searching on e.g. company, I can then access the 
>> "DashboardCompanyIndex" with a slice on its SC and then grab all the uuids 
>> from the columns, and after this, make a lookup in the Dashboard CF for each 
>> uuid found in the index.
>> 
> 
> That's the normal way to do it.

Okay. Thanks! Nice to know I am on the right path then :-)

I have a little follow-up question: As I asked here 
http://www.mail-archive.com/user@cassandra.apache.org/msg03498.html and was 
kindly answered, that no, batch_mutate is not atomic, how does people then deal 
with the case of updating two or more CF (the Dashboard CF and the indices CFs) 
in the case of failure in between?

I thought about a model where I update the index CF first. The actual data 
insert in Dashboard CF can then fail, so there can be entries in the index CF 
that points to no rows in the Dashboard CF. I could then have a periodic job, 
that cleans up index entries that have no entry in the Dashboard CF.

Is that the way to work around no atomic updates? 
Or is there another (better) way to organize data for searching?



RE: Pelops - a new Java client library paradigm

2010-06-14 Thread Kochheiser,Todd W - TO-DITT1
Thank you for the very clear and detailed explanation of how the pool works.   
I think I'll give Pelops a try.

Todd


From: Dominic Williams [mailto:thedwilli...@googlemail.com]
Sent: Monday, June 14, 2010 8:16 AM
To: user@cassandra.apache.org
Subject: Re: Pelops - a new Java client library paradigm

Hi, re: pools and detecting node failure...

Pooling is handled by ThriftPool. This class maintains a separate NodeContext 
object for each known node. This in turn maintains a pool of connections to its 
node.

Each NodeContext has a single "poolRefiller" object/thread, which runs either 
when signalled, or every ~2s, whichever is the sooner. Whenever it runs, the 
first thing it does is check which of its existing pooled connections are open. 
This is necessary for it to correctly calculate the number of new connections 
to open (assuming it has to)

To check whether a connection is open, it calls TTransport.isOpen, which is 
TSocket.isOpen, which is Socket.isConnected. If a connection is not open, then 
it is binned.

Therefore, pretty quickly if a node has failed, the NodeContext will not be 
holding any connections to it. This causes the NodeContext.isAvailable method 
to return false. When this is the case, that node is not considered by 
ThriftPool when it is seeking to return a connection to an operand (Mutator, 
Selector, KeyDeletor etc object)

The pool refiller thread keeps on trying to create connections to a node, even 
after all connections to it have failed. When/if it becomes available again, 
then as soon as a connection is made NodeContext.isAvailable will return true 
and it comes "back online" for the purposes of the operands.

NOTE: Some of my colleagues were working on Windows machines separated from our 
local development servers by low-end NAT routers. After some period using this 
Cassandra, inside Pelops even though TSocket.isOpen was returning true, when an 
operand tried using connections they were getting a timeout or other network 
exception. Calling setKeepAlive(true) on the underlying socket does not prevent 
this (although this option is best set because in general it should force 
timely detection of connection failure). Hector also experienced similar 
problems and we adopt a similar response - by default you'll see that Pelops 
sets Policy.getKillNodeConnsOnException() to true by default. What this means 
is that if a network exception is thrown when an operand interacts with a node, 
the NodeContext destroys all pooled connections to that node on the basis that 
the general failure of connections to that node may not be detectable because 
of the network setup. Of course, not many people will be running their 
Cassandra clients from Windows behind NAT in production, but the option is set 
by default because otherwise a segment of developers trying the library will 
experience persistent problems due to this network (and/or Thrift) strangeness 
and in production we are ourselves will switch it off (although note the worse 
downside is that the occasional network error to a node will cause the 
refreshing of its pool)

Hope this makes sense.
Best, Dominic

On 14 June 2010 15:32, Kochheiser,Todd W - TO-DITT1 
mailto:twkochhei...@bpa.gov>> wrote:
Great API that looks easy and intuitive to use.  Regarding your connection pool 
implementation, how does it handle failed/crashed nodes?  Will the pool 
auto-detect failed nodes via a "tester" thread or will a failed node, and hence 
its pooled connection(s), be removed only when they are used?  Conversely, how 
will the pool be repopulated once the failed/crashed node becomes available?

Todd


From: Dominic Williams 
[mailto:thedwilli...@googlemail.com]
Sent: Friday, June 11, 2010 7:05 AM
To: user@cassandra.apache.org
Subject: Re: Pelops - a new Java client library paradigm

Hi good question.

The scalability of Pelops is dependent on Cassandra, not the library itself. 
The library aims to provide an more effective access layer on top of the Thrift 
API.

The library does perform connection pooling, and you can control the size of 
the pool and other parameters using a policy object. But connection pooling 
itself does not increase scalability, only efficiency.

Hope this helps.
BEst, Dominic

On 11 June 2010 14:47, Ian Soboroff 
mailto:isobor...@gmail.com>> wrote:
Sounds nice.  Can you say something about the scales at which you've used this 
library?  Both write and read load?  Size of clusters and size of data?

Ian

On Fri, Jun 11, 2010 at 9:41 AM, Dominic Williams 
mailto:thedwilli...@googlemail.com>> wrote:
Pelops is a new high quality Java client library for Cassandra.

It has a design that:
* reveals the full power of Cassandra through an elegant "Mutator and Selector" 
paradigm
* generates better, cleaner, less bug prone code
* reduces the learning curve for new users
* drives rapid applica

RE: Pelops - a new Java client library paradigm

2010-06-14 Thread Kochheiser,Todd W - TO-DITT1
I checked out the source and noticed a few things:


 1.  You did not include an Ant build file.  Not a big deal, but if you happen 
to have one it would be nice to have.
 2.  It appears you built against Cassandra 0.6.0.  Have you built and/or run 
Pelops against 0.6.2 or trunk?

Todd


From: Dominic Williams [mailto:thedwilli...@googlemail.com]
Sent: Friday, June 11, 2010 6:42 AM
To: user@cassandra.apache.org
Subject: Pelops - a new Java client library paradigm

Pelops is a new high quality Java client library for Cassandra.

It has a design that:
* reveals the full power of Cassandra through an elegant "Mutator and Selector" 
paradigm
* generates better, cleaner, less bug prone code
* reduces the learning curve for new users
* drives rapid application development
* encapsulates advanced pooling algorithms

An article introducing Pelops can be found at
http://ria101.wordpress.com/2010/06/11/pelops-the-beautiful-cassandra-database-client-for-java/

Thanks for reading.
Best, Dominic


Re: Pelops - a new Java client library paradigm

2010-06-14 Thread Dominic Williams
Hi Todd, we're on 6 or 6.1 but I would hope it should work with 6.2

We will also be moving to 6.2 very shortly because we need the column
timeout feature that a 6.2 patch has been submitted for (I haven't checked,
maybe that feature is actually in the 6.2 trunk now, would be great?)

In relation to adding support for new features like the column timeout this
will be done using overloading to be non-breaking. I guess it would be nice
to pull the Cassandra version from the Thrift jar somehow and check version
>= what's needed in the new functions. If anyone already knows how to extra
Cassandra version from Thrift jar would be great to hear.

Re: Ant file if someone can submit I'll check it in. I'm currently lifting
Pelops directly out of our main project tree and copying it Googlecode but
ANT would be better for distribution.

Best, Dominic

On 14 June 2010 20:33, Kochheiser,Todd W - TO-DITT1 wrote:

>  I checked out the source and noticed a few things:
>
>
>
>1. You did not include an Ant build file.  Not a big deal, but if you
>happen to have one it would be nice to have.
>2. It appears you built against Cassandra 0.6.0.  Have you built and/or
>run Pelops against 0.6.2 or trunk?
>
>
>
> Todd
>
>
>  --
>
> *From:* Dominic Williams [mailto:thedwilli...@googlemail.com]
> *Sent:* Friday, June 11, 2010 6:42 AM
>
> *To:* user@cassandra.apache.org
> *Subject:* Pelops - a new Java client library paradigm
>
>
>
> Pelops is a new high quality Java client library for Cassandra.
>
>
>
> It has a design that:
>
> * reveals the full power of Cassandra through an elegant "Mutator and
> Selector" paradigm
>
> * generates better, cleaner, less bug prone code
>
> * reduces the learning curve for new users
>
> * drives rapid application development
>
> * encapsulates advanced pooling algorithms
>
>
>
> An article introducing Pelops can be found at
>
>
> http://ria101.wordpress.com/2010/06/11/pelops-the-beautiful-cassandra-database-client-for-java/
>
>
>
> Thanks for reading.
>
> Best, Dominic
>


Re: Pelops - a new Java client library paradigm

2010-06-14 Thread Jonathan Ellis
On Mon, Jun 14, 2010 at 2:28 PM, Dominic Williams
 wrote:
> Hi Todd, we're on 6 or 6.1 but I would hope it should work with 6.2
> We will also be moving to 6.2 very shortly because we need the column
> timeout feature that a 6.2 patch has been submitted for (I haven't checked,
> maybe that feature is actually in the 6.2 trunk now, would be great?)

No, column ttl is too invasive to add to a stable release series.

-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com


JVM Options for Production

2010-06-14 Thread Anthony Molinaro
Hi,

  I was updating to a newer 0.6.3 and happened to remember that I noticed
back in 0.6.2 there's this change in CHANGES.txt

 * improve default JVM GC options (CASSANDRA-1014)

Looking at that ticket, I don't actually see the options listed or a
reason for why they changed.  Also, I'm not certain which options are
now recommended for a production system versus what's in the distribution.

The distribution (well svn) for 0.6.x currently has

JVM_OPTS=" \
-ea \
-Xms256M \
-Xmx1G \
-XX:+UseParNewGC \
-XX:+UseConcMarkSweepGC \
-XX:+CMSParallelRemarkEnabled \
-XX:SurvivorRatio=8 \
-XX:MaxTenuringThreshold=1 \
-XX:+HeapDumpOnOutOfMemoryError \
-Dcom.sun.management.jmxremote.port=8080 \
-Dcom.sun.management.jmxremote.ssl=false \
-Dcom.sun.management.jmxremote.authenticate=false"

Now I would assume that for 'production' you want to remove
   -ea
and
   -XX:+HeapDumpOnOutOfMemoryError

as well as adjust -Xms and Xmx accordingly, but are there any others
which should be tweaked?  Is there actually a recommended production
set of values or does it very greatly from installation to installation?

Thanks,

-Anthony

-- 

Anthony Molinaro   


java.lang.OutofMemoryerror: Java heap space

2010-06-14 Thread Caribbean410
Hi,

I wrote 200k records to db with each record 5MB. Get this error when I uses
3 threads (each thread tries to read 200k record totally, 100 records a
time) to read data from db. The write is OK, the error comes from read.
Right now the Xmx of JVM is 1GB. I changed it to 2GB, still not working. If
the record size is under 4K, I will not get this error. Any clues to avoid
this error?

Thx


Re: read operation is slow

2010-06-14 Thread Caribbean410
Now I read 100 records each time, and the total time to read 200k records
(1M each) reduce to 10s. Looks good. But I am still curious how to handle
the case that users read one record each time,

On Fri, Jun 11, 2010 at 6:05 PM, Dop Sun  wrote:

>  And also, you are only select *1* key and *10* columns?
>
>
>
> criteria.keyList(Lists.newArrayList(userName)).columnRange(nameFirst,
> nameFirst, 10);
>
>
>
> Then, if you have 200k keys, you have 200k Thrift calls.  If this is the
> case, you may need to optimize the way you do the query (to combine multiple
> keys into a single query), and to reduce the number of calls.
>
>
>
> *From:* Dop Sun [mailto:su...@dopsun.com]
> *Sent:* Saturday, June 12, 2010 8:57 AM
>
> *To:* user@cassandra.apache.org
> *Subject:* RE: read operation is slow
>
>
>
> You mean after you “I remove some unnecessary column family and change the
> size of rowcache and keycache, now the latency changes from 0.25ms to
> 0.09ms. In essence 0.09ms*200k=18s.”, it still takes 400 seconds to
> returning?
>
>
>
> *From:* Caribbean410 [mailto:caribbean...@gmail.com]
> *Sent:* Saturday, June 12, 2010 8:48 AM
> *To:* user@cassandra.apache.org
> *Subject:* Re: read operation is slow
>
>
>
> Hi, do you mean this one should not introduce much extra delay? To read a
> record, I need select here, not sure where the extra delay comes from.
>
> On Fri, Jun 11, 2010 at 5:29 PM, Dop Sun  wrote:
>
> Jassandra is used here:
>
>
>
> Map> map = criteria.select();
>
>
>
> The select here basically is a call to Thrift API: get_range_slices
>
>
>
>
>
> *From:* Caribbean410 [mailto:caribbean...@gmail.com]
> *Sent:* Saturday, June 12, 2010 8:00 AM
>
>
> *To:* user@cassandra.apache.org
> *Subject:* Re: read operation is slow
>
>
>
> I remove some unnecessary column family and change the size of rowcache and
> keycache, now the latency changes from 0.25ms to 0.09ms. In essence
> 0.09ms*200k=18s. I don't know why it takes more than 400s total. Here is the
> client code and cfstats. There are not many operations here, why is the
> extra time so large?
>
>
>
>   long start = System.currentTimeMillis();
>   for (int j = 0; j < 1; j++) {
>   for (int i = 0; i < numOfRecords; i++) {
>   int n = random.nextInt(numOfRecords);
>   ICriteria criteria = cf.createCriteria();
>   userName = keySet[n];
>
> criteria.keyList(Lists.newArrayList(userName)).columnRange(nameFirst,
> nameFirst, 10);
>   Map> map = criteria.select();
>   List list = map.get(userName);
> //  ByteArray bloc = list.get(0).getValue();
> //  byte[] byteArrayloc = bloc.toByteArray();
> //  loc = new String(byteArrayloc);
>
> //  readBytes = readBytes + loc.length();
>   readBytes = readBytes + blobSize;
>   }
>   }
>
> long finish=System.currentTimeMillis();
>
> float totalTime=(finish-start)/1000;
>
>
> Keyspace: Keyspace1
> Read Count: 60
> Read Latency: 0.090530067 ms.
> Write Count: 20
> Write Latency: 0.01504989 ms.
> Pending Tasks: 0
> Column Family: Standard2
> SSTable count: 3
> Space used (live): 265990358
> Space used (total): 265990358
> Memtable Columns Count: 2615
> Memtable Data Size: 2667300
> Memtable Switch Count: 3
> Read Count: 60
> Read Latency: 0.091 ms.
> Write Count: 20
> Write Latency: 0.015 ms.
> Pending Tasks: 0
> Key cache capacity: 1000
> Key cache size: 187465
> Key cache hit rate: 0.0
> Row cache capacity: 1000
> Row cache size: 189990
> Row cache hit rate: 0.68335
> Compacted row minimum size: 0
> Compacted row maximum size: 0
> Compacted row mean size: 0
>
> 
> Keyspace: system
> Read Count: 1
> Read Latency: 10.954 ms.
> Write Count: 4
> Write Latency: 0.28075 ms.
> Pending Tasks: 0
> Column Family: HintsColumnFamily
> SSTable count: 0
> Space used (live): 0
> Space used (total): 0
> Memtable Columns Count: 0
> Memtable Data Size: 0
> Memtable Switch Count: 0
> Read Count: 0
> Read Latency: NaN ms.
> Write Count: 0
> Write Latency: NaN ms.
> Pending Tasks: 0
> Key cache capacity: 1
> Key cache size: 0
> Key cache hit rate: NaN
> Row cache: disabled
> Compacted row minimum size: 0
> Compacted row maximum size: 0
> Compacted row mean size: 0
>
> Column Family: LocationInfo
> SSTable count: 2
> Space used (live): 3232
> Space used (total): 3232
> Memtable Columns Count: 2
> Memtable Data Size: 46
>  

Re: java.lang.OutofMemoryerror: Java heap space

2010-06-14 Thread Benjamin Black
My guess: you are outrunning your disk I/O.  Each of those 5MB rows
gets written to the commitlog, and the memtable is flushed when it
hits the configured limit, which you've probably left at 128MB.  Every
25 rows or so you are getting memtable flushed to disk.  Until these
things complete, they are in RAM.

If this is actually representative of your production use, you need a
dedicated commitlog disk, several drives in RAID0 or RAID10 for data,
a lot more RAM, and much larger memtable flush size.


b

On Mon, Jun 14, 2010 at 6:13 PM, Caribbean410  wrote:
> Hi,
>
> I wrote 200k records to db with each record 5MB. Get this error when I uses
> 3 threads (each thread tries to read 200k record totally, 100 records a
> time) to read data from db. The write is OK, the error comes from read.
> Right now the Xmx of JVM is 1GB. I changed it to 2GB, still not working. If
> the record size is under 4K, I will not get this error. Any clues to avoid
> this error?
>
> Thx
>


Re: JVM Options for Production

2010-06-14 Thread Benjamin Black
"...or does it very greatly from installation to installation?"

Yes.


CFP for Surge Scalability Conference 2010

2010-06-14 Thread Jason Dixon
We're excited to announce Surge, the Scalability and Performance
Conference, to be held in Baltimore on Sept 30 and Oct 1, 2010.  The
event focuses on case studies that demonstrate successes (and failures)
in Web applications and Internet architectures.

Our Keynote speakers include John Allspaw and Theo Schlossnagle.  We are
currently accepting submissions for the Call For Papers through July
9th.  You can find more information, including our current list of
speakers, online:

http://omniti.com/surge/2010

If you've been to Velocity, or wanted to but couldn't afford it, then
Surge is just what you've been waiting for.  For more information,
including CFP, sponsorship of the event, or participating as an
exhibitor, please contact us at su...@omniti.com.

Thanks,

-- 
Jason Dixon
OmniTI Computer Consulting, Inc.
jdi...@omniti.com
443.325.1357 x.241