Re: Best way to know the cluster status

2012-02-06 Thread R. Verlangen
You might consider writing some kind of php script that runs nodetool
"ring" and parse the output?

2012/2/6 Tamil selvan R.S 

> Hi,
>  What is the best way to know the cluster status via php?
>  Currently we are trying to connect to individual cassandra instance with
> a specified timeout and if it fails we report the node to be down.
>  But this test remains faulty. What are the other ways to test
> availability of nodes in cassandra cluster?
>  How does datastax opscenter manage to  do that?
>
> Regards,
> Tamil Selvan
>


Re: nodetool hangs and didn't print anything with firewall

2012-02-06 Thread R. Verlangen
Do you allow both outbound as inbound traffic? You might also try allowing
both TCP as UDP.

2012/2/6 Roshan 

> Yes, If the firewall is disable it works.
>
> --
> View this message in context:
> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/nodetool-hangs-and-didn-t-print-anything-with-firewall-tp7257286p7257310.html
> Sent from the cassandra-u...@incubator.apache.org mailing list archive at
> Nabble.com.
>


Re: Best way to know the cluster status

2012-02-06 Thread Sasha Dolgy
Tamil, what is the underlying purpose you are trying to achieve?  To
have your webpages know and detect when a node is down?  To have a
monitoring tool detect when a node is down?  PHPCassa allows you to
define multiple nodes.  If one node is down, it should log information
to the webserver logs and continue to work as expected if an alternate
node is available.

Parsing the output of "nodetool ring" is OK if you want the status at
that very moment.  Something more reliable should be considered,
perhaps using JMX and a proper monitoring tool, like Nagios or
Zenoss...etc.

On Mon, Feb 6, 2012 at 8:59 AM, R. Verlangen  wrote:
> You might consider writing some kind of php script that runs nodetool "ring"
> and parse the output?
>
> 2012/2/6 Tamil selvan R.S 
>>
>> Hi,
>>  What is the best way to know the cluster status via php?
>>  Currently we are trying to connect to individual cassandra instance with
>> a specified timeout and if it fails we report the node to be down.
>>  But this test remains faulty. What are the other ways to test
>> availability of nodes in cassandra cluster?
>>  How does datastax opscenter manage to  do that?
>>
>> Regards,
>> Tamil Selvan
>
>



-- 
Sasha Dolgy
sasha.do...@gmail.com


Need database to log and retrieve sensor data

2012-02-06 Thread Heiner Bunjes

I need a database to log and retrieve sensor data.

Is cassandra the right solution for this task and if, how should I
set it up and which access methods should I use?
If not, which other DB system might be a better fit?


The details are as follows:

 

Glossary

- Node = A computer on which an instance of the database
  is running

- Blip = one data record send by a sensor

- Blip page = The sorted list of all blips for a specific sensor
  and a specific time range.


The scale is as follows:

(01) 10E6 sensors deliver 1 blip every 100 seconds
 -> Insert rate = 10 kiloblip/s
 -> Insert rate ~ 315 gigablip/Year

(02) They have to be stored for ~3 years
 -> Size of database = 1 terablip

(03) Each blip has about 200 bytes
 -> Size of database = 200TB

(04) The system will start with just 10E4 sensors but will
 soon increase upto the described volume.


The main operations on the data are:

(05) Add the new blips to the database
 (written blips are never changed)!

(06) Return all blips for sensor X with a timestamp
 between timestamp_a and timestamp_b!
 With other words: Return a blip page.

(07) Return all the blips specified in (06) ordered
 by timestamp!

(08) Delete all blips older than Y!


Further the following is true:

(09) Each added blip is clearly (without ambiguity) identifiable by
 sensor_id+timestamp.

(10) 99.9% of the blips are inserted in
 chronological order, the rest is not.

(11) The database system MUST be free and open source.

(12) The DB SHOULD be easy to administrate.

(13) All data MUST still be writable and readable while less
 then the configurable number N of nodes are down (unexpectedly).

(14) The mechanisms to distribute the data to the available
 nodes SHOULD be handled by the database.
 This means that the database SHOULD automatically
 redistribute the data when nodes are added or removed.

(15) The project is mainly implemented in erlang, so there must be
 a stable erlang interface for database access.

 


Many thanks in advance
Heiner


Re: 1.0.6 - High CPU troubleshooting

2012-02-06 Thread Matthew Trinneer
Aaron,

Have reduced cache sizes and been monitoring for the past week.  It appears as 
if this was the culprit - since the changes have not seen a resurfacing.  

For those keeping score at home.

* Had sudden persistent spikes in CPU from the Cassandra java process
* Occurred every 24-48 hours and required a restart to resolve
* Reducing row cache sizes on some of the more active column families (which 
have wide rows) appears to eliminate the issue.


On 2012-01-25, at 7:49 PM, aaron morton wrote:

> You are running into GC issues. 
> 
>>> WARN [ScheduledTasks:1] 2012-01-22 12:53:42,804 GCInspector.java (line 146) 
>>> Heap is 0.7767292149986439 full.  You may need to reduce memtable and/or 
>>> cache sizes.  Cassandra will now flush up to the two largest memtables to 
>>> free up memory.  Adjust flush_largest_memtables_at threshold in 
>>> cassandra.yaml if you don't want Cassandra to do this automatically
> 
> Can you reduce the size of the caches ? 
> 
> As you are under low load, does it correlate with compaction or repair 
> processes ? Check node tool compactioninfo
> 
> Do you have wide rows ? Checlk the max row size with nodetool cfstats. 
> 
> Also, if you have made any changes to the default memory and gc settings try 
> reverting them. 
> 
> 
> Hope that helps. 
> 
> 
> -
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
> 
> On 26/01/2012, at 5:24 AM, Vitalii Tymchyshyn wrote:
> 
>> According to the log, I don't see much time spent for GC. You can still 
>> check it with jstat or uncomment GC logging in cassandra-env.sh. Are you 
>> sure you've identified the thread correctly?
>> It's still possible that you have memory spike where GCInspector simply has 
>> no chance to run between Full GC rounds. Checking with jstat or adding GC 
>> logging may help to diagnose.
>> 
>> 25.01.12 17:24, Matthew Trinneer написав(ла):
>>> Here is a snippet of what I'm getting out of system.log for GC.  Anything 
>>> is there provide a clue?
>>> 
>>>  WARN [ScheduledTasks:1] 2012-01-22 12:53:42,804 GCInspector.java (line 
>>> 146) Heap is 0.7767292149986439 full.  You may need to reduce memtable 
>>> and/or cache sizes.  Cassandra will now flush up to the two largest 
>>> memtables to free up memory.  Adjust flush_largest_memtables_at threshold 
>>> in cassandra.yaml if you don't want Cassandra to do this automatically
>>>  INFO [ScheduledTasks:1] 2012-01-22 12:54:57,685 GCInspector.java (line 
>>> 123) GC for ConcurrentMarkSweep: 240 ms for 1 collections, 111478936 used; 
>>> max is 8547991552
>>>  INFO [ScheduledTasks:1] 2012-01-22 15:12:21,710 GCInspector.java (line 
>>> 123) GC for ConcurrentMarkSweep: 1141 ms for 1 collections, 167667688 used; 
>>> max is 8547991552
>>>  INFO [ScheduledTasks:1] 2012-01-23 14:20:32,862 GCInspector.java (line 
>>> 123) GC for ParNew: 205 ms for 1 collections, 2894546328 used; max is 
>>> 8547991552
>>>  INFO [ScheduledTasks:1] 2012-01-23 20:25:06,541 GCInspector.java (line 
>>> 123) GC for ParNew: 240 ms for 1 collections, 4602331064 used; max is 
>>> 8547991552
>>>  INFO [ScheduledTasks:1] 2012-01-24 13:24:57,473 GCInspector.java (line 
>>> 123) GC for ConcurrentMarkSweep: 27869 ms for 1 collections, 6376733632 
>>> used; max is 8547991552
>>>  INFO [ScheduledTasks:1] 2012-01-24 13:25:24,879 GCInspector.java (line 
>>> 123) GC for ConcurrentMarkSweep: 26306 ms for 1 collections, 6392079368 
>>> used; max is 8547991552
>>>  INFO [ScheduledTasks:1] 2012-01-24 13:27:12,991 GCInspector.java (line 
>>> 123) GC for ConcurrentMarkSweep: 238 ms for 1 collections, 131710776 used; 
>>> max is 8547991552
>>>  INFO [ScheduledTasks:1] 2012-01-24 13:55:48,326 GCInspector.java (line 
>>> 123) GC for ConcurrentMarkSweep: 609 ms for 1 collections, 50380160 used; 
>>> max is 8547991552
>>>  INFO [ScheduledTasks:1] 2012-01-24 14:34:41,392 GCInspector.java (line 
>>> 123) GC for ParNew: 325 ms for 1 collections, 1340375240 used; max is 
>>> 8547991552
>>>  INFO [ScheduledTasks:1] 2012-01-24 20:55:19,636 GCInspector.java (line 
>>> 123) GC for ParNew: 233 ms for 1 collections, 6387236992 used; max is 
>>> 8547991552
>>>  INFO [ScheduledTasks:1] 2012-01-25 14:43:28,921 GCInspector.java (line 
>>> 123) GC for ParNew: 337 ms for 1 collections, 7031219304 used; max is 
>>> 8547991552
>>>  INFO [ScheduledTasks:1] 2012-01-25 14:43:51,043 GCInspector.java (line 
>>> 123) GC for ParNew: 211 ms for 1 collections, 7025723712 used; max is 
>>> 8547991552
>>>  INFO [ScheduledTasks:1] 2012-01-25 14:50:00,012 GCInspector.java (line 
>>> 123) GC for ConcurrentMarkSweep: 51534 ms for 2 collections, 6844998736 
>>> used; max is 8547991552
>>>  INFO [ScheduledTasks:1] 2012-01-25 14:51:22,249 GCInspector.java (line 
>>> 123) GC for ConcurrentMarkSweep: 250 ms for 1 collections, 154848440 used; 
>>> max is 8547991552
>>>  INFO [ScheduledTasks:1] 2012-01-25 14:57:46,519 GCInspector.java (line 
>>> 123) GC for ParNew: 244 ms for 1 collections, 190838344 used; max is 

Re: Need database to log and retrieve sensor data

2012-02-06 Thread R. Verlangen
As far as I'm familiar with Cassandra, I gave my opinion for every
requirement on your list:

1) 10k inserts / seconds should be no problem at all for Cassandra
2) Cassandra should scale to that
3) As the homepage of Cassandra states that amount of data should be able
to fit (source:  http://cassandra.apache.org/ )
4) Not Cassandra related

5) Inserts are very fast in Cassandra
6) You could create row keys in cassandra that hold the values as columns,
within a timespan (e.g. per second / minute). Please not that "The maximum
of column per row is 2 billion" (source:
http://wiki.apache.org/cassandra/CassandraLimitations )
7) The most common ordering for Cassandra is random. Hower you could create
some kind of index ColumnFamily (CF) with as columns the row keys of your
actual Data CF. Columns are sorted by default.
8) Cassandra provides a time-to-live (TTL) mechanism: this suits perfect
for your needs

9) The column key could be something like "SENSORID~TIMESTAMP", e.g.
"US123~1328539905"
10) Cassandra will take care of the column sorting
11) Cassandra is released under the Apache 2.0 license: so it's open source
12) Opscenter from DataStax is a really nice tool with some GUI: for
enterprise usage there's a subscription required
13) The high-availability that Cassandra provides will meet your
requirements
14) Your contact-node will find out which nodes are responsible for your
write/read. Adding, removing or moving nodes is also possible.
15) I have no experience with that, but I'm pretty shure there's someone
around here who can help you.

Good luck with finding the best database for your problem.

2012/2/6 Heiner Bunjes 

> I need a database to log and retrieve sensor data.
>
> Is cassandra the right solution for this task and if, how should I
> set it up and which access methods should I use?
> If not, which other DB system might be a better fit?
>
>
> The details are as follows:
>
>  
>
> Glossary
>
> - Node = A computer on which an instance of the database
>  is running
>
> - Blip = one data record send by a sensor
>
> - Blip page = The sorted list of all blips for a specific sensor
>  and a specific time range.
>
>
> The scale is as follows:
>
> (01) 10E6 sensors deliver 1 blip every 100 seconds
> -> Insert rate = 10 kiloblip/s
> -> Insert rate ~ 315 gigablip/Year
>
> (02) They have to be stored for ~3 years
> -> Size of database = 1 terablip
>
> (03) Each blip has about 200 bytes
> -> Size of database = 200TB
>
> (04) The system will start with just 10E4 sensors but will
> soon increase upto the described volume.
>
>
> The main operations on the data are:
>
> (05) Add the new blips to the database
> (written blips are never changed)!
>
> (06) Return all blips for sensor X with a timestamp
> between timestamp_a and timestamp_b!
> With other words: Return a blip page.
>
> (07) Return all the blips specified in (06) ordered
> by timestamp!
>
> (08) Delete all blips older than Y!
>
>
> Further the following is true:
>
> (09) Each added blip is clearly (without ambiguity) identifiable by
> sensor_id+timestamp.
>
> (10) 99.9% of the blips are inserted in
> chronological order, the rest is not.
>
> (11) The database system MUST be free and open source.
>
> (12) The DB SHOULD be easy to administrate.
>
> (13) All data MUST still be writable and readable while less
> then the configurable number N of nodes are down (unexpectedly).
>
> (14) The mechanisms to distribute the data to the available
> nodes SHOULD be handled by the database.
> This means that the database SHOULD automatically
> redistribute the data when nodes are added or removed.
>
> (15) The project is mainly implemented in erlang, so there must be
> a stable erlang interface for database access.
>
>  
>
>
> Many thanks in advance
> Heiner
>


Re: yet a couple more questions on composite columns

2012-02-06 Thread Jim Ancona
On Sat, Feb 4, 2012 at 8:54 PM, Yiming Sun  wrote:
> Interesting idea, Jim.  Is there a reason you don't you use
> "metadata:{accountId}" instead?  For performance reasons?

No, because the column comparator is defined as
CompositeType(DateType, AsciiType), and all column names must conform
to that.

Jim

>
>
> On Sat, Feb 4, 2012 at 6:24 PM, Jim Ancona  wrote:
>>
>> I've used "special" values which still comply with the Composite
>> schema for the metadata columns, e.g. a column of
>> 1970-01-01:{accountId} for a metadata column where the Composite is
>> DateType:UTF8Type.
>>
>> Jim
>>
>> On Sat, Feb 4, 2012 at 2:13 PM, Yiming Sun  wrote:
>> > Thanks Andrey and Chris.  It sounds like we don't necessarily have to
>> > use
>> > composite columns.  From what I understand about dynamic CF, each row
>> > may
>> > have completely different data from other rows;  but in our case, the
>> > data
>> > in each row is similar to other rows; my concern was more about the
>> > homogeneity of the data between columns.
>> >
>> > In our original supercolumn-based schema, one special supercolumn is
>> > called
>> > "metadata" which contains a number of subcolumns to hold metadata
>> > describing
>> > each collection (e.g. number of documents, etc.), then the rest of the
>> > supercolumns in the same row are all IDs of documents belong to the
>> > collection, and for each document supercolumn, the subcolumns contain
>> > the
>> > document content as well as metadata on individual document (e.g.
>> > checksum
>> > of each document).
>> >
>> > To move away from the supercolumn schema, I could either create two CFs,
>> > one
>> > to hold metadata, the other document content; or I could create just one
>> > CF
>> > mixing metadata and doc content in the same row, and using composite
>> > column
>> > names to identify if the particular column is metadata or a document.  I
>> > am
>> > just wondering if you have any inputs on the pros and cons of each
>> > schema.
>> >
>> > -- Y.
>> >
>> >
>> > On Fri, Feb 3, 2012 at 10:27 PM, Chris Gerken
>> > 
>> > wrote:
>> >>
>> >>
>> >>
>> >>
>> >> On 4 February 2012 06:21, Yiming Sun  wrote:
>> >>>
>> >>> I cannot have one composite column name with 3 components while
>> >>> another
>> >>> with 4 components?
>> >>
>> >>  Just put 4 components and left last empty (if it is same type)?!
>> >>
>> >>> Another question I have is how flexible composite columns actually
>> >>> are.
>> >>>  If my data model has a CF containing US zip codes with the following
>> >>> composite columns:
>> >>>
>> >>> {OH:Spring Field} : 45503
>> >>> {OH:Columbus} : 43085
>> >>> {FL:Spring Field} : 32401
>> >>> {FL:Key West}  : 33040
>> >>>
>> >>> I know I can ask cassandra to "give me the zip codes of all cities in
>> >>> OH".  But can I ask it to "give me the zip codes of all cities named
>> >>> Spring
>> >>> Field" using this model?  Thanks.
>> >>
>> >> No. You set first composite component at first.
>> >>
>> >>
>> >> I'd use a dynamic CF:
>> >> row key = state abbreviation
>> >> column name = city name
>> >> column value = zip code (or a complex object, one of whose properties
>> >> is
>> >> zip code)
>> >>
>> >> you can iterate over the columns in a single row to get a state's city
>> >> names and their zip code and you can do a get_range_slices on all keys
>> >> for
>> >> the columns starting and ending on the city name to find out the zip
>> >> codes
>> >> for a cities with the given name.
>> >>
>> >> I think
>> >>
>> >> - Chris
>> >
>> >
>
>


Re: nodetool hangs and didn't print anything with firewall

2012-02-06 Thread Nick Bailey
JMX is not very firewall friendly. The problem is that JMX is a two
connection process. The first connection happens on port 7199 and the
second connection happens on some random port > 1024. Work on changing
this behavior was started in this ticket:

https://issues.apache.org/jira/browse/CASSANDRA-2967

On Mon, Feb 6, 2012 at 2:02 AM, R. Verlangen  wrote:
> Do you allow both outbound as inbound traffic? You might also try allowing
> both TCP as UDP.
>
>
> 2012/2/6 Roshan 
>>
>> Yes, If the firewall is disable it works.
>>
>> --
>> View this message in context:
>> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/nodetool-hangs-and-didn-t-print-anything-with-firewall-tp7257286p7257310.html
>> Sent from the cassandra-u...@incubator.apache.org mailing list archive at
>> Nabble.com.
>
>


Re: yet a couple more questions on composite columns

2012-02-06 Thread Yiming Sun
Thanks for the clarification, Jim.  I didn't know the first comparator was
defined as DateType. Yeah, in that case, the beginning of the epoch is the
only choice.

-- Y.

On Mon, Feb 6, 2012 at 11:35 AM, Jim Ancona  wrote:

> On Sat, Feb 4, 2012 at 8:54 PM, Yiming Sun  wrote:
> > Interesting idea, Jim.  Is there a reason you don't you use
> > "metadata:{accountId}" instead?  For performance reasons?
>
> No, because the column comparator is defined as
> CompositeType(DateType, AsciiType), and all column names must conform
> to that.
>
> Jim
>
> >
> >
> > On Sat, Feb 4, 2012 at 6:24 PM, Jim Ancona  wrote:
> >>
> >> I've used "special" values which still comply with the Composite
> >> schema for the metadata columns, e.g. a column of
> >> 1970-01-01:{accountId} for a metadata column where the Composite is
> >> DateType:UTF8Type.
> >>
> >> Jim
> >>
> >> On Sat, Feb 4, 2012 at 2:13 PM, Yiming Sun 
> wrote:
> >> > Thanks Andrey and Chris.  It sounds like we don't necessarily have to
> >> > use
> >> > composite columns.  From what I understand about dynamic CF, each row
> >> > may
> >> > have completely different data from other rows;  but in our case, the
> >> > data
> >> > in each row is similar to other rows; my concern was more about the
> >> > homogeneity of the data between columns.
> >> >
> >> > In our original supercolumn-based schema, one special supercolumn is
> >> > called
> >> > "metadata" which contains a number of subcolumns to hold metadata
> >> > describing
> >> > each collection (e.g. number of documents, etc.), then the rest of the
> >> > supercolumns in the same row are all IDs of documents belong to the
> >> > collection, and for each document supercolumn, the subcolumns contain
> >> > the
> >> > document content as well as metadata on individual document (e.g.
> >> > checksum
> >> > of each document).
> >> >
> >> > To move away from the supercolumn schema, I could either create two
> CFs,
> >> > one
> >> > to hold metadata, the other document content; or I could create just
> one
> >> > CF
> >> > mixing metadata and doc content in the same row, and using composite
> >> > column
> >> > names to identify if the particular column is metadata or a document.
>  I
> >> > am
> >> > just wondering if you have any inputs on the pros and cons of each
> >> > schema.
> >> >
> >> > -- Y.
> >> >
> >> >
> >> > On Fri, Feb 3, 2012 at 10:27 PM, Chris Gerken
> >> > 
> >> > wrote:
> >> >>
> >> >>
> >> >>
> >> >>
> >> >> On 4 February 2012 06:21, Yiming Sun  wrote:
> >> >>>
> >> >>> I cannot have one composite column name with 3 components while
> >> >>> another
> >> >>> with 4 components?
> >> >>
> >> >>  Just put 4 components and left last empty (if it is same type)?!
> >> >>
> >> >>> Another question I have is how flexible composite columns actually
> >> >>> are.
> >> >>>  If my data model has a CF containing US zip codes with the
> following
> >> >>> composite columns:
> >> >>>
> >> >>> {OH:Spring Field} : 45503
> >> >>> {OH:Columbus} : 43085
> >> >>> {FL:Spring Field} : 32401
> >> >>> {FL:Key West}  : 33040
> >> >>>
> >> >>> I know I can ask cassandra to "give me the zip codes of all cities
> in
> >> >>> OH".  But can I ask it to "give me the zip codes of all cities named
> >> >>> Spring
> >> >>> Field" using this model?  Thanks.
> >> >>
> >> >> No. You set first composite component at first.
> >> >>
> >> >>
> >> >> I'd use a dynamic CF:
> >> >> row key = state abbreviation
> >> >> column name = city name
> >> >> column value = zip code (or a complex object, one of whose properties
> >> >> is
> >> >> zip code)
> >> >>
> >> >> you can iterate over the columns in a single row to get a state's
> city
> >> >> names and their zip code and you can do a get_range_slices on all
> keys
> >> >> for
> >> >> the columns starting and ending on the city name to find out the zip
> >> >> codes
> >> >> for a cities with the given name.
> >> >>
> >> >> I think
> >> >>
> >> >> - Chris
> >> >
> >> >
> >
> >
>


Re: Internal error processing batch_mutate java.util.ConcurrentModificationException

2012-02-06 Thread aaron morton
That looks like a bug. Were you writing counters ? 


Can you please add it here https://issues.apache.org/jira/browse/CASSANDRA , 
include some information on the request that caused it and email the bug report 
back to the list. 

(note to self) I *think* the problem is the counter WritePerformer 
implementations are put into the REPLICATE_ON_WRITE TP and then update the 
hints on the AbstractWriteResponeHandler asynchronously. This could happen 
after the write thread has move on to wait on the handlers which involves 
waiting on the hints futures. 

thanks

-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 4/02/2012, at 4:49 AM, Viktor Jevdokimov wrote:

> What may be cause of the following exception in 1.0.7 Cassandra:
>  
> ERROR [Thrift:134] 2012-02-03 15:51:02,800 Cassandra.java (line 3462) 
> Internal error processing batch_mutate
> java.util.ConcurrentModificationException
> at 
> java.util.AbstractList$Itr.checkForComodification(AbstractList.java:372)
> at java.util.AbstractList$Itr.next(AbstractList.java:343)
> at 
> org.apache.cassandra.utils.FBUtilities.waitOnFutures(FBUtilities.java:532)
> at 
> org.apache.cassandra.service.AbstractWriteResponseHandler.waitForHints(AbstractWriteResponseHandler.java:89)
> at 
> org.apache.cassandra.service.AbstractWriteResponseHandler.get(AbstractWriteResponseHandler.java:58)
> at 
> org.apache.cassandra.service.StorageProxy.mutate(StorageProxy.java:201)
> at 
> org.apache.cassandra.thrift.CassandraServer.doInsert(CassandraServer.java:639)
> at 
> org.apache.cassandra.thrift.CassandraServer.internal_batch_mutate(CassandraServer.java:590)
> at 
> org.apache.cassandra.thrift.CassandraServer.batch_mutate(CassandraServer.java:598)
> at 
> org.apache.cassandra.thrift.Cassandra$Processor$batch_mutate.process(Cassandra.java:3454)
> at 
> org.apache.cassandra.thrift.Cassandra$Processor.process(Cassandra.java:2889)
> at 
> org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:187)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> at java.lang.Thread.run(Thread.java:662)
>  
>  
>  
> Best regards/ Pagarbiai
>  
> Viktor Jevdokimov
> Senior Developer
>  
> Email:  viktor.jevdoki...@adform.com
> Phone: +370 5 212 3063. Fax: +370 5 261 0453
> J. Jasinskio 16C, LT-01112 Vilnius, Lithuania
>  
>  
> 
> 
> 
> Follow:
> 
> 
> Visit our blog
> 
> Disclaimer: The information contained in this message and attachments is 
> intended solely for the attention and use of the named addressee and may be 
> confidential. If you are not the intended recipient, you are reminded that 
> the information remains the property of the sender. You must not use, 
> disclose, distribute, copy, print or rely on this e-mail. If you have 
> received this message in error, please contact the sender immediately and 
> irrevocably delete this message and any copies.
> 
> 



Re: sensible data model ?

2012-02-06 Thread aaron morton
Sounds like a good start. Super columns are not a great fit for modeling time 
series data for a few reasons, here is one 
http://wiki.apache.org/cassandra/CassandraLimitations

It's also a good idea to partition time series data so that the rows do not 
grow too big. You can have 2 billion columns in a row, but big rows have 
operational down sides.

You could go with either:

rows: 
column: 

Which would mean each time your query for a date range you need to query 
multiple rows. But it is possible to get a range of  columns / properties.

Or

rows: 
column: 

Where time_partition is something that makes sense in your problem domain, e.g. 
a calendar month. If you often query for days in a month you  can then get all 
the columns for the days you are interested in (using a column range). If you 
only want to get a sub set of the entity properties you will need to get them 
all and filter them client side, depending on the number and size of the 
properties this may be more efficient than multiple calls. 

One word of warning, avoid sending read requests for lots (i.e. 100's) of rows 
at once it will reduce overall query throughput. Some clients like pycassa take 
care of this for you.

Good luck. 
 
-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 5/02/2012, at 12:12 AM, Franc Carter wrote:

> 
> Hi,
> 
> I'm pretty new to Cassandra and am currently doing a proof of concept, and 
> thought it would be a good idea to ask if my data model is sane . . . 
> 
> The data I have, and need to query, is reasonably simple. It consists of 
> about 10 million entities, each of which have a set of key/value properties 
> for each day for about 10 years. The number of keys is in the 50-100 range 
> and there will be a lot of overlap for keys in 
> 
> The queries I need to make are for sets of key/value properties for an entity 
> on a day, e.g key1,keys2,key3 for 10 entities on 20 days. The number of 
> entities and/or days in the query could be either very small or very large.
> 
> I've modeled this with a simple column family for the keys with the row key 
> being the concatenation of the entity and date. My first go, used only the 
> entity as the row key and then used a supercolumn for each date. I decided 
> against this mostly because it seemed more complex for a gain I didn't really 
> understand.
> 
> Does this seem sensible ?
> 
> thanks
> 
> -- 
> Franc Carter | Systems architect | Sirca Ltd
> franc.car...@sirca.org.au | www.sirca.org.au
> Tel: +61 2 9236 9118 
> Level 9, 80 Clarence St, Sydney NSW 2000
> PO Box H58, Australia Square, Sydney NSW 1215
> 



Re: Cassandra OOM - 1.0.2

2012-02-06 Thread Ajeet Grewal
On Sat, Feb 4, 2012 at 7:03 AM, Jonathan Ellis  wrote:
> Sounds like you need to increase sysctl vm.max_map_count

This did not work. I increased vm.max_map_count from 65536 to 131072.
I am still getting the same error.

ERROR [SSTableBatchOpen:4] 2012-02-06 11:43:50,463
AbstractCassandraDaemon.java (line 133) Fatal exception in thread
Thread[SSTableBatchOpen:4,5,main]
java.io.IOError: java.io.IOException: Map failed
at 
org.apache.cassandra.io.util.MmappedSegmentedFile$Builder.createSegments(MmappedSegmentedFile.java:225)
at 
org.apache.cassandra.io.util.MmappedSegmentedFile$Builder.complete(MmappedSegmentedFile.java:202)
at 
org.apache.cassandra.io.sstable.SSTableReader.load(SSTableReader.java:380)
at 
org.apache.cassandra.io.sstable.SSTableReader.open(SSTableReader.java:159)
at 
org.apache.cassandra.io.sstable.SSTableReader$1.run(SSTableReader.java:197)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.io.IOException: Map failed

--
Regards,
Ajeet


Re: Cassandra OOM - 1.0.2

2012-02-06 Thread Ajeet Grewal
On Mon, Feb 6, 2012 at 11:50 AM, Ajeet Grewal  wrote:
> On Sat, Feb 4, 2012 at 7:03 AM, Jonathan Ellis  wrote:
>> Sounds like you need to increase sysctl vm.max_map_count
>
> This did not work. I increased vm.max_map_count from 65536 to 131072.
> I am still getting the same error.

The number of files in the data directory is small (~300), so I dont
see why mmap should fail because of this.

-- 
Regards,
Ajeet


Re: Cassandra OOM - 1.0.2

2012-02-06 Thread Ajeet Grewal
Here are the last few lines of strace (of one of the threads). There
are a bunch of mmap system calls. Notice the last mmap call a couple
of lines before the trace ends. Could the last mmap call fail?

== BEGIN STRACE ==
mmap(NULL, 2147487599, PROT_READ, MAP_SHARED, 37, 0xbb000) = 0x7709b54000
fstat(37, {st_mode=S_IFREG|0644, st_size=59568105422, ...}) = 0
mmap(NULL, 214743, PROT_READ, MAP_SHARED, 37, 0xc7fffb000) = 0x7789b55000
fstat(37, {st_mode=S_IFREG|0644, st_size=59568105422, ...}) = 0
mmap(NULL, 2147483522, PROT_READ, MAP_SHARED, 37, 0xc4000) = 0x7809b4f000
fstat(37, {st_mode=S_IFREG|0644, st_size=59568105422, ...}) = 0
mmap(NULL, 1586100174, PROT_READ, MAP_SHARED, 37, 0xd7fff3000) = 0x7889b4f000
dup2(40, 37)= 37
close(37)   = 0
open("/home/y/var/fresh_cassandra/data/fresh/counter_object-h-4240-Filter.db",
O_RDONLY) = 37
.
.
.
.
close(37)   = 0
futex(0x2ab5a39754, FUTEX_WAKE, 1)  = 1
futex(0x2ab5a39750, FUTEX_WAKE, 1)  = 1
futex(0x40116940, FUTEX_WAKE, 1)= 1
mmap(0x41a17000, 12288, PROT_NONE,
MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x41a17000
rt_sigprocmask(SIG_SETMASK, [QUIT], NULL, 8) = 0
_exit(0)= ?
== END STRACE ==

-- 
Regards,
Ajeet


Kundera 2.0.5 Released

2012-02-06 Thread Amresh Singh
Hi All,

We are happy to announce release of Kundera 2.0.5.

Kundera is a JPA 2.0 based, Object-Datastore Mapping Library for NoSQL
Datastores. The idea behind Kundera is to make working with NoSQL Databases
drop-dead simple and fun. It currently supports Cassandra, HBase,
MongoDB and relational databases.


Major Changes in this release:
---
- Cassandra 1.x migration.
- Support for Many-to-Many relationship (via Join table)
- Transitive persistence.
- Datastore native secondary index support in addition to Lucene based
indexing. An optional switch provided to change between two.
- Query support for >, < , >=,<=,!=, like, order by, logical operators and
between.
- Connection pooling settings provided for all datastores.
- Support for all data types as required by JPA.
- Range queries for cassandra (via between clause in JPA-QL)
- Bug fixes related to self join.


To download, use or contribute to Kundera, visit:
http://github.com/impetus-opensource/Kundera

Sample codes and examples for using Kundera can be found here:
http://github.com/impetus-opensource/Kundera-Examples


NOSQL is as easy as SQL, if done through Kundera!
Happy working with NoSQL!!


Re: sensible data model ?

2012-02-06 Thread Ajeet Grewal
> It's also a good idea to partition time series data so that the rows do not
> grow too big. You can have 2 billion columns in a row, but big rows have
> operational down sides.

What are the down sides here? Unfortunately I have an existing system
which I modeled with large rows (because I use the sorted nature of
columns to get column ranges). After the amount of data grows, I get
"mmap failed" exceptions (See my other thread "Cassandra OOM"). I
wonder if there is a connection.

-- 
Regards,
Ajeet


Re: sensible data model ?

2012-02-06 Thread Franc Carter
On Tue, Feb 7, 2012 at 6:39 AM, aaron morton wrote:

> Sounds like a good start. Super columns are not a great fit for modeling
> time series data for a few reasons, here is one
> http://wiki.apache.org/cassandra/CassandraLimitations
>


None of those jump out at me as horrible for my case. If I modelled with
Super Columns I would have less than 10,000 Super Columns with an average
of 50 columns - big but no insane ?


>
> It's also a good idea to partition time series data so that the rows do
> not grow too big. You can have 2 billion columns in a row, but big rows
> have operational down sides.
>
> You could go with either:
>
> rows: 
> column: 
>
> Which would mean each time your query for a date range you need to query
> multiple rows. But it is possible to get a range of  columns / properties.
>
> Or
>
> rows: 
> column: 
>

That's an interesting idea - I'll talk to the data experts to see if we
have a sensible range.


>
> Where time_partition is something that makes sense in your problem domain,
> e.g. a calendar month. If you often query for days in a month you  can then
> get all the columns for the days you are interested in (using a column
> range). If you only want to get a sub set of the entity properties you will
> need to get them all and filter them client side, depending on the number
> and size of the properties this may be more efficient than multiple calls.
>

I'm find with doing work on the client side - I have a bias in that
direction as it tends to scale better.


>
> One word of warning, avoid sending read requests for lots (i.e. 100's) of
> rows at once it will reduce overall query throughput. Some clients like
> pycassa take care of this for you.
>

Because of request overhead ? I'm currently using the batch interface of
pycassa to do bulk reads. Is the same problem going to bite me if I have
many clients reading (using bulk reads) ? In production we will have ~50
clients.

thanks


> Good luck.
>
>   -
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
>
> On 5/02/2012, at 12:12 AM, Franc Carter wrote:
>
>
> Hi,
>
> I'm pretty new to Cassandra and am currently doing a proof of concept, and
> thought it would be a good idea to ask if my data model is sane . . .
>
> The data I have, and need to query, is reasonably simple. It consists of
> about 10 million entities, each of which have a set of key/value properties
> for each day for about 10 years. The number of keys is in the 50-100 range
> and there will be a lot of overlap for keys in 
>
> The queries I need to make are for sets of key/value properties for an
> entity on a day, e.g key1,keys2,key3 for 10 entities on 20 days. The number
> of entities and/or days in the query could be either very small or very
> large.
>
> I've modeled this with a simple column family for the keys with the row
> key being the concatenation of the entity and date. My first go, used only
> the entity as the row key and then used a supercolumn for each date. I
> decided against this mostly because it seemed more complex for a gain I
> didn't really understand.
>
> Does this seem sensible ?
>
> thanks
>
> --
> *Franc Carter* | Systems architect | Sirca Ltd
>  
> franc.car...@sirca.org.au | www.sirca.org.au
> Tel: +61 2 9236 9118
>  Level 9, 80 Clarence St, Sydney NSW 2000
> PO Box H58, Australia Square, Sydney NSW 1215
>
>
>


-- 

*Franc Carter* | Systems architect | Sirca Ltd
 

franc.car...@sirca.org.au | www.sirca.org.au

Tel: +61 2 9236 9118

Level 9, 80 Clarence St, Sydney NSW 2000

PO Box H58, Australia Square, Sydney NSW 1215