Combining Cassandra with some SQL language

2012-02-26 Thread R. Verlangen
Hi there,

I'm currently busy with the technical design of a new project. Of course it
will depend on your needs, but is it weird to combine Cassandra with a SQL
language like MySQL?

In my usecase it would be nice because we have some tables/CF's with lots
and lots of data that does not really have to be consistent 100%, but also
have some data that should be always consistent.

What do you think of this?

With kind regards,
Robin Verlangen


Re: Combining Cassandra with some SQL language

2012-02-26 Thread Benjamin Hawkes-Lewis
On Sun, Feb 26, 2012 at 1:06 PM, R. Verlangen  wrote:
> I'm currently busy with the technical design of a new project. Of course it
> will depend on your needs, but is it weird to combine Cassandra with a SQL
> language like MySQL?
>
> In my usecase it would be nice because we have some tables/CF's with lots
> and lots of data that does not really have to be consistent 100%, but also
> have some data that should be always consistent.
>
> What do you think of this?

It seems entirely reasonable to hybridise your stack to take advantage
of the qualities of different data stores. The tradeoff is your system
will have more moving parts, increasing its learning curve,
complicating provisioning, etc.

Where I work, we moved a lot of our domain out of MySQL into
Cassandra, and are now porting select parts of the domain that change
infrequently but require greater consistency back into MySQL. We are
also using other forms of storage (Redis and S3).

--
Benjamin Hawkes-Lewis


Re: Combining Cassandra with some SQL language

2012-02-26 Thread Adam Haney
I've been using a combination of MySQL and Cassandra for about a year now
on a project that now serves about 20k users. We use Cassandra for storing
large entities and MySQL to store meta data that allows us to do better ad
hoc querying. It's worked quite well for us. During this time we have also
been able to migrate some of our tables in MySQL to Cassandra if MySQL
performance / capacity became a problem. This may seem obvious but if
you're planning on creating a data model that spans multiple databases make
sure you encapsulate the logic to read/write/delete information in a good
data model library and only use that library to access your data. This is
good practice anyway but when you add the extra complication of multiple
databases that may reference one another it's an absolute must.

On Sun, Feb 26, 2012 at 8:06 AM, R. Verlangen  wrote:

> Hi there,
>
> I'm currently busy with the technical design of a new project. Of course
> it will depend on your needs, but is it weird to combine Cassandra with a
> SQL language like MySQL?
>
> In my usecase it would be nice because we have some tables/CF's with lots
> and lots of data that does not really have to be consistent 100%, but also
> have some data that should be always consistent.
>
> What do you think of this?
>
> With kind regards,
> Robin Verlangen
>


Re: Frequency of Flushing in 1.0

2012-02-26 Thread Radim Kolar

> if a node goes down, it will take longer for commitlog replay.

commit log replay time is insignificant. most time during node startup 
is wasted on index sampling. Index sampling here runs for about 15 minutes.


Re: Frequency of Flushing in 1.0

2012-02-26 Thread Edward Capriolo
If you are doing a planned maintenance you can flush first as well
ensuring the that the commit logs will not be as large.

On Sun, Feb 26, 2012 at 10:09 AM, Radim Kolar  wrote:
>> if a node goes down, it will take longer for commitlog replay.
>
> commit log replay time is insignificant. most time during node startup is
> wasted on index sampling. Index sampling here runs for about 15 minutes.


Re: unidirectional communication/replication

2012-02-26 Thread aaron morton
All nodes in the cluster need two way communication. Nodes need to talk to 
Gossip to each other so they know they are alive. 

If you need to dump a lot of data consider the Hadoop integration. 
http://wiki.apache.org/cassandra/HadoopSupport It can run a bit faster than 
going through the thrift api.

Copying sstables may be another option depending on the data size. 

Cheers


-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 25/02/2012, at 3:21 AM, Alexandru Sicoe wrote:

> Hello everyone,
> 
> I'm battling with this contraint that I have: I need to regularly ship out 
> timeseries data from a Cassandra cluster that sits within an enclosed 
> network, outside of the network. 
> 
> I tried to select all the data within a certian time window, writing to a 
> file, and then copying the file out but this hits the I/O performance because 
> even for a small time window (say 5mins) I am hitting more than a million 
> rows. 
> 
> It would really help if I used Cassandra to replicate the data automatically 
> outside. The problem is they will only allow me to have outbound traffic out 
> of the enclosed network (not inbound). Is there any way to configure the 
> cluster or have 2 data centers in such a way that the data center (node or 
> cluster) outside of the enclosed network only gets a replica of the data, 
> without ever needing to communicate anything back?
> 
> I appreciate the help,
> Alex



Re: How to delete a range of columns using first N components of CompositeType Column?

2012-02-26 Thread aaron morton
it has been discussed a few times :)

https://issues.apache.org/jira/browse/CASSANDRA-494

A

-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 25/02/2012, at 8:06 AM, Praveen Baratam wrote:

> Thank you Aaron for the clarification. 
> 
> May be this could be a feature that Cassandra team should consider 
> implementing. Instead of two network round trips the logic could be 
> consolidated on the server side if read before range delete is unavoidable. 
> 
> On Fri, Feb 24, 2012 at 12:46 AM, aaron morton  
> wrote:
> Unfortunately you can use column ranges for delete operations. 
> 
> So while what you want to do is something like...
> 
> Delete 'Jack:*:*'...'Jack:*:*' from Test where KEY = "friends";
> 
> You cannot do it. 
> 
> You need to read and then delete by name.
> 
> Cheers
> 
> -
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
> 
> On 23/02/2012, at 8:08 PM, Praveen Baratam wrote:
> 
>> More precisely,
>> 
>> Lets say we have a CF with the following spec.
>> 
>> create column family Test
>> with comparator = 'CompositeType(UTF8Type,UTF8Type,UTF8Type)'
>> and key_validation_class = 'UTF8Type'
>> and default_validation_class = 'UTF8Type';
>> 
>> And I have columns such as:
>> 
>> Jack:Name:First - Jackson
>> Jack:Name:Last -  Samuel
>> Jack:Age - 50
>> 
>> Now To delete all columns related to Jack, I need to use as far as I can 
>> comprehend
>> 
>> Delete 'Jack:Name:First', 'Jack:Name:Last', 'Jack:Age' from Test where KEY = 
>> "friends";
>> 
>> The problem is we do not usually know what meta-data is associated with a 
>> user as it may include Timestamp based columns.
>> 
>> such as: Jack:1234567890:Location - Chicago
>> 
>> Can something like -
>> 
>> Delete 'Jack' from Test where KEY = "friends";
>> 
>> be done using the First N components of the CompositeType?
>> 
>> Or should we read first and then delete?
>> 
>> Thank You.
>> 
>> On Thu, Feb 23, 2012 at 4:47 AM, Praveen Baratam  
>> wrote:
>> I am using CompositeType columns and its very convenient to query for a 
>> range of columns using the First N components but how do I delete a range of 
>> columns using the First N components of the CompositeType column.
>> 
>> In order to specify the exact column names to delete, I would have to read 
>> first and then delete.
>> 
>> Is there a better way?
>> 
> 
> 



Re: Server crashed due to "OutOfMemoryError: Java heap space"

2012-02-26 Thread aaron morton
> several compactions on few 200-300 GB SSTables
Sounds like some big files. Out of interest how much data do you have per node 
? 
Also do you have wide rows ? Can check via nodetool cfstats. 

In cases where OOM / GC is related to compaction these are the steps i take 
first. It's heavy handed and will probably increase the IO load. Once you 
stabilise you should see if you can increase them.

in cassandra.yaml
* set concurrent_compactors to 2 - this will reduce the number of concurrent 
compactions. 
* if you have wide rows reduce in_memory_compaction_limit_in_mb to 32 or lower. 

(as you are on 0.8.X also check memtable_total_space_in_mb is enabled)

Hope that helps. 


-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 25/02/2012, at 10:14 AM, Feng Qu wrote:

> Hello, 
> 
> We have a 6-node ring running 0.8.6 on RHEL 6.1. The first node also runs 
> OpsCenter community. This node has crashed few time recently with 
> "OutOfMemoryError: Java heap space" while several compactions on few 200-300 
> GB SSTables were running. We are using 8GB Java heap on host with 96GB RAM. 
> 
> I would appreciate for help to figure out the root cause and solution.
>  
> Feng Qu
> 
> 
>  INFO [GossipTasks:1] 2012-02-22 13:15:59,135 Gossiper.java (line 697) 
> InetAddress /10.89.74.67 is now dead.
>  INFO [ScheduledTasks:1] 2012-02-22 13:16:12,114 StatusLogger.java (line 65) 
> ReadStage 0 0 0
> ERROR [CompactionExecutor:10538] 2012-02-22 13:16:12,115 
> AbstractCassandraDaemon.java (line 139) Fatal exception in thread 
> Thread[CompactionExecutor:10538,1,
> main]
> java.lang.OutOfMemoryError: Java heap space
> at 
> org.apache.cassandra.io.util.BufferedRandomAccessFile.(BufferedRandomAccessFile.java:123)
> at 
> org.apache.cassandra.io.sstable.SSTableScanner.(SSTableScanner.java:57)
> at 
> org.apache.cassandra.io.sstable.SSTableReader.getDirectScanner(SSTableReader.java:664)
> at 
> org.apache.cassandra.db.compaction.CompactionIterator.getCollatingIterator(CompactionIterator.java:92)
> at 
> org.apache.cassandra.db.compaction.CompactionIterator.(CompactionIterator.java:68)
> at 
> org.apache.cassandra.db.compaction.CompactionManager.doCompactionWithoutSizeEstimation(CompactionManager.java:553)
> at 
> org.apache.cassandra.db.compaction.CompactionManager.doCompaction(CompactionManager.java:507)
> at 
> org.apache.cassandra.db.compaction.CompactionManager$1.call(CompactionManager.java:142)
> at 
> org.apache.cassandra.db.compaction.CompactionManager$1.call(CompactionManager.java:108)
> at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
> at java.util.concurrent.FutureTask.run(Unknown Source)
> at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown 
> Source)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
> at java.lang.Thread.run(Unknown Source)
>  INFO [GossipTasks:1] 2012-02-22 13:16:12,115 Gossiper.java (line 697) 
> InetAddress /10.2.128.55 is now dead.
> ERROR [Thread-734] 2012-02-22 13:16:48,189 AbstractCassandraDaemon.java (line 
> 139) Fatal exception in thread Thread[Thread-734,5,main]
> java.util.concurrent.RejectedExecutionException: ThreadPoolExecutor has shut 
> down
> at 
> org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor$1.rejectedExecution(DebuggableThreadPoolExecutor.java:60)
> at java.util.concurrent.ThreadPoolExecutor.reject(Unknown Source)
> at java.util.concurrent.ThreadPoolExecutor.execute(Unknown Source)
> at 
> org.apache.cassandra.net.MessagingService.receive(MessagingService.java:490)
> at 
> org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:136)
> ERROR [Thread-68450] 2012-02-22 13:16:48,189 AbstractCassandraDaemon.java 
> (line 139) Fatal exception in thread Thread[Thread-68450,5,main]
> java.util.concurrent.RejectedExecutionException: ThreadPoolExecutor has shut 
> down
> at 
> org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor$1.rejectedExecution(DebuggableThreadPoolExecutor.java:60)
> at java.util.concurrent.ThreadPoolExecutor.reject(Unknown Source)
> at 
> java.util.concurrent.ThreadPoolExecutor.ensureQueuedTaskHandled(Unknown 
> Source)
> at java.util.concurrent.ThreadPoolExecutor.execute(Unknown Source)
> at 
> org.apache.cassandra.net.MessagingService.receive(MessagingService.java:490)
> at 
> org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:136)
> ERROR [Thread-731] 2012-02-22 13:16:48,189 AbstractCassandraDaemon.java (line 
> 139) Fatal exception in thread Thread[Thread-731,5,main]
> java.util.concurrent.RejectedExecutionException: ThreadPoolExecutor has shut 
> down
> at 
> org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor$1.rejectedExecution(DebuggableThread

Re: Querying all keys in a column family

2012-02-26 Thread aaron morton
When you say query 1 million records in my mind i'm saying "dump 1 million 
records to another system as a back office job".
 
Hadoop will split the job over multiple nodes and will assign a task to read 
the range "owned" by each node. From memory it uses CL ONE (by default) for the 
read so the node it is connected to is the only one involved in the read.  Also 
the task can be run on the node rather than off node. 

This does not magic up up some new IO capacity though. It will spread the work 
load so to add IO capacity add nodes.  

You could do something similar by reducing the CL level and querying through 
the thrift interface. Then only ask a node for data in the key range it "owns". 

If this does not help the next step is to borrow from the ideas in Data Stax 
Brisk (now Data Stax Enterprise). Use the NetworkTopologyStrategy and two data 
centres (or a Virtual Data Centre 
http://wiki.apache.org/cassandra/HadoopSupport). 

One DC is for OLTP and the other for OLAP / Export. The OLTP side will be able 
to run without interruption from the OLAP side. 

Another option is use something like Kafka and fork the data stream, send it to 
cassandra and the external system at the same time. 

Hope that helps. 

-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 26/02/2012, at 2:21 PM, Martin Arrowsmith wrote:

> Hi Alexandru,
> 
> Things got hectic and I put off the project until this weekend. I'm actually 
> learning about Hadoop right now and how to implement it. I can respond to 
> this thread when I have something running.
> 
> In the meantime, I'd like to bump this email up and see if there are others 
> who can provide some feedback. 1) Will Hadoop speed up the time to read all 
> the rows? 2) Are there other options?
> 
> My guess was that hadoop could split up your jobs, so each node could handle 
> a portion of the query. For instance, having 2 nodes would do the job twice 
> as fast. That is my naive guess though and could be far from the truth.
> 
> Best wishes,
> 
> Martin
> 
> On Fri, Feb 24, 2012 at 5:29 AM, Alexandru Sicoe  wrote:
> Hi Aaron and Martin,
> 
> Sorry about my previous reply, I thought you wanted to process only all the 
> row keys in CF.
> 
> I have a similar issue as Martin because I see myself being forced to hit 
> more than a million rows with a query (I only get a few columns from every 
> row). Aaron, we've talked about this in another thread, basically I am 
> constrained to ship out a window of data from my online cluster to an offline 
> cluster. For this I need to read for example 5 min window of all the data I 
> have. This simply accesses too many rows and I am hitting the I/O limit on 
> the nodes. As I understand for every row it will do 2 random disk seeks (I 
> have no caches).
> 
> My question is, what can I do to improve the performance of shipping windows 
> of data entirely out?
> 
> Martin, did you use Hadoop as Aaron suggested? How did that work with 
> Cassandra? I don't understand how accessing 1 million of rows through map 
> reduce jobs be any faster?
> 
> Cheers,
> Alexandru
> 
>  
> 
> On Tue, Feb 14, 2012 at 10:00 AM, aaron morton  
> wrote:
> If you want to process 1 million rows use Hadoop with Hive or Pig. If you use 
> Hadoop you are not doing things in real time. 
> 
> You may need to rephrase the problem. 
> 
> Cheers
> 
> -
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
> 
> On 14/02/2012, at 11:00 AM, Martin Arrowsmith wrote:
> 
>> Hi Experts,
>> 
>> My program is such that it queries all keys on Cassandra. I want to do this 
>> as quick as possible, in order to get as close to real-time as possible.
>> 
>> One solution I heard was to use the sstables2json tool, and read the data in 
>> as JSON. I understand that reading from each line in Cassandra might take 
>> longer.
>> 
>> Are there any other ideas for doing this ? Or can you confirm that 
>> sstables2json is the way to go.
>> 
>> Querying 100 rows in Cassandra the normal way is fast enough. I'd like to 
>> query a million rows, do some calculations on them, and spit out the result 
>> like it's real time.
>> 
>> Thanks for any help you can give,
>> 
>> Martin
> 
> 
> 



Re: Frequency of Flushing in 1.0

2012-02-26 Thread aaron morton
Nathan Milford has a post about taking a node down 

http://blog.milford.io/2011/11/rolling-upgrades-for-cassandra/

The only thing I would do differently would be turn off thrift first.

Cheers


-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 27/02/2012, at 4:35 AM, Edward Capriolo wrote:

> If you are doing a planned maintenance you can flush first as well
> ensuring the that the commit logs will not be as large.
> 
> On Sun, Feb 26, 2012 at 10:09 AM, Radim Kolar  wrote:
>>> if a node goes down, it will take longer for commitlog replay.
>> 
>> commit log replay time is insignificant. most time during node startup is
>> wasted on index sampling. Index sampling here runs for about 15 minutes.



how to cast traditional sql schema to nosql

2012-02-26 Thread Michael Cherkasov
Hi all,
I'm newbie in nosql and can't understand how to create nosql style schema.
First, I what describe my problem:
 I need to store results of tests. Each test consists of a list of
parameters(if tests have the same list of parameters that means, two tests
belong to the same testcase), tag or tags, and test result,
For exmaple:
Test1 :
 params:
   -user_role:admin
   -miss_captcha:true
   -test_name:login_test
   -locales:en,es,fr   -- as you can see, parameter can be the list.
testcase: testcase_1_id -- test case id formed as md5 of params.
tags:
   -aaa_site_test
   -smoke
 result:
   -passed
   -some_other_results_stuff( logs, errors' codes and so on )
 start_time: 1330287048  ( time stamp)
Test2 :
 params:
   -user_role:admin
   -miss_captcha:true
   -test_name:login_test
   -locales:en,es,fr   -- as you can see, parameter can be the list.
 testcase: testcase_1_id -- test case id formed as md5 of params.
 tags:
   -aaa_site_test
   -function_tests
 result:
   -failed
   -some_other_results_stuff( logs, errors' codes and so on )
 start_time: 1330290648
Test3 :
 params:
   -user_role:user
   -miss_captcha:true
   -test_name:change_password
   -locales:en
 testcase: testcase_2_id -- test case id formed as md5 of params.
 tags:
   -bbb_site_test
   -function_tests
 result:
   -failed
   -some_other_results_stuff( logs, errors' codes and so on )
 start_time: 1330290648

So, above you can see 3 tests, the first two belong to the same testcase,
but test 1 and test 2 are different test runs, also they have different
tags. Test 3 one more test case.
Usually I will need to execute the following queries:
1)Get latest result for specific tag/tags, for exmale:
Get latest result for aaa_site. Result should be:
Test2 result, because test 1 and test 2 is the same test case, but test 2
is newer.
2)Or get latest result for locale == es, result is test 2.
3)Get the latest results for each test case, result is: test 2, test 3.
4)Get get history for test case 1, result: test 1 and test 2.

I create the following schema:
TestRuns:
*test run id(key) | test case id  | start_time | result id*
test_1_id| testcase_1_id | 1330287048 | result_1
test_2_id| testcase_1_id | 1330290648 | result_2
test_3_id| testcase_2_id | 1330290648  | result_3

Result:
*result id | result_value | other stuff...*
result_1  |  passed  | ...
result_2  |  failed  | ...
result_3  |  failed  | ...

ParamsAndTags:( for tags I put $tag for tagParamName, $ - for case if we
have parameter with name 'tag'  )
*key (not used, but required by cassandra)| test run id | tagParamName |
value*
some key |test_1_id| $tag |
aaa_site_test
some key |test_1_id| $tag |
smoke
some key |test_1_id| user_role|
admin
some key |test_1_id| miss_captcha | true
some key |test_1_id| test_name|
login_test
some key |test_1_id|  locales  |
en   --- list is splited
some key |test_1_id|  locales  |
es   --- list is splited
some key |test_1_id|  locales  |
fr   --- list is splited
and so on...


But it's look very heavy to perform queries.
To take latest result for tag aaa_site_test and with locale es I need
perform the following steps:
Fetch all rows from ParamsAndTags with tag aaa_site_test, then fetch all
rows for param locale == es.
Then find intersection of first and second result so I receive test runs
id, but this is not the end.
After that I should fetch test runs and in result find the latest results
only.
As you can see for that simple query I should perform 3 query to DB and a
lot of work inside my application to merge results and filter latests
results.
I'am afraid it will work too slowly.
Can someone advise more nosql solution for this task?


Re: Frequency of Flushing in 1.0

2012-02-26 Thread Mohit Anchlia
On Sun, Feb 26, 2012 at 12:18 PM, aaron morton wrote:

> Nathan Milford has a post about taking a node down
>
> http://blog.milford.io/2011/11/rolling-upgrades-for-cassandra/
>
> The only thing I would do differently would be turn off thrift first.
>
> Cheers
>

Isn't decomission meant to do the same thing as disablethrift and gossip?

>
>
>-
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
>
>  On 27/02/2012, at 4:35 AM, Edward Capriolo wrote:
>
>  If you are doing a planned maintenance you can flush first as well
> ensuring the that the commit logs will not be as large.
>
> On Sun, Feb 26, 2012 at 10:09 AM, Radim Kolar  wrote:
>
> if a node goes down, it will take longer for commitlog replay.
>
>
> commit log replay time is insignificant. most time during node startup is
>
> wasted on index sampling. Index sampling here runs for about 15 minutes.
>
>
>


CounterColumn java.lang.AssertionError: Wrong class type.

2012-02-26 Thread Gary Ogasawara
Using v1.0.7, we see many of the following errors.
Any thoughts on why this is occurring?
Thanks in advance.
 -gary

ERROR [ReadRepairStage:9] 2012-02-24 18:31:28,623
AbstractCassandraDaemon.java
(line 139) Fatal exception in thread Thread[ReadRepairStage:9,5,main]
java.lang.AssertionError: Wrong class type.
at
org.apache.cassandra.db.CounterColumn.diff(CounterColumn.java:112)
at org.apache.cassandra.db.ColumnFamily.diff(ColumnFamily.java:230)
at org.apache.cassandra.db.ColumnFamily.diff(ColumnFamily.java:309)
at
org.apache.cassandra.service.RowRepairResolver.scheduleRepairs(RowRepairReso
lver.java:117)
at
org.apache.cassandra.service.RowRepairResolver.resolve(RowRepairResolver.jav
a:94)
at
org.apache.cassandra.service.AsyncRepairCallback$1.runMayThrow(AsyncRepairCa
llback.java:54)
at
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:11
10)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:6
03)
at java.lang.Thread.run(Thread.java:722)
ERROR [ReadRepairStage:9] 2012-02-24 18:31:28,625
AbstractCassandraDaemon.java
(line 139) Fatal exception in thread Thread[ReadRepairStage:9,5,main]
--

>From cassandra-cli "show schema", I think the relevant CF is:

create column family QOSCounters
  with column_type = 'Standard'
  and comparator = 'UTF8Type'
  and default_validation_class = 'CounterColumnType'
  and key_validation_class = 'UTF8Type'
  and rows_cached = 0.0
  and row_cache_save_period = 0
  and row_cache_keys_to_save = 2147483647
  and keys_cached = 20.0
  and key_cache_save_period = 14400
  and read_repair_chance = 1.0
  and gc_grace = 604800
  and min_compaction_threshold = 4
  and max_compaction_threshold = 32
  and replicate_on_write = true
  and row_cache_provider = 'SerializingCacheProvider'
  and compaction_strategy =
'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy';




Re: Frequency of Flushing in 1.0

2012-02-26 Thread Xaero S
The challenge that we face is that our commitlog disk capacity is much much
less (under 10 GB in some cases) than the disk capacity of SSTables. So we
cannot really have the commitlog data continuously growing. This is the
reason that we need to be able to tune the the way we flush the memtables.
>From this link -
http://www.datastax.com/dev/blog/whats-new-in-cassandra-1-0-improved-memory-and-disk-space-management,
it looks like "commitlog_total_space_in_mb" is the parameter to
control
the rate at which memtables get flushed. Also it seems
"memtable_total_space_in_mb" is also another setting to play with.
we are planning to do some load testing with changes to these two settings,
but can anyone confirm that i am headed in the right direction? Or any
other pointers on this?


On Sun, Feb 26, 2012 at 5:26 PM, Mohit Anchlia wrote:

>
>
> On Sun, Feb 26, 2012 at 12:18 PM, aaron morton wrote:
>
>> Nathan Milford has a post about taking a node down
>>
>> http://blog.milford.io/2011/11/rolling-upgrades-for-cassandra/
>>
>> The only thing I would do differently would be turn off thrift first.
>>
>> Cheers
>>
>
> Isn't decomission meant to do the same thing as disablethrift and gossip?
>
>>
>>
>>-
>> Aaron Morton
>> Freelance Developer
>> @aaronmorton
>> http://www.thelastpickle.com
>>
>>  On 27/02/2012, at 4:35 AM, Edward Capriolo wrote:
>>
>>  If you are doing a planned maintenance you can flush first as well
>> ensuring the that the commit logs will not be as large.
>>
>> On Sun, Feb 26, 2012 at 10:09 AM, Radim Kolar  wrote:
>>
>> if a node goes down, it will take longer for commitlog replay.
>>
>>
>> commit log replay time is insignificant. most time during node startup is
>>
>> wasted on index sampling. Index sampling here runs for about 15 minutes.
>>
>>
>>
>


Cassandra 1.1 beta on Maven?

2012-02-26 Thread Praveen Sadhu
Hi,

I could not find cassandra 1.1 jars on maven repo.  Can a beta version be 
released?

Thanks,
Praveen


Re: Frequency of Flushing in 1.0

2012-02-26 Thread Peter Schuller
>> if a node goes down, it will take longer for commitlog replay.
>
> commit log replay time is insignificant. most time during node startup is
> wasted on index sampling. Index sampling here runs for about 15 minutes.

Depends entirely on your situation. If you have few keys and lots of
writes, index sampling will be insignificant.

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)


RE: Combining Cassandra with some SQL language

2012-02-26 Thread Sanjay Sharma
Kundera (https://github.com/impetus-opensource/Kundera)- an open source APL 
Java ORM allows polyglot persistence between  RDBMS and NoSQL databases such as 
Cassandra, MongoDB, HBase etc. transparently to the business logic developer.

A note of caution- this does not mean that Cassandra data modeling can be 
bypassed- NoSQL entities still need to be modeled in such a way so as to best 
use Cassandra capabilities.
Kundera can also take care of relationship between the entities in RDBMS.  
Transactions management is still pending however.


Regards,
Sanjay
From: Adam Haney [mailto:adam.ha...@retickr.com]
Sent: Sunday, February 26, 2012 7:51 PM
To: user@cassandra.apache.org
Subject: Re: Combining Cassandra with some SQL language

I've been using a combination of MySQL and Cassandra for about a year now on a 
project that now serves about 20k users. We use Cassandra for storing large 
entities and MySQL to store meta data that allows us to do better ad hoc 
querying. It's worked quite well for us. During this time we have also been 
able to migrate some of our tables in MySQL to Cassandra if MySQL performance / 
capacity became a problem. This may seem obvious but if you're planning on 
creating a data model that spans multiple databases make sure you encapsulate 
the logic to read/write/delete information in a good data model library and 
only use that library to access your data. This is good practice anyway but 
when you add the extra complication of multiple databases that may reference 
one another it's an absolute must.
On Sun, Feb 26, 2012 at 8:06 AM, R. Verlangen 
mailto:ro...@us2.nl>> wrote:
Hi there,

I'm currently busy with the technical design of a new project. Of course it 
will depend on your needs, but is it weird to combine Cassandra with a SQL 
language like MySQL?

In my usecase it would be nice because we have some tables/CF's with lots and 
lots of data that does not really have to be consistent 100%, but also have 
some data that should be always consistent.

What do you think of this?
With kind regards,
Robin Verlangen




Impetus' Head of Innovation labs, Vineet Tyagi will be presenting on 'Big Data 
Big Costs?' at the Strata Conference, CA (Feb 28 - Mar 1) http://bit.ly/bSMWd7.

Listen to our webcast 'Hybrid Approach to Extend Web Apps to Tablets & 
Smartphones' available at http://bit.ly/yQC1oD.


NOTE: This message may contain information that is confidential, proprietary, 
privileged or otherwise protected by law. The message is intended solely for 
the named addressee. If received in error, please destroy and notify the 
sender. Any use of this email is prohibited when received in error. Impetus 
does not represent, warrant and/or guarantee, that the integrity of this 
communication has been maintained nor that the communication is free of errors, 
virus, interception or interference.


Re: newer Cassandra + Hadoop = TimedOutException()

2012-02-26 Thread Patrik Modesto
On Sun, Feb 26, 2012 at 04:25, Edward Capriolo  wrote:
> Did you see the notes here?

I'm not sure what do you mean by the notes?

I'm using the mapred.* settings suggested there:

 
 mapred.max.tracker.failures
 20
 
 
 mapred.map.max.attempts
 20
 
 
 mapred.reduce.max.attempts
 20
 

But I still see the timeouts that I haven't with cassandra-all 0.8.7.

P.

> http://wiki.apache.org/cassandra/HadoopSupport#Troubleshooting


Re: Combining Cassandra with some SQL language

2012-02-26 Thread R. Verlangen
Ok, thank you all for your opinions. Seems that I can continue without any
extra db-model headaches ;-)

2012/2/27 Sanjay Sharma 

>  Kundera (https://github.com/impetus-opensource/Kundera)- an open source
> APL Java ORM allows polyglot persistence between  RDBMS and NoSQL databases
> such as Cassandra, MongoDB, HBase etc. transparently to the business logic
> developer.
>
>
>
> A note of caution- this does not mean that Cassandra data modeling can be
> bypassed- NoSQL entities still need to be modeled in such a way so as to
> best use Cassandra capabilities.
>
> Kundera can also take care of relationship between the entities in RDBMS.
>  Transactions management is still pending however.
>
>
>
>
>
> Regards,
>
> Sanjay
>
> *From:* Adam Haney [mailto:adam.ha...@retickr.com]
> *Sent:* Sunday, February 26, 2012 7:51 PM
> *To:* user@cassandra.apache.org
> *Subject:* Re: Combining Cassandra with some SQL language
>
>
>
> I've been using a combination of MySQL and Cassandra for about a year now
> on a project that now serves about 20k users. We use Cassandra for storing
> large entities and MySQL to store meta data that allows us to do better ad
> hoc querying. It's worked quite well for us. During this time we have also
> been able to migrate some of our tables in MySQL to Cassandra if MySQL
> performance / capacity became a problem. This may seem obvious but if
> you're planning on creating a data model that spans multiple databases make
> sure you encapsulate the logic to read/write/delete information in a good
> data model library and only use that library to access your data. This is
> good practice anyway but when you add the extra complication of multiple
> databases that may reference one another it's an absolute must.
>
> On Sun, Feb 26, 2012 at 8:06 AM, R. Verlangen  wrote:
>
> Hi there,
>
>
>
> I'm currently busy with the technical design of a new project. Of course
> it will depend on your needs, but is it weird to combine Cassandra with a
> SQL language like MySQL?
>
>
>
> In my usecase it would be nice because we have some tables/CF's with lots
> and lots of data that does not really have to be consistent 100%, but also
> have some data that should be always consistent.
>
>
>
> What do you think of this?
>
> With kind regards,
>
> Robin Verlangen
>
>
>
> --
>
> Impetus’ Head of Innovation labs, Vineet Tyagi will be presenting on ‘Big
> Data Big Costs?’ at the Strata Conference, CA (Feb 28 - Mar 1)
> http://bit.ly/bSMWd7.
>
> Listen to our webcast ‘Hybrid Approach to Extend Web Apps to Tablets &
> Smartphones’ available at http://bit.ly/yQC1oD.
>
>
> NOTE: This message may contain information that is confidential,
> proprietary, privileged or otherwise protected by law. The message is
> intended solely for the named addressee. If received in error, please
> destroy and notify the sender. Any use of this email is prohibited when
> received in error. Impetus does not represent, warrant and/or guarantee,
> that the integrity of this communication has been maintained nor that the
> communication is free of errors, virus, interception or interference.
>