Re: Experiences with Map&Reduce Stress Tests

2011-05-03 Thread Subscriber
Hi Jeremy, 

yes, the setup on the data-nodes is:
- Hadoop DataNode
- Hadoop TaskTracker
- CassandraDaemon
 
However - the map-input is not read from Cassandra. I am running a writing 
stress test - no reads (well from time to time I check the produced items using 
cassandra-cli).
Is it possible to achieve data-locality on writes? Well I think that this is 
(in practice) not possible (one could create some artificial data that 
correlates with the hashed row-key values or so ... ;-)

Thanks for all your tips and hints! It's good see that someone worries about my 
problems :-)
But - to be honest - my number one priority is not to get this test running but 
to answer the question whether the setup Cassandra+Hadoop with massive parallel 
writes (using map/reduce) meets the demands of our customer.

I found out that the following configuration helps a lot. 
 * disk_access_mode: standard 
 * MAX_HEAP_SIZE="4G"
 * HEAP_NEWSIZE="400M"
 * rpc_timeout_in_ms: 2

Now the stress test runs through, but there are still timeouts (Hadoop 
reschedules the failing mapper tasks on another node and so the test runs 
through).
But what causes this timeouts? 20 seconds are a long time for a modern cpu (and 
an eternity for an android ;-) 

It seems to me that it's not only the massive amount of data or to many 
parallel mappers, because Cassandra can handle this huge write rate over one 
hour! 
I found in the system.logs that the ConcurrentMarkSweeps take quite long (up to 
8 seconds). The heap size didn't grow much about 3GB so there was still "enough 
air to breath".

So the question remains: can I recommend this setup?

Thanks again and best regards
Udo


Am 02.05.2011 um 20:21 schrieb Jeremy Hanna:

> Udo,
> 
> One thing to get out of the way - you're running task trackers on all of your 
> cassandra nodes, right?  That is the first and foremost way to get good 
> performance.  Otherwise you don't have data locality, which is really the 
> point of map/reduce, co-locating your data and your processes operating over 
> that data.  You're probably already doing that, but I had forgotten to ask 
> that before.
> 
> Besides that...
> 
> You might try messing with those values a bit more as well as the input split 
> size - cassandra.input.split.size which defaults to ~65k.  So you might try 
> rpc timeout of 30s just to see if that helps and try reducing the input split 
> size significantly to see if that helps.
> 
> For your setup I don't see the range batch size as being meaningful at all 
> with your narrow rows, so don't worry about that.
> 
> Also, the capacity of your nodes and the number of mappers/reducers you're 
> trying to use will also have an effect on whether it has to timeout.  
> Essentially it's getting overwhelmed for some reason.  You might lower the 
> number of mappers and reducers you're hitting your cassandra cluster with to 
> see if that helps.
> 
> Jeremy
> 
> On May 2, 2011, at 6:25 AM, Subscriber wrote:
> 
>> Hi Jeremy, 
>> 
>> thanks for the link.
>> I doubled the rpc_timeout (20 seconds) and reduced the range-batch-size to 
>> 2048, but I still get timeouts...
>> 
>> Udo
>> 
>> Am 29.04.2011 um 18:53 schrieb Jeremy Hanna:
>> 
>>> It sounds like there might be some tuning you can do to your jobs - take a 
>>> look at the wiki's HadoopSupport page, specifically the Troubleshooting 
>>> section:
>>> http://wiki.apache.org/cassandra/HadoopSupport#Troubleshooting
>>> 
>>> On Apr 29, 2011, at 11:45 AM, Subscriber wrote:
>>> 
 Hi all, 
 
 We want to share our experiences we got during our Cassandra plus Hadoop 
 Map/Reduce evaluation.
 Our question was whether Cassandra is suitable for massive distributed 
 data writes using Hadoop's Map/Reduce feature.
 
 Our setup is described in the attached file 'cassandra_stress_setup.txt'.
 
 
 
 The stress test uses 800 map-tasks to generate data and store it into 
 cassandra.
 Each map task writes 500.000 items (i.e. rows) resulting in totally 
 400.000.000 items. 
 There are max. 8 map tasks in parallel on each node. An item contains 
 (beside the key) two long and two double values, 
 so that items are a few 100 bytes in size. This leads to a total data size 
 of approximately 120GB.
 
 The Map-Tasks uses the Hector API. Hector is "feeded" with all three data 
 nodes. The data is written in chunks of 1000 items.
 The ConsitencyLevel is set to ONE.
 
 We ran the stress tests in several runs with different configuration 
 settings (for example I started with cassandra's default configuration and 
 I used Pelops for another test).
 
 Our observations are like this:
 
 1) Cassandra is really fast - we are really impressed about the huge write 
 throughput. A map task writing 500.000 items (appr. 200MB) usually 
 finishes under 5 minutes.
 2) However - unfortunately all tests failed in the end
 
 

Using snapshot for backup and restore

2011-05-03 Thread Arsene Lee
Hi,

We are trying to use snapshot for backup and restore. We found out that 
snapshot doesn't take secondary indexes.
We are wondering why is that? And is there any way we can rebuild the secondary 
index?

Regards,

Arsene


Re: Unable to add columns to empty row in Column family: Cassandra

2011-05-03 Thread aaron morton
If your are still having problems can you say what version, how many nodes, 
what RF, what CL and if after inserting and failing on the first get it works 
on a subsequent get. 


Thanks
Aaron

On 3 May 2011, at 18:54, chovatia jaydeep wrote:

> One small correction in my mail below. 
> Second insertion time-stamp has to be greater than delete time-stamp in-order 
> to retrieve the data.
> 
> Thank you,
> Jaydeep
> From: chovatia jaydeep 
> To: "user@cassandra.apache.org" 
> Sent: Monday, 2 May 2011 11:52 PM
> Subject: Re: Unable to add columns to empty row in Column family: Cassandra
> 
> Hi Anuya,
> 
> > However, columns are not being inserted.
> 
> Do you mean to say that after insert operation you couldn't retrieve the same 
> data? If so, then please check the time-stamp when you reinserted after 
> delete operation. Your second insertion time-stamp has to be greater than the 
> previous insertion.
> 
> Thank you,
> Jaydeep
> From: anuya joshi 
> To: user@cassandra.apache.org
> Sent: Monday, 2 May 2011 11:34 PM
> Subject: Re: Unable to add columns to empty row in Column family: Cassandra
> 
> Hello,
> 
> I am using Cassandra for my application.My Cassandra client uses Thrift APIs 
> directly. The problem I am facing currently is as follows:
> 
> 1) I added a row and columns in it dynamically via Thrift API Client
> 2) Next, I used command line client to delete row which actually deleted all 
> the columns in it, leaving empty row with original row id.
> 3) Now, I am trying to add columns dynamically using client program into this 
> empty row with same row key
> However, columns are not being inserted.
> But, when tried from command line client, it worked correctly.
> 
> Any pointer on this would be of great use
> 
> Thanks in  advance,
> 
> Regards,
> Anuya
> 
> 
> 
> 



One cluster or many?

2011-05-03 Thread David Boxenhorn
If I have a database that partitions naturally into non-overlapping
datasets, in which there are no references between datasets, where each
dataset is quite large (i.e. large enough to merit its own cluster from the
point of view of quantity of data), should I set up one cluster per database
or one large cluster for everything together?

As I see it:

The primary advantage of separate clusters is total isolation: if I have a
problem with one dataset, my application will continue working normally for
all other datasets.

The primary advantage of one big cluster is usage pooling: when one server
goes down in a large cluster it's much less important than when one server
goes down in a small cluster. Also, different temporal usage patterns of the
different datasets (i.e. there will be different peak hours on different
datasets) can be combined to ease capacity requirements.

Any thoughts?


low performance inserting

2011-05-03 Thread charles THIBAULT
Hello everybody,

first: sorry for my english in advance!!

I'm getting started with Cassandra on a 5 nodes cluster inserting data
with the pycassa API.

I've read everywere on internet that cassandra's performance are better than
MySQL
because of the writes append's only into commit logs files.

When i'm trying to insert 100 000 rows with 10 columns per row with batch
insert, I'v this result: 27 seconds
But with MySQL (load data infile) this take only 2 seconds (using indexes)

Here my configuration

cassandra version: 0.7.5
nodes : 192.168.1.210, 192.168.1.211, 192.168.1.212, 192.168.1.213,
192.168.1.214
seed: 192.168.1.210

My script
*
#!/usr/bin/env python

import pycassa
import time
import random
from cassandra import ttypes

pool = pycassa.connect('test', ['192.168.1.210:9160'])
cf = pycassa.ColumnFamily(pool, 'test')
b = cf.batch(queue_size=50,
write_consistency_level=ttypes.ConsistencyLevel.ANY)

tps1 = time.time()
for i in range(10):
columns = dict()
for j in range(10):
columns[str(j)] = str(random.randint(0,100))
b.insert(str(i), columns)
b.send()
tps2 = time.time()


print("execution time: " + str(tps2 - tps1) + " seconds")
*

what I'm doing rong ?


unsubscribe

2011-05-03 Thread Brendan Poole

 

Please consider the environment before printing this e-mail

Important - The information contained in this email (and any attached files) is 
confidential and may be legally privileged and protected by law.  

The intended recipient is authorised to access it.  If you are not the intended 
recipient, please notify the sender immediately and delete or destroy all 
copies. You must not disclose the 
contents of this email to anyone. Unauthorised use, dissemination, 
distribution, publication or copying of this communication is prohibited. 

NewLaw Solicitors does not accept any liability for any inaccuracies or 
omissions in the contents of this email that may have arisen as a result of 
transmission.  This message and any 
attachments are believed to be free of any virus or defect that might affect 
any computer system into which it is received and opened.  However,it is the 
responsibility of the recipient to 
ensure that it is virus free; therefore, no responsibility is accepted for any 
loss or damage in any way arising from its use. 

NewLaw Solicitors is the trading name of NewLaw Legal Ltd, a limited company 
registered in England and Wales with registered number 07200038.  
NewLaw Legal Ltd is regulated by the Solicitors Regulation Authority whose 
website is http://www.sra.org.uk 

The registered office of NewLaw Legal Ltd is at Helmont House, Churchill Way, 
Cardiff, CF10 2HE. Tel: 0845 756 6870, Fax: 0845 756 6871, Email: 
i...@new-law.co.uk. www.new-law.co.uk.  

We use the word ‘partner’ to refer to a shareowner or director of the company, 
or an employee or consultant of the company who is a lawyer with equivalent 
standing and qualifications. A list 
of the directors is displayed at the above address, together with a list of 
those persons who are designated as partners. 

Problems recovering a dead node

2011-05-03 Thread Héctor Izquierdo Seliva
Hi everyone. One of the nodes in my 6 node cluster died with disk
failures. I have replaced the disks, and it's clean. It has the same
configuration (same ip, same token).

When I try to restart the node it starts to throw mmap underflow
exceptions till it closes again.

I tried setting io to standard, but it still fails. It gives errors
about two decorated keys being different, and the EOFException.

Here is an excerpt of the log

http://pastebin.com/ZXW1wY6T

I can provide more info if needed. I'm at a loss here so any help is
appreciated.

Thanks all for your time

Héctor Izquierdo



Re: Replica data distributing between racks

2011-05-03 Thread aaron morton
I've been digging into this and worked was able to reproduce something, not 
sure if it's a fault and I can't work on it any more tonight. 


To reproduce:
- 2 node cluster on my mac book
- set the tokens as if they were nodes 3 and 4 in a 4 node cluster, e.g. node 1 
with 85070591730234615865843651857942052864 and node 2 
127605887595351923798765477786913079296 
- set cassandra-topology.properties to put the nodes in DC1 on RAC1 and RAC2
- create a keyspace using NTS and strategy_options = [{DC1:1}]

Inserted 10 rows they were distributed as 
- node 1 - 9 rows 
- node 2 - 1 row

I *think* the problem has to do with TokenMetadata.firstTokenIndex(). It often 
says the closest token to a key is the node 1 because in effect...

- node 1 is responsible for 0 to 85070591730234615865843651857942052864
- node 2 is responsible for 85070591730234615865843651857942052864 to 
127605887595351923798765477786913079296
- AND node 1 does the wrap around from 127605887595351923798765477786913079296 
to 0 as keys that would insert past the last token in the ring array wrap to 0 
because  insertMin is false. 

Thoughts ? 

Aaron


On 3 May 2011, at 10:29, Eric tamme wrote:

> On Mon, May 2, 2011 at 5:59 PM, aaron morton  wrote:
>> My bad, I missed the way TokenMetadata.ringIterator() and firstTokenIndex() 
>> work.
>> 
>> Eric, can you show the output from nodetool ring ?
>> 
>> 
> 
> Sorry if the previous paste was way to unformatted, here is a
> pastie.org link with nicer formatting of nodetool ring output than
> plain text email allows.
> 
> http://pastie.org/private/50khpakpffjhsmgf66oetg



Re: Using snapshot for backup and restore

2011-05-03 Thread aaron morton
Looking at the code for the snapshot it looks like it does not include 
secondary indexes. And I cannot see a way to manually trigger an index rebuild 
(via CFS.buildSecondaryIndexes())

Looking at this it's probably handy to snapshot them 
https://issues.apache.org/jira/browse/CASSANDRA-2470

I'm not sure if there is a reason for excluding them. Is this causing a problem 
right now ?

Aaron



On 3 May 2011, at 20:22, Arsene Lee wrote:

> Hi,
>  
> We are trying to use snapshot for backup and restore. We found out that 
> snapshot doesn’t take secondary indexes.
> We are wondering why is that? And is there any way we can rebuild the 
> secondary index?
>  
> Regards,
>  
> Arsene



Write performance help needed

2011-05-03 Thread Steve Smith
I am working for client that needs to persist 100K-200K records per second
for later querying.  As a proof of concept, we are looking at several
options including nosql (Cassandra and MongoDB).

I have been running some tests on my laptop (MacBook Pro, 4GB RAM, 2.66 GHz,
Dual Core/4 logical cores) and have not been happy with the results.

The best I have been able to accomplish is 100K records in approximately 30
seconds.  Each record has 30 columns, mostly made up of integers.  I have
tried both the Hector and Pelops APIs, and have tried writing in batches
versus one at a time.  The times have not varied much.

I am using the out of the box configuration for Cassandra, and while I know
using 1 disk will have an impact on performance, I would expect to see
better write numbers than I am.

As a point of reference, the same test using MongoDB I was able to
accomplish 100K records in 3.5 seconds.

Any tips would be appreciated.

- Steve


Re: low performance inserting

2011-05-03 Thread Roland Gude
Hi,
Not sure this is the case for your Bad Performance, but you are Meassuring Data 
creation and Insertion together. Your Data creation involves Lots of class 
casts which are probably quite Slow.
Try
Timing only the b.send Part and See how Long that Takes. 

Roland

Am 03.05.2011 um 12:30 schrieb "charles THIBAULT" :

> Hello everybody, 
> 
> first: sorry for my english in advance!!
> 
> I'm getting started with Cassandra on a 5 nodes cluster inserting data
> with the pycassa API.
> 
> I've read everywere on internet that cassandra's performance are better than 
> MySQL
> because of the writes append's only into commit logs files.
> 
> When i'm trying to insert 100 000 rows with 10 columns per row with batch 
> insert, I'v this result: 27 seconds
> But with MySQL (load data infile) this take only 2 seconds (using indexes)
> 
> Here my configuration
> 
> cassandra version: 0.7.5
> nodes : 192.168.1.210, 192.168.1.211, 192.168.1.212, 192.168.1.213, 
> 192.168.1.214
> seed: 192.168.1.210
> 
> My script
> *
> #!/usr/bin/env python
> 
> import pycassa
> import time
> import random
> from cassandra import ttypes
> 
> pool = pycassa.connect('test', ['192.168.1.210:9160'])
> cf = pycassa.ColumnFamily(pool, 'test')
> b = cf.batch(queue_size=50, 
> write_consistency_level=ttypes.ConsistencyLevel.ANY)
> 
> tps1 = time.time()
> for i in range(10):
> columns = dict()
> for j in range(10):
> columns[str(j)] = str(random.randint(0,100))
> b.insert(str(i), columns)
> b.send()
> tps2 = time.time()
> 
> 
> print("execution time: " + str(tps2 - tps1) + " seconds")
> *
> 
> what I'm doing rong ?


Re: Write performance help needed

2011-05-03 Thread Eric tamme
Use more nodes to increase your write throughput.  Testing on a single
machine is not really a viable benchmark for what you can achieve with
cassandra.


RE: Using snapshot for backup and restore

2011-05-03 Thread Arsene Lee
If snapshot doesn't include secondary indexes then we can't use it for our 
backup and restore procedure.
.
This mean, we need to stop our service when we want to do backups and this 
would cause longer system down time.

If there is no particular reason, it is probably a good idea to also include 
secondary indexes when taking the snapshot.


Arsene


From: aaron morton [aa...@thelastpickle.com]
Sent: Tuesday, May 03, 2011 7:28 PM
To: user@cassandra.apache.org
Subject: Re: Using snapshot for backup and restore

Looking at the code for the snapshot it looks like it does not include 
secondary indexes. And I cannot see a way to manually trigger an index rebuild 
(via CFS.buildSecondaryIndexes())

Looking at this it's probably handy to snapshot them 
https://issues.apache.org/jira/browse/CASSANDRA-2470

I'm not sure if there is a reason for excluding them. Is this causing a problem 
right now ?

Aaron



On 3 May 2011, at 20:22, Arsene Lee 
wrote:

Hi,

We are trying to use snapshot for backup and restore. We found out that 
snapshot doesn’t take secondary indexes.
We are wondering why is that? And is there any way we can rebuild the secondary 
index?

Regards,

Arsene




Re: low performance inserting

2011-05-03 Thread Sylvain Lebresne
There is probably a fair number of things you'd have to make sure you do to
improve the write performance on the Cassandra side (starting by using multiple
threads to do the insertion), but the first thing is probably to start
comparing things
that are at least mildly comparable. If you do inserts in Cassandra,
you should try
to do inserts in MySQL too, not "load data infile" (which really is
just a bulk loading
utility). And as stated here
http://dev.mysql.com/doc/refman/5.0/en/insert-speed.html:
"When loading a table from a text file, use LOAD DATA INFILE. This is
usually 20 times
faster than using INSERT statements."

--
Sylvain

On Tue, May 3, 2011 at 12:30 PM, charles THIBAULT
 wrote:
> Hello everybody,
>
> first: sorry for my english in advance!!
>
> I'm getting started with Cassandra on a 5 nodes cluster inserting data
> with the pycassa API.
>
> I've read everywere on internet that cassandra's performance are better than
> MySQL
> because of the writes append's only into commit logs files.
>
> When i'm trying to insert 100 000 rows with 10 columns per row with batch
> insert, I'v this result: 27 seconds
> But with MySQL (load data infile) this take only 2 seconds (using indexes)
>
> Here my configuration
>
> cassandra version: 0.7.5
> nodes : 192.168.1.210, 192.168.1.211, 192.168.1.212, 192.168.1.213,
> 192.168.1.214
> seed: 192.168.1.210
>
> My script
> *
> #!/usr/bin/env python
>
> import pycassa
> import time
> import random
> from cassandra import ttypes
>
> pool = pycassa.connect('test', ['192.168.1.210:9160'])
> cf = pycassa.ColumnFamily(pool, 'test')
> b = cf.batch(queue_size=50,
> write_consistency_level=ttypes.ConsistencyLevel.ANY)
>
> tps1 = time.time()
> for i in range(10):
>     columns = dict()
>     for j in range(10):
>         columns[str(j)] = str(random.randint(0,100))
>     b.insert(str(i), columns)
> b.send()
> tps2 = time.time()
>
>
> print("execution time: " + str(tps2 - tps1) + " seconds")
> *
>
> what I'm doing rong ?
>


Re: Replica data distributing between racks

2011-05-03 Thread Jonathan Ellis
Right, when you are computing balanced RP tokens for NTS you need to
compute the tokens for each DC independently.

On Tue, May 3, 2011 at 6:23 AM, aaron morton  wrote:
> I've been digging into this and worked was able to reproduce something, not 
> sure if it's a fault and I can't work on it any more tonight.
>
>
> To reproduce:
> - 2 node cluster on my mac book
> - set the tokens as if they were nodes 3 and 4 in a 4 node cluster, e.g. node 
> 1 with 85070591730234615865843651857942052864 and node 2 
> 127605887595351923798765477786913079296
> - set cassandra-topology.properties to put the nodes in DC1 on RAC1 and RAC2
> - create a keyspace using NTS and strategy_options = [{DC1:1}]
>
> Inserted 10 rows they were distributed as
> - node 1 - 9 rows
> - node 2 - 1 row
>
> I *think* the problem has to do with TokenMetadata.firstTokenIndex(). It 
> often says the closest token to a key is the node 1 because in effect...
>
> - node 1 is responsible for 0 to 85070591730234615865843651857942052864
> - node 2 is responsible for 85070591730234615865843651857942052864 to 
> 127605887595351923798765477786913079296
> - AND node 1 does the wrap around from 
> 127605887595351923798765477786913079296 to 0 as keys that would insert past 
> the last token in the ring array wrap to 0 because  insertMin is false.
>
> Thoughts ?
>
> Aaron
>
>
> On 3 May 2011, at 10:29, Eric tamme wrote:
>
>> On Mon, May 2, 2011 at 5:59 PM, aaron morton  wrote:
>>> My bad, I missed the way TokenMetadata.ringIterator() and firstTokenIndex() 
>>> work.
>>>
>>> Eric, can you show the output from nodetool ring ?
>>>
>>>
>>
>> Sorry if the previous paste was way to unformatted, here is a
>> pastie.org link with nicer formatting of nodetool ring output than
>> plain text email allows.
>>
>> http://pastie.org/private/50khpakpffjhsmgf66oetg
>
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com


Re: Write performance help needed

2011-05-03 Thread Jonathan Ellis
You don't give many details, but I would guess:

- your benchmark is not multithreaded
- mongodb is not configured for durable writes, so you're really only
measuring the time for it to buffer it in memory
- you haven't loaded enough data to hit "mongo's index doesn't fit in
memory anymore"

On Tue, May 3, 2011 at 8:24 AM, Steve Smith  wrote:
> I am working for client that needs to persist 100K-200K records per second
> for later querying.  As a proof of concept, we are looking at several
> options including nosql (Cassandra and MongoDB).
> I have been running some tests on my laptop (MacBook Pro, 4GB RAM, 2.66 GHz,
> Dual Core/4 logical cores) and have not been happy with the results.
> The best I have been able to accomplish is 100K records in approximately 30
> seconds.  Each record has 30 columns, mostly made up of integers.  I have
> tried both the Hector and Pelops APIs, and have tried writing in batches
> versus one at a time.  The times have not varied much.
> I am using the out of the box configuration for Cassandra, and while I know
> using 1 disk will have an impact on performance, I would expect to see
> better write numbers than I am.
> As a point of reference, the same test using MongoDB I was able to
> accomplish 100K records in 3.5 seconds.
> Any tips would be appreciated.
>
> - Steve
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com


RE: Replica data distributing between racks

2011-05-03 Thread Jeremiah Jordan
So we are currently running a 10 node ring in one DC, and we are going to be 
adding 5 more nodes
in another DC.  To keep the rings in each DC balanced, should I really 
calculate the tokens independently
and just make sure none of them are the same? Something like:

DC1 (RF 5):
1:  0
2:  17014118346046923173168730371588410572
3:  34028236692093846346337460743176821144
4:  51042355038140769519506191114765231716
5:  68056473384187692692674921486353642288
6:  85070591730234615865843651857942052860
7:  102084710076281539039012382229530463432
8:  119098828422328462212181112601118874004
9:  136112946768375385385349842972707284576
10: 153127065114422308558518573344295695148

DC2 (RF 3):
1:  1 (one off from DC1 node 1)
2:  34028236692093846346337460743176821145 (one off from DC1 node 3)
3:  68056473384187692692674921486353642290 (two off from DC1 node 5)
4:  102084710076281539039012382229530463435 (three off from DC1 node 7)
5:  136112946768375385385349842972707284580 (four off from DC1 node 9)

Originally I was thinking I should spread the DC2 nodes evenly in between every 
other DC1 node.
Or does it not matter where they are in respect to the DC1 nodes, and long as 
they fall somewhere
after every other DC1 node? So it is DC1-1, DC2-1, DC1-2, DC1-3, DC2-2, DC1-4, 
DC1-5...

-Original Message-
From: Jonathan Ellis [mailto:jbel...@gmail.com] 
Sent: Tuesday, May 03, 2011 9:14 AM
To: user@cassandra.apache.org
Subject: Re: Replica data distributing between racks

Right, when you are computing balanced RP tokens for NTS you need to compute 
the tokens for each DC independently.

On Tue, May 3, 2011 at 6:23 AM, aaron morton  wrote:
> I've been digging into this and worked was able to reproduce something, not 
> sure if it's a fault and I can't work on it any more tonight.
>
>
> To reproduce:
> - 2 node cluster on my mac book
> - set the tokens as if they were nodes 3 and 4 in a 4 node cluster, 
> e.g. node 1 with 85070591730234615865843651857942052864 and node 2 
> 127605887595351923798765477786913079296
> - set cassandra-topology.properties to put the nodes in DC1 on RAC1 
> and RAC2
> - create a keyspace using NTS and strategy_options = [{DC1:1}]
>
> Inserted 10 rows they were distributed as
> - node 1 - 9 rows
> - node 2 - 1 row
>
> I *think* the problem has to do with TokenMetadata.firstTokenIndex(). It 
> often says the closest token to a key is the node 1 because in effect...
>
> - node 1 is responsible for 0 to 
> 85070591730234615865843651857942052864
> - node 2 is responsible for 85070591730234615865843651857942052864 to 
> 127605887595351923798765477786913079296
> - AND node 1 does the wrap around from 
> 127605887595351923798765477786913079296 to 0 as keys that would insert past 
> the last token in the ring array wrap to 0 because  insertMin is false.
>
> Thoughts ?
>
> Aaron
>
>
> On 3 May 2011, at 10:29, Eric tamme wrote:
>
>> On Mon, May 2, 2011 at 5:59 PM, aaron morton  wrote:
>>> My bad, I missed the way TokenMetadata.ringIterator() and firstTokenIndex() 
>>> work.
>>>
>>> Eric, can you show the output from nodetool ring ?
>>>
>>>
>>
>> Sorry if the previous paste was way to unformatted, here is a 
>> pastie.org link with nicer formatting of nodetool ring output than 
>> plain text email allows.
>>
>> http://pastie.org/private/50khpakpffjhsmgf66oetg
>
>



--
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support 
http://www.datastax.com


IOException: Unable to create hard link ... /snapshots/ ... (errno 17)

2011-05-03 Thread Mck
Running a 3 node cluster with cassandra-0.8.0-beta1 

I'm seeing the first node logging many (thousands) times lines like


Caused by: java.io.IOException: Unable to create hard link
from 
/iad/finn/countstatistics/cassandra-data/countstatisticsCount/thrift_no_finntech_countstats_count_Count_1299479381593068337-f-5504-Data.db
 to 
/iad/finn/countstatistics/cassandra-data/countstatisticsCount/snapshots/compact-thrift_no_finntech_countstats_count_Count_1299479381593068337/thrift_no_finntech_countstats_count_Count_1299479381593068337-f-5504-Data.db
 (errno 17)


This seems to happen for all column families (including system).
It happens a lot during startup.

The hardlinks do exist. Stopping, deleting the hardlinks, and starting
again does not help.

But i haven't seen it once on the other nodes...

~mck


ps the stacktrace


java.io.IOError: java.io.IOException: Unable to create hard link from 
/iad/finn/countstatistics/cassandra-data/countstatisticsCount/thrift_no_finntech_countstats_count_Count_1299479381593068337-f-3875-Data.db
 to 
/iad/finn/countstatistics/cassandra-data/countstatisticsCount/snapshots/compact-thrift_no_finntech_countstats_count_Count_1299479381593068337/thrift_no_finntech_countstats_count_Count_1299479381593068337-f-3875-Data.db
 (errno 17)
at 
org.apache.cassandra.db.ColumnFamilyStore.snapshotWithoutFlush(ColumnFamilyStore.java:1629)
at 
org.apache.cassandra.db.ColumnFamilyStore.snapshot(ColumnFamilyStore.java:1654)
at org.apache.cassandra.db.Table.snapshot(Table.java:198)
at 
org.apache.cassandra.db.CompactionManager.doCompaction(CompactionManager.java:504)
at 
org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:146)
at 
org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:112)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.io.IOException: Unable to create hard link from 
/iad/finn/countstatistics/cassandra-data/countstatisticsCount/thrift_no_finntech_countstats_count_Count_1299479381593068337-f-3875-Data.db
 to 
/iad/finn/countstatistics/cassandra-data/countstatisticsCount/snapshots/compact-thrift_no_finntech_countstats_count_Count_1299479381593068337/thrift_no_finntech_countstats_count_Count_1299479381593068337-f-3875-Data.db
 (errno 17)
at org.apache.cassandra.utils.CLibrary.createHardLink(CLibrary.java:155)
at 
org.apache.cassandra.io.sstable.SSTableReader.createLinks(SSTableReader.java:713)
at 
org.apache.cassandra.db.ColumnFamilyStore.snapshotWithoutFlush(ColumnFamilyStore.java:1622)
... 10 more





Re: Replica data distributing between racks

2011-05-03 Thread Eric tamme
On Tue, May 3, 2011 at 10:13 AM, Jonathan Ellis  wrote:
> Right, when you are computing balanced RP tokens for NTS you need to
> compute the tokens for each DC independently.

I am confused ... sorry.  Are you saying that ... I need to change how
my keys are calculated to fix this problem?  Or are you talking about
the implementation of how replication selects a token?

-Eric


Re: low performance inserting

2011-05-03 Thread charles THIBAULT
Hi Sylvain,

thanks for your answer.

I'd make a test with the stress utility inserting 100 000 rows with 10
columns per row
I use these options: -o insert -t 5 -n 10 -c 10 -d
192.168.1.210,192.168.1.211,...
result: 161 seconds

with MySQL using inserts (after a dump): 1.79 second

Charles

2011/5/3 Sylvain Lebresne 

> There is probably a fair number of things you'd have to make sure you do to
> improve the write performance on the Cassandra side (starting by using
> multiple
> threads to do the insertion), but the first thing is probably to start
> comparing things
> that are at least mildly comparable. If you do inserts in Cassandra,
> you should try
> to do inserts in MySQL too, not "load data infile" (which really is
> just a bulk loading
> utility). And as stated here
> http://dev.mysql.com/doc/refman/5.0/en/insert-speed.html:
> "When loading a table from a text file, use LOAD DATA INFILE. This is
> usually 20 times
> faster than using INSERT statements."
>
> --
> Sylvain
>
> On Tue, May 3, 2011 at 12:30 PM, charles THIBAULT
>  wrote:
> > Hello everybody,
> >
> > first: sorry for my english in advance!!
> >
> > I'm getting started with Cassandra on a 5 nodes cluster inserting data
> > with the pycassa API.
> >
> > I've read everywere on internet that cassandra's performance are better
> than
> > MySQL
> > because of the writes append's only into commit logs files.
> >
> > When i'm trying to insert 100 000 rows with 10 columns per row with batch
> > insert, I'v this result: 27 seconds
> > But with MySQL (load data infile) this take only 2 seconds (using
> indexes)
> >
> > Here my configuration
> >
> > cassandra version: 0.7.5
> > nodes : 192.168.1.210, 192.168.1.211, 192.168.1.212, 192.168.1.213,
> > 192.168.1.214
> > seed: 192.168.1.210
> >
> > My script
> >
> *
> > #!/usr/bin/env python
> >
> > import pycassa
> > import time
> > import random
> > from cassandra import ttypes
> >
> > pool = pycassa.connect('test', ['192.168.1.210:9160'])
> > cf = pycassa.ColumnFamily(pool, 'test')
> > b = cf.batch(queue_size=50,
> > write_consistency_level=ttypes.ConsistencyLevel.ANY)
> >
> > tps1 = time.time()
> > for i in range(10):
> > columns = dict()
> > for j in range(10):
> > columns[str(j)] = str(random.randint(0,100))
> > b.insert(str(i), columns)
> > b.send()
> > tps2 = time.time()
> >
> >
> > print("execution time: " + str(tps2 - tps1) + " seconds")
> >
> *
> >
> > what I'm doing rong ?
> >
>


Unicode key encoding problem when upgrading from 0.6.13 to 0.7.5

2011-05-03 Thread Henrik Schröder
Hey everyone,

We did some tests before upgrading our Cassandra cluster from 0.6 to 0.7,
just to make sure that the change in how keys are encoded wouldn't cause us
any dataloss. Unfortunately it seems that rows stored under a unicode key
couldn't be retrieved after the upgrade. We're running everything on
Windows, and we're using the generated thrift client in C# to access it.

I managed to make a minimal test to reproduce the error consistently:

First, I started up Cassandra 0.6.13 with an empty data directory, and a
really simple config with a single keyspace with a single bytestype
columnfamily.
I wrote two rows, each with a single column with a simple column name and a
1-byte value of "1". The first row had a key using only ascii chars ('foo'),
and the second row had a key using unicode chars ('ドメインウ').

Using multi_get, and both those keys, I got both columns back, as expected.
Using multi_get_slice and both those keys, I got both columns back, as
expected.
I also did a get_range_slices to get all rows in the columnfamily, and I got
both columns back, as expected.

So far so good. Then I drain and shut down Cassandra 0.6.13, and start up
Cassandra 0.7.5, pointing to the same data directory, with a config
containing the same keyspace, and I run the schematool import command.

I then start up my test program that uses the new thrift api, and run some
commands.

Using multi_get_slice, and those two keys encoded as UTF8 byte-arrays, I
only get back one column, the one under the key 'foo'. The other row I
simply can't retrieve.

However, when I use get_range_slices to get all rows, I get back two rows,
with the correct column values, and the byte-array keys are identical to my
encoded keys, and when I decode the byte-arrays as UTF8 drings, I get back
my two original keys. This means that both my rows are still there, the keys
as output by Cassandra are identical to the original string keys I used when
I created the rows in 0.6.13, but it's just impossible to retrieve the
second row.

To continue the test, I inserted a row with the key 'ドメインウ' encoded as UTF-8
again, and gave it a similar column as the original, but with a 1-byte value
of "2".

Now, when I use multi_get_slice with my two encoded keys, I get back two
rows, the 'foo' row has the old value as expected, and the other row has the
new value as expected.

However, when I use get_range_slices to get all rows, I get back *three*
rows, two of which have the *exact same* byte-array key, one has the old
column, one has the new column.


How is this possible? How can there be two different rows with the exact
same key? I'm guessing that it's related to the encoding of string keys in
0.6, and that the internal representation is off somehow. I checked the
generated thrift client for 0.6, and it UTF8-encodes all keys before sending
them to the server, so it should be UTF8 all the way, but apparently it
isn't.

Has anyone else experienced the same problem? Is it a platform-specific
problem? Is there a way to avoid this and upgrade from 0.6 to 0.7 and not
lose any rows? I would also really like to know which byte-array I should
send in to get back that second row, there's gotta be some key that can be
used to get it, the row is still there after all.


/Henrik Schröder


Re: Unicode key encoding problem when upgrading from 0.6.13 to 0.7.5

2011-05-03 Thread Henrik Schröder
The way we solved this problem is that it turned out we had only a few
hundred rows with unicode keys, so we simply extracted them, upgraded to
0.7, and wrote them back. However, this means that among the rows, there are
a few hundred weird duplicate rows with identical keys.

Is this going to be a problem in the future? Is there a chance that the good
duplicate is cleaned out in favour of the bad duplicate so that we suddnely
lose those rows again?


/Henrik Schröder


Re: One cluster or many?

2011-05-03 Thread Jonathan Ellis
I would add that running one cluster is operationally less work than
running multiple.

On Tue, May 3, 2011 at 4:15 AM, David Boxenhorn  wrote:
> If I have a database that partitions naturally into non-overlapping
> datasets, in which there are no references between datasets, where each
> dataset is quite large (i.e. large enough to merit its own cluster from the
> point of view of quantity of data), should I set up one cluster per database
> or one large cluster for everything together?
>
> As I see it:
>
> The primary advantage of separate clusters is total isolation: if I have a
> problem with one dataset, my application will continue working normally for
> all other datasets.
>
> The primary advantage of one big cluster is usage pooling: when one server
> goes down in a large cluster it's much less important than when one server
> goes down in a small cluster. Also, different temporal usage patterns of the
> different datasets (i.e. there will be different peak hours on different
> datasets) can be combined to ease capacity requirements.
>
> Any thoughts?
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com


Re: low performance inserting

2011-05-03 Thread mcasandra
Did you do a bulk upload with mysql from the same machine or separate
insert/commit for each row? And did you run inserts from the same machine as
the mysqld server?

--
View this message in context: 
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/low-performance-inserting-tp6326832p6327957.html
Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
Nabble.com.


Re: Experiences with Map&Reduce Stress Tests

2011-05-03 Thread Jeremy Hanna
Writing to Cassandra from map/reduce jobs over HDFS shouldn't be a problem.  
We're doing it in our cluster and I know of others doing the same thing.  You 
might just make sure the number of reducers (or mappers) writing to cassandra 
don't overwhelm it.  There's no data locality for writes, though a cassandra 
specific partitioner might help with that in the future.  See CASSANDRA-1473 - 
https://issues.apache.org/jira/browse/CASSANDRA-1473.

I apologize that I misspoke about one of the settings.  The batch size is in 
fact the number of rows it gets each time.  The input splits just affects how 
many mappers it splits the data into.

As far as recommending this solution, it really depends on the problem.  The 
people I know doing what you're thinking of doing typically store raw data in 
HDFS, perform mapreduce jobs over that data and output the results into 
Cassandra for realtime queries.

We're using it where I work for storage and analytics both.  We store raw data 
into S3/HDFS, mapreduce over that data and output into cassandra, then perform 
realtime queries as well as analytics over that data.  If you want to do run 
analytics over Cassandra data, you'll want to partition your cluster so that 
mapreduce jobs don't affect the realtime performance.

On May 3, 2011, at 3:19 AM, Subscriber wrote:

> Hi Jeremy, 
> 
> yes, the setup on the data-nodes is:
>   - Hadoop DataNode
>   - Hadoop TaskTracker
>   - CassandraDaemon
> 
> However - the map-input is not read from Cassandra. I am running a writing 
> stress test - no reads (well from time to time I check the produced items 
> using cassandra-cli).
> Is it possible to achieve data-locality on writes? Well I think that this is 
> (in practice) not possible (one could create some artificial data that 
> correlates with the hashed row-key values or so ... ;-)
> 
> Thanks for all your tips and hints! It's good see that someone worries about 
> my problems :-)
> But - to be honest - my number one priority is not to get this test running 
> but to answer the question whether the setup Cassandra+Hadoop with massive 
> parallel writes (using map/reduce) meets the demands of our customer.
> 
> I found out that the following configuration helps a lot. 
> * disk_access_mode: standard 
> * MAX_HEAP_SIZE="4G"
> * HEAP_NEWSIZE="400M"
> * rpc_timeout_in_ms: 2
> 
> Now the stress test runs through, but there are still timeouts (Hadoop 
> reschedules the failing mapper tasks on another node and so the test runs 
> through).
> But what causes this timeouts? 20 seconds are a long time for a modern cpu 
> (and an eternity for an android ;-) 
> 
> It seems to me that it's not only the massive amount of data or to many 
> parallel mappers, because Cassandra can handle this huge write rate over one 
> hour! 
> I found in the system.logs that the ConcurrentMarkSweeps take quite long (up 
> to 8 seconds). The heap size didn't grow much about 3GB so there was still 
> "enough air to breath".
> 
> So the question remains: can I recommend this setup?
> 
> Thanks again and best regards
> Udo
> 
> 
> Am 02.05.2011 um 20:21 schrieb Jeremy Hanna:
> 
>> Udo,
>> 
>> One thing to get out of the way - you're running task trackers on all of 
>> your cassandra nodes, right?  That is the first and foremost way to get good 
>> performance.  Otherwise you don't have data locality, which is really the 
>> point of map/reduce, co-locating your data and your processes operating over 
>> that data.  You're probably already doing that, but I had forgotten to ask 
>> that before.
>> 
>> Besides that...
>> 
>> You might try messing with those values a bit more as well as the input 
>> split size - cassandra.input.split.size which defaults to ~65k.  So you 
>> might try rpc timeout of 30s just to see if that helps and try reducing the 
>> input split size significantly to see if that helps.
>> 
>> For your setup I don't see the range batch size as being meaningful at all 
>> with your narrow rows, so don't worry about that.
>> 
>> Also, the capacity of your nodes and the number of mappers/reducers you're 
>> trying to use will also have an effect on whether it has to timeout.  
>> Essentially it's getting overwhelmed for some reason.  You might lower the 
>> number of mappers and reducers you're hitting your cassandra cluster with to 
>> see if that helps.
>> 
>> Jeremy
>> 
>> On May 2, 2011, at 6:25 AM, Subscriber wrote:
>> 
>>> Hi Jeremy, 
>>> 
>>> thanks for the link.
>>> I doubled the rpc_timeout (20 seconds) and reduced the range-batch-size to 
>>> 2048, but I still get timeouts...
>>> 
>>> Udo
>>> 
>>> Am 29.04.2011 um 18:53 schrieb Jeremy Hanna:
>>> 
 It sounds like there might be some tuning you can do to your jobs - take a 
 look at the wiki's HadoopSupport page, specifically the Troubleshooting 
 section:
 http://wiki.apache.org/cassandra/HadoopSupport#Troubleshooting
 
 On Apr 29, 2011, at 11:45 AM, Subscriber wrote:
 
> Hi all, 
> 
>>

Range Slice Issue

2011-05-03 Thread Serediuk, Adam
We appear to have encountered an issue with cassandra 0.7.5 after upgrading 
from 0.7.2. While doing a batch read using a get_range_slice against the ranges 
an individual node is master for we are able to reproduce consistently that the 
last two nodes in the ring, regardless of the ring size (we have a 60 node 
production cluster and a 12 node test cluster) perform this read over the 
network using replicas of executing locally. Every other node in the ring 
successfully reads locally.

To be sure there were no data consistency issues we performed a nodetool repair 
against both of these nodes and the issue persists. We also tried truncating 
the column family and repopulating, but the issue remains.

This seems to be related to CASSANDRA-2286 in 0.7.4. We always want to read 
data locally if it is available there. We use Cassandra.Client.describe_ring() 
to figure out which machine in the ring is master for which TokenRange. I then 
compare the master for each TokenRange against the localhost to find out which 
token ranges are owned by the local machine (remote reads are too slow for this 
type of batch processing). Once I know which TokenRanges are on each machine 
locally I get evenly sized splits using Cassandra.Client.describe_splits().


Adam



Re: Range Slice Issue

2011-05-03 Thread Jonathan Ellis
Do you still see this behavior if you disable dynamic snitch?

On Tue, May 3, 2011 at 12:31 PM, Serediuk, Adam
 wrote:
> We appear to have encountered an issue with cassandra 0.7.5 after upgrading
> from 0.7.2. While doing a batch read using a get_range_slice against the
> ranges an individual node is master for we are able to reproduce
> consistently that the last two nodes in the ring, regardless of the ring
> size (we have a 60 node production cluster and a 12 node test cluster)
> perform this read over the network using replicas of executing locally.
> Every other node in the ring successfully reads locally.
> To be sure there were no data consistency issues we performed a nodetool
> repair against both of these nodes and the issue persists. We also tried
> truncating the column family and repopulating, but the issue remains.
> This seems to be related to CASSANDRA-2286 in 0.7.4. We always want to read
> data locally if it is available there. We
> use Cassandra.Client.describe_ring() to figure out which machine in the
> ring is master for which TokenRange. I then compare the master for
> each TokenRange against the localhost to find out which token ranges
> are owned by the local machine (remote reads are too slow for this type
> of batch processing). Once I know which TokenRanges are on
> each machine locally I get evenly sized splits using
> Cassandra.Client.describe_splits().
>
> Adam
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com


Re: Range Slice Issue

2011-05-03 Thread Serediuk, Adam
I just ran a test and we do not see that behavior with dynamic snitch disabled. 
All nodes appear to be doing local reads as expected.


On May 3, 2011, at 10:37 AM, Jonathan Ellis wrote:

> Do you still see this behavior if you disable dynamic snitch?
> 
> On Tue, May 3, 2011 at 12:31 PM, Serediuk, Adam
>  wrote:
>> We appear to have encountered an issue with cassandra 0.7.5 after upgrading
>> from 0.7.2. While doing a batch read using a get_range_slice against the
>> ranges an individual node is master for we are able to reproduce
>> consistently that the last two nodes in the ring, regardless of the ring
>> size (we have a 60 node production cluster and a 12 node test cluster)
>> perform this read over the network using replicas of executing locally.
>> Every other node in the ring successfully reads locally.
>> To be sure there were no data consistency issues we performed a nodetool
>> repair against both of these nodes and the issue persists. We also tried
>> truncating the column family and repopulating, but the issue remains.
>> This seems to be related to CASSANDRA-2286 in 0.7.4. We always want to read
>> data locally if it is available there. We
>> use Cassandra.Client.describe_ring() to figure out which machine in the
>> ring is master for which TokenRange. I then compare the master for
>> each TokenRange against the localhost to find out which token ranges
>> are owned by the local machine (remote reads are too slow for this type
>> of batch processing). Once I know which TokenRanges are on
>> each machine locally I get evenly sized splits using
>> Cassandra.Client.describe_splits().
>> 
>> Adam
>> 
> 
> 
> 
> -- 
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of DataStax, the source for professional Cassandra support
> http://www.datastax.com
> 




Re: Range Slice Issue

2011-05-03 Thread Jonathan Ellis
So either (a) dynamic snitch is wrong or (b) those nodes really are
more heavily loaded than the others, and are correctly pushing queries
to other replicas.

On Tue, May 3, 2011 at 12:47 PM, Serediuk, Adam
 wrote:
> I just ran a test and we do not see that behavior with dynamic snitch 
> disabled. All nodes appear to be doing local reads as expected.
>
>
> On May 3, 2011, at 10:37 AM, Jonathan Ellis wrote:
>
>> Do you still see this behavior if you disable dynamic snitch?
>>
>> On Tue, May 3, 2011 at 12:31 PM, Serediuk, Adam
>>  wrote:
>>> We appear to have encountered an issue with cassandra 0.7.5 after upgrading
>>> from 0.7.2. While doing a batch read using a get_range_slice against the
>>> ranges an individual node is master for we are able to reproduce
>>> consistently that the last two nodes in the ring, regardless of the ring
>>> size (we have a 60 node production cluster and a 12 node test cluster)
>>> perform this read over the network using replicas of executing locally.
>>> Every other node in the ring successfully reads locally.
>>> To be sure there were no data consistency issues we performed a nodetool
>>> repair against both of these nodes and the issue persists. We also tried
>>> truncating the column family and repopulating, but the issue remains.
>>> This seems to be related to CASSANDRA-2286 in 0.7.4. We always want to read
>>> data locally if it is available there. We
>>> use Cassandra.Client.describe_ring() to figure out which machine in the
>>> ring is master for which TokenRange. I then compare the master for
>>> each TokenRange against the localhost to find out which token ranges
>>> are owned by the local machine (remote reads are too slow for this type
>>> of batch processing). Once I know which TokenRanges are on
>>> each machine locally I get evenly sized splits using
>>> Cassandra.Client.describe_splits().
>>>
>>> Adam
>>>
>>
>>
>>
>> --
>> Jonathan Ellis
>> Project Chair, Apache Cassandra
>> co-founder of DataStax, the source for professional Cassandra support
>> http://www.datastax.com
>>
>
>
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com


MemtablePostFlusher with high number of pending calls?

2011-05-03 Thread Terje Marthinussen
Cassandra 0.8 beta trunk from about 1 week ago:

Pool NameActive   Pending  Completed
ReadStage 0 0  5
RequestResponseStage  0 0  87129
MutationStage 0 0 187298
ReadRepairStage   0 0  0
ReplicateOnWriteStage 0 0  0
GossipStage   0 01353524
AntiEntropyStage  0 0  0
MigrationStage0 0 10
MemtablePostFlusher   1   190108
StreamStage   0 0  0
FlushWriter   0 0302
FILEUTILS-DELETE-POOL 0 0 26
MiscStage 0 0  0
FlushSorter   0 0  0
InternalResponseStage 0 0  0
HintedHandoff 1 4  7


Anyone with nice theories about the pending value on the memtable post
flusher?

Regards,
Terje


Re: MemtablePostFlusher with high number of pending calls?

2011-05-03 Thread Jonathan Ellis
Does it resolve down to 0 eventually if you stop doing writes?

On Tue, May 3, 2011 at 12:56 PM, Terje Marthinussen
 wrote:
> Cassandra 0.8 beta trunk from about 1 week ago:
> Pool Name                    Active   Pending      Completed
> ReadStage                         0         0              5
> RequestResponseStage              0         0          87129
> MutationStage                     0         0         187298
> ReadRepairStage                   0         0              0
> ReplicateOnWriteStage             0         0              0
> GossipStage                       0         0        1353524
> AntiEntropyStage                  0         0              0
> MigrationStage                    0         0             10
> MemtablePostFlusher               1       190            108
> StreamStage                       0         0              0
> FlushWriter                       0         0            302
> FILEUTILS-DELETE-POOL             0         0             26
> MiscStage                         0         0              0
> FlushSorter                       0         0              0
> InternalResponseStage             0         0              0
> HintedHandoff                     1         4              7
>
> Anyone with nice theories about the pending value on the memtable post
> flusher?
> Regards,
> Terje



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com


Re: MemtablePostFlusher with high number of pending calls?

2011-05-03 Thread Jonathan Ellis
... and are there any exceptions in the log?

On Tue, May 3, 2011 at 1:01 PM, Jonathan Ellis  wrote:
> Does it resolve down to 0 eventually if you stop doing writes?
>
> On Tue, May 3, 2011 at 12:56 PM, Terje Marthinussen
>  wrote:
>> Cassandra 0.8 beta trunk from about 1 week ago:
>> Pool Name                    Active   Pending      Completed
>> ReadStage                         0         0              5
>> RequestResponseStage              0         0          87129
>> MutationStage                     0         0         187298
>> ReadRepairStage                   0         0              0
>> ReplicateOnWriteStage             0         0              0
>> GossipStage                       0         0        1353524
>> AntiEntropyStage                  0         0              0
>> MigrationStage                    0         0             10
>> MemtablePostFlusher               1       190            108
>> StreamStage                       0         0              0
>> FlushWriter                       0         0            302
>> FILEUTILS-DELETE-POOL             0         0             26
>> MiscStage                         0         0              0
>> FlushSorter                       0         0              0
>> InternalResponseStage             0         0              0
>> HintedHandoff                     1         4              7
>>
>> Anyone with nice theories about the pending value on the memtable post
>> flusher?
>> Regards,
>> Terje
>
>
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of DataStax, the source for professional Cassandra support
> http://www.datastax.com
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com


Re: Range Slice Issue

2011-05-03 Thread Serediuk, Adam
Both data and system load are equal across all nodes and the smaller test 
cluster also exhibits the same issue. tokens are balanced and total node size 
is equivalent.

On May 3, 2011, at 10:51 AM, Jonathan Ellis wrote:

> So either (a) dynamic snitch is wrong or (b) those nodes really are
> more heavily loaded than the others, and are correctly pushing queries
> to other replicas.
> 
> On Tue, May 3, 2011 at 12:47 PM, Serediuk, Adam
>  wrote:
>> I just ran a test and we do not see that behavior with dynamic snitch 
>> disabled. All nodes appear to be doing local reads as expected.
>> 
>> 
>> On May 3, 2011, at 10:37 AM, Jonathan Ellis wrote:
>> 
>>> Do you still see this behavior if you disable dynamic snitch?
>>> 
>>> On Tue, May 3, 2011 at 12:31 PM, Serediuk, Adam
>>>  wrote:
 We appear to have encountered an issue with cassandra 0.7.5 after upgrading
 from 0.7.2. While doing a batch read using a get_range_slice against the
 ranges an individual node is master for we are able to reproduce
 consistently that the last two nodes in the ring, regardless of the ring
 size (we have a 60 node production cluster and a 12 node test cluster)
 perform this read over the network using replicas of executing locally.
 Every other node in the ring successfully reads locally.
 To be sure there were no data consistency issues we performed a nodetool
 repair against both of these nodes and the issue persists. We also tried
 truncating the column family and repopulating, but the issue remains.
 This seems to be related to CASSANDRA-2286 in 0.7.4. We always want to read
 data locally if it is available there. We
 use Cassandra.Client.describe_ring() to figure out which machine in the
 ring is master for which TokenRange. I then compare the master for
 each TokenRange against the localhost to find out which token ranges
 are owned by the local machine (remote reads are too slow for this type
 of batch processing). Once I know which TokenRanges are on
 each machine locally I get evenly sized splits using
 Cassandra.Client.describe_splits().
 
 Adam
 
>>> 
>>> 
>>> 
>>> --
>>> Jonathan Ellis
>>> Project Chair, Apache Cassandra
>>> co-founder of DataStax, the source for professional Cassandra support
>>> http://www.datastax.com
>>> 
>> 
>> 
>> 
> 
> 
> 
> -- 
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of DataStax, the source for professional Cassandra support
> http://www.datastax.com
> 




Re: MemtablePostFlusher with high number of pending calls?

2011-05-03 Thread Terje Marthinussen
Just some very tiny amount of writes in the background here (some hints
spooled up on another node slowly coming in).
No new data.

I thought there was no exceptions, but I did not look far enough back in the
log at first.

Going back a bit further now however, I see that about 50 hours ago:
ERROR [CompactionExecutor:387] 2011-05-02 01:16:01,027
AbstractCassandraDaemon.java (line 112) Fatal exception in thread
Thread[CompactionExecutor:387,1,main]
java.io.IOException: No space left on device
at java.io.RandomAccessFile.writeBytes(Native Method)
at java.io.RandomAccessFile.write(RandomAccessFile.java:466)
at
org.apache.cassandra.io.util.BufferedRandomAccessFile.flush(BufferedRandomAccessFile.java:160)
at
org.apache.cassandra.io.util.BufferedRandomAccessFile.reBuffer(BufferedRandomAccessFile.java:225)
at
org.apache.cassandra.io.util.BufferedRandomAccessFile.writeAtMost(BufferedRandomAccessFile.java:356)
at
org.apache.cassandra.io.util.BufferedRandomAccessFile.write(BufferedRandomAccessFile.java:335)
at
org.apache.cassandra.io.PrecompactedRow.write(PrecompactedRow.java:102)
at
org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:130)
at
org.apache.cassandra.db.CompactionManager.doCompaction(CompactionManager.java:566)
at
org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:146)
at
org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:112)
at
java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
[followed by a few more of those...]

and then a bunch of these:
ERROR [FlushWriter:123] 2011-05-02 01:21:12,690 AbstractCassandraDaemon.java
(line 112) Fatal exception in thread Thread[FlushWriter:123,5,main]
java.lang.RuntimeException: java.lang.RuntimeException: Insufficient disk
space to flush 40009184 bytes
at
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:34)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.lang.RuntimeException: Insufficient disk space to flush
40009184 bytes
at
org.apache.cassandra.db.ColumnFamilyStore.getFlushPath(ColumnFamilyStore.java:597)
at
org.apache.cassandra.db.ColumnFamilyStore.createFlushWriter(ColumnFamilyStore.java:2100)
at
org.apache.cassandra.db.Memtable.writeSortedContents(Memtable.java:239)
at org.apache.cassandra.db.Memtable.access$400(Memtable.java:50)
at org.apache.cassandra.db.Memtable$3.runMayThrow(Memtable.java:263)
at
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30)
... 3 more

Seems like compactions stopped after this (a bunch of tmp tables there still
from when those errors where generated), and I can only suspect the post
flusher may have stopped at the same time.

There is 890GB of disk for data, sstables are currently using 604G (139GB is
old tmp tables from when it ran out of disk) and "ring" tells me the load on
the node is 313GB.

Terje



On Wed, May 4, 2011 at 3:02 AM, Jonathan Ellis  wrote:

> ... and are there any exceptions in the log?
>
> On Tue, May 3, 2011 at 1:01 PM, Jonathan Ellis  wrote:
> > Does it resolve down to 0 eventually if you stop doing writes?
> >
> > On Tue, May 3, 2011 at 12:56 PM, Terje Marthinussen
> >  wrote:
> >> Cassandra 0.8 beta trunk from about 1 week ago:
> >> Pool NameActive   Pending  Completed
> >> ReadStage 0 0  5
> >> RequestResponseStage  0 0  87129
> >> MutationStage 0 0 187298
> >> ReadRepairStage   0 0  0
> >> ReplicateOnWriteStage 0 0  0
> >> GossipStage   0 01353524
> >> AntiEntropyStage  0 0  0
> >> MigrationStage0 0 10
> >> MemtablePostFlusher   1   190108
> >> StreamStage   0 0  0
> >> FlushWriter   0 0302
> >> FILEUTILS-DELETE-POOL 0 0 26
> >> MiscStage 0 0  0
> >> FlushSorter   0 0  0
> >> InternalResponseStage 0 0  0
> >> HintedHandoff 1 4  7
> >>
> >> Anyo

Re: MemtablePostFlusher with high number of pending calls?

2011-05-03 Thread Terje Marthinussen
So yes, there is currently some 200GB empty disk.

On Wed, May 4, 2011 at 3:20 AM, Terje Marthinussen
wrote:

> Just some very tiny amount of writes in the background here (some hints
> spooled up on another node slowly coming in).
> No new data.
>
> I thought there was no exceptions, but I did not look far enough back in
> the log at first.
>
> Going back a bit further now however, I see that about 50 hours ago:
> ERROR [CompactionExecutor:387] 2011-05-02 01:16:01,027
> AbstractCassandraDaemon.java (line 112) Fatal exception in thread
> Thread[CompactionExecutor:387,1,main]
> java.io.IOException: No space left on device
> at java.io.RandomAccessFile.writeBytes(Native Method)
> at java.io.RandomAccessFile.write(RandomAccessFile.java:466)
> at
> org.apache.cassandra.io.util.BufferedRandomAccessFile.flush(BufferedRandomAccessFile.java:160)
> at
> org.apache.cassandra.io.util.BufferedRandomAccessFile.reBuffer(BufferedRandomAccessFile.java:225)
> at
> org.apache.cassandra.io.util.BufferedRandomAccessFile.writeAtMost(BufferedRandomAccessFile.java:356)
> at
> org.apache.cassandra.io.util.BufferedRandomAccessFile.write(BufferedRandomAccessFile.java:335)
> at
> org.apache.cassandra.io.PrecompactedRow.write(PrecompactedRow.java:102)
> at
> org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:130)
> at
> org.apache.cassandra.db.CompactionManager.doCompaction(CompactionManager.java:566)
> at
> org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:146)
> at
> org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:112)
> at
> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> at java.lang.Thread.run(Thread.java:662)
> [followed by a few more of those...]
>
> and then a bunch of these:
> ERROR [FlushWriter:123] 2011-05-02 01:21:12,690
> AbstractCassandraDaemon.java (line 112) Fatal exception in thread
> Thread[FlushWriter:123,5,main]
> java.lang.RuntimeException: java.lang.RuntimeException: Insufficient disk
> space to flush 40009184 bytes
> at
> org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:34)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> at java.lang.Thread.run(Thread.java:662)
> Caused by: java.lang.RuntimeException: Insufficient disk space to flush
> 40009184 bytes
> at
> org.apache.cassandra.db.ColumnFamilyStore.getFlushPath(ColumnFamilyStore.java:597)
> at
> org.apache.cassandra.db.ColumnFamilyStore.createFlushWriter(ColumnFamilyStore.java:2100)
> at
> org.apache.cassandra.db.Memtable.writeSortedContents(Memtable.java:239)
> at org.apache.cassandra.db.Memtable.access$400(Memtable.java:50)
> at
> org.apache.cassandra.db.Memtable$3.runMayThrow(Memtable.java:263)
> at
> org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30)
> ... 3 more
>
> Seems like compactions stopped after this (a bunch of tmp tables there
> still from when those errors where generated), and I can only suspect the
> post flusher may have stopped at the same time.
>
> There is 890GB of disk for data, sstables are currently using 604G (139GB
> is old tmp tables from when it ran out of disk) and "ring" tells me the load
> on the node is 313GB.
>
> Terje
>
>
>
> On Wed, May 4, 2011 at 3:02 AM, Jonathan Ellis  wrote:
>
>> ... and are there any exceptions in the log?
>>
>> On Tue, May 3, 2011 at 1:01 PM, Jonathan Ellis  wrote:
>> > Does it resolve down to 0 eventually if you stop doing writes?
>> >
>> > On Tue, May 3, 2011 at 12:56 PM, Terje Marthinussen
>> >  wrote:
>> >> Cassandra 0.8 beta trunk from about 1 week ago:
>> >> Pool NameActive   Pending  Completed
>> >> ReadStage 0 0  5
>> >> RequestResponseStage  0 0  87129
>> >> MutationStage 0 0 187298
>> >> ReadRepairStage   0 0  0
>> >> ReplicateOnWriteStage 0 0  0
>> >> GossipStage   0 01353524
>> >> AntiEntropyStage  0 0  0
>> >> MigrationStage0 0 10
>> >> MemtablePostFlusher   1   190108
>> >> StreamStage   0 0  0
>> >> FlushWriter   0 0302
>> >> FILEUTILS-DELETE-POOL 0 0 

Re: IOException: Unable to create hard link ... /snapshots/ ... (errno 17)

2011-05-03 Thread Mck
On Tue, 2011-05-03 at 16:52 +0200, Mck wrote:
> Running a 3 node cluster with cassandra-0.8.0-beta1 
> 
> I'm seeing the first node logging many (thousands) times 

Only "special" thing about this first node is it receives all the writes
from our sybase->cassandra import job.
This process migrates an existing 60million rows into cassandra (before
the cluster is /turned on/ for normal operations). The import job runs
over ~20minutes.

I wiped everything and started from scratch, this time running the
import job with cassandra configured instead with:

incremental_backups: false
snapshot_before_compaction: false

This created the problem then on another node.
So changing to these settings on all nodes and running the import again
fixed it: no more "Unable to create hard link ..."

After the import i could turn both incremental_backups and
snapshot_before_compaction to true again without problems so far.

To me this says something is broken with incremental_backups and
snapshot_before_compaction under heavy writing?

~mck




Re: MemtablePostFlusher with high number of pending calls?

2011-05-03 Thread Jonathan Ellis
post flusher is responsible for updating commitlog header after a
flush; each task waits for a specific flush to complete, then does its
thing.

so when you had a flush catastrophically fail, its corresponding
post-flush task will be stuck.

On Tue, May 3, 2011 at 1:20 PM, Terje Marthinussen
 wrote:
> Just some very tiny amount of writes in the background here (some hints
> spooled up on another node slowly coming in).
> No new data.
>
> I thought there was no exceptions, but I did not look far enough back in the
> log at first.
> Going back a bit further now however, I see that about 50 hours ago:
> ERROR [CompactionExecutor:387] 2011-05-02 01:16:01,027
> AbstractCassandraDaemon.java (line 112) Fatal exception in thread
> Thread[CompactionExecutor:387,1,main]
> java.io.IOException: No space left on device
>         at java.io.RandomAccessFile.writeBytes(Native Method)
>         at java.io.RandomAccessFile.write(RandomAccessFile.java:466)
>         at
> org.apache.cassandra.io.util.BufferedRandomAccessFile.flush(BufferedRandomAccessFile.java:160)
>         at
> org.apache.cassandra.io.util.BufferedRandomAccessFile.reBuffer(BufferedRandomAccessFile.java:225)
>         at
> org.apache.cassandra.io.util.BufferedRandomAccessFile.writeAtMost(BufferedRandomAccessFile.java:356)
>         at
> org.apache.cassandra.io.util.BufferedRandomAccessFile.write(BufferedRandomAccessFile.java:335)
>         at
> org.apache.cassandra.io.PrecompactedRow.write(PrecompactedRow.java:102)
>         at
> org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:130)
>         at
> org.apache.cassandra.db.CompactionManager.doCompaction(CompactionManager.java:566)
>         at
> org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:146)
>         at
> org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:112)
>         at
> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:662)
> [followed by a few more of those...]
> and then a bunch of these:
> ERROR [FlushWriter:123] 2011-05-02 01:21:12,690 AbstractCassandraDaemon.java
> (line 112) Fatal exception in thread Thread[FlushWriter:123,5,main]
> java.lang.RuntimeException: java.lang.RuntimeException: Insufficient disk
> space to flush 40009184 bytes
>         at
> org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:34)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:662)
> Caused by: java.lang.RuntimeException: Insufficient disk space to flush
> 40009184 bytes
>         at
> org.apache.cassandra.db.ColumnFamilyStore.getFlushPath(ColumnFamilyStore.java:597)
>         at
> org.apache.cassandra.db.ColumnFamilyStore.createFlushWriter(ColumnFamilyStore.java:2100)
>         at
> org.apache.cassandra.db.Memtable.writeSortedContents(Memtable.java:239)
>         at org.apache.cassandra.db.Memtable.access$400(Memtable.java:50)
>         at org.apache.cassandra.db.Memtable$3.runMayThrow(Memtable.java:263)
>         at
> org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30)
>         ... 3 more
> Seems like compactions stopped after this (a bunch of tmp tables there still
> from when those errors where generated), and I can only suspect the post
> flusher may have stopped at the same time.
> There is 890GB of disk for data, sstables are currently using 604G (139GB is
> old tmp tables from when it ran out of disk) and "ring" tells me the load on
> the node is 313GB.
> Terje
>
>
> On Wed, May 4, 2011 at 3:02 AM, Jonathan Ellis  wrote:
>>
>> ... and are there any exceptions in the log?
>>
>> On Tue, May 3, 2011 at 1:01 PM, Jonathan Ellis  wrote:
>> > Does it resolve down to 0 eventually if you stop doing writes?
>> >
>> > On Tue, May 3, 2011 at 12:56 PM, Terje Marthinussen
>> >  wrote:
>> >> Cassandra 0.8 beta trunk from about 1 week ago:
>> >> Pool Name                    Active   Pending      Completed
>> >> ReadStage                         0         0              5
>> >> RequestResponseStage              0         0          87129
>> >> MutationStage                     0         0         187298
>> >> ReadRepairStage                   0         0              0
>> >> ReplicateOnWriteStage             0         0              0
>> >> GossipStage                       0         0        1353524
>> >> AntiEntropyStage                  0         0              0
>> >> MigrationStage                    0         0             10
>> >> MemtablePostFlusher               1       190            108
>> >> StreamSt

Re: IOException: Unable to create hard link ... /snapshots/ ... (errno 17)

2011-05-03 Thread Jonathan Ellis
you should probably look to see what errno 17 means for the link
system call on your system.

On Tue, May 3, 2011 at 9:52 AM, Mck  wrote:
> Running a 3 node cluster with cassandra-0.8.0-beta1
>
> I'm seeing the first node logging many (thousands) times lines like
>
>
> Caused by: java.io.IOException: Unable to create hard link
> from 
> /iad/finn/countstatistics/cassandra-data/countstatisticsCount/thrift_no_finntech_countstats_count_Count_1299479381593068337-f-5504-Data.db
>  to 
> /iad/finn/countstatistics/cassandra-data/countstatisticsCount/snapshots/compact-thrift_no_finntech_countstats_count_Count_1299479381593068337/thrift_no_finntech_countstats_count_Count_1299479381593068337-f-5504-Data.db
>  (errno 17)
>
>
> This seems to happen for all column families (including system).
> It happens a lot during startup.
>
> The hardlinks do exist. Stopping, deleting the hardlinks, and starting
> again does not help.
>
> But i haven't seen it once on the other nodes...
>
> ~mck
>
>
> ps the stacktrace
>
>
> java.io.IOError: java.io.IOException: Unable to create hard link from 
> /iad/finn/countstatistics/cassandra-data/countstatisticsCount/thrift_no_finntech_countstats_count_Count_1299479381593068337-f-3875-Data.db
>  to 
> /iad/finn/countstatistics/cassandra-data/countstatisticsCount/snapshots/compact-thrift_no_finntech_countstats_count_Count_1299479381593068337/thrift_no_finntech_countstats_count_Count_1299479381593068337-f-3875-Data.db
>  (errno 17)
>        at 
> org.apache.cassandra.db.ColumnFamilyStore.snapshotWithoutFlush(ColumnFamilyStore.java:1629)
>        at 
> org.apache.cassandra.db.ColumnFamilyStore.snapshot(ColumnFamilyStore.java:1654)
>        at org.apache.cassandra.db.Table.snapshot(Table.java:198)
>        at 
> org.apache.cassandra.db.CompactionManager.doCompaction(CompactionManager.java:504)
>        at 
> org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:146)
>        at 
> org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:112)
>        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>        at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>        at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>        at java.lang.Thread.run(Thread.java:662)
> Caused by: java.io.IOException: Unable to create hard link from 
> /iad/finn/countstatistics/cassandra-data/countstatisticsCount/thrift_no_finntech_countstats_count_Count_1299479381593068337-f-3875-Data.db
>  to 
> /iad/finn/countstatistics/cassandra-data/countstatisticsCount/snapshots/compact-thrift_no_finntech_countstats_count_Count_1299479381593068337/thrift_no_finntech_countstats_count_Count_1299479381593068337-f-3875-Data.db
>  (errno 17)
>        at 
> org.apache.cassandra.utils.CLibrary.createHardLink(CLibrary.java:155)
>        at 
> org.apache.cassandra.io.sstable.SSTableReader.createLinks(SSTableReader.java:713)
>        at 
> org.apache.cassandra.db.ColumnFamilyStore.snapshotWithoutFlush(ColumnFamilyStore.java:1622)
>        ... 10 more
>
>
>
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com


Re: IOException: Unable to create hard link ... /snapshots/ ... (errno 17)

2011-05-03 Thread Mck
On Tue, 2011-05-03 at 13:52 -0500, Jonathan Ellis wrote:
> you should probably look to see what errno 17 means for the link
> system call on your system. 

That the file already exists.
It seems cassandra is trying to make the same hard link in parallel
(under heavy write load) ?

I see now i can also reproduce the problem with hadoop and
ColumnFamilyOutputFormat. 
Turning off snapshot_before_compaction seems to be enough to prevent
it. 

~mck




Re: Using snapshot for backup and restore

2011-05-03 Thread Jonathan Ellis
You're right, this is an oversight.  Created
https://issues.apache.org/jira/browse/CASSANDRA-2596 to fix.

As for a workaround, you can drop the index + recreate. (Upgrade to
0.7.5 first, if you haven't yet.)

On Tue, May 3, 2011 at 3:22 AM, Arsene Lee
 wrote:
> Hi,
>
>
>
> We are trying to use snapshot for backup and restore. We found out that
> snapshot doesn’t take secondary indexes.
>
> We are wondering why is that? And is there any way we can rebuild the
> secondary index?
>
>
>
> Regards,
>
>
>
> Arsene



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com


Re: IOException: Unable to create hard link ... /snapshots/ ... (errno 17)

2011-05-03 Thread Jonathan Ellis
Ah, that makes sense.  snapshot_before_compaction is trying to
snapshot, but incremental_backups already created one (for newly
flushed sstables).  You're probably the only one running with both
options on. :)

Can you create a ticket?

On Tue, May 3, 2011 at 2:05 PM, Mck  wrote:
> On Tue, 2011-05-03 at 13:52 -0500, Jonathan Ellis wrote:
>> you should probably look to see what errno 17 means for the link
>> system call on your system.
>
> That the file already exists.
> It seems cassandra is trying to make the same hard link in parallel
> (under heavy write load) ?
>
> I see now i can also reproduce the problem with hadoop and
> ColumnFamilyOutputFormat.
> Turning off snapshot_before_compaction seems to be enough to prevent
> it.
>
> ~mck
>
>
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com


Re: MemtablePostFlusher with high number of pending calls?

2011-05-03 Thread Terje Marthinussen
Yes, I realize that.

I am bit curious why it ran out of disk, or rather, why I have 200GB empty
disk now, but unfortunately it seems like we may not have had monitoring
enabled on this node to tell me what happened in terms of disk usage.

I also thought that compaction was supposed to resume (try again with less
data) if it fails?

Terje

On Wed, May 4, 2011 at 3:50 AM, Jonathan Ellis  wrote:

> post flusher is responsible for updating commitlog header after a
> flush; each task waits for a specific flush to complete, then does its
> thing.
>
> so when you had a flush catastrophically fail, its corresponding
> post-flush task will be stuck.
>
> On Tue, May 3, 2011 at 1:20 PM, Terje Marthinussen
>  wrote:
> > Just some very tiny amount of writes in the background here (some hints
> > spooled up on another node slowly coming in).
> > No new data.
> >
> > I thought there was no exceptions, but I did not look far enough back in
> the
> > log at first.
> > Going back a bit further now however, I see that about 50 hours ago:
> > ERROR [CompactionExecutor:387] 2011-05-02 01:16:01,027
> > AbstractCassandraDaemon.java (line 112) Fatal exception in thread
> > Thread[CompactionExecutor:387,1,main]
> > java.io.IOException: No space left on device
> > at java.io.RandomAccessFile.writeBytes(Native Method)
> > at java.io.RandomAccessFile.write(RandomAccessFile.java:466)
> > at
> >
> org.apache.cassandra.io.util.BufferedRandomAccessFile.flush(BufferedRandomAccessFile.java:160)
> > at
> >
> org.apache.cassandra.io.util.BufferedRandomAccessFile.reBuffer(BufferedRandomAccessFile.java:225)
> > at
> >
> org.apache.cassandra.io.util.BufferedRandomAccessFile.writeAtMost(BufferedRandomAccessFile.java:356)
> > at
> >
> org.apache.cassandra.io.util.BufferedRandomAccessFile.write(BufferedRandomAccessFile.java:335)
> > at
> > org.apache.cassandra.io.PrecompactedRow.write(PrecompactedRow.java:102)
> > at
> >
> org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:130)
> > at
> >
> org.apache.cassandra.db.CompactionManager.doCompaction(CompactionManager.java:566)
> > at
> >
> org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:146)
> > at
> >
> org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:112)
> > at
> > java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> > at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> > at
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> > at
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> > at java.lang.Thread.run(Thread.java:662)
> > [followed by a few more of those...]
> > and then a bunch of these:
> > ERROR [FlushWriter:123] 2011-05-02 01:21:12,690
> AbstractCassandraDaemon.java
> > (line 112) Fatal exception in thread Thread[FlushWriter:123,5,main]
> > java.lang.RuntimeException: java.lang.RuntimeException: Insufficient disk
> > space to flush 40009184 bytes
> > at
> > org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:34)
> > at
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> > at
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> > at java.lang.Thread.run(Thread.java:662)
> > Caused by: java.lang.RuntimeException: Insufficient disk space to flush
> > 40009184 bytes
> > at
> >
> org.apache.cassandra.db.ColumnFamilyStore.getFlushPath(ColumnFamilyStore.java:597)
> > at
> >
> org.apache.cassandra.db.ColumnFamilyStore.createFlushWriter(ColumnFamilyStore.java:2100)
> > at
> > org.apache.cassandra.db.Memtable.writeSortedContents(Memtable.java:239)
> > at org.apache.cassandra.db.Memtable.access$400(Memtable.java:50)
> > at
> org.apache.cassandra.db.Memtable$3.runMayThrow(Memtable.java:263)
> > at
> > org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30)
> > ... 3 more
> > Seems like compactions stopped after this (a bunch of tmp tables there
> still
> > from when those errors where generated), and I can only suspect the post
> > flusher may have stopped at the same time.
> > There is 890GB of disk for data, sstables are currently using 604G (139GB
> is
> > old tmp tables from when it ran out of disk) and "ring" tells me the load
> on
> > the node is 313GB.
> > Terje
> >
> >
> > On Wed, May 4, 2011 at 3:02 AM, Jonathan Ellis 
> wrote:
> >>
> >> ... and are there any exceptions in the log?
> >>
> >> On Tue, May 3, 2011 at 1:01 PM, Jonathan Ellis 
> wrote:
> >> > Does it resolve down to 0 eventually if you stop doing writes?
> >> >
> >> > On Tue, May 3, 2011 at 12:56 PM, Terje Marthinussen
> >> >  wrote:
> >> >> Cassandra 0.8 beta trunk from about 1 week ago:
> >> >> Pool NameActive   Pending

Re: Replica data distributing between racks

2011-05-03 Thread aaron morton
Jonathan, 
I think you are saying each DC should have it's own (logical) token 
ring. Which makes sense as the only way to balance the load in each dc. I think 
most people assume (including me) there was a single token ring for the entire 
cluster. 

But currently two endpoints cannot have the same token regardless of 
the DC they are in. Or should people just bump the tokens in extra DC's to 
avoid the collision?  

Cheers
Aaron

On 4 May 2011, at 03:03, Eric tamme wrote:

> On Tue, May 3, 2011 at 10:13 AM, Jonathan Ellis  wrote:
>> Right, when you are computing balanced RP tokens for NTS you need to
>> compute the tokens for each DC independently.
> 
> I am confused ... sorry.  Are you saying that ... I need to change how
> my keys are calculated to fix this problem?  Or are you talking about
> the implementation of how replication selects a token?
> 
> -Eric



Re: Problems recovering a dead node

2011-05-03 Thread aaron morton
When you say "it's clean" does that mean the node has no data files ?

After you replaced the disk what process did you use to recover  ?

Also what version are you running and what's the recent upgrade history ?

Cheers
Aaron

On 3 May 2011, at 23:09, Héctor Izquierdo Seliva wrote:

> Hi everyone. One of the nodes in my 6 node cluster died with disk
> failures. I have replaced the disks, and it's clean. It has the same
> configuration (same ip, same token).
> 
> When I try to restart the node it starts to throw mmap underflow
> exceptions till it closes again.
> 
> I tried setting io to standard, but it still fails. It gives errors
> about two decorated keys being different, and the EOFException.
> 
> Here is an excerpt of the log
> 
> http://pastebin.com/ZXW1wY6T
> 
> I can provide more info if needed. I'm at a loss here so any help is
> appreciated.
> 
> Thanks all for your time
> 
> Héctor Izquierdo
> 



Re: Write performance help needed

2011-05-03 Thread aaron morton
To give an idea, last March (2010) I run the a much older Cassandra on 10 HP 
blades (dual socket, 4 core, 16GB, 2.5 laptop HDD) and was writing around 250K 
columns per second with 500 python processes loading the data from wikipedia 
running on another 10 HP blades. 

This was my first out of the box no tuning (other then using sensible batch 
updates) test. Since then Cassandra has gotten much faster.
  
Hope that helps
Aaron

On 4 May 2011, at 02:22, Jonathan Ellis wrote:

> You don't give many details, but I would guess:
> 
> - your benchmark is not multithreaded
> - mongodb is not configured for durable writes, so you're really only
> measuring the time for it to buffer it in memory
> - you haven't loaded enough data to hit "mongo's index doesn't fit in
> memory anymore"
> 
> On Tue, May 3, 2011 at 8:24 AM, Steve Smith  wrote:
>> I am working for client that needs to persist 100K-200K records per second
>> for later querying.  As a proof of concept, we are looking at several
>> options including nosql (Cassandra and MongoDB).
>> I have been running some tests on my laptop (MacBook Pro, 4GB RAM, 2.66 GHz,
>> Dual Core/4 logical cores) and have not been happy with the results.
>> The best I have been able to accomplish is 100K records in approximately 30
>> seconds.  Each record has 30 columns, mostly made up of integers.  I have
>> tried both the Hector and Pelops APIs, and have tried writing in batches
>> versus one at a time.  The times have not varied much.
>> I am using the out of the box configuration for Cassandra, and while I know
>> using 1 disk will have an impact on performance, I would expect to see
>> better write numbers than I am.
>> As a point of reference, the same test using MongoDB I was able to
>> accomplish 100K records in 3.5 seconds.
>> Any tips would be appreciated.
>> 
>> - Steve
>> 
> 
> 
> 
> -- 
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of DataStax, the source for professional Cassandra support
> http://www.datastax.com



Re: MemtablePostFlusher with high number of pending calls?

2011-05-03 Thread Jonathan Ellis
Compaction does, but flush didn't until
https://issues.apache.org/jira/browse/CASSANDRA-2404

On Tue, May 3, 2011 at 2:26 PM, Terje Marthinussen
 wrote:
> Yes, I realize that.
> I am bit curious why it ran out of disk, or rather, why I have 200GB empty
> disk now, but unfortunately it seems like we may not have had monitoring
> enabled on this node to tell me what happened in terms of disk usage.
> I also thought that compaction was supposed to resume (try again with less
> data) if it fails?
> Terje
>
> On Wed, May 4, 2011 at 3:50 AM, Jonathan Ellis  wrote:
>>
>> post flusher is responsible for updating commitlog header after a
>> flush; each task waits for a specific flush to complete, then does its
>> thing.
>>
>> so when you had a flush catastrophically fail, its corresponding
>> post-flush task will be stuck.
>>
>> On Tue, May 3, 2011 at 1:20 PM, Terje Marthinussen
>>  wrote:
>> > Just some very tiny amount of writes in the background here (some hints
>> > spooled up on another node slowly coming in).
>> > No new data.
>> >
>> > I thought there was no exceptions, but I did not look far enough back in
>> > the
>> > log at first.
>> > Going back a bit further now however, I see that about 50 hours ago:
>> > ERROR [CompactionExecutor:387] 2011-05-02 01:16:01,027
>> > AbstractCassandraDaemon.java (line 112) Fatal exception in thread
>> > Thread[CompactionExecutor:387,1,main]
>> > java.io.IOException: No space left on device
>> >         at java.io.RandomAccessFile.writeBytes(Native Method)
>> >         at java.io.RandomAccessFile.write(RandomAccessFile.java:466)
>> >         at
>> >
>> > org.apache.cassandra.io.util.BufferedRandomAccessFile.flush(BufferedRandomAccessFile.java:160)
>> >         at
>> >
>> > org.apache.cassandra.io.util.BufferedRandomAccessFile.reBuffer(BufferedRandomAccessFile.java:225)
>> >         at
>> >
>> > org.apache.cassandra.io.util.BufferedRandomAccessFile.writeAtMost(BufferedRandomAccessFile.java:356)
>> >         at
>> >
>> > org.apache.cassandra.io.util.BufferedRandomAccessFile.write(BufferedRandomAccessFile.java:335)
>> >         at
>> > org.apache.cassandra.io.PrecompactedRow.write(PrecompactedRow.java:102)
>> >         at
>> >
>> > org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:130)
>> >         at
>> >
>> > org.apache.cassandra.db.CompactionManager.doCompaction(CompactionManager.java:566)
>> >         at
>> >
>> > org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:146)
>> >         at
>> >
>> > org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:112)
>> >         at
>> > java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>> >         at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>> >         at
>> >
>> > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>> >         at
>> >
>> > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>> >         at java.lang.Thread.run(Thread.java:662)
>> > [followed by a few more of those...]
>> > and then a bunch of these:
>> > ERROR [FlushWriter:123] 2011-05-02 01:21:12,690
>> > AbstractCassandraDaemon.java
>> > (line 112) Fatal exception in thread Thread[FlushWriter:123,5,main]
>> > java.lang.RuntimeException: java.lang.RuntimeException: Insufficient
>> > disk
>> > space to flush 40009184 bytes
>> >         at
>> > org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:34)
>> >         at
>> >
>> > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>> >         at
>> >
>> > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>> >         at java.lang.Thread.run(Thread.java:662)
>> > Caused by: java.lang.RuntimeException: Insufficient disk space to flush
>> > 40009184 bytes
>> >         at
>> >
>> > org.apache.cassandra.db.ColumnFamilyStore.getFlushPath(ColumnFamilyStore.java:597)
>> >         at
>> >
>> > org.apache.cassandra.db.ColumnFamilyStore.createFlushWriter(ColumnFamilyStore.java:2100)
>> >         at
>> > org.apache.cassandra.db.Memtable.writeSortedContents(Memtable.java:239)
>> >         at org.apache.cassandra.db.Memtable.access$400(Memtable.java:50)
>> >         at
>> > org.apache.cassandra.db.Memtable$3.runMayThrow(Memtable.java:263)
>> >         at
>> > org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30)
>> >         ... 3 more
>> > Seems like compactions stopped after this (a bunch of tmp tables there
>> > still
>> > from when those errors where generated), and I can only suspect the post
>> > flusher may have stopped at the same time.
>> > There is 890GB of disk for data, sstables are currently using 604G
>> > (139GB is
>> > old tmp tables from when it ran out of disk) and "ring" tells me the
>> > load on
>> > the node is 313GB.
>> > Terje
>> >
>> >
>> > On Wed, May 4, 2011 at 3:02 AM, Jonathan Ellis 
>> > wrote:
>> >>
>> >> ... and are there any exceptions in the log?
>> >>
>>

Re: Replica data distributing between racks

2011-05-03 Thread Jonathan Ellis
On Tue, May 3, 2011 at 2:46 PM, aaron morton  wrote:
> Jonathan,
>        I think you are saying each DC should have it's own (logical) token 
> ring.

Right. (Only with NTS, although you'd usually end up with a similar
effect if you alternate DC locations for nodes in a ONTS cluster.)

>        But currently two endpoints cannot have the same token regardless of 
> the DC they are in.

Also right.

> Or should people just bump the tokens in extra DC's to avoid the collision?

Yes.

-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com


Re: Unicode key encoding problem when upgrading from 0.6.13 to 0.7.5

2011-05-03 Thread aaron morton
Can you provide some details of the data returned from you do the =
get_range() ? It will be interesting to see the raw bytes returned for =
the keys. The likely culprit is a change in the encoding. Can you also =
try to grab the bytes sent for the key when doing the single select that =
fails.=20

You can grab these either on the client and/or by turing on the logging =
the DEBUG in conf/log4j-server.properties

Thanks
Aaron

On 4 May 2011, at 03:19, Henrik Schröder wrote:

> The way we solved this problem is that it turned out we had only a few 
> hundred rows with unicode keys, so we simply extracted them, upgraded to 0.7, 
> and wrote them back. However, this means that among the rows, there are a few 
> hundred weird duplicate rows with identical keys.
> 
> Is this going to be a problem in the future? Is there a chance that the good 
> duplicate is cleaned out in favour of the bad duplicate so that we suddnely 
> lose those rows again?
> 
> 
> /Henrik Schröder



Re: Replica data distributing between racks

2011-05-03 Thread Eric tamme
On Tue, May 3, 2011 at 4:08 PM, Jonathan Ellis  wrote:
> On Tue, May 3, 2011 at 2:46 PM, aaron morton  wrote:
>> Jonathan,
>>        I think you are saying each DC should have it's own (logical) token 
>> ring.
>
> Right. (Only with NTS, although you'd usually end up with a similar
> effect if you alternate DC locations for nodes in a ONTS cluster.)
>
>>        But currently two endpoints cannot have the same token regardless of 
>> the DC they are in.
>
> Also right.
>
>> Or should people just bump the tokens in extra DC's to avoid the collision?
>
> Yes.
>


I am sorry, but I do not understand fully.  I would appreciate it if
some one could explain with more verbosity for me.

I do not understand why data insertion is even, but replication is not.

I do not understand how to solve the problem.  What does "bumping"
tokens entail - Is that going to change my insertion distribution?  I
had no idea you can create different logical keyspaces ... and I am
not sure what that exactly means... or that I even want to do it.  Is
there a clear solution to "fixing" the problem I laid out, and getting
replication data evenly distributed between racks in each DC?

Sorry again for needing more verbosity - I am learning as I go with
this stuff.  I appreciate everyones help.

-Eric


Re: IOException: Unable to create hard link ... /snapshots/ ... (errno 17)

2011-05-03 Thread Mck
On Tue, 2011-05-03 at 14:22 -0500, Jonathan Ellis wrote:
> Can you create a ticket?

CASSANDRA-2598



Backup full cluster

2011-05-03 Thread A J
Snapshot runs on a local node. How do I ensure I have a 'point in
time' snapshot of the full cluster ? Do I have to stop the writes on
the full cluster and then snapshot all the nodes individually ?

Thanks.


Re: MemtablePostFlusher with high number of pending calls?

2011-05-03 Thread Terje Marthinussen
Hm... peculiar.

Post flush is not involved in compactions, right?

May 2nd
01:06 - Out of disk
01:51 - Starts a mix of major and minor compactions on different column
families
It then starts a few minor compactions extra over the day, but given that
there are more than 1000 sstables, and we are talking 3 minor compactions
started, it is not normal I think.
May 3rd 1 minor compaction started.

When I checked today, there was a bunch of tmp files on the disk with last
modify time from 01:something on may 2nd and 200GB empty disk...

Definitely no compaction going on.
Guess I will add some debug logging and see if I get lucky and run out of
disk again.

Terje

On Wed, May 4, 2011 at 5:06 AM, Jonathan Ellis  wrote:

> Compaction does, but flush didn't until
> https://issues.apache.org/jira/browse/CASSANDRA-2404
>
> On Tue, May 3, 2011 at 2:26 PM, Terje Marthinussen
>  wrote:
> > Yes, I realize that.
> > I am bit curious why it ran out of disk, or rather, why I have 200GB
> empty
> > disk now, but unfortunately it seems like we may not have had monitoring
> > enabled on this node to tell me what happened in terms of disk usage.
> > I also thought that compaction was supposed to resume (try again with
> less
> > data) if it fails?
> > Terje
> >
> > On Wed, May 4, 2011 at 3:50 AM, Jonathan Ellis 
> wrote:
> >>
> >> post flusher is responsible for updating commitlog header after a
> >> flush; each task waits for a specific flush to complete, then does its
> >> thing.
> >>
> >> so when you had a flush catastrophically fail, its corresponding
> >> post-flush task will be stuck.
> >>
> >> On Tue, May 3, 2011 at 1:20 PM, Terje Marthinussen
> >>  wrote:
> >> > Just some very tiny amount of writes in the background here (some
> hints
> >> > spooled up on another node slowly coming in).
> >> > No new data.
> >> >
> >> > I thought there was no exceptions, but I did not look far enough back
> in
> >> > the
> >> > log at first.
> >> > Going back a bit further now however, I see that about 50 hours ago:
> >> > ERROR [CompactionExecutor:387] 2011-05-02 01:16:01,027
> >> > AbstractCassandraDaemon.java (line 112) Fatal exception in thread
> >> > Thread[CompactionExecutor:387,1,main]
> >> > java.io.IOException: No space left on device
> >> > at java.io.RandomAccessFile.writeBytes(Native Method)
> >> > at java.io.RandomAccessFile.write(RandomAccessFile.java:466)
> >> > at
> >> >
> >> >
> org.apache.cassandra.io.util.BufferedRandomAccessFile.flush(BufferedRandomAccessFile.java:160)
> >> > at
> >> >
> >> >
> org.apache.cassandra.io.util.BufferedRandomAccessFile.reBuffer(BufferedRandomAccessFile.java:225)
> >> > at
> >> >
> >> >
> org.apache.cassandra.io.util.BufferedRandomAccessFile.writeAtMost(BufferedRandomAccessFile.java:356)
> >> > at
> >> >
> >> >
> org.apache.cassandra.io.util.BufferedRandomAccessFile.write(BufferedRandomAccessFile.java:335)
> >> > at
> >> >
> org.apache.cassandra.io.PrecompactedRow.write(PrecompactedRow.java:102)
> >> > at
> >> >
> >> >
> org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:130)
> >> > at
> >> >
> >> >
> org.apache.cassandra.db.CompactionManager.doCompaction(CompactionManager.java:566)
> >> > at
> >> >
> >> >
> org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:146)
> >> > at
> >> >
> >> >
> org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:112)
> >> > at
> >> > java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> >> > at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> >> > at
> >> >
> >> >
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> >> > at
> >> >
> >> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> >> > at java.lang.Thread.run(Thread.java:662)
> >> > [followed by a few more of those...]
> >> > and then a bunch of these:
> >> > ERROR [FlushWriter:123] 2011-05-02 01:21:12,690
> >> > AbstractCassandraDaemon.java
> >> > (line 112) Fatal exception in thread Thread[FlushWriter:123,5,main]
> >> > java.lang.RuntimeException: java.lang.RuntimeException: Insufficient
> >> > disk
> >> > space to flush 40009184 bytes
> >> > at
> >> >
> org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:34)
> >> > at
> >> >
> >> >
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> >> > at
> >> >
> >> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> >> > at java.lang.Thread.run(Thread.java:662)
> >> > Caused by: java.lang.RuntimeException: Insufficient disk space to
> flush
> >> > 40009184 bytes
> >> > at
> >> >
> >> >
> org.apache.cassandra.db.ColumnFamilyStore.getFlushPath(ColumnFamilyStore.java:597)
> >> > at
> >> >
> >> >
> org.apache.cassandra.db.ColumnFamilyStore.createFlushWriter(Co

Re: Performance tests using stress testing tool

2011-05-03 Thread Baskar Duraikannu
Thanks Peter. 

I believe...I found the root cause. Switch that we used was bad. 
Now on a 4 node cluster ( Each Node has 1 CPU  - Quad Core and 16 GB of RAM), 
I was able to get around 11,000 writes and 10,050 reads per second 
simultaneously (CPU usage is around 45% on all nodes. Disk queue size is in the 
neighbourhood of 10)

Is this inline with what you usually see with Cassandra? 


- Original Message - 
From: Peter Schuller 
To: user@cassandra.apache.org 
Sent: Friday, April 29, 2011 12:21 PM
Subject: Re: Performance tests using stress testing tool


> Thanks Peter. I am using java version of the stress testing tool from the
> contrib folder. Is there any issue that should be aware of? Do you recommend
> using pystress?

I just saw Brandon file this:
https://issues.apache.org/jira/browse/CASSANDRA-2578

Maybe that's it.

-- 
/ Peter Schuller


Decommissioning node is causing broken pipe error

2011-05-03 Thread tamara.alexander
Hi all,

I ran decommission on a node in my 32 node cluster. After about an hour of 
streaming files to another node, I got this error on the node being 
decommissioned:
INFO [MiscStage:1] 2011-05-03 21:49:00,235 StreamReplyVerbHandler.java (line 
58) Need to re-stream file /raiddrive/MDR/MeterRecords-f-2283-Data.db to 
/10.206.63.208
ERROR [Streaming:1] 2011-05-03 21:49:01,580 DebuggableThreadPoolExecutor.java 
(line 103) Error in ThreadPoolExecutor
java.lang.RuntimeException: java.io.IOException: Broken pipe
at 
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:34)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.io.IOException: Broken pipe
at sun.nio.ch.FileChannelImpl.transferTo0(Native Method)
at 
sun.nio.ch.FileChannelImpl.transferToDirectly(FileChannelImpl.java:415)
at sun.nio.ch.FileChannelImpl.transferTo(FileChannelImpl.java:516)
at 
org.apache.cassandra.streaming.FileStreamTask.stream(FileStreamTask.java:105)
at 
org.apache.cassandra.streaming.FileStreamTask.runMayThrow(FileStreamTask.java:67)
at 
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30)
... 3 more
ERROR [Streaming:1] 2011-05-03 21:49:01,581 AbstractCassandraDaemon.java (line 
112) Fatal exception in thread Thread[Streaming:1,1,main]
java.lang.RuntimeException: java.io.IOException: Broken pipe
at 
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:34)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.io.IOException: Broken pipe
at sun.nio.ch.FileChannelImpl.transferTo0(Native Method)
at 
sun.nio.ch.FileChannelImpl.transferToDirectly(FileChannelImpl.java:415)
at sun.nio.ch.FileChannelImpl.transferTo(FileChannelImpl.java:516)
at 
org.apache.cassandra.streaming.FileStreamTask.stream(FileStreamTask.java:105)
at 
org.apache.cassandra.streaming.FileStreamTask.runMayThrow(FileStreamTask.java:67)
at 
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30)
... 3 more

And this message on the node that it was streaming to:
INFO [Thread-333] 2011-05-03 21:49:00,234 StreamInSession.java (line 121) 
Streaming of file 
/raiddrive/MDR/MeterRecords-f-2283-Data.db/(98605680685,197932763967)
 progress=49016107008/99327083282 - 49% from 
org.apache.cassandra.streaming.StreamInSession@33721219 failed: requesting a 
retry.

I tried running decommission again (and running scrub + decommission), but I 
keep getting this error on the same file.

I checked out the file and saw that it is a lot bigger than all the other 
sstables... 184GB instead of about 74MB. I haven't run a major compaction for a 
bit, so I'm trying to stream 658 sstables.

I'm using Cassandra 0.7.4, I have two data directories (I know that's not good 
practice...), and all my nodes are on Amazon EC2.

Any thoughts on what could be going on or how to prevent this?

Thanks!
Tamara




This message is for the designated recipient only and may contain privileged, 
proprietary, or otherwise private information. If you have received it in 
error, please notify the sender immediately and delete the original. Any other 
use of the email by you is prohibited.


Re: Problems recovering a dead node

2011-05-03 Thread Héctor Izquierdo Seliva

Hi Aaron

It has no data files whatsoever. The upgrade path is 0.7.4 -> 0.7.5. It
turns out the initial problem was the sw raid failing silently because
of another faulty disk.

Now that the storage is working, I brought up the node again, same IP,
same token and tried doing nodetool repair. 

All adjacent nodes have finished the streaming session, and now the node
has a total of 248 GB of data. Is this normal when the load per node is
about 18GB? 

Also there are 1245 pending tasks. It's been compacting or rebuilding
sstables for the last 8 hours non stop. There are 2057 sstables in the
data folder.

Should I have done thing differently or is this the normal behaviour?

Thanks!

El mié, 04-05-2011 a las 07:54 +1200, aaron morton escribió:
> When you say "it's clean" does that mean the node has no data files ?
> 
> After you replaced the disk what process did you use to recover  ?
> 
> Also what version are you running and what's the recent upgrade history ?
> 
> Cheers
> Aaron
> 
> On 3 May 2011, at 23:09, Héctor Izquierdo Seliva wrote:
> 
> > Hi everyone. One of the nodes in my 6 node cluster died with disk
> > failures. I have replaced the disks, and it's clean. It has the same
> > configuration (same ip, same token).
> > 
> > When I try to restart the node it starts to throw mmap underflow
> > exceptions till it closes again.
> > 
> > I tried setting io to standard, but it still fails. It gives errors
> > about two decorated keys being different, and the EOFException.
> > 
> > Here is an excerpt of the log
> > 
> > http://pastebin.com/ZXW1wY6T
> > 
> > I can provide more info if needed. I'm at a loss here so any help is
> > appreciated.
> > 
> > Thanks all for your time
> > 
> > Héctor Izquierdo
> > 
> 




Re: MemtablePostFlusher with high number of pending calls?

2011-05-03 Thread Terje Marthinussen
Well, just did not look at these logs very well at all last night
First out of disk message:
ERROR [CompactionExecutor:387] 2011-05-02 01:16:01,027
AbstractCassandraDaemon.java (line 112) Fatal exception in thread
Thread[CompactionExecutor:387,1,main]
java.io.IOException: No space left on device

Then finally the last one
ERROR [FlushWriter:128] 2011-05-02 01:51:06,112 AbstractCassandraDaemon.java
(line 112) Fatal exception in thread Thread[FlushWriter:128,5,main]
java.lang.RuntimeException: java.lang.RuntimeException: Insufficient disk
space to flush 554962 bytes
at
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:34)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.lang.RuntimeException: Insufficient disk space to flush
554962 bytes
at
org.apache.cassandra.db.ColumnFamilyStore.getFlushPath(ColumnFamilyStore.java:597)
at
org.apache.cassandra.db.ColumnFamilyStore.createFlushWriter(ColumnFamilyStore.java:2100)
at
org.apache.cassandra.db.Memtable.writeSortedContents(Memtable.java:239)
at org.apache.cassandra.db.Memtable.access$400(Memtable.java:50)
at org.apache.cassandra.db.Memtable$3.runMayThrow(Memtable.java:263)
at
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30)
... 3 more
 INFO [CompactionExecutor:451] 2011-05-02 01:51:06,113 StorageService.java
(line 2066) requesting GC to free disk space
[lots of sstables deleted]

After this is starts running again (although not fine it seems).

So the disk seems to have been full for 35 minutes due to un-deleted
sstables.

Terje

On Wed, May 4, 2011 at 6:34 AM, Terje Marthinussen
wrote:

> Hm... peculiar.
>
> Post flush is not involved in compactions, right?
>
> May 2nd
> 01:06 - Out of disk
> 01:51 - Starts a mix of major and minor compactions on different column
> families
> It then starts a few minor compactions extra over the day, but given that
> there are more than 1000 sstables, and we are talking 3 minor compactions
> started, it is not normal I think.
> May 3rd 1 minor compaction started.
>
> When I checked today, there was a bunch of tmp files on the disk with last
> modify time from 01:something on may 2nd and 200GB empty disk...
>
> Definitely no compaction going on.
> Guess I will add some debug logging and see if I get lucky and run out of
> disk again.
>
> Terje
>
> On Wed, May 4, 2011 at 5:06 AM, Jonathan Ellis  wrote:
>
>> Compaction does, but flush didn't until
>> https://issues.apache.org/jira/browse/CASSANDRA-2404
>>
>> On Tue, May 3, 2011 at 2:26 PM, Terje Marthinussen
>>  wrote:
>> > Yes, I realize that.
>> > I am bit curious why it ran out of disk, or rather, why I have 200GB
>> empty
>> > disk now, but unfortunately it seems like we may not have had monitoring
>> > enabled on this node to tell me what happened in terms of disk usage.
>> > I also thought that compaction was supposed to resume (try again with
>> less
>> > data) if it fails?
>> > Terje
>> >
>> > On Wed, May 4, 2011 at 3:50 AM, Jonathan Ellis 
>> wrote:
>> >>
>> >> post flusher is responsible for updating commitlog header after a
>> >> flush; each task waits for a specific flush to complete, then does its
>> >> thing.
>> >>
>> >> so when you had a flush catastrophically fail, its corresponding
>> >> post-flush task will be stuck.
>> >>
>> >> On Tue, May 3, 2011 at 1:20 PM, Terje Marthinussen
>> >>  wrote:
>> >> > Just some very tiny amount of writes in the background here (some
>> hints
>> >> > spooled up on another node slowly coming in).
>> >> > No new data.
>> >> >
>> >> > I thought there was no exceptions, but I did not look far enough back
>> in
>> >> > the
>> >> > log at first.
>> >> > Going back a bit further now however, I see that about 50 hours ago:
>> >> > ERROR [CompactionExecutor:387] 2011-05-02 01:16:01,027
>> >> > AbstractCassandraDaemon.java (line 112) Fatal exception in thread
>> >> > Thread[CompactionExecutor:387,1,main]
>> >> > java.io.IOException: No space left on device
>> >> > at java.io.RandomAccessFile.writeBytes(Native Method)
>> >> > at java.io.RandomAccessFile.write(RandomAccessFile.java:466)
>> >> > at
>> >> >
>> >> >
>> org.apache.cassandra.io.util.BufferedRandomAccessFile.flush(BufferedRandomAccessFile.java:160)
>> >> > at
>> >> >
>> >> >
>> org.apache.cassandra.io.util.BufferedRandomAccessFile.reBuffer(BufferedRandomAccessFile.java:225)
>> >> > at
>> >> >
>> >> >
>> org.apache.cassandra.io.util.BufferedRandomAccessFile.writeAtMost(BufferedRandomAccessFile.java:356)
>> >> > at
>> >> >
>> >> >
>> org.apache.cassandra.io.util.BufferedRandomAccessFile.write(BufferedRandomAccessFile.java:335)
>> >> > at
>> >> >
>> org.apache.cassandra.io.PrecompactedRow.write(PrecompactedRow.java:102)
>> >> >