Re: nodetool connection refused

2012-09-14 Thread Manu Zhang
should we update the wiki?

On Fri, Sep 14, 2012 at 1:18 PM, aaron morton wrote:

> Yes.
> If your IDE is starting cassandra the settings from cassandra-env.sh will
> not be used.
>
> Cheers
>
> -
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
>
> On 13/09/2012, at 12:41 PM, Manu Zhang  wrote:
>
> I'm afraid we have to include all $JVM_OPTS in the cassandra-env.sh ?
>
> On Thu, Sep 13, 2012 at 5:49 AM, aaron morton wrote:
>
>> Thanks for updating the Wiki :)
>>
>> Cheers
>>
>>   -
>> Aaron Morton
>> Freelance Developer
>> @aaronmorton
>> http://www.thelastpickle.com
>>
>> On 12/09/2012, at 1:14 PM, Manu Zhang  wrote:
>>
>> problems solved. I didn't add the jmx_host and jmx_port to vm_arguments
>> in Eclipse. How come it is not covered in wiki
>> http://wiki.apache.org/cassandra/RunningCassandraInEclipse ? Or is it
>> outdated?
>>
>> On Mon, Sep 10, 2012 at 10:11 AM, Manu Zhang wrote:
>>
>>> It's more like an Eclipse issue now since I find a "0.0.0.0:7199"
>>> listener when executing "bin/cassandra" in terminal but none when running
>>> Cassandra in Eclipse.
>>>
>>>
>>> On Sun, Sep 9, 2012 at 12:56 PM, Manu Zhang wrote:
>>>
 No, I don't find a listener whose port is 7199. Where to setup? I've
 been experimenting on my laptop so both of them are local.


 On Sun, Sep 9, 2012 at 1:28 AM, Senthilvel Rangaswamy <
 senthil...@gmail.com> wrote:

> What is the address for thrift listener. Did you put 0.0.0.0:7199 ?
>
> On Fri, Sep 7, 2012 at 11:53 PM, Manu Zhang 
> wrote:
>
>> When I run Cassandra-trunk in Eclipse, nodetool fail to connect with
>> the following error
>> "Failed to connect to '127.0.0.1:7199': Connection refused"
>> But if I run in terminal, all will be fine.
>>
>
>
>
> --
> ..Senthil
>
> "If there's anything more important than my ego around, I want it
>  caught and shot now."
> - Douglas Adams.
>
>

>>>
>>
>>
>
>


Re: Schema consistently not propagating to a node.

2012-09-14 Thread aaron morton
Out of interest, how out of sync where they ?

Cheers

-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 14/09/2012, at 6:53 AM, Ben Frank  wrote:

> Hi Sergey,
>That was exactly it, thank you!
> 
> -Ben
> 
> On Thu, Sep 13, 2012 at 12:07 AM, Sergey Tryuber  wrote:
> Hi Ben,
> 
> We've seen similar behavior once a time. The cause was unsynchronized time on 
> the nodes. So I can recommend to check that system time is equal over the 
> nodes (especially on that one "bad" server).
> 
> 
> On 13 September 2012 05:15, Ben Frank  wrote:
> Hey all,
>I'm setting up a new ring on 11 machines with cassandra 1.1.5. All seems 
> to install fine and startup ok (from tarball) but I'm having an issue when 
> updating the schema. There's one node that just doesn't want to receive the 
> schema change. I've tried blowing away my /var/lib/cassandra/* directories 
> and trying it a few times with the same results. I also tried the 
> instructions here: http://wiki.apache.org/cassandra/FAQ#schema_disagreement
> 
> If I update the schema with cassandra-cli I get the following error:
> 
> Waiting for schema agreement...
> The schema has not settled in 10 seconds; further migrations are ill-advised 
> until it does.
> Versions are 3b172ce8-d8e8-362f-b955-79fe6b8a35e4:[12.19.103.103, 
> 12.19.110.103, 12.19.102.137, 12.19.103.111, 12.19.102.131, 12.19.102.129, 
> 12.19.110.105, 172.19.102.135, 12.19.103.105, 12.19.102.133], 
> b83fe28a-2851-34cb-bd8d-7d2621c1d872:[12.19.110.109]
> 
> then if I do a 'describe cluster' I get:
> 
> Cluster Information:
>Snitch: org.apache.cassandra.locator.SimpleSnitch
>Partitioner: org.apache.cassandra.dht.RandomPartitioner
>Schema versions: 
>   3b172ce8-d8e8-362f-b955-79fe6b8a35e4: [12.19.103.103, 12.19.110.103, 
> 12.19.102.137, 12.19.103.111, 12.19.102.131, 12.19.102.129, 12.19.110.105, 
> 12.19.102.135, 12.19.103.105, 12.19.102.133]
> 
>   a0f80d21-ad98-31df-b6dc-871061b1bcf9: [12.19.110.109]
> 
> if I keep running describe cluster, both the schema UUID's(?) above keep 
> changing every few seconds. Looking at the logs shows the nodes are 
> continually doing a flush and compact. 
> 
> if I start over by blowing away /var/lib/cassandra, and then try the schema 
> update on the machine which is having trouble receiving it everything works 
> ok and all nodes are in sync.
> 
> Anyone know what's going on here, or what I should look at to troubleshoot 
> this?
> 
> -Ben
> 
> 



Re: secondery indexes TTL - strange issues

2012-09-14 Thread aaron morton
> INFO [CompactionExecutor:181] 2012-09-13 12:58:37,443 CompactionTask.java 
> (line
> 221) Compacted to 
> [/var/lib/cassandra/data/Eventstore/EventsByItem/Eventstore-E
> ventsByItem.ebi_eventtypeIndex-he-10-Data.db,].  78,623,000 to 373,348 (~0% 
> of o
> riginal) bytes for 83 keys at 0.000280MB/s.  Time: 1,272,883ms.
There is a lot of weird things here. 
It could be levelled compaction compacting an older file for the first time. 
But that would be a guess. 

> Rebuilding the index gives us back the data for a couple of minutes - then it 
> vanishes again.
Are you able to do a test with SiezedTieredCompaction ? 

Are you able to replicate the problem with a fresh testing CF and some test 
Data?

If it's only a problem with imported data can you provide a sample of the 
failing query ? Any maybe the CF definition ? 

Cheers


-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 14/09/2012, at 2:46 AM, Roland Gude  wrote:

> Hi,
>  
> we have been running a system on Cassandra 0.7 heavily relying on secondary 
> indexes for columns with TTL.
> This has been working like a charm, but we are trying hard to move forward 
> with Cassandra and are struggling at that point:
>  
> When we put our data into a new cluster (any 1.1.x version – currently 1.1.5) 
> , rebuild indexes and run our system, everything seems to work good – until 
> in some point of time index queries do not return any data at all anymore 
> (note that the TTL has not yet expired for several months).
> Rebuilding the index gives us back the data for a couple of minutes - then it 
> vanishes again.
>  
> What seems strange is that compaction apparently is very aggressive:
>  
> INFO [CompactionExecutor:181] 2012-09-13 12:58:37,443 CompactionTask.java 
> (line
> 221) Compacted to 
> [/var/lib/cassandra/data/Eventstore/EventsByItem/Eventstore-E
> ventsByItem.ebi_eventtypeIndex-he-10-Data.db,].  78,623,000 to 373,348 (~0% 
> of o
> riginal) bytes for 83 keys at 0.000280MB/s.  Time: 1,272,883ms.
>  
>  
> Actually we have switched to LeveledCompaction. Could it be that leveled 
> compaction does not play nice with indexes?
>  
>  



Re: Composite Column Query Modeling

2012-09-14 Thread aaron morton
You _could_ use one wide row and do a multiget against the same row for 
different column slices. Would be less efficient than a single get against the 
row. But you could still do big contiguous column slices. 

You may get some benefit from the collections in CQL 3 
http://www.datastax.com/dev/blog/cql3_collections

Hope that helps. 


-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 14/09/2012, at 8:31 AM, Adam Holmberg  wrote:

> I'm modeling a new application and considering the use of SuperColumn vs. 
> Composite Column paradigms. I understand that SuperColumns are discouraged in 
> new development, but I'm pondering a query where it seems like SuperColumns 
> might be better suited.
> 
> Consider a CF with SuperColumn layout as follows
> 
> t = {
>   k1: {
> s1: { c1:v1, c2:v2 },
> s2: { c1:v3, c2:v4 },
> s3: { c1:v5, c2:v6}
> ...
>   }
>   ...
> }
> 
> Which might be modeled in CQL3:
> 
> CREATE TABLE t (
>   k text,
>   s text,
>   c1 text,
>   c2 text,
>   PRIMARY KEY (k, s)
> );
> 
> I know that it is possible to do range slices with either approach. However, 
> with SuperColumns I can do sparse slice queries with a set (list) of column 
> names as the SlicePredicate. I understand that the composites API only 
> returns contiguous slices, but I keep finding myself wanting to do a query as 
> follows:
> 
> SELECT * FROM t WHERE k = 'foo' AND s IN (1,3);
> 
> The question: Is there a recommended technique for emulating sparse column 
> slices in composites?
> 
> One suggestion I've read is to get the entire range and filter client side. 
> This is pretty punishing if the range is large and the second keys being 
> queried are sparse. Additionally, there are enough keys being queried that 
> calling once per key is undesirable.
> 
> I also realize that I could manually composite k:s as the row key and use 
> multiget, but this gives away the benefit of having these records proximate 
> when range queries are used. 
> 
> Any input on modeling/query techniques would be appreciated.
> 
> Regards,
> Adam Holmberg
> 
> 
> P.S./Sidebar:
> 
> What this seems like to me is a desire for 'multiget' at the second key level 
> analogous to multiget at the row key level. Is this something that could be 
> implemented in the server using SlicePredicate.column_names? Is this just an 
> implementation gap, or is there something technical I'm overlooking? 



Re: Data Model

2012-09-14 Thread aaron morton
> Consider a course_students col family which gives a list of students for a 
> course

I would use two CF's:

Course CF:
* Each row is one course
* Columns are the properties and values of the course

CourseEnrolements CF
* Each row is one course
* Column name is the student ID. 
* Column value may be blank or some useful value. 

Hope that helps. 

-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 14/09/2012, at 3:19 PM, Michael Morris  wrote:

> I'm fairly new to Cassandra myself, but had to solve a similar problem.  If 
> ordering of the student number values is not important to you, you can store 
> them as UTF8 values (Ascii would work too, may be a better choice?), and the 
> resulting columns would be sorted by the lexical ordering of the numeric 
> values (as opposed to numeric sorting), so it would be 1, 10, 11, 2, 3, 31, 
> 5...
> 
> In my situation, I had a document, composed of digitized images of the pages. 
>  My row key is the doc id number, and there were document level attributes, 
> as well as page level attributes I wanted to capture in the row.  So in my 
> model, I used composite columns of 3 UTF8 values.  The first is an attribute 
> descriptor, I used 'a' to indicate a document level attribute, and 'p' as a 
> page level attribute.  The second composite value depends on the 1st, for 'a' 
> types, the 2nd value is the actual attribute identifier (ex, form type, 
> scanner number, etc...).  For 'p' types, it refers to the page number.  
> Having Cassandra preserve the order of the page numbers is not a priority for 
> me, so I can deal with the page number being sorted in String order in the 
> database, I'll deal with numerical sorting that in the app logic (since the 
> largest documents we process are only about 100 pages long).  The 3rd 
> composite value is empty for all 'a' types, and for 'p' types, it refers to a 
> page level attribute (page landmark information, image skew angle, location 
> on disk, etc...).
> 
> As an example:
> 
> key1 => a:form=x, a:scanner=1425436, p:1:skew=0.0042142, p:1:file=page1.png, 
> p:2:skew=0.0042412, p:2:file=page2.png
> key2 => a:form=q, a:scanner=935625, p:1:skew=0.00032352, p:1:file=other1.png, 
> p:2:skew=:0.0002355, p:2:file=other2.png
> 
> It's been working well for me when using the Hector client.
> 
> Thanks,
> 
> Mike
> 
> On Thu, Sep 13, 2012 at 12:59 PM, Soumya Acharya  wrote:
> I just started learning Cassandra any suggestion where to start with ??
> 
> 
> Thanks
> Soumya 
> 
> On Thu, Sep 13, 2012 at 10:54 AM, Roshni Rajagopal 
>  wrote:
> I want to learn how we can model a mix of static and dynamic columns in a 
> family.
> 
> Consider a course_students col family which gives a list of students for a 
> course
> with row key- Course Id
> Columns - Name, Teach_Nm, StudID1, StudID2, StudID3
> Values - Maths, Prof. Abc, 20,21,25 
> where 20,21,25 are IDs of students.
> 
> We have
> fixed columns like Course Name, Teacher Name, and a dynamic number of
> 
> columns like 'StudID1', 'StudID2' etc, and my thoughts were that we could
> 
> look for 'StudID' and get all the columns with the student Ids in Hector. But 
> the
> 
> question was how would we determine the number for the column, like to add
> 
> StudID3 we need to read the row and identify that 2 students are there,
> 
> and this is the third one.
> 
> 
> 
> So we can remove the number in the column name, altogether and keep
> 
> columns like Course Name, Teacher Name, Student:20,Student:21, Student:25,
> 
> where the second part is the actual student id. However here we run into
> 
> the second issue that we cannot have some columns of a composite format
> 
> and some of another format, when we use static column families- all
> 
> columns would need to be in the format UTF8:integer We may want to treat
> 
> it as a composite column key and not use a delimiter- to get sorting,
> 
> validate the types of the parts of the key, not have to search for the
> 
> delimiter and separate the 2 components  manually etc. 
> 
> 
> 
> A third option is to put only data in the column name for students like
> 
> Course Name, Teacher Name, 20,21,25 - it would be difficult to identify
> 
> that columns with name 20, 21, 25 actually stand for student names - a bit
> 
> unreadable.
> 
> 
> 
> I hope this is not confusing, and would like to hear your thoughts on 
> this.The question is
> 
> around when you de-normalize & want to have some static info like name ,
> 
> and a dynamic list - whats the best way to model this.
> 
> 
> 
> Regards,
> 
> Roshni
> 
> 
> 
> 
> -- 
> Regards and Thanks 
> Soumya Kanti Acharya
> 



Reading column names only

2012-09-14 Thread Robin Verlangen
Hi there,

Would it be possible to read only the column names, instead of the names
and values? I would like to store some in the value, but without the cost
of slowing down the reads of the column names (primary task).

Best regards,

Robin Verlangen
*Software engineer*
*
*
W http://www.robinverlangen.nl
E ro...@us2.nl

Disclaimer: The information contained in this message and attachments is
intended solely for the attention and use of the named addressee and may be
confidential. If you are not the intended recipient, you are reminded that
the information remains the property of the sender. You must not use,
disclose, distribute, copy, print or rely on this e-mail. If you have
received this message in error, please contact the sender immediately and
irrevocably delete this message and any copies.


Re: Changing bloom filter false positive ratio

2012-09-14 Thread aaron morton
I have a hunch that the SSTable selection based on the Min and Max keys in 
ColumnFamilyStore.markReferenced() means that a higher false positive has less 
of an impact. 

it's just a hunch, i've not tested it. 

Cheers

-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 14/09/2012, at 4:52 AM, Peter Schuller  wrote:

>> changing it on some of them.  Can I just change that value through the
>> cli and restart or are there any concerns I should have before trying
>> to tweak that parameter?
> 
> You can change it, you don't have to restart. It will take effect on
> new sstables - flushed and compacted, not on pre-existing sstables.
> 
> Keep in mind that you can get to about half the size of the default
> (iirc with about 10% fp chance). Beyond that you have extremely
> diminishing returns.
> 
> -- 
> / Peter Schuller (@scode, http://worldmodscode.wordpress.com)



Re: hadoop inserts blow out heap

2012-09-14 Thread aaron morton
Hi Brian did you see my follow up questions here 
http://www.mail-archive.com/user@cassandra.apache.org/msg24840.html

Cheers

-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 12/09/2012, at 11:52 PM, Brian Jeltema  
wrote:

> I'm a fairly novice Cassandra/Hadoop guy. I have written a Hadoop job (using 
> the Cassandra/Hadoop integration API)
> that performs a full table scan and attempts to populate a new table from the 
> results of the map/reduce. The read
> works fine and is fast, but the table insertion is failing with OOM errors 
> (in the Cassandra VM). The resulting heap dump from one node shows that
> 2.9G of the heap is consumed by a JMXConfigurableThreadPoolExecutor that 
> appears to be full of batch mutations.
> 
> I'm using a 6-node cluster, 32G per node, 8G heap, RF=3, if any of that 
> matters.
> 
> Any suggestions would be appreciated regarding configuration changes or 
> additional information I might
> capture to understand this problem.
> 
> Thanks
> 
> Brian J



Re: Reading column names only

2012-09-14 Thread aaron morton
It's not possible to read just the column names. 

Cheers

-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 14/09/2012, at 9:05 PM, Robin Verlangen  wrote:

> Hi there,
> 
> Would it be possible to read only the column names, instead of the names and 
> values? I would like to store some in the value, but without the cost of 
> slowing down the reads of the column names (primary task).
> 
> Best regards, 
> 
> Robin Verlangen
> Software engineer
> 
> W http://www.robinverlangen.nl
> E ro...@us2.nl
> 
> Disclaimer: The information contained in this message and attachments is 
> intended solely for the attention and use of the named addressee and may be 
> confidential. If you are not the intended recipient, you are reminded that 
> the information remains the property of the sender. You must not use, 
> disclose, distribute, copy, print or rely on this e-mail. If you have 
> received this message in error, please contact the sender immediately and 
> irrevocably delete this message and any copies.
> 



Re: Reading column names only

2012-09-14 Thread Robin Verlangen
Hi Aaron,

Is this something that's worth becoming a feature in the future? Or should
I rework my data model? If so, do you have any suggestions?

Best regards,

Robin Verlangen
*Software engineer*
*
*
W http://www.robinverlangen.nl
E ro...@us2.nl

Disclaimer: The information contained in this message and attachments is
intended solely for the attention and use of the named addressee and may be
confidential. If you are not the intended recipient, you are reminded that
the information remains the property of the sender. You must not use,
disclose, distribute, copy, print or rely on this e-mail. If you have
received this message in error, please contact the sender immediately and
irrevocably delete this message and any copies.



2012/9/14 aaron morton 

> It's not possible to read just the column names.
>
> Cheers
>
> -
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
>
> On 14/09/2012, at 9:05 PM, Robin Verlangen  wrote:
>
> Hi there,
>
> Would it be possible to read only the column names, instead of the names
> and values? I would like to store some in the value, but without the cost
> of slowing down the reads of the column names (primary task).
>
> Best regards,
>
> Robin Verlangen
> *Software engineer*
> *
> *
> W http://www.robinverlangen.nl
> E ro...@us2.nl
>
> Disclaimer: The information contained in this message and attachments is
> intended solely for the attention and use of the named addressee and may be
> confidential. If you are not the intended recipient, you are reminded that
> the information remains the property of the sender. You must not use,
> disclose, distribute, copy, print or rely on this e-mail. If you have
> received this message in error, please contact the sender immediately and
> irrevocably delete this message and any copies.
>
>
>


AW: secondery indexes TTL - strange issues

2012-09-14 Thread Roland Gude
I am not sure it is compacting an old file: the same thing happens eeverytime I 
rebuild the index. New Files appear, get compacted and vanish.

We have set up a new smaller cluster with fresh data. Same thing happens here 
as well. Date gets inserted and accessible via index query for some time. At 
some point in time Indexes are completely empty and start filling again (while 
new data enters the system).

I am currently testing with SizeTiered on both the fresh set and the imported 
set.

For the fresh set (which is significantly smaller) first results imply that the 
issue is not happening with SizeTieredCompaction - I have not yet tested 
everything that comes into my mind and will update if something new comes up.

As for the failing query it is from the cli:
get EventsByItem where 0003--1000--=utf8('someValue');
0003--1000-- is a TUUID we use as a marker for a 
TimeSeries.
(and equivalent queries with astyanax and hector as well)

This is a cf with the issue:

create column family EventsByItem
  with column_type = 'Standard'
  and comparator = 'TimeUUIDType'
  and default_validation_class = 'BytesType'
  and key_validation_class = 'BytesType'
  and read_repair_chance = 0.5
  and dclocal_read_repair_chance = 0.0
  and gc_grace = 864000
  and min_compaction_threshold = 4
  and max_compaction_threshold = 32
  and replicate_on_write = true
  and compaction_strategy = 
'org.apache.cassandra.db.compaction.LeveledCompactionStrategy'
  and caching = 'NONE'
  and column_metadata = [
{column_name : '--1000--',
validation_class : BytesType,
index_name : 'ebi_mandatorIndex',
index_type : 0},
{column_name : '0002--1000--',
validation_class : BytesType,
index_name : 'ebi_itemidIndex',
index_type : 0},
{column_name : '0003--1000--',
validation_class : BytesType,
index_name : 'ebi_eventtypeIndex',
index_type : 0}]
  and compression_options={sstable_compression:SnappyCompressor, 
chunk_length_kb:64};

Von: aaron morton [mailto:aa...@thelastpickle.com]
Gesendet: Freitag, 14. September 2012 10:46
An: user@cassandra.apache.org
Betreff: Re: secondery indexes TTL - strange issues

INFO [CompactionExecutor:181] 2012-09-13 12:58:37,443 CompactionTask.java (line
221) Compacted to [/var/lib/cassandra/data/Eventstore/EventsByItem/Eventstore-E
ventsByItem.ebi_eventtypeIndex-he-10-Data.db,].  78,623,000 to 373,348 (~0% of o
riginal) bytes for 83 keys at 0.000280MB/s.  Time: 1,272,883ms.
There is a lot of weird things here.
It could be levelled compaction compacting an older file for the first time. 
But that would be a guess.

Rebuilding the index gives us back the data for a couple of minutes - then it 
vanishes again.
Are you able to do a test with SiezedTieredCompaction ?

Are you able to replicate the problem with a fresh testing CF and some test 
Data?

If it's only a problem with imported data can you provide a sample of the 
failing query ? Any maybe the CF definition ?

Cheers


-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 14/09/2012, at 2:46 AM, Roland Gude 
mailto:roland.g...@ez.no>> wrote:


Hi,

we have been running a system on Cassandra 0.7 heavily relying on secondary 
indexes for columns with TTL.
This has been working like a charm, but we are trying hard to move forward with 
Cassandra and are struggling at that point:

When we put our data into a new cluster (any 1.1.x version - currently 1.1.5) , 
rebuild indexes and run our system, everything seems to work good - until in 
some point of time index queries do not return any data at all anymore (note 
that the TTL has not yet expired for several months).
Rebuilding the index gives us back the data for a couple of minutes - then it 
vanishes again.

What seems strange is that compaction apparently is very aggressive:

INFO [CompactionExecutor:181] 2012-09-13 12:58:37,443 CompactionTask.java (line
221) Compacted to [/var/lib/cassandra/data/Eventstore/EventsByItem/Eventstore-E
ventsByItem.ebi_eventtypeIndex-he-10-Data.db,].  78,623,000 to 373,348 (~0% of o
riginal) bytes for 83 keys at 0.000280MB/s.  Time: 1,272,883ms.


Actually we have switched to LeveledCompaction. Could it be that leveled 
compaction does not play nice with indexes?





Re: Many ParNew collections

2012-09-14 Thread Rene Kochen
Thanks Aaron,

At another production site the exact same problems occur (also after
~6 months). Here I have a very small cluster of three nodes with
replication factor = 3.
One of the three nodes begins to have many long Parnews and high CPU
load. I upgraded to Cassandra 1.0.11, but the GC problem still
continues on that node.

If I look at the CFStats of the three nodes, there is one CF which is different:

Column Family: Logs
SSTable count: 1
Space used (live): 47606705
Space used (total): 47606705
Number of Keys (estimate): 338176
Memtable Columns Count: 22297
Memtable Data Size: 51542275
Memtable Switch Count: 1
Read Count: 189441
Read Latency: 0,768 ms.
Write Count: 123411
Write Latency: 0,035 ms.
Pending Tasks: 0
Bloom Filter False Postives: 0
Bloom Filter False Ratio: 0,0
Bloom Filter Space Used: 721456
Key cache capacity: 20
Key cache size: 56685
Key cache hit rate: 0.9132482658217008
Row cache: disabled
Compacted row minimum size: 73
Compacted row maximum size: 263210
Compacted row mean size: 94

Column Family: Logs
SSTable count: 3
Space used (live): 233688199
Space used (total): 233688199
Number of Keys (estimate): 1191936
Memtable Columns Count: 20147
Memtable Data Size: 47067518
Memtable Switch Count: 1
Read Count: 188473
Read Latency: 4031,791 ms.
Write Count: 120412
Write Latency: 0,042 ms.
Pending Tasks: 0
Bloom Filter False Postives: 234
Bloom Filter False Ratio: 0,0
Bloom Filter Space Used: 2603808
Key cache capacity: 20
Key cache size: 5153
Key cache hit rate: 1.0
Row cache: disabled
Compacted row minimum size: 73
Compacted row maximum size: 25109160
Compacted row mean size: 156

Column Family: Logs
SSTable count: 1
Space used (live): 47714798
Space used (total): 47714798
Number of Keys (estimate): 338176
Memtable Columns Count: 29046
Memtable Data Size: 66585390
Memtable Switch Count: 1
Read Count: 196048
Read Latency: 1,466 ms.
Write Count: 127709
Write Latency: 0,034 ms.
Pending Tasks: 0
Bloom Filter False Postives: 8
Bloom Filter False Ratio: 0,00847
Bloom Filter Space Used: 720496
Key cache capacity: 20
Key cache size: 54166
Key cache hit rate: 0.9833443960960739
Row cache: disabled
Compacted row minimum size: 73
Compacted row maximum size: 263210
Compacted row mean size: 95

The second node (the one suffering from many GC) has a high read
latency compared to the others. Another thing is that the compacted
row maximum size is bigger than on the other nodes.

What puzzles me:

- Should the other nodes also have that wide row, because the
replication factor is three and I only have three nodes? I must say
that the wide row is probably the index row which has columns
added/removed continuously. Maybe the other nodes lost much data
because of compactions?
- Could repeatedly reading a wide row cause parnew problems?

Thanks!

Rene

2012/8/17 aaron morton :
> - Cassandra 0.7.10
>
> You _really_ should look at getting up to 1.1 :) Memory management is much
> better and the JVM heap requirements are less.
>
> However, there is one node with high read latency and far too many
> ParNew collections (compared to other nodes).
>
> Check for long running compactions or repairs.
> Check for unexpected row or key cache settings.
> Look at the connections on the box and see if it is getting more client load
> than others.
>
> Restart the node and see if the situations returns. If it does try to
> correlate the start of the GC issues with the cassandra log.
>
> Hope that helps.
>
> -
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
>
> On 17/08/2012, at 2:58 AM, Rene Kochen  wrote:
>
> Hi
>
> I have a cluster of 7 nodes:
>
> - Windows Server 2008
> - Cassandra 0.7.10
> - The nodes are identical (hardware, configuration and client request load)
> - Standard batch file with 8GB heap
> - I use disk_access_mode = standard
> - Random partitioner
> - TP stats shows no problems
> - Ring command shows no problems (data is balanced)
>
> However, there is one node with high read latency and far too many
> ParNew collections (compared to other nodes). It also suffers from a
> high CPU load (I guess due to the ParNew collections).
>
> What can be the source of so many ParNew collections? The other
> identical nodes do not have this behavior.
>
> Logging:
>
> 2012-08-16 15:58:46,436906.072: [GC 906.072: [ParNew:
> 345022K->38336K(345024K), 0.2375976 secs]
> 3630599K->3516249K(8350272K), 0.2377296 secs] [Times: user=3.21
> sys=0.00, real=0.23 secs]
> 2012-08-16 15:58:46,888906.517: [GC 906.517: [ParNew:
> 345024K->38336K(345024K), 0.2400594 secs]
> 3822937K->3743690K(8350272K), 0.2401802 secs] [Times: user=3.48
> sys=0.03, real=0.25 secs]
> 2012-08-16 15:58:46,888 INFO 15:58:46,888 GC for ParNew: 478 ms for 2
> collections, 3837792904 used; max is 8550678528
> 2012-08-16 15:58:47,372907.003: [GC 907.003: [ParNew:
> 345024K->38336K(345024K), 0.2405363 secs]
> 4050378K->3971544K(8350272K), 0.2406553 secs] [Times: user=3.34
> sys=0.01, real=0.23 se

Query advice to prevent node overload

2012-09-14 Thread André Cruz
Hello.

I have a schema that represents a filesystem and one example of a Super CF is:

CF FilesPerDir: (DIRNAME -> (FILENAME -> (attribute1: value1, attribute2: 
value2))

And in cases of directory moves, I have to fetch all files of that directory 
and subdirectories. This implies one cassandra query per dir, or a multiget for 
all needed dirs. The multiget() is more efficient, but I'm having trouble 
trying to limit the size of the data returned in order to not crash the 
cassandra node.

I'm using the pycassa client lib, and until now I have been doing per-directory 
get()s specifiying a column_count. This effectively limits the size of the 
dataset, but I would like to perform a multiget() to fetch the contents of 
multiple dirs at a time. The problem is that it seems that the column_count is 
per-key, and not global per dataset. So if I set column_count = 1, as I 
have now, but fetch 1000 dirs (rows) and each one happens to have 1 files 
(columns) the dataset is 1000x1. Is there a better way to query for this 
data or does multiget deal with this through the "buffer_size"?

Thanks,
André

Re: Data Model

2012-09-14 Thread Hiller, Dean
playOrm uses EXACTLY that pattern where @OneToMany becomes 
student.rowkeyStudent1 student.rowkeyStudent2 and the other fields are fixed.  
It is a common pattern in noSQL.

Dean

From: aaron morton mailto:aa...@thelastpickle.com>>
Reply-To: "user@cassandra.apache.org" 
mailto:user@cassandra.apache.org>>
Date: Friday, September 14, 2012 3:00 AM
To: "user@cassandra.apache.org" 
mailto:user@cassandra.apache.org>>
Subject: Re: Data Model

Consider a course_students col family which gives a list of students for a 
course

I would use two CF's:

Course CF:
* Each row is one course
* Columns are the properties and values of the course

CourseEnrolements CF
* Each row is one course
* Column name is the student ID.
* Column value may be blank or some useful value.

Hope that helps.

-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 14/09/2012, at 3:19 PM, Michael Morris 
mailto:michael.m.mor...@gmail.com>> wrote:

I'm fairly new to Cassandra myself, but had to solve a similar problem.  If 
ordering of the student number values is not important to you, you can store 
them as UTF8 values (Ascii would work too, may be a better choice?), and the 
resulting columns would be sorted by the lexical ordering of the numeric values 
(as opposed to numeric sorting), so it would be 1, 10, 11, 2, 3, 31, 5...

In my situation, I had a document, composed of digitized images of the pages.  
My row key is the doc id number, and there were document level attributes, as 
well as page level attributes I wanted to capture in the row.  So in my model, 
I used composite columns of 3 UTF8 values.  The first is an attribute 
descriptor, I used 'a' to indicate a document level attribute, and 'p' as a 
page level attribute.  The second composite value depends on the 1st, for 'a' 
types, the 2nd value is the actual attribute identifier (ex, form type, scanner 
number, etc...).  For 'p' types, it refers to the page number.  Having 
Cassandra preserve the order of the page numbers is not a priority for me, so I 
can deal with the page number being sorted in String order in the database, 
I'll deal with numerical sorting that in the app logic (since the largest 
documents we process are only about 100 pages long).  The 3rd composite value 
is empty for all 'a' types, and for 'p' types, it refers to a page level 
attribute (page landmark information, image skew angle, location on disk, 
etc...).

As an example:

key1 => a:form=x, a:scanner=1425436, p:1:skew=0.0042142, p:1:file=page1.png, 
p:2:skew=0.0042412, p:2:file=page2.png
key2 => a:form=q, a:scanner=935625, p:1:skew=0.00032352, p:1:file=other1.png, 
p:2:skew=:0.0002355, p:2:file=other2.png

It's been working well for me when using the Hector client.

Thanks,

Mike

On Thu, Sep 13, 2012 at 12:59 PM, Soumya Acharya 
mailto:cse.sou...@gmail.com>> wrote:
I just started learning Cassandra any suggestion where to start with ??


Thanks
Soumya

On Thu, Sep 13, 2012 at 10:54 AM, Roshni Rajagopal 
mailto:roshni_rajago...@hotmail.com>> wrote:
I want to learn how we can model a mix of static and dynamic columns in a 
family.

Consider a course_students col family which gives a list of students for a 
course
with row key- Course Id
Columns - Name, Teach_Nm, StudID1, StudID2, StudID3
Values - Maths, Prof. Abc, 20,21,25
where 20,21,25 are IDs of students.

We have

fixed columns like Course Name, Teacher Name, and a dynamic number of

columns like 'StudID1', 'StudID2' etc, and my thoughts were that we could

look for 'StudID' and get all the columns with the student Ids in Hector. But 
the

question was how would we determine the number for the column, like to add

StudID3 we need to read the row and identify that 2 students are there,

and this is the third one.


So we can remove the number in the column name, altogether and keep

columns like Course Name, Teacher Name, Student:20,Student:21, Student:25,

where the second part is the actual student id. However here we run into

the second issue that we cannot have some columns of a composite format

and some of another format, when we use static column families- all

columns would need to be in the format UTF8:integer We may want to treat

it as a composite column key and not use a delimiter- to get sorting,

validate the types of the parts of the key, not have to search for the

delimiter and separate the 2 components  manually etc.


A third option is to put only data in the column name for students like

Course Name, Teacher Name, 20,21,25 - it would be difficult to identify

that columns with name 20, 21, 25 actually stand for student names - a bit

unreadable.


I hope this is not confusing, and would like to hear your thoughts on this.The 
question is

around when you de-normalize & want to have some static info like name ,

and a dynamic list - whats the best way to model this.


Regards,

Roshni



--
Regards and Thanks
Soumya Kanti

Cassandra node going down

2012-09-14 Thread rohit reddy
Hi,

I'm facing a problem in Cassandra cluster deployed on EC2 where the node is
going down under write load.

I have configured a cluster of 4 Large EC2 nodes with RF of 2.
All nodes are instance storage backed. DISK is RAID0 with 800GB

I'm pumping in write requests at about 4000 writes/sec. One of the node
went down under this load. The total data size in each node was not more
than 7GB
Got the following WARN messages in the LOG file...

1. setting live ratio to minimum of 1.0 instead of 0.9003153296009601
2. Heap is 0.7515559786053904 full.  You may need to reduce memtable and/or
cache sizes.  Cassandra will now flush up to the two largest memtables to
free up memory.  Adjust flush_largest_memtables_at threshold in
cassandra.yaml if you don't want Cassandra to do
this automatically
3. WARN [CompactionExecutor:570] 2012-09-14 11:45:12,024
CompactionTask.java (line 84) insufficient space to compact all requested
files

All cassandra settings are default settings.
Do i need to tune anything to support this write rate?

Thanks
Rohit


Re: Cassandra node going down

2012-09-14 Thread Robin Verlangen
Hi Robbit,

I think it's running out of disk space, please verify that (on Linux: df -h
).

Best regards,

Robin Verlangen
*Software engineer*
*
*
W http://www.robinverlangen.nl
E ro...@us2.nl

Disclaimer: The information contained in this message and attachments is
intended solely for the attention and use of the named addressee and may be
confidential. If you are not the intended recipient, you are reminded that
the information remains the property of the sender. You must not use,
disclose, distribute, copy, print or rely on this e-mail. If you have
received this message in error, please contact the sender immediately and
irrevocably delete this message and any copies.



2012/9/14 rohit reddy 

> Hi,
>
> I'm facing a problem in Cassandra cluster deployed on EC2 where the node
> is going down under write load.
>
> I have configured a cluster of 4 Large EC2 nodes with RF of 2.
> All nodes are instance storage backed. DISK is RAID0 with 800GB
>
> I'm pumping in write requests at about 4000 writes/sec. One of the node
> went down under this load. The total data size in each node was not more
> than 7GB
> Got the following WARN messages in the LOG file...
>
> 1. setting live ratio to minimum of 1.0 instead of 0.9003153296009601
> 2. Heap is 0.7515559786053904 full.  You may need to reduce memtable
> and/or cache sizes.  Cassandra will now flush up to the two largest
> memtables to free up memory.  Adjust flush_largest_memtables_at threshold
> in cassandra.yaml if you don't want Cassandra to do
> this automatically
> 3. WARN [CompactionExecutor:570] 2012-09-14 11:45:12,024
> CompactionTask.java (line 84) insufficient space to compact all requested
> files
>
> All cassandra settings are default settings.
> Do i need to tune anything to support this write rate?
>
> Thanks
> Rohit
>
>


Re: Cassandra node going down

2012-09-14 Thread Robin Verlangen
Robbit = Rohit of course, excuse me.

Best regards,

Robin Verlangen
*Software engineer*
*
*
W http://www.robinverlangen.nl
E ro...@us2.nl

Disclaimer: The information contained in this message and attachments is
intended solely for the attention and use of the named addressee and may be
confidential. If you are not the intended recipient, you are reminded that
the information remains the property of the sender. You must not use,
disclose, distribute, copy, print or rely on this e-mail. If you have
received this message in error, please contact the sender immediately and
irrevocably delete this message and any copies.



2012/9/14 Robin Verlangen 

> Hi Robbit,
>
> I think it's running out of disk space, please verify that (on Linux: df
> -h ).
>
> Best regards,
>
> Robin Verlangen
> *Software engineer*
> *
> *
> W http://www.robinverlangen.nl
> E ro...@us2.nl
>
> Disclaimer: The information contained in this message and attachments is
> intended solely for the attention and use of the named addressee and may be
> confidential. If you are not the intended recipient, you are reminded that
> the information remains the property of the sender. You must not use,
> disclose, distribute, copy, print or rely on this e-mail. If you have
> received this message in error, please contact the sender immediately and
> irrevocably delete this message and any copies.
>
>
>
> 2012/9/14 rohit reddy 
>
>> Hi,
>>
>> I'm facing a problem in Cassandra cluster deployed on EC2 where the node
>> is going down under write load.
>>
>> I have configured a cluster of 4 Large EC2 nodes with RF of 2.
>> All nodes are instance storage backed. DISK is RAID0 with 800GB
>>
>> I'm pumping in write requests at about 4000 writes/sec. One of the node
>> went down under this load. The total data size in each node was not more
>> than 7GB
>> Got the following WARN messages in the LOG file...
>>
>> 1. setting live ratio to minimum of 1.0 instead of 0.9003153296009601
>> 2. Heap is 0.7515559786053904 full.  You may need to reduce memtable
>> and/or cache sizes.  Cassandra will now flush up to the two largest
>> memtables to free up memory.  Adjust flush_largest_memtables_at threshold
>> in cassandra.yaml if you don't want Cassandra to do
>> this automatically
>> 3. WARN [CompactionExecutor:570] 2012-09-14 11:45:12,024
>> CompactionTask.java (line 84) insufficient space to compact all requested
>> files
>>
>> All cassandra settings are default settings.
>> Do i need to tune anything to support this write rate?
>>
>> Thanks
>> Rohit
>>
>>
>


Re: Cassandra node going down

2012-09-14 Thread rohit reddy
Hi Robin,

I had checked that. Our disk size is about 800GB, and the total data size
is not more than 40GB. Even if all the data is stored in one node, this
won't happen.

I'll try to see if the disk failed.

Is this anything to do with VM memory?.. cause this logs suggests that..
Heap is 0.7515559786053904 full.  You may need to reduce memtable and/or
cache sizes.  Cassandra will now flush up to the two largest memtables to
free up memory.  Adjust flush_largest_memtables_at threshold in
cassandra.yaml if you don't want Cassandra to do this automatically

But, i'm only testing writes, there are no reads on the cluster. Will the
writes require so much memory. A large instance has 7.5GB, so by default
cassandra allocates about 3.75 GB for the VM.



On Fri, Sep 14, 2012 at 6:58 PM, Robin Verlangen  wrote:

> Hi Robbit,
>
> I think it's running out of disk space, please verify that (on Linux: df
> -h ).
>
> Best regards,
>
> Robin Verlangen
> *Software engineer*
> *
> *
> W http://www.robinverlangen.nl
> E ro...@us2.nl
>
> Disclaimer: The information contained in this message and attachments is
> intended solely for the attention and use of the named addressee and may be
> confidential. If you are not the intended recipient, you are reminded that
> the information remains the property of the sender. You must not use,
> disclose, distribute, copy, print or rely on this e-mail. If you have
> received this message in error, please contact the sender immediately and
> irrevocably delete this message and any copies.
>
>
>
> 2012/9/14 rohit reddy 
>
>> Hi,
>>
>> I'm facing a problem in Cassandra cluster deployed on EC2 where the node
>> is going down under write load.
>>
>> I have configured a cluster of 4 Large EC2 nodes with RF of 2.
>> All nodes are instance storage backed. DISK is RAID0 with 800GB
>>
>> I'm pumping in write requests at about 4000 writes/sec. One of the node
>> went down under this load. The total data size in each node was not more
>> than 7GB
>> Got the following WARN messages in the LOG file...
>>
>> 1. setting live ratio to minimum of 1.0 instead of 0.9003153296009601
>> 2. Heap is 0.7515559786053904 full.  You may need to reduce memtable
>> and/or cache sizes.  Cassandra will now flush up to the two largest
>> memtables to free up memory.  Adjust flush_largest_memtables_at threshold
>> in cassandra.yaml if you don't want Cassandra to do
>> this automatically
>> 3. WARN [CompactionExecutor:570] 2012-09-14 11:45:12,024
>> CompactionTask.java (line 84) insufficient space to compact all requested
>> files
>>
>> All cassandra settings are default settings.
>> Do i need to tune anything to support this write rate?
>>
>> Thanks
>> Rohit
>>
>>
>


Is it possible to create a schema before a Cassandra node starts up ?

2012-09-14 Thread Xu, Zaili
Guys

I am pretty new to Cassandra. I have a script that needs to set up a schema 
first before starting up the cassandra node. Is this possible ? Can I create 
the schema directly on cassandra storage and then when the node starts up it 
will pick up the schema ?

Zaili

From: rohit reddy [mailto:rohit.kommare...@gmail.com]
Sent: Friday, September 14, 2012 9:50 AM
To: user@cassandra.apache.org
Subject: Re: Cassandra node going down

Hi Robin,

I had checked that. Our disk size is about 800GB, and the total data size is 
not more than 40GB. Even if all the data is stored in one node, this won't 
happen.

I'll try to see if the disk failed.

Is this anything to do with VM memory?.. cause this logs suggests that..
Heap is 0.7515559786053904 full.  You may need to reduce memtable and/or cache 
sizes.  Cassandra will now flush up to the two largest memtables to free up 
memory.  Adjust flush_largest_memtables_at threshold in cassandra.yaml if you 
don't want Cassandra to do this automatically

But, i'm only testing writes, there are no reads on the cluster. Will the 
writes require so much memory. A large instance has 7.5GB, so by default 
cassandra allocates about 3.75 GB for the VM.



On Fri, Sep 14, 2012 at 6:58 PM, Robin Verlangen 
mailto:ro...@us2.nl>> wrote:
Hi Robbit,

I think it's running out of disk space, please verify that (on Linux: df -h ).

Best regards,

Robin Verlangen
Software engineer

W http://www.robinverlangen.nl
E ro...@us2.nl

Disclaimer: The information contained in this message and attachments is 
intended solely for the attention and use of the named addressee and may be 
confidential. If you are not the intended recipient, you are reminded that the 
information remains the property of the sender. You must not use, disclose, 
distribute, copy, print or rely on this e-mail. If you have received this 
message in error, please contact the sender immediately and irrevocably delete 
this message and any copies.


2012/9/14 rohit reddy 
mailto:rohit.kommare...@gmail.com>>
Hi,

I'm facing a problem in Cassandra cluster deployed on EC2 where the node is 
going down under write load.

I have configured a cluster of 4 Large EC2 nodes with RF of 2.
All nodes are instance storage backed. DISK is RAID0 with 800GB

I'm pumping in write requests at about 4000 writes/sec. One of the node went 
down under this load. The total data size in each node was not more than 7GB
Got the following WARN messages in the LOG file...

1. setting live ratio to minimum of 1.0 instead of 0.9003153296009601
2. Heap is 0.7515559786053904 full.  You may need to reduce memtable and/or 
cache sizes.  Cassandra will now flush up to the two largest memtables to free 
up memory.  Adjust flush_largest_memtables_at threshold in cassandra.yaml if 
you don't want Cassandra to do
this automatically
3. WARN [CompactionExecutor:570] 2012-09-14 11:45:12,024 CompactionTask.java 
(line 84) insufficient space to compact all requested files

All cassandra settings are default settings.
Do i need to tune anything to support this write rate?

Thanks
Rohit




**
IMPORTANT: Any information contained in this communication is intended for the 
use of the named individual or entity. All information contained in this 
communication is not intended or construed as an offer, solicitation, or a 
recommendation to purchase any security. Advice, suggestions or views presented 
in this communication are not necessarily those of Pershing LLC nor do they 
warrant a complete or accurate statement. 

If you are not an intended party to this communication, please notify the 
sender and delete/destroy any and all copies of this communication. Unintended 
recipients shall not review, reproduce, disseminate nor disclose any 
information contained in this communication. Pershing LLC reserves the right to 
monitor and retain all incoming and outgoing communications as permitted by 
applicable law.

Email communications may contain viruses or other defects. Pershing LLC does 
not accept liability nor does it warrant that email communications are virus or 
defect free.
**


Re: Cassandra node going down

2012-09-14 Thread Robin Verlangen
Cassandra writes to memtables, that will get flushed to disk when it's
time. That might be because of running out of memory (the log message you
just posted), on a shutdown, or at other times. That's why you're using
memory while writing.

You seem to be running on AWS, are you sure your data location is on the
right disk? Default is /var/lib/cassandra/data

Best regards,

Robin Verlangen
*Software engineer*
*
*
W http://www.robinverlangen.nl
E ro...@us2.nl

Disclaimer: The information contained in this message and attachments is
intended solely for the attention and use of the named addressee and may be
confidential. If you are not the intended recipient, you are reminded that
the information remains the property of the sender. You must not use,
disclose, distribute, copy, print or rely on this e-mail. If you have
received this message in error, please contact the sender immediately and
irrevocably delete this message and any copies.



2012/9/14 rohit reddy 

> Hi Robin,
>
> I had checked that. Our disk size is about 800GB, and the total data size
> is not more than 40GB. Even if all the data is stored in one node, this
> won't happen.
>
> I'll try to see if the disk failed.
>
> Is this anything to do with VM memory?.. cause this logs suggests that..
> Heap is 0.7515559786053904 full.  You may need to reduce memtable and/or
> cache sizes.  Cassandra will now flush up to the two largest memtables to
> free up memory.  Adjust flush_largest_memtables_at threshold in
> cassandra.yaml if you don't want Cassandra to do this automatically
>
> But, i'm only testing writes, there are no reads on the cluster. Will the
> writes require so much memory. A large instance has 7.5GB, so by default
> cassandra allocates about 3.75 GB for the VM.
>
>
>
> On Fri, Sep 14, 2012 at 6:58 PM, Robin Verlangen  wrote:
>
>> Hi Robbit,
>>
>> I think it's running out of disk space, please verify that (on Linux: df
>> -h ).
>>
>> Best regards,
>>
>> Robin Verlangen
>> *Software engineer*
>> *
>> *
>> W http://www.robinverlangen.nl
>> E ro...@us2.nl
>>
>> Disclaimer: The information contained in this message and attachments is
>> intended solely for the attention and use of the named addressee and may be
>> confidential. If you are not the intended recipient, you are reminded that
>> the information remains the property of the sender. You must not use,
>> disclose, distribute, copy, print or rely on this e-mail. If you have
>> received this message in error, please contact the sender immediately and
>> irrevocably delete this message and any copies.
>>
>>
>>
>> 2012/9/14 rohit reddy 
>>
>>> Hi,
>>>
>>> I'm facing a problem in Cassandra cluster deployed on EC2 where the node
>>> is going down under write load.
>>>
>>> I have configured a cluster of 4 Large EC2 nodes with RF of 2.
>>> All nodes are instance storage backed. DISK is RAID0 with 800GB
>>>
>>> I'm pumping in write requests at about 4000 writes/sec. One of the node
>>> went down under this load. The total data size in each node was not more
>>> than 7GB
>>> Got the following WARN messages in the LOG file...
>>>
>>> 1. setting live ratio to minimum of 1.0 instead of 0.9003153296009601
>>> 2. Heap is 0.7515559786053904 full.  You may need to reduce memtable
>>> and/or cache sizes.  Cassandra will now flush up to the two largest
>>> memtables to free up memory.  Adjust flush_largest_memtables_at threshold
>>> in cassandra.yaml if you don't want Cassandra to do
>>> this automatically
>>> 3. WARN [CompactionExecutor:570] 2012-09-14 11:45:12,024
>>> CompactionTask.java (line 84) insufficient space to compact all requested
>>> files
>>>
>>> All cassandra settings are default settings.
>>> Do i need to tune anything to support this write rate?
>>>
>>> Thanks
>>> Rohit
>>>
>>>
>>
>


cassandra does not close the stdout console on startup

2012-09-14 Thread Xu, Zaili
Hi,

Another newby question. My script needs to start up cassandra node. However 
cassandra doesn't close the stdout console and therefore never returns

cassandra -p c.pid

Is there anyway to have cassandra close the stdout ?

Zaili

**
IMPORTANT: Any information contained in this communication is intended for the 
use of the named individual or entity. All information contained in this 
communication is not intended or construed as an offer, solicitation, or a 
recommendation to purchase any security. Advice, suggestions or views presented 
in this communication are not necessarily those of Pershing LLC nor do they 
warrant a complete or accurate statement. 

If you are not an intended party to this communication, please notify the 
sender and delete/destroy any and all copies of this communication. Unintended 
recipients shall not review, reproduce, disseminate nor disclose any 
information contained in this communication. Pershing LLC reserves the right to 
monitor and retain all incoming and outgoing communications as permitted by 
applicable law.

Email communications may contain viruses or other defects. Pershing LLC does 
not accept liability nor does it warrant that email communications are virus or 
defect free.
**


Re: Composite Column Query Modeling

2012-09-14 Thread Adam Holmberg
I think what you're describing might give me what I'm after, but I don't
see how I can pass different column slices in a multiget call. I may be
missing something, but it looks like you pass multiple keys but only a
singular SlicePredicate. Please let me know if that's not what you meant.

I'm aware of CQL3 collections, but I don't think they quite suite my needs
in this case.

Thanks for the suggestions!

Adam

On Fri, Sep 14, 2012 at 1:56 AM, aaron morton wrote:

> You _could_ use one wide row and do a multiget against the same row for
> different column slices. Would be less efficient than a single get against
> the row. But you could still do big contiguous column slices.
>
> You may get some benefit from the collections in CQL 3
> http://www.datastax.com/dev/blog/cql3_collections
>
> Hope that helps.
>
>
> -
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
>
> On 14/09/2012, at 8:31 AM, Adam Holmberg 
> wrote:
>
> I'm modeling a new application and considering the use of SuperColumn vs.
> Composite Column paradigms. I understand that SuperColumns are discouraged
> in new development, but I'm pondering a query where it seems like
> SuperColumns might be better suited.
>
> Consider a CF with SuperColumn layout as follows
>
> t = {
>   k1: {
> s1: { c1:v1, c2:v2 },
> s2: { c1:v3, c2:v4 },
> s3: { c1:v5, c2:v6}
> ...
>   }
>   ...
> }
>
> Which might be modeled in CQL3:
>
> CREATE TABLE t (
>   k text,
>   s text,
>   c1 text,
>   c2 text,
>   PRIMARY KEY (k, s)
> );
>
> I know that it is possible to do range slices with either approach.
> However, with SuperColumns I can do sparse slice queries with a set (list)
> of column names as the SlicePredicate. I understand that the composites API
> only returns contiguous slices, but I keep finding myself wanting to do a
> query as follows:
>
> SELECT * FROM t WHERE k = 'foo' AND s IN (1,3);
>
> The question: Is there a recommended technique for emulating sparse column
> slices in composites?
>
> One suggestion I've read is to get the entire range and filter client
> side. This is pretty punishing if the range is large and the second keys
> being queried are sparse. Additionally, there are enough keys being queried
> that calling once per key is undesirable.
>
> I also realize that I could manually composite k:s as the row key and use
> multiget, but this gives away the benefit of having these records proximate
> when range queries *are* used.
>
> Any input on modeling/query techniques would be appreciated.
>
> Regards,
> Adam Holmberg
>
>
> P.S./Sidebar:
> 
> What this seems like to me is a desire for 'multiget' at the second key
> level analogous to multiget at the row key level. Is this something that
> could be implemented in the server using SlicePredicate.column_names? Is
> this just an implementation gap, or is there something technical I'm
> overlooking?
>
>
>


Re: Cassandra, AWS and EBS Optimized Instances/Provisioned IOPs

2012-09-14 Thread Chris Dodge

Michael Theroux  yahoo.com> writes:

> 
> Hello,
> A number of weeks ago, Amazon announced the availability of EBS Optimized
instances and Provisioned IOPs for Amazon EC2.  Historically, I've read EBS is
not recommended for Cassandra due to the network contention that can quickly
result (http://www.datastax.com/docs/1.0/cluster_architecture/cluster_planning).
> 
> Costs put aside, and assuming everything promoted by Amazon is accurate, with
the existence of provisioned IOPs, is EBS now a better option than before?
 Taking the points against EBS mentioned in the link above:
> 
> 
> EBS volumes contend directly for network throughput with standard packets.
This means that EBS throughput is likely to fail if you saturate a network link.
> According to Amazon, Provisioned IOPS guarantees to be within 10% of  of the
provisioned performance 99.9%of the time.  This would mean that throughput
should no longer fail.
> EBS volumes have unreliable performance. I/O performance can be exceptionally
slow, causing the system to backload reads and writes until the entire cluster
becomes unresponsive.
> Same point as above.
> Adding capacity by increasing the number of EBS volumes per host does not
scale. You can easily surpass the ability of the system to keep effective buffer
caches and concurrently serve requests for all of the data it is responsible for
managing.
> I believe this may be still true, although I'm not entirely sure why this is
more true for EBS volumes vs. emphemeral.
> 
> Any real world experience out there with these new EBS options?
> 
> -Mike
> 
> 
> 

Thanks for asking this as I'm wondering the same thing. I'd like to hear other
people's take on this as I'm trying to spec out a Cassandra ring on AWS.






Re: Composite Column Query Modeling

2012-09-14 Thread Hiller, Dean
There is another trick here.  On the playOrm open source project, we need to do 
a sparse query for a join and so we send out 100 async requests and cache up 
the java "Future" objects and return the first needed result back without 
waiting for the others.  With the S-SQLin playOrm, we have the IN clause coming 
soon as well in which we will use the same technique so as you iterate over the 
1, 3, 29, 56 rows, results may still be coming in as it only blocks if it gets 
to 3 and the result for 3 has not come back yet.

Anyways, just an option.  (ps. This option helps us query a 1,000,000 row 
partition in 60ms ;) and we still haven't added the lookahead cursors which 
should speed some systems up as well as it fetches stuff while you are working 
on the first batch of results)

Later,
Dean

From: Adam Holmberg 
mailto:adam.holmberg.l...@gmail.com>>
Reply-To: "user@cassandra.apache.org" 
mailto:user@cassandra.apache.org>>
Date: Friday, September 14, 2012 9:08 AM
To: "user@cassandra.apache.org" 
mailto:user@cassandra.apache.org>>
Subject: Re: Composite Column Query Modeling

I think what you're describing might give me what I'm after, but I don't see 
how I can pass different column slices in a multiget call. I may be missing 
something, but it looks like you pass multiple keys but only a singular 
SlicePredicate. Please let me know if that's not what you meant.

I'm aware of CQL3 collections, but I don't think they quite suite my needs in 
this case.

Thanks for the suggestions!

Adam

On Fri, Sep 14, 2012 at 1:56 AM, aaron morton 
mailto:aa...@thelastpickle.com>> wrote:
You _could_ use one wide row and do a multiget against the same row for 
different column slices. Would be less efficient than a single get against the 
row. But you could still do big contiguous column slices.

You may get some benefit from the collections in CQL 3 
http://www.datastax.com/dev/blog/cql3_collections

Hope that helps.


-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 14/09/2012, at 8:31 AM, Adam Holmberg 
mailto:adam.holmberg.l...@gmail.com>> wrote:

I'm modeling a new application and considering the use of SuperColumn vs. 
Composite Column paradigms. I understand that SuperColumns are discouraged in 
new development, but I'm pondering a query where it seems like SuperColumns 
might be better suited.

Consider a CF with SuperColumn layout as follows

t = {
  k1: {
s1: { c1:v1, c2:v2 },
s2: { c1:v3, c2:v4 },
s3: { c1:v5, c2:v6}
...
  }
  ...
}

Which might be modeled in CQL3:

CREATE TABLE t (
  k text,
  s text,
  c1 text,
  c2 text,
  PRIMARY KEY (k, s)
);

I know that it is possible to do range slices with either approach. However, 
with SuperColumns I can do sparse slice queries with a set (list) of column 
names as the SlicePredicate. I understand that the composites API only returns 
contiguous slices, but I keep finding myself wanting to do a query as follows:

SELECT * FROM t WHERE k = 'foo' AND s IN (1,3);

The question: Is there a recommended technique for emulating sparse column 
slices in composites?

One suggestion I've read is to get the entire range and filter client side. 
This is pretty punishing if the range is large and the second keys being 
queried are sparse. Additionally, there are enough keys being queried that 
calling once per key is undesirable.

I also realize that I could manually composite k:s as the row key and use 
multiget, but this gives away the benefit of having these records proximate 
when range queries are used.

Any input on modeling/query techniques would be appreciated.

Regards,
Adam Holmberg


P.S./Sidebar:

What this seems like to me is a desire for 'multiget' at the second key level 
analogous to multiget at the row key level. Is this something that could be 
implemented in the server using SlicePredicate.column_names? Is this just an 
implementation gap, or is there something technical I'm overlooking?




Astyanax - build

2012-09-14 Thread A J
Hi,
I am new to java and trying to get the Astyanax client running for Cassandra.

Downloaded astyanax from https://github.com/Netflix/astyanax. How do I
compile the source code from here it in a very simple fashion from
linux command line ?

Thanks.


Re: Changing bloom filter false positive ratio

2012-09-14 Thread Peter Schuller
> I have a hunch that the SSTable selection based on the Min and Max keys in
> ColumnFamilyStore.markReferenced() means that a higher false positive has
> less of an impact.
>
> it's just a hunch, i've not tested it.

For leveled compaction, yes. For non-leveled, I can't see how it would
since each sstable will effectively cover almost the entire range
(since you're effectively spraying random tokens at it, unless clients
are writing data in md5 order).

(Maybe it's different for ordered partitioning though.)

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)


Re: Astyanax - build

2012-09-14 Thread Philip O'Toole
On Fri, Sep 14, 2012 at 12:28:08PM -0400, A J wrote:
> Hi,
> I am new to java and trying to get the Astyanax client running for Cassandra.
> 
> Downloaded astyanax from https://github.com/Netflix/astyanax. How do I
> compile the source code from here it in a very simple fashion from
> linux command line ?

I am relatively new to Java and Astyanax too, and created a set up for
exactly this purpose. You can learn what I did here:

http://www.philipotoole.com/bootstrapping-cassandra

Not the most elegant, but I wanted to get some stuff up-and-running quickly.

> 
> Thanks.

-- 
Philip O'Toole

Senior Developer
Loggly, Inc.
San Francisco, Calif.
www.loggly.com



Re: Astyanax - build

2012-09-14 Thread Hiller, Dean
I didn't need to compile it.  It is up in the maven repositories as we

http://mvnrepository.com/artifact/com.netflix.astyanax/astyanax


Or are you trying to see how it works?  (We use the same client on playORM
open source projectŠit works like a charm).

Dean

On 9/14/12 10:28 AM, "A J"  wrote:

>Hi,
>I am new to java and trying to get the Astyanax client running for
>Cassandra.
>
>Downloaded astyanax from https://github.com/Netflix/astyanax. How do I
>compile the source code from here it in a very simple fashion from
>linux command line ?
>
>Thanks.



Re: Astyanax - build

2012-09-14 Thread Philip O'Toole
On Fri, Sep 14, 2012 at 10:49 AM, Hiller, Dean  wrote:
> I didn't need to compile it.  It is up in the maven repositories as we
>
> http://mvnrepository.com/artifact/com.netflix.astyanax/astyanax

Actually, yeah, that's what I ended up doing with my ghetto set up
too, but I did compile my examples programs at the command-line (as
opposed to using something like Eclipse). Pretty straightforward.

>
>
> Or are you trying to see how it works?  (We use the same client on playORM
> open source projectŠit works like a charm).
>
> Dean
>
> On 9/14/12 10:28 AM, "A J"  wrote:
>
>>Hi,
>>I am new to java and trying to get the Astyanax client running for
>>Cassandra.
>>
>>Downloaded astyanax from https://github.com/Netflix/astyanax. How do I
>>compile the source code from here it in a very simple fashion from
>>linux command line ?
>>
>>Thanks.
>


nodetool cfstats and compression

2012-09-14 Thread Jim Ancona
Do the row size stats reported by 'nodetool cfstats' include the
effect of compression?

Thanks,

Jim


minor compaction and delete expired column-tombstones

2012-09-14 Thread Rene Kochen
Hi all,

Does minor compaction delete expired column-tombstones when the row is
also present in another table which is not subject to the minor
compaction?

Example:

Say there are 5 SStables:

- Customers_0 (10 MB)
- Customers_1 (10 MB)
- Customers_2 (10 MB)
- Customers_3 (10 MB)
- Customers_4 (30 MB)

A minor compaction is triggered which will compact the similar sized
tables 0 to 3. In these tables is a customer record with key "C1" with
an expired column tombstone. Customer "C1" is also present in table 4.
Will the minor compaction delete the column (i.e. will the tombstone
be present in the newly created table)?

Thanks,

Rene


cassandra/hadoop BulkOutputFormat failures

2012-09-14 Thread Brian Jeltema
I'm trying to do a bulk load from a Cassandra/Hadoop job using the 
BulkOutputFormat class.
It appears that the reducers are generating the SSTables, but is failing to 
load them into the cluster:

12/09/14 14:08:13 INFO mapred.JobClient: Task Id : 
attempt_201208201337_0184_r_04_0, Status : FAILED
 java.io.IOException: Too many hosts failed: [/10.4.0.6, /10.4.0.5, /10.4.0.2, 
/10.4.0.1, /10.4.0.3, /10.4.0.4] 
at 
org.apache.cassandra.hadoop.BulkRecordWriter.close(BulkRecordWriter.java:242)
at 
org.apache.cassandra.hadoop.BulkRecordWriter.close(BulkRecordWriter.java:207)
at 
org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.close(ReduceTask.java:579)
at 
org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:650)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:417)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255) 
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)   
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
at org.apache.hadoop.mapred.Child.main(Child.java:249)  

A brief look at the BulkOutputFormat class shows that it depends on 
SSTableLoader. My Hadoop cluster
and my Cassandra cluster are co-located on the same set of machines. I haven't 
found any stated restrictions,
but does this technique only work if the Hadoop cluster is distinct from the 
Cassandra cluster? Any suggestions
on how to get past this problem?

Thanks in advance.

Brian

Disk configuration in new cluster node

2012-09-14 Thread Casey Deccio
I'm building a new "cluster" (to replace the broken setup I've written
about in previous posts) that will consist of only two nodes.  I understand
that I'll be sacrificing high availability of writes if one of the nodes
goes down, and I'm okay with that.  I'm more interested in maintaining high
consistency and high read availability.  So I've decided to use a
write-level consistency of ALL and read-level consistency of ONE.

My first question is about the drives in this setup.  If I initially set up
the system with, say, 4 drives for data and 1 drive for commitlog, and
later I decide to add more capacity to the node by adding more drives for
data (adding the new data directory entries in cassandra.yaml), will the
node balance out the load on the drives, or is it agnostic to usage of
drives underlying data directories?

My second question has to do with RAID striping.  Would it be more useful
to stripe the disk with the commitlog or the disks with the data?  Of
course, with a single striped volume for data directories, it would be more
difficult to add capacity to the node later, as I've suggested above.

Casey


Re: Cassandra node going down

2012-09-14 Thread rohit reddy
Thanks for the inputs.
The disk on the EC2 node failed. This led to the problem. Now i have
created a new cassandra node and added it to the cluster.

Do i need to do anything to delete the old node from the cluster, or will
the cluster balance it self.
Asking this since in Datastax ops center its still showing the old node.

Thanks
Rohit

On Fri, Sep 14, 2012 at 7:42 PM, Robin Verlangen  wrote:

> Cassandra writes to memtables, that will get flushed to disk when it's
> time. That might be because of running out of memory (the log message you
> just posted), on a shutdown, or at other times. That's why you're using
> memory while writing.
>
> You seem to be running on AWS, are you sure your data location is on the
> right disk? Default is /var/lib/cassandra/data
>
> Best regards,
>
> Robin Verlangen
> *Software engineer*
> *
> *
> W http://www.robinverlangen.nl
> E ro...@us2.nl
>
> Disclaimer: The information contained in this message and attachments is
> intended solely for the attention and use of the named addressee and may be
> confidential. If you are not the intended recipient, you are reminded that
> the information remains the property of the sender. You must not use,
> disclose, distribute, copy, print or rely on this e-mail. If you have
> received this message in error, please contact the sender immediately and
> irrevocably delete this message and any copies.
>
>
>
> 2012/9/14 rohit reddy 
>
>> Hi Robin,
>>
>> I had checked that. Our disk size is about 800GB, and the total data size
>> is not more than 40GB. Even if all the data is stored in one node, this
>> won't happen.
>>
>> I'll try to see if the disk failed.
>>
>> Is this anything to do with VM memory?.. cause this logs suggests that..
>> Heap is 0.7515559786053904 full.  You may need to reduce memtable and/or
>> cache sizes.  Cassandra will now flush up to the two largest memtables to
>> free up memory.  Adjust flush_largest_memtables_at threshold in
>> cassandra.yaml if you don't want Cassandra to do this automatically
>>
>> But, i'm only testing writes, there are no reads on the cluster. Will the
>> writes require so much memory. A large instance has 7.5GB, so by default
>> cassandra allocates about 3.75 GB for the VM.
>>
>>
>>
>> On Fri, Sep 14, 2012 at 6:58 PM, Robin Verlangen  wrote:
>>
>>> Hi Robbit,
>>>
>>> I think it's running out of disk space, please verify that (on Linux: df
>>> -h ).
>>>
>>> Best regards,
>>>
>>> Robin Verlangen
>>> *Software engineer*
>>> *
>>> *
>>> W http://www.robinverlangen.nl
>>> E ro...@us2.nl
>>>
>>> Disclaimer: The information contained in this message and attachments is
>>> intended solely for the attention and use of the named addressee and may be
>>> confidential. If you are not the intended recipient, you are reminded that
>>> the information remains the property of the sender. You must not use,
>>> disclose, distribute, copy, print or rely on this e-mail. If you have
>>> received this message in error, please contact the sender immediately and
>>> irrevocably delete this message and any copies.
>>>
>>>
>>>
>>> 2012/9/14 rohit reddy 
>>>
 Hi,

 I'm facing a problem in Cassandra cluster deployed on EC2 where the
 node is going down under write load.

 I have configured a cluster of 4 Large EC2 nodes with RF of 2.
 All nodes are instance storage backed. DISK is RAID0 with 800GB

 I'm pumping in write requests at about 4000 writes/sec. One of the node
 went down under this load. The total data size in each node was not more
 than 7GB
 Got the following WARN messages in the LOG file...

 1. setting live ratio to minimum of 1.0 instead of 0.9003153296009601
 2. Heap is 0.7515559786053904 full.  You may need to reduce memtable
 and/or cache sizes.  Cassandra will now flush up to the two largest
 memtables to free up memory.  Adjust flush_largest_memtables_at threshold
 in cassandra.yaml if you don't want Cassandra to do
 this automatically
 3. WARN [CompactionExecutor:570] 2012-09-14 11:45:12,024
 CompactionTask.java (line 84) insufficient space to compact all requested
 files

 All cassandra settings are default settings.
 Do i need to tune anything to support this write rate?

 Thanks
 Rohit


>>>
>>
>


Re: Cassandra node going down

2012-09-14 Thread Tyler Hobbs
You will need to run nodetool removetoken with the old node's token to
permanently remove it from the cluster.

On Fri, Sep 14, 2012 at 3:06 PM, rohit reddy wrote:

> Thanks for the inputs.
> The disk on the EC2 node failed. This led to the problem. Now i have
> created a new cassandra node and added it to the cluster.
>
> Do i need to do anything to delete the old node from the cluster, or will
> the cluster balance it self.
> Asking this since in Datastax ops center its still showing the old node.
>
> Thanks
> Rohit
>
>
> On Fri, Sep 14, 2012 at 7:42 PM, Robin Verlangen  wrote:
>
>> Cassandra writes to memtables, that will get flushed to disk when it's
>> time. That might be because of running out of memory (the log message you
>> just posted), on a shutdown, or at other times. That's why you're using
>> memory while writing.
>>
>> You seem to be running on AWS, are you sure your data location is on the
>> right disk? Default is /var/lib/cassandra/data
>>
>> Best regards,
>>
>> Robin Verlangen
>> *Software engineer*
>> *
>> *
>> W http://www.robinverlangen.nl
>> E ro...@us2.nl
>>
>> Disclaimer: The information contained in this message and attachments is
>> intended solely for the attention and use of the named addressee and may be
>> confidential. If you are not the intended recipient, you are reminded that
>> the information remains the property of the sender. You must not use,
>> disclose, distribute, copy, print or rely on this e-mail. If you have
>> received this message in error, please contact the sender immediately and
>> irrevocably delete this message and any copies.
>>
>>
>>
>> 2012/9/14 rohit reddy 
>>
>>> Hi Robin,
>>>
>>> I had checked that. Our disk size is about 800GB, and the total data
>>> size is not more than 40GB. Even if all the data is stored in one node,
>>> this won't happen.
>>>
>>> I'll try to see if the disk failed.
>>>
>>> Is this anything to do with VM memory?.. cause this logs suggests that..
>>> Heap is 0.7515559786053904 full.  You may need to reduce memtable and/or
>>> cache sizes.  Cassandra will now flush up to the two largest memtables to
>>> free up memory.  Adjust flush_largest_memtables_at threshold in
>>> cassandra.yaml if you don't want Cassandra to do this automatically
>>>
>>> But, i'm only testing writes, there are no reads on the cluster. Will
>>> the writes require so much memory. A large instance has 7.5GB, so by
>>> default cassandra allocates about 3.75 GB for the VM.
>>>
>>>
>>>
>>> On Fri, Sep 14, 2012 at 6:58 PM, Robin Verlangen  wrote:
>>>
 Hi Robbit,

 I think it's running out of disk space, please verify that (on Linux:
 df -h ).

 Best regards,

 Robin Verlangen
 *Software engineer*
 *
 *
 W http://www.robinverlangen.nl
 E ro...@us2.nl

 Disclaimer: The information contained in this message and attachments
 is intended solely for the attention and use of the named addressee and may
 be confidential. If you are not the intended recipient, you are reminded
 that the information remains the property of the sender. You must not use,
 disclose, distribute, copy, print or rely on this e-mail. If you have
 received this message in error, please contact the sender immediately and
 irrevocably delete this message and any copies.



 2012/9/14 rohit reddy 

> Hi,
>
> I'm facing a problem in Cassandra cluster deployed on EC2 where the
> node is going down under write load.
>
> I have configured a cluster of 4 Large EC2 nodes with RF of 2.
> All nodes are instance storage backed. DISK is RAID0 with 800GB
>
> I'm pumping in write requests at about 4000 writes/sec. One of the
> node went down under this load. The total data size in each node was not
> more than 7GB
> Got the following WARN messages in the LOG file...
>
> 1. setting live ratio to minimum of 1.0 instead of 0.9003153296009601
> 2. Heap is 0.7515559786053904 full.  You may need to reduce memtable
> and/or cache sizes.  Cassandra will now flush up to the two largest
> memtables to free up memory.  Adjust flush_largest_memtables_at threshold
> in cassandra.yaml if you don't want Cassandra to do
> this automatically
> 3. WARN [CompactionExecutor:570] 2012-09-14 11:45:12,024
> CompactionTask.java (line 84) insufficient space to compact all requested
> files
>
> All cassandra settings are default settings.
> Do i need to tune anything to support this write rate?
>
> Thanks
> Rohit
>
>

>>>
>>
>


-- 
Tyler Hobbs
DataStax 


Differences in row iteration behavior

2012-09-14 Thread Todd Fast

Hi--

We are iterating rows in a column family two different ways and are 
seeing radically different row counts. We are using 1.0.8 and 
RandomPartitioner on a 3-node cluster.


In the first case, we have a trivial Hadoop job that counts 29M rows 
using the standard MR pattern for counting (mapper outputs a single key 
with a value of 1, reducer adds up all the values).


In the second case, we have a simple Quartz batch job which counts only 
10M rows. We are iterating using chained calls to get_row_slices, as 
described on the wiki: http://wiki.apache.org/cassandra/FAQ#iter_world 
We've also implemented the batch job using Pelops, with and without 
chaining. In all cases, the job counts just 10M rows, and it is not 
encountering any errors.


We are confident that we are doing everything right in both cases (no 
bugs), yet the results are baffling. Tests in smaller, single-node 
environments results in consistent counts between the two methods, but 
we don't have the same amount of data nor the same topology.


Is the right answer 29M or 10M? Any clues to what we're seeing?

Todd


Re: Differences in row iteration behavior

2012-09-14 Thread Jeremy Hanna
Are there any deletions in your data?  The Hadoop support doesn't filter out 
tombstones, though you may not be filtering them out in your code either.  I've 
used the hadoop support for doing a lot of data validation in the past and as 
long as you're sure that the code is sound, I'm pretty confident in it.

On Sep 14, 2012, at 10:07 PM, Todd Fast  wrote:

> Hi--
> 
> We are iterating rows in a column family two different ways and are seeing 
> radically different row counts. We are using 1.0.8 and RandomPartitioner on a 
> 3-node cluster.
> 
> In the first case, we have a trivial Hadoop job that counts 29M rows using 
> the standard MR pattern for counting (mapper outputs a single key with a 
> value of 1, reducer adds up all the values).
> 
> In the second case, we have a simple Quartz batch job which counts only 10M 
> rows. We are iterating using chained calls to get_row_slices, as described on 
> the wiki: http://wiki.apache.org/cassandra/FAQ#iter_world We've also 
> implemented the batch job using Pelops, with and without chaining. In all 
> cases, the job counts just 10M rows, and it is not encountering any errors.
> 
> We are confident that we are doing everything right in both cases (no bugs), 
> yet the results are baffling. Tests in smaller, single-node environments 
> results in consistent counts between the two methods, but we don't have the 
> same amount of data nor the same topology.
> 
> Is the right answer 29M or 10M? Any clues to what we're seeing?
> 
> Todd



Re: cassandra/hadoop BulkOutputFormat failures

2012-09-14 Thread Jeremy Hanna
A couple of guesses:
- are you mixing versions of Cassandra?  Streaming differences between versions 
might throw this error.  That is, are you bulk loading with one version of 
Cassandra into a cluster that's a different version?
- (shot in the dark) is your cluster overwhelmed for some reason?

If the temp dir hasn't been cleaned up yet, you are able to retry, fwiw.

Jeremy

On Sep 14, 2012, at 1:34 PM, Brian Jeltema  
wrote:

> I'm trying to do a bulk load from a Cassandra/Hadoop job using the 
> BulkOutputFormat class.
> It appears that the reducers are generating the SSTables, but is failing to 
> load them into the cluster:
> 
> 12/09/14 14:08:13 INFO mapred.JobClient: Task Id : 
> attempt_201208201337_0184_r_04_0, Status : FAILED
> java.io.IOException: Too many hosts failed: [/10.4.0.6, /10.4.0.5, /10.4.0.2, 
> /10.4.0.1, /10.4.0.3, /10.4.0.4] 
>at 
> org.apache.cassandra.hadoop.BulkRecordWriter.close(BulkRecordWriter.java:242)
>at 
> org.apache.cassandra.hadoop.BulkRecordWriter.close(BulkRecordWriter.java:207)
>at 
> org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.close(ReduceTask.java:579)
>at 
> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:650)
>at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:417)
>at org.apache.hadoop.mapred.Child$4.run(Child.java:255) 
>at java.security.AccessController.doPrivileged(Native Method)
>at javax.security.auth.Subject.doAs(Subject.java:396)   
>at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
>at org.apache.hadoop.mapred.Child.main(Child.java:249)  
> 
> A brief look at the BulkOutputFormat class shows that it depends on 
> SSTableLoader. My Hadoop cluster
> and my Cassandra cluster are co-located on the same set of machines. I 
> haven't found any stated restrictions,
> but does this technique only work if the Hadoop cluster is distinct from the 
> Cassandra cluster? Any suggestions
> on how to get past this problem?
> 
> Thanks in advance.
> 
> Brian