Re: Filter data on row key in Cassandra Hadoop's Random Partitioner

2012-12-13 Thread Ayush V.
Thanks Hiller and Shamim. 

Let me share more details. I want to use cassandra MR to calculate some
KPI's on the data which is stored in cassandra continuously. So here
fetching whole data from cassandra every time seems an overhead to me? 

The rowkey I'm using is like "(timestamp/6)_otherid"; this CF contains
reference of rowkeys of actual data stored in other CF. so to calculate KPI
I will work for a particular minute and fetch data from other CF, and
process it.



--
View this message in context: 
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Filter-data-on-row-key-in-Cassandra-Hadoop-s-Random-Partitioner-tp7584212p7584263.html
Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
Nabble.com.


Re: Multiple Data Center shows very uneven load

2012-12-13 Thread Sergey Olefir
I'll try nodetool drain, thanks.

But more generally -- are you basically saying that I should not worry about
these things? Data will not keep accumulating indefinitely in production and
it'll not affect performance negatively (despite vast differences in node
load)?

Best regards,
Sergey


aaron morton wrote
> try nodetool drain. It will flush everything to disk and the commit log
> will be truncated.
> 
> HH can be ignored. If you really want them gone they can be purged using
> the JMX interface, or you can stop the node and delete the sstables. 
> 
> 
> Cheers
>  
> -
> Aaron Morton
> Freelance Cassandra Developer
> New Zealand
> 
> @aaronmorton
> http://www.thelastpickle.com
> 
> On 13/12/2012, at 10:35 AM, Sergey Olefir <

> solf.lists@

> > wrote:
> 
>> Nick Bailey-2 wrote
>>> Dropping a keyspace causes a snapshot to be taken of the keyspace before
>>> it
>>> is removed from the schema. So it won't actually delete any data. You
>>> can
>>> manually delete the data from /var/lib/cassandra/
>>> 
> 
>>> //snapshots
>> 
>> Indeed, it looks like snapshot is on the file system. However it looks
>> like
>> it is not the only thing by a long shot, i.e.:
>> cassa1-1:/var/log/cassandra# du -k /spool1/cassandra/data/1.1/
>> 375372 
>> /spool1/cassandra/data/1.1/rainmanLoadTestKeyspace/marquisColumnFamily/snapshots/1355222054452-marquisColumnFamily
>> 375376 
>> /spool1/cassandra/data/1.1/rainmanLoadTestKeyspace/marquisColumnFamily/snapshots
>> 375380 
>> /spool1/cassandra/data/1.1/rainmanLoadTestKeyspace/marquisColumnFamily
>> 375384  /spool1/cassandra/data/1.1/rainmanLoadTestKeyspace
>> 4   /spool1/cassandra/data/1.1/system/Versions
>> 52  /spool1/cassandra/data/1.1/system/schema_columns
>> 4   /spool1/cassandra/data/1.1/system/Schema
>> 28  /spool1/cassandra/data/1.1/system/NodeIdInfo
>> 4   /spool1/cassandra/data/1.1/system/Migrations
>> 28  /spool1/cassandra/data/1.1/system/schema_keyspaces
>> 28  /spool1/cassandra/data/1.1/system/schema_columnfamilies
>> 786348  /spool1/cassandra/data/1.1/system/HintsColumnFamily
>> 52  /spool1/cassandra/data/1.1/system/LocationInfo
>> 4   /spool1/cassandra/data/1.1/system/IndexInfo
>> 786556  /spool1/cassandra/data/1.1/system
>> 1161944 /spool1/cassandra/data/1.1/
>> 
>> 
>> And also 700+MB in the commitlog. Neither of which seemed to 'go away' on
>> its own when idle or even after running nodetool repair/cleanup and even
>> dropping keyspace.
>> 
>> I suppose these hints and commitlog may be the reason behind huge
>> difference
>> in load on nodes -- but why does it happen and more importantly is it
>> harmful? Will it keep accumulating?
>> 
>> 
>> 
>> --
>> View this message in context:
>> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Multiple-Data-Center-shows-very-uneven-load-tp7584197p7584256.html
>> Sent from the 

> cassandra-user@.apache

>  mailing list archive at Nabble.com.





--
View this message in context: 
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Multiple-Data-Center-shows-very-uneven-load-tp7584197p7584264.html
Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
Nabble.com.


Does a scrub remove deleted/expired columns?

2012-12-13 Thread Mike Smith
I'm using 1.0.12 and I find that large sstables tend to get compacted
infrequently. I've got data that gets deleted or expired frequently. Is it
possible to use scrub to accelerate the clean up of expired/deleted data?

-- 
Mike Smith
Director Development, MailChannels


Best Java Driver for Cassandra?

2012-12-13 Thread Stephen.M.Thompson
There seem to be a number of good options listed ... FireBrand and Hector seem 
to have the most attractive sites, but that doesn't necessarily mean anything.  
:)  Can anybody make a case for one of the drivers over another, especially in 
terms of which ones seem to be most used in major implementations?

Thanks
Steve


Re: Why Secondary indexes is so slowly by my test?

2012-12-13 Thread Edward Capriolo
Until the secondary indexes do not read before write is in a release and
stabilized you should follow Ed ENuff s blog and do your indexing yourself
with composites.

On Thursday, December 13, 2012, aaron morton 
wrote:
> The IndexClause for the get_indexed_slices takes a start key. You can
page the results from your secondary index query by making multiple calls
with a sane count and including a start key.
> Cheers
> -
> Aaron Morton
> Freelance Cassandra Developer
> New Zealand
> @aaronmorton
> http://www.thelastpickle.com
> On 13/12/2012, at 6:34 PM, Chengying Fang  wrote:
>
> You are right, Dean. It's due to the heavy result returned by query, not
index itself. According to my test, if the result  rows less than 5000,
it's very quick. But how to limit the result? It seems row limit is a good
choice. But if do so, some rows I wanted  maybe miss because the row order
not fulfill query conditions.
> For example: CF User{I1,C1} with Index I1. Query conditions:I1=foo, order
by C1. If I1=foo return 1 limit 100, I can't get the right result of
C1. Also we can not always set row range fulfill the query conditions when
doing query. Maybe I should redesign the CF model to fix it.
>
> -- Original --
> From:  "Hiller, Dean";
> Date:  Wed, Dec 12, 2012 10:51 PM
> To:  "user@cassandra.apache.org";
> Subject:  Re: Why Secondary indexes is so slowly by my test?
>
> You could always try PlayOrm's query capability on top of cassandra
;)….it works for us.
>
> Dean
>
> From: Chengying Fang mailto:cyf...@ngnsoft.com>>
> Reply-To: "user@cassandra.apache.org" <
user@cassandra.apache.org>
> Date: Tuesday, December 11, 2012 8:22 PM
> To: user mailto:user@cassandra.apache.org>>
> Subject: Re: Why Secondary indexes is so slowly by my test?
>
> Thanks to Low. We use CompositeColumn to substitue it in single
not-equality and definite equalitys query. And we will give up cassandra
because of the weak query ability and unstability. Many times, we found our
data in confusion without definite  cause in our cluster. For example, only
two rows in one CF,
row1-columnname1-columnvalue1,row2-columnname2-columnvalue2, but some
times, it becomes
row1-columnname1-columnvalue2,row2-columnname2-columnvalue1. Notice the
wrong column value.
>
>
> -- Original --
> From:  "Richard Low"mailto:r...@acunu.com>>;
> Date:  Tue, Dec 11, 2012 07:44 PM
> To:  "user"mailto:user@cassandra.apache.org>>;
> Subject:  Re: Why Secondary indexes is so slowly by my test?
>
> Hi,
>
> Secondary index lookups are more complicated than normal queries so will
be slower. Items have to first be queried in the index, then retrieved from
their actual location. Also, inserting into indexed CFs will be slower (but
will get substantially faster in 1.2 due


Re: Datastax C*ollege Credit Webinar Series : Create your first Java App w/ Cassandra

2012-12-13 Thread Edward Capriolo
It should be good stuff. Brian eats this stuff for lunch.

On Wednesday, December 12, 2012, Brian O'Neill 
wrote:
> FWIW --
> I'm presenting tomorrow for the Datastax C*ollege Credit Webinar Series:
>
http://brianoneill.blogspot.com/2012/12/presenting-for-datastax-college-credit.html
>
> I hope to make CQL part of the presentation and show how it integrates
> with the Java APIs.
> If you are interested, drop in.
>
> -brian
>
> --
> Brian ONeill
> Lead Architect, Health Market Science (http://healthmarketscience.com)
> mobile:215.588.6024
> blog: http://brianoneill.blogspot.com/
> twitter: @boneill42
>


Re: Help on MMap of SSTables

2012-12-13 Thread Edward Capriolo
This issue has to be looked from a micro and macro level. On the microlevel
the "best" way is workload specific. On the macro level this mostly boils
down to data and memory size.

Companions are going to churn cache, this is unavoidable. Imho solid state
makes the micro optimization meanless in the big picture. Not that we
should not consider tweaking flags but just saying it is hard to believe
anything like that is a game change.

On Monday, December 10, 2012, Rob Coli  wrote:
> On Thu, Dec 6, 2012 at 7:36 PM, aaron morton 
wrote:
>> So for memory mapped files, compaction can do a madvise SEQUENTIAL
instead
>> of current DONTNEED flag after detecting appropriate OS versions. Will
this
>> help?
>>
>>
>> AFAIK Compaction does use memory mapped file access.
>
> The history :
>
> https://issues.apache.org/jira/browse/CASSANDRA-1470
>
> =Rob
>
> --
> =Robert Coli
> AIM>ALK - rc...@palominodb.com
> YAHOO - rcoli.palominob
> SKYPE - rcoli_palominodb
>


Re: Best Java Driver for Cassandra?

2012-12-13 Thread Brian O'Neill

Well, we'll talk a bit about this in my webinar later todayŠ
http://brianoneill.blogspot.com/2012/12/presenting-for-datastax-college-cre
dit.html

I put together a quick decision matrix for all of the options based on
production-readiness, potential and momentum.  I think the slides will be
made available afterwards.

I also have a laundry list here: (written before I knew about Firebrand)
http://brianoneill.blogspot.com/2012/08/cassandra-apis-laundry-list.html

-brian

---
Brian O'Neill
Lead Architect, Software Development
Health Market Science
The Science of Better Results
2700 Horizon Drive € King of Prussia, PA € 19406
M: 215.588.6024 € @boneill42   €
healthmarketscience.com

This information transmitted in this email message is for the intended
recipient only and may contain confidential and/or privileged material. If
you received this email in error and are not the intended recipient, or
the person responsible to deliver it to the intended recipient, please
contact the sender at the email above and delete this email and any
attachments and destroy any copies thereof. Any review, retransmission,
dissemination, copying or other use of, or taking any action in reliance
upon, this information by persons or entities other than the intended
recipient is strictly prohibited.
 






On 12/13/12 9:03 AM, "stephen.m.thomp...@wellsfargo.com"
 wrote:

>There seem to be a number of good options listed ... FireBrand and Hector
>seem to have the most attractive sites, but that doesn't necessarily mean
>anything.  :)  Can anybody make a case for one of the drivers over
>another, especially in terms of which ones seem to be most used in major
>implementations?
>
>Thanks
>Steve




Re: Why Secondary indexes is so slowly by my test?

2012-12-13 Thread Alain RODRIGUEZ
Hi Edward, can you share the link to this blog ?

Alain

2012/12/13 Edward Capriolo 

> Ed ENuff s


Re: Why Secondary indexes is so slowly by my test?

2012-12-13 Thread Edward Capriolo
Here is a good start.

http://www.anuff.com/2011/02/indexing-in-cassandra.html

On Thu, Dec 13, 2012 at 11:35 AM, Alain RODRIGUEZ wrote:

> Hi Edward, can you share the link to this blog ?
>
> Alain
>
> 2012/12/13 Edward Capriolo 
>
>> Ed ENuff s
>
>
>


Re: Why Secondary indexes is so slowly by my test?

2012-12-13 Thread Tyler Hobbs
If anyone's interested in a little more background on the read-before-write
fix that Ed mentioned, see:
https://issues.apache.org/jira/browse/CASSANDRA-2897


On Thu, Dec 13, 2012 at 11:31 AM, Edward Capriolo wrote:

> Here is a good start.
>
> http://www.anuff.com/2011/02/indexing-in-cassandra.html
>
> On Thu, Dec 13, 2012 at 11:35 AM, Alain RODRIGUEZ wrote:
>
>> Hi Edward, can you share the link to this blog ?
>>
>> Alain
>>
>> 2012/12/13 Edward Capriolo 
>>
>>> Ed ENuff s
>>
>>
>>
>


-- 
Tyler Hobbs
DataStax 


Re: Datastax C*ollege Credit Webinar Series : Create your first Java App w/ Cassandra

2012-12-13 Thread Wei Zhu
I tried to registered and got the following page and haven't received email 
yet. I registered 10 minutes ago.

Thank you for registering to attend:

Is My App a Good Fit for Apache Cassandra?

Details about this webinar have also been sent to your email, including a link 
to the webinar's URL.


Webinar Description:

Join Eric Lubow, CTO of Simple Reach and DataStax MVP for Apache Cassandra 
as he examines the types of applications that are suited to be built on 
top of Cassandra. Eric will talk about the key considerations for 
designing and deploying your application on Apache Cassandra. 

How come it's saying Is My App a Good Fit for Apache Cassandra? which was the 
previous webniar. 

Thanks.
-Wei



 From: Edward Capriolo 
To: "user@cassandra.apache.org"  
Sent: Thursday, December 13, 2012 7:23 AM
Subject: Re: Datastax C*ollege Credit Webinar Series : Create your first Java 
App w/ Cassandra
 

It should be good stuff. Brian eats this stuff for lunch.

On Wednesday, December 12, 2012, Brian O'Neill  wrote:
> FWIW --
> I'm presenting tomorrow for the Datastax C*ollege Credit Webinar Series:
> http://brianoneill.blogspot.com/2012/12/presenting-for-datastax-college-credit.html
>
> I hope to make CQL part of the presentation and show how it integrates
> with the Java APIs.
> If you are interested, drop in.
>
> -brian
>
> --
> Brian ONeill
> Lead Architect, Health Market Science (http://healthmarketscience.com)
> mobile:215.588.6024
> blog: http://brianoneill.blogspot.com/
> twitter: @boneill42
> 

Re: Datastax C*ollege Credit Webinar Series : Create your first Java App w/ Cassandra

2012-12-13 Thread Wei Zhu
Never mind, the email arrived after 15 minutes or so...



 From: Wei Zhu 
To: "user@cassandra.apache.org"  
Sent: Thursday, December 13, 2012 10:06 AM
Subject: Re: Datastax C*ollege Credit Webinar Series : Create your first Java 
App w/ Cassandra
 

I tried to registered and got the following page and haven't received email 
yet. I registered 10 minutes ago.

Thank you for registering to attend:

Is My App a Good Fit for Apache Cassandra?

Details about this webinar have also been sent to your email, including a link 
to the webinar's URL.


Webinar Description:

Join Eric Lubow, CTO of Simple Reach and DataStax MVP for Apache Cassandra 
as he examines the types of applications that are suited to be built on 
top of Cassandra. Eric will talk about the key considerations for 
designing and deploying your application on Apache Cassandra. 

How come it's saying Is My App a Good Fit for Apache Cassandra? which was the 
previous webniar. 

Thanks.
-Wei



 From: Edward Capriolo 
To: "user@cassandra.apache.org"  
Sent: Thursday, December 13, 2012 7:23 AM
Subject: Re: Datastax C*ollege Credit Webinar Series : Create your first Java 
App w/ Cassandra
 

It should be good stuff. Brian eats this stuff for lunch.

On Wednesday, December 12, 2012, Brian O'Neill  wrote:
> FWIW --
> I'm presenting tomorrow for the Datastax C*ollege Credit Webinar Series:
> http://brianoneill.blogspot.com/2012/12/presenting-for-datastax-college-credit.html
>
> I hope to make CQL part of the presentation and show how it integrates
> with the Java APIs.
> If you are interested, drop in.
>
> -brian
>
> --
> Brian ONeill
> Lead Architect, Health Market Science (http://healthmarketscience.com)
> mobile:215.588.6024
> blog: http://brianoneill.blogspot.com/
> twitter: @boneill42
> 

State of Cassandra and Java 7

2012-12-13 Thread Drew Kutcharian
Hey Guys,

With Java 6 begin EOL-ed soon 
(https://blogs.oracle.com/java/entry/end_of_public_updates_for), what's the 
status of Cassandra's Java 7 support? Anyone using it in production? Any 
outstanding *known* issues? 

-- Drew



Re: State of Cassandra and Java 7

2012-12-13 Thread Michael Kjellman
Works just fine for us.

On 12/13/12 11:43 AM, "Drew Kutcharian"  wrote:

>Hey Guys,
>
>With Java 6 begin EOL-ed soon
>(https://blogs.oracle.com/java/entry/end_of_public_updates_for), what's
>the status of Cassandra's Java 7 support? Anyone using it in production?
>Any outstanding *known* issues?
>
>-- Drew
>


Join Barracuda Networks in the fight against hunger.
To learn how you can help in your community, please visit: 
http://on.fb.me/UAdL4f





BulkOutputFormat error - org.apache.thrift.transport.TTransportException

2012-12-13 Thread ANAND_BALARAMAN
Hi

I am a newbie to Cassandra. Was trying out a sample (word count) code on 
BulkOutputFormat and got stuck with an error.

What I am trying to do is - migrate all Hive tables (from Hadoop cluster) to 
Cassandra column families.
My MR program is configured to run on Hadoop cluster v 0.20.2 (cdh3u3) by 
pointing job config params 'fs.default.name' and 'mapred.job.tracker' 
appropriately.
The output is pointed to my local Cassandra v1.1.7.
Have set the following params for writing to Cassandra:
conf.set("cassandra.output.keyspace", "Customer");
   conf.set("cassandra.output.columnfamily", "words");
   conf.set("cassandra.output.partitioner.class", 
"org.apache.cassandra.dht.RandomPartitioner");
   conf.set("cassandra.output.thrift.port","9160");// default
   conf.set("cassandra.output.thrift.address", "localhost");
   conf.set("mapreduce.output.bulkoutputformat.streamthrottlembits", "10");

But, programs fails with the below error:
12/12/13 15:32:55 INFO security.UserGroupInformation: JAAS Configuration 
already set up for Hadoop, not re-installing.
Cassandra thrift address   :  localhost
Cassandra thrift port  :  9160
12/12/13 15:32:56 WARN mapred.JobClient: Use GenericOptionsParser for parsing 
the arguments. Applications should implement Tool for the same.
12/12/13 15:34:21 INFO input.FileInputFormat: Total input paths to process : 1
12/12/13 15:34:21 WARN util.NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
12/12/13 15:34:21 WARN snappy.LoadSnappy: Snappy native library not loaded
12/12/13 15:34:22 INFO mapred.JobClient: Running job: job_20121201_4622
12/12/13 15:34:23 INFO mapred.JobClient:  map 0% reduce 0%
12/12/13 15:34:28 INFO mapred.JobClient:  map 100% reduce 0%
12/12/13 15:34:37 INFO mapred.JobClient:  map 100% reduce 33%
12/12/13 15:34:39 INFO mapred.JobClient: Task Id : 
attempt_20121201_4622_r_00_0, Status : FAILED
java.lang.RuntimeException: Could not retrieve endpoint ranges:
   at 
org.apache.cassandra.hadoop.BulkRecordWriter$ExternalClient.init(BulkRecordWriter.java:328)
   at 
org.apache.cassandra.io.sstable.SSTableLoader.stream(SSTableLoader.java:116)
   at 
org.apache.cassandra.io.sstable.SSTableLoader.stream(SSTableLoader.java:111)
   at 
org.apache.cassandra.hadoop.BulkRecordWriter.close(BulkRecordWriter.java:223)
   at 
org.apache.cassandra.hadoop.BulkRecordWriter.close(BulkRecordWriter.java:208)
   at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:573)
   at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:414)
   at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:396)
   at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157)
   at org.apache.hadoop.mapred.Child.main(Child.java:264)
Caused by: org.apache.thrift.transport.TTransportException: java.net.ConnectE

Please help me out understand the problem.

Regards
Anand B



The information in this Internet Email is confidential and may be legally 
privileged. It is intended solely for the addressee. Access to this Email by 
anyone else is unauthorized. If you are not the intended recipient, any 
disclosure, copying, distribution or any action taken or omitted to be taken in 
reliance on it, is prohibited and may be unlawful. When addressed to our 
clients any opinions or advice contained in this Email are subject to the terms 
and conditions expressed in any applicable governing The Home Depot terms of 
business or client engagement letter. The Home Depot disclaims all 
responsibility and liability for the accuracy and content of this attachment 
and for any damages or losses arising from any inaccuracies, errors, viruses, 
e.g., worms, trojan horses, etc., or other items of a destructive nature, which 
may be contained in this attachment and shall not be liable for direct, 
indirect, consequential or special damages in connection with this e-mail 
message or its attachment.


Re: Multiple Data Center shows very uneven load

2012-12-13 Thread aaron morton
There is a limit on the size of the commit log and on how long Hints are stored 
for. 

I'm not sure why your load was different, I think it was left of hints and 
commit log. But it's not always easy to diagnose thingsvia email. 

Hopefully nodetool drain or deleting the rest system and starting again will 
get you moving forwards again.

Cheers
 
-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 14/12/2012, at 12:50 AM, Sergey Olefir  wrote:

> I'll try nodetool drain, thanks.
> 
> But more generally -- are you basically saying that I should not worry about
> these things? Data will not keep accumulating indefinitely in production and
> it'll not affect performance negatively (despite vast differences in node
> load)?
> 
> Best regards,
> Sergey
> 
> 
> aaron morton wrote
>> try nodetool drain. It will flush everything to disk and the commit log
>> will be truncated.
>> 
>> HH can be ignored. If you really want them gone they can be purged using
>> the JMX interface, or you can stop the node and delete the sstables. 
>> 
>> 
>> Cheers
>> 
>> -
>> Aaron Morton
>> Freelance Cassandra Developer
>> New Zealand
>> 
>> @aaronmorton
>> http://www.thelastpickle.com
>> 
>> On 13/12/2012, at 10:35 AM, Sergey Olefir <
> 
>> solf.lists@
> 
>> > wrote:
>> 
>>> Nick Bailey-2 wrote
 Dropping a keyspace causes a snapshot to be taken of the keyspace before
 it
 is removed from the schema. So it won't actually delete any data. You
 can
 manually delete the data from /var/lib/cassandra/
 
>> 
 //snapshots
>>> 
>>> Indeed, it looks like snapshot is on the file system. However it looks
>>> like
>>> it is not the only thing by a long shot, i.e.:
>>> cassa1-1:/var/log/cassandra# du -k /spool1/cassandra/data/1.1/
>>> 375372 
>>> /spool1/cassandra/data/1.1/rainmanLoadTestKeyspace/marquisColumnFamily/snapshots/1355222054452-marquisColumnFamily
>>> 375376 
>>> /spool1/cassandra/data/1.1/rainmanLoadTestKeyspace/marquisColumnFamily/snapshots
>>> 375380 
>>> /spool1/cassandra/data/1.1/rainmanLoadTestKeyspace/marquisColumnFamily
>>> 375384  /spool1/cassandra/data/1.1/rainmanLoadTestKeyspace
>>> 4   /spool1/cassandra/data/1.1/system/Versions
>>> 52  /spool1/cassandra/data/1.1/system/schema_columns
>>> 4   /spool1/cassandra/data/1.1/system/Schema
>>> 28  /spool1/cassandra/data/1.1/system/NodeIdInfo
>>> 4   /spool1/cassandra/data/1.1/system/Migrations
>>> 28  /spool1/cassandra/data/1.1/system/schema_keyspaces
>>> 28  /spool1/cassandra/data/1.1/system/schema_columnfamilies
>>> 786348  /spool1/cassandra/data/1.1/system/HintsColumnFamily
>>> 52  /spool1/cassandra/data/1.1/system/LocationInfo
>>> 4   /spool1/cassandra/data/1.1/system/IndexInfo
>>> 786556  /spool1/cassandra/data/1.1/system
>>> 1161944 /spool1/cassandra/data/1.1/
>>> 
>>> 
>>> And also 700+MB in the commitlog. Neither of which seemed to 'go away' on
>>> its own when idle or even after running nodetool repair/cleanup and even
>>> dropping keyspace.
>>> 
>>> I suppose these hints and commitlog may be the reason behind huge
>>> difference
>>> in load on nodes -- but why does it happen and more importantly is it
>>> harmful? Will it keep accumulating?
>>> 
>>> 
>>> 
>>> --
>>> View this message in context:
>>> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Multiple-Data-Center-shows-very-uneven-load-tp7584197p7584256.html
>>> Sent from the 
> 
>> cassandra-user@.apache
> 
>> mailing list archive at Nabble.com.
> 
> 
> 
> 
> 
> --
> View this message in context: 
> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Multiple-Data-Center-shows-very-uneven-load-tp7584197p7584264.html
> Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
> Nabble.com.



Re: Does a scrub remove deleted/expired columns?

2012-12-13 Thread aaron morton
>  Is it possible to use scrub to accelerate the clean up of expired/deleted 
> data?
No.
Scrub, and upgradesstables, are used to re-write each file on disk. Scrub may 
remove some rows from a file because of corruption, however upgradesstables 
will not. 

If you have long lived rows and a mixed work load of writes and deletes there 
are a couple of options. 

You can try levelled compaction 
http://www.datastax.com/dev/blog/when-to-use-leveled-compaction

You can tune the default sized tiered compaction by increasing the 
min_compaction_threshold. This will increase the number of files that must 
exist in each size tier before it will be compacted. As a result the speed at 
which rows move into the higher tiers will slow down. 

Note that having lots of files may have a negative impact on read performance. 
You can measure this my looking at the SSTables per read metric in the 
cfhistograms. 

Lastly you can run a user defined or major compaction. User defined compaction 
is available via JMX and allows you to compact any file you want. Manual / 
major compaction is available via node tool. We usually discourage it's use as 
it will create one big file that will not get compacted for a while. 


For background the tombstones / expired columns for a row are only purged from 
the database when all fragments of the row are  in the files been compacted. So 
if you have an old row that is spread out over many files it may not get 
purged. 

Hope that helps. 



-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 14/12/2012, at 3:01 AM, Mike Smith  wrote:

> I'm using 1.0.12 and I find that large sstables tend to get compacted 
> infrequently. I've got data that gets deleted or expired frequently. Is it 
> possible to use scrub to accelerate the clean up of expired/deleted data?
> 
> -- 
> Mike Smith
> Director Development, MailChannels
> 



Re: BulkOutputFormat error - org.apache.thrift.transport.TTransportException

2012-12-13 Thread aaron morton
Looks like it cannot connect to the server

>conf.set("cassandra.output.thrift.address", "localhost");
Is this the same address as the rpc_address in the cassandra config ? 

Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 14/12/2012, at 9:57 AM, anand_balara...@homedepot.com wrote:

> Hi
>  
> I am a newbie to Cassandra. Was trying out a sample (word count) code on 
> BulkOutputFormat and got stuck with an error.
>  
> What I am trying to do is – migrate all Hive tables (from Hadoop cluster) to 
> Cassandra column families.
> My MR program is configured to run on Hadoop cluster v 0.20.2 (cdh3u3) by 
> pointing job config params ‘fs.default.name’ and ‘mapred.job.tracker’ 
> appropriately.
> The output is pointed to my local Cassandra v1.1.7.
> Have set the following params for writing to Cassandra:
> conf.set("cassandra.output.keyspace", "Customer");
>conf.set("cassandra.output.columnfamily", "words");
>conf.set("cassandra.output.partitioner.class", 
> "org.apache.cassandra.dht.RandomPartitioner");
>conf.set("cassandra.output.thrift.port","9160");// default
>conf.set("cassandra.output.thrift.address", "localhost");
>conf.set("mapreduce.output.bulkoutputformat.streamthrottlembits", 
> "10");
>  
> But, programs fails with the below error:
> 12/12/13 15:32:55 INFO security.UserGroupInformation: JAAS Configuration 
> already set up for Hadoop, not re-installing.
> Cassandra thrift address   :  localhost
> Cassandra thrift port  :  9160
> 12/12/13 15:32:56 WARN mapred.JobClient: Use GenericOptionsParser for parsing 
> the arguments. Applications should implement Tool for the same.
> 12/12/13 15:34:21 INFO input.FileInputFormat: Total input paths to process : 1
> 12/12/13 15:34:21 WARN util.NativeCodeLoader: Unable to load native-hadoop 
> library for your platform... using builtin-java classes where applicable
> 12/12/13 15:34:21 WARN snappy.LoadSnappy: Snappy native library not loaded
> 12/12/13 15:34:22 INFO mapred.JobClient: Running job: job_20121201_4622
> 12/12/13 15:34:23 INFO mapred.JobClient:  map 0% reduce 0%
> 12/12/13 15:34:28 INFO mapred.JobClient:  map 100% reduce 0%
> 12/12/13 15:34:37 INFO mapred.JobClient:  map 100% reduce 33%
> 12/12/13 15:34:39 INFO mapred.JobClient: Task Id : 
> attempt_20121201_4622_r_00_0, Status : FAILED
> java.lang.RuntimeException: Could not retrieve endpoint ranges:
>at 
> org.apache.cassandra.hadoop.BulkRecordWriter$ExternalClient.init(BulkRecordWriter.java:328)
>at 
> org.apache.cassandra.io.sstable.SSTableLoader.stream(SSTableLoader.java:116)
>at 
> org.apache.cassandra.io.sstable.SSTableLoader.stream(SSTableLoader.java:111)
>at 
> org.apache.cassandra.hadoop.BulkRecordWriter.close(BulkRecordWriter.java:223)
>at 
> org.apache.cassandra.hadoop.BulkRecordWriter.close(BulkRecordWriter.java:208)
>at 
> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:573)
>at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:414)
>at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
>at java.security.AccessController.doPrivileged(Native Method)
>at javax.security.auth.Subject.doAs(Subject.java:396)
>at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157)
>at org.apache.hadoop.mapred.Child.main(Child.java:264)
> Caused by: org.apache.thrift.transport.TTransportException: java.net.ConnectE
>  
> Please help me out understand the problem.
>  
> Regards
> Anand B
> 
> 
> The information in this Internet Email is confidential and may be legally 
> privileged. It is intended solely for the addressee. Access to this Email by 
> anyone else is unauthorized. If you are not the intended recipient, any 
> disclosure, copying, distribution or any action taken or omitted to be taken 
> in reliance on it, is prohibited and may be unlawful. When addressed to our 
> clients any opinions or advice contained in this Email are subject to the 
> terms and conditions expressed in any applicable governing The Home Depot 
> terms of business or client engagement letter. The Home Depot disclaims all 
> responsibility and liability for the accuracy and content of this attachment 
> and for any damages or losses arising from any inaccuracies, errors, viruses, 
> e.g., worms, trojan horses, etc., or other items of a destructive nature, 
> which may be contained in this attachment and shall not be liable for direct, 
> indirect, consequential or special damages in connection with this e-mail 
> message or its attachment.



Re: State of Cassandra and Java 7

2012-12-13 Thread Rob Coli
On Thu, Dec 13, 2012 at 11:43 AM, Drew Kutcharian  wrote:
> With Java 6 begin EOL-ed soon 
> (https://blogs.oracle.com/java/entry/end_of_public_updates_for), what's the 
> status of Cassandra's Java 7 support? Anyone using it in production? Any 
> outstanding *known* issues?

I'd love to see an official statement from the project, due to the
sort of EOL issues you're referring to. Unfortunately previous
requests on this list for such a statement have gone unanswered.

The non-official response is that various people run in production
with Java 7 and it seems to work. :)

=Rob

-- 
=Robert Coli
AIM>ALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


Re: Why Secondary indexes is so slowly by my test?

2012-12-13 Thread Chengying Fang
I do missed this important article about index, which discussing about the 
focus concerns. In fact, I have used Composite Column to resolve my problem. In 
some context, data model can resolves as 'alternate index',  but it's 
complicated and can result new problems: data redundancy and maintenance. The 
most , what items I can query are decided at design time, that is, No Design No 
Query, even all data is there. Thanks to all.
 
-- Original --
From:  "Edward Capriolo";
Date:  Fri, Dec 14, 2012 01:31 AM
To:  "user"; 

Subject:  Re: Why Secondary indexes is so slowly by my test?

 
Here is a good start.



http://www.anuff.com/2011/02/indexing-in-cassandra.html

On Thu, Dec 13, 2012 at 11:35 AM, Alain RODRIGUEZ  wrote:
 Hi Edward, can you share the link to this blog ?


 Alain

2012/12/13 Edward Capriolo 
 Ed ENuff s

ETL Tools to transfer data from Cassandra into other relational databases

2012-12-13 Thread cko2...@gmail.com
We will use Cassandra as logging storage in one of our web application. The 
application only insert rows into Cassandra but never update or delete any 
rows. The CF is expected to grow by about 0.5 million rows per day.
 
We need to transfer the data in Cassandra to another relational database daily. 
Due to the large size of the CF, instead of truncating the relational table and 
reloading all rows into it each time, we plan to run a job to select the 
"delta" rows since the last run and insert them into the relational database.
 
We know we can use Java, Pig or Hive to extract the delta rows to a flat file 
and load the data into the target relational table. We are particularly 
interested in a process that can extract delta rows without scanning the entire 
CF.
 
Has anyone used any other ETL tools to do this kind of delta extraction from 
Cassandra? We appreciate any comments and experience.
 
Thanks,
Chin


Re: Does a scrub remove deleted/expired columns?

2012-12-13 Thread Mike Smith
Thanks for the great explanation.

I'd just like some clarification on the last point. Is it the case that if
I constantly add new columns to a row, while periodically trimming the row
by by deleting the oldest columns, the deleted columns won't get cleaned up
until all fragments of the row exist in a single sstable and that sstable
undergoes a compaction?

If my understanding is correct, do you know if 1.2 will enable cleanup of
columns in rows that have scattered fragments? Or, should I take a
different approach?



On Thu, Dec 13, 2012 at 5:52 PM, aaron morton wrote:

>  Is it possible to use scrub to accelerate the clean up of expired/deleted
> data?
>
> No.
> Scrub, and upgradesstables, are used to re-write each file on disk. Scrub
> may remove some rows from a file because of corruption, however
> upgradesstables will not.
>
> If you have long lived rows and a mixed work load of writes and deletes
> there are a couple of options.
>
> You can try levelled compaction
> http://www.datastax.com/dev/blog/when-to-use-leveled-compaction
>
> You can tune the default sized tiered compaction by increasing the
> min_compaction_threshold. This will increase the number of files that must
> exist in each size tier before it will be compacted. As a result the speed
> at which rows move into the higher tiers will slow down.
>
> Note that having lots of files may have a negative impact on read
> performance. You can measure this my looking at the SSTables per read
> metric in the cfhistograms.
>
> Lastly you can run a user defined or major compaction. User defined
> compaction is available via JMX and allows you to compact any file you
> want. Manual / major compaction is available via node tool. We usually
> discourage it's use as it will create one big file that will not get
> compacted for a while.
>
>
> For background the tombstones / expired columns for a row are only purged
> from the database when all fragments of the row are  in the files been
> compacted. So if you have an old row that is spread out over many files it
> may not get purged.
>
> Hope that helps.
>
>
>
>-
> Aaron Morton
> Freelance Cassandra Developer
> New Zealand
>
> @aaronmorton
> http://www.thelastpickle.com
>
> On 14/12/2012, at 3:01 AM, Mike Smith  wrote:
>
> I'm using 1.0.12 and I find that large sstables tend to get compacted
> infrequently. I've got data that gets deleted or expired frequently. Is it
> possible to use scrub to accelerate the clean up of expired/deleted data?
>
> --
> Mike Smith
> Director Development, MailChannels
>
>
>


-- 
Mike Smith
Director Development, MailChannels


Re: ETL Tools to transfer data from Cassandra into other relational databases

2012-12-13 Thread Milind Parikh
Why would you use Cassandra for primary store of logging information? Have
you considered Kafka ?

You could , of course, then fan out the logs to both Cassandra (on a near
real time basis ) and then on a daily basis (if you wish) extract the
"deltas" from Kafka into a RDBMS; with no PIG/Hive etc.


Regards
Milind


Regards
Milind



On Thu, Dec 13, 2012 at 7:19 PM, cko2...@gmail.com wrote:

> We will use Cassandra as logging storage in one of our web application.
> The application only insert rows into Cassandra but never update or delete
> any rows. The CF is expected to grow by about 0.5 million rows per day.
>
> We need to transfer the data in Cassandra to another relational database
> daily. Due to the large size of the CF, instead of truncating the
> relational table and reloading all rows into it each time, we plan to run a
> job to select the "delta" rows since the last run and insert them into the
> relational database.
>
> We know we can use Java, Pig or Hive to extract the delta rows to a flat
> file and load the data into the target relational table. We are
> particularly interested in a process that can extract delta rows without
> scanning the entire CF.
>
> Has anyone used any other ETL tools to do this kind of delta extraction
> from Cassandra? We appreciate any comments and experience.
>
> Thanks,
> Chin
>


RE: BulkOutputFormat error - org.apache.thrift.transport.TTransportException

2012-12-13 Thread ANAND_BALARAMAN
Aaron
Both the rpc_address in caasandra.yaml file and job configuration are same 
(localhost).
I will try connecting to a different Cassandra cluster and test it again.

-Original Message-
From: aaron morton [mailto:aa...@thelastpickle.com]
Sent: Thursday, December 13, 2012 9:03 PM
To: user@cassandra.apache.org
Subject: Re: BulkOutputFormat error - 
org.apache.thrift.transport.TTransportException

Looks like it cannot connect to the server

>conf.set("cassandra.output.thrift.address", "localhost");
Is this the same address as the rpc_address in the cassandra config ?

Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 14/12/2012, at 9:57 AM, anand_balara...@homedepot.com wrote:

> Hi
>
> I am a newbie to Cassandra. Was trying out a sample (word count) code on 
> BulkOutputFormat and got stuck with an error.
>
> What I am trying to do is - migrate all Hive tables (from Hadoop cluster) to 
> Cassandra column families.
> My MR program is configured to run on Hadoop cluster v 0.20.2 (cdh3u3) by 
> pointing job config params 'fs.default.name' and 'mapred.job.tracker' 
> appropriately.
> The output is pointed to my local Cassandra v1.1.7.
> Have set the following params for writing to Cassandra:
> conf.set("cassandra.output.keyspace", "Customer");
>conf.set("cassandra.output.columnfamily", "words");
>conf.set("cassandra.output.partitioner.class", 
> "org.apache.cassandra.dht.RandomPartitioner");
>conf.set("cassandra.output.thrift.port","9160");// default
>conf.set("cassandra.output.thrift.address", "localhost");
>conf.set("mapreduce.output.bulkoutputformat.streamthrottlembits", 
> "10");
>
> But, programs fails with the below error:
> 12/12/13 15:32:55 INFO security.UserGroupInformation: JAAS Configuration 
> already set up for Hadoop, not re-installing.
> Cassandra thrift address   :  localhost
> Cassandra thrift port  :  9160
> 12/12/13 15:32:56 WARN mapred.JobClient: Use GenericOptionsParser for parsing 
> the arguments. Applications should implement Tool for the same.
> 12/12/13 15:34:21 INFO input.FileInputFormat: Total input paths to process : 1
> 12/12/13 15:34:21 WARN util.NativeCodeLoader: Unable to load native-hadoop 
> library for your platform... using builtin-java classes where applicable
> 12/12/13 15:34:21 WARN snappy.LoadSnappy: Snappy native library not loaded
> 12/12/13 15:34:22 INFO mapred.JobClient: Running job: job_20121201_4622
> 12/12/13 15:34:23 INFO mapred.JobClient:  map 0% reduce 0%
> 12/12/13 15:34:28 INFO mapred.JobClient:  map 100% reduce 0%
> 12/12/13 15:34:37 INFO mapred.JobClient:  map 100% reduce 33%
> 12/12/13 15:34:39 INFO mapred.JobClient: Task Id : 
> attempt_20121201_4622_r_00_0, Status : FAILED
> java.lang.RuntimeException: Could not retrieve endpoint ranges:
>at 
> org.apache.cassandra.hadoop.BulkRecordWriter$ExternalClient.init(BulkRecordWriter.java:328)
>at 
> org.apache.cassandra.io.sstable.SSTableLoader.stream(SSTableLoader.java:116)
>at 
> org.apache.cassandra.io.sstable.SSTableLoader.stream(SSTableLoader.java:111)
>at 
> org.apache.cassandra.hadoop.BulkRecordWriter.close(BulkRecordWriter.java:223)
>at 
> org.apache.cassandra.hadoop.BulkRecordWriter.close(BulkRecordWriter.java:208)
>at 
> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:573)
>at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:414)
>at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
>at java.security.AccessController.doPrivileged(Native Method)
>at javax.security.auth.Subject.doAs(Subject.java:396)
>at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157)
>at org.apache.hadoop.mapred.Child.main(Child.java:264)
> Caused by: org.apache.thrift.transport.TTransportException: java.net.ConnectE
>
> Please help me out understand the problem.
>
> Regards
> Anand B
>
>
> The information in this Internet Email is confidential and may be legally 
> privileged. It is intended solely for the addressee. Access to this Email by 
> anyone else is unauthorized. If you are not the intended recipient, any 
> disclosure, copying, distribution or any action taken or omitted to be taken 
> in reliance on it, is prohibited and may be unlawful. When addressed to our 
> clients any opinions or advice contained in this Email are subject to the 
> terms and conditions expressed in any applicable governing The Home Depot 
> terms of business or client engagement letter. The Home Depot disclaims all 
> responsibility and liability for the accuracy and content of this attachment 
> and for any damages or losses arising from any inaccuracies, errors, viruses, 
> e.g., worms, trojan horses, etc., or other items of a destructive nature, 
> which may be contained in this attachment and shall not be liable for direct, 
>