Bulk Load Hadoop to Cassandra

2014-11-05 Thread Vijay Kadel
Hi,

I intend to bulk load data from HDFS to Cassandra using a map-only program 
which uses the BulkOutputFormat class. Please advise me which versions of 
Cassandra and Hadoop would support such a use-case. I am using Hadoop 2.2.0 and 
Cassandra 2.0.6 and I am getting following error:

Error: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but 
class was expected

Thanks,
Vijay



Re: Unsubscribe

2014-11-05 Thread Alain RODRIGUEZ
http://cassandra.apache.org/#lists

2014-11-04 21:59 GMT+01:00 James Carman :

> You should have received an email when you signed up which gives you
> instructions on how to unsubscribe.  Otherwise, send an email to
> user-h...@cassandra.apache.org
>
> On Mon, Nov 3, 2014 at 10:30 PM, Malay Nilabh <
> malay.nil...@lntinfotech.com> wrote:
>
>>  Hi
>>
>>
>>
>> It was great to be part of this group. Thanks for helping out. Please
>> unsubscribe me now.
>>
>>
>>
>> *Regards,*
>>
>> *Malay Nilabh*
>>
>> BIDW BU/ Big Data CoE
>>
>> L&T Infotech Ltd, Hinjewadi,Pune
>>
>> [image: Description: image001]: +91-20-66571746
>>
>> [image: Description: Description: Description: Description:
>> cid:image002.png@01CF1EAD.959B9290]+91-73-879-00727
>>
>> Email: malay.nil...@lntinfotech.com
>>
>> *|| Save Paper - Save Trees || *
>>
>>
>>
>> --
>> The contents of this e-mail and any attachment(s) may contain
>> confidential or privileged information for the intended recipient(s).
>> Unintended recipients are prohibited from taking action on the basis of
>> information in this e-mail and using or disseminating the information, and
>> must notify the sender and delete it from their system. L&T Infotech will
>> not accept responsibility or liability for the accuracy or completeness of,
>> or the presence of any virus or disabling code in this e-mail"
>>
>
>


Storing files in Cassandra with Spring Data / Astyanax

2014-11-05 Thread Wim Deblauwe
Hi,

I am currently testing with Cassandra and Spring Data Cassandra. I would
now need to store files (images and avi files, normally up to 50 Mb big).

I did find the Chuncked Object store
 from
Astyanax  which looks promising. However, I have no idea on how to combine
Astyanax with Spring Data Cassandra ?

Also this answer on SO  states
that Netflix is no longer working on Astyanax, so maybe this is not a good
option to base my application?

Are there any other options (where I can keep using Spring Data Cassandra)?

I also read
http://www.datastax.com/docs/datastax_enterprise3.0/solutions/hadoop_multiple_cfs
but it is unclear to me if I would need to install Hadoop as well if I want
to use this?

regards,

Wim


different disk foot print of cassandra data folder on copying

2014-11-05 Thread KZ Win
I have cassandra nodes with long uptime.  Disk foot print for
cassandra data older is different when I copy to a different folder.
Why is that ?  I have used rsync and cp.  This can be very confusing
when trying to do certain maintenance tasks like hardware upgrade on
EC2 and backing up a snapshot.

I am talking about as much 100% different for 25-40GB of data.  On
copying they grow to double that.  The server's folder is on EC2
magnetic instance-store and I copied to various EBS.  I do not think
that it's something weird about EC2; when I copied EBS data back to
magnetic instance-store
the size remains the same.So I am guessing there is some kind of
cassandra magical compression that is fooling the operation system
tools like du and df

Some issue with commitlog folder too but the total size of this folder
is not as big and differences is size percent is low.

Thanks for any insight you can share

k.z.


Re: Cassandra heap pre-1.1

2014-11-05 Thread Robert Coli
On Tue, Nov 4, 2014 at 8:51 PM, Raj N  wrote:

> Is there a good formula to calculate heap utilization in Cassandra
> pre-1.1, specifically 1.0.10. We are seeing gc pressure on our nodes. And I
> am trying to estimate what could be causing this? Using node tool info my
> steady state heap is at about 10GB. XMX is 12G.
>

Basically, no. If you really want to know, take a heap dump and load it
into Eclipse Memory Analyzer.


> I have 4.5 GB of bloom filters which I can derive looking at cfstats
>

This is a *very* large percentage of your total heap, and is probably the
lever you have most influence on pulling.


> I have negligible row caching.
>

Row caching is generally not advised in that era, especially with heap
pressure.


> I have key caching enabled on my cfs. I couldn't find an easy way to
> estimate how much this is using, but I tried to invalidate the key cache
> and I got 1.3 GB back.
>

Key caching is generally advisable, but 1.3GB is a lot of key cache..


> That still only adds up to 5.8 GB. I know there is index sampling going on
> as well. I have around 800 million rows. Is there a way to estimate how
> much space this would add up to?
>

Plenty. You should reduce your bloom filter size, or upgrade to a version
of Cassandra that moves stuff off the heap.

=Rob
http://twitter.com/rcolidba


Re: different disk foot print of cassandra data folder on copying

2014-11-05 Thread Robert Coli
On Wed, Nov 5, 2014 at 12:08 PM, KZ Win  wrote:

> I have cassandra nodes with long uptime.  Disk foot print for
> cassandra data older is different when I copy to a different folder.
>


> I am talking about as much 100% different for 25-40GB of data.  On
> copying they grow to double that.


1) Cassandra automatically "snapshots" SSTables when one does certain
operations.
2) One can also manually create snapshots.
3) Snapshots are hard links to files.
4) Hard links to files generally become duplicate files when copied to
another partition, unless rsync or cp is configured to maintain the hard
link relationship.
5) snapshots are kept in a subdirectory of the data directory for the
columnfamily.
6) This all has the pathological seeming outcome that snapshots become
effectively larger as time passes (because the hard links they contain
become the only copy of the file when the "original" is deleted from the
data directory via compaction) and might grow significantly when copied.

tl;dr : modify your rsync to include --exclude=snapshots/

=Rob


Re: different disk foot print of cassandra data folder on copying

2014-11-05 Thread KZ Win
Duh.  I totally forgot about my snapshotting just before daily rsync backup.

k.z.

On Wed, Nov 5, 2014 at 3:13 PM, Robert Coli  wrote:
> On Wed, Nov 5, 2014 at 12:08 PM, KZ Win  wrote:
>>
>> I have cassandra nodes with long uptime.  Disk foot print for
>> cassandra data older is different when I copy to a different folder.
>
>
>>
>> I am talking about as much 100% different for 25-40GB of data.  On
>> copying they grow to double that.
>
>
> 1) Cassandra automatically "snapshots" SSTables when one does certain
> operations.
> 2) One can also manually create snapshots.
> 3) Snapshots are hard links to files.
> 4) Hard links to files generally become duplicate files when copied to
> another partition, unless rsync or cp is configured to maintain the hard
> link relationship.
> 5) snapshots are kept in a subdirectory of the data directory for the
> columnfamily.
> 6) This all has the pathological seeming outcome that snapshots become
> effectively larger as time passes (because the hard links they contain
> become the only copy of the file when the "original" is deleted from the
> data directory via compaction) and might grow significantly when copied.
>
> tl;dr : modify your rsync to include --exclude=snapshots/
>
> =Rob
>


use select with different attributes present in where clause Cassandra

2014-11-05 Thread Chamila Wijayarathna
Hello all,

I need to create a Cassandra column family with following attributes.

id bigint,
content varchar,
year int,
frequency int,

I want to get the content with highest frequency in a given year using this
column family. Also when inserting data to table, for given content and
year, I need to check if an id already exist or not. How can I achieve this
with Cassandra?

I tried creating CF using

CREATE TABLE sinmin.word_time_inv_frequency (
id bigint,
content varchar,
year int,
frequency int,
PRIMARY KEY((year), frequency)
);

and then retrieved data using

SELECT id FROM word_time_inv_frequency WHERE year = 2010 ORDER BY frequency ;

But when using this, I can't check if entry is already existing for the
(content,year) pair in the CF.

Thank You!

-- 
*Chamila Dilshan Wijayarathna,*
SMIEEE, SMIESL,
Undergraduate,
Department of Computer Science and Engineering,
University of Moratuwa.


Re: Storing files in Cassandra with Spring Data / Astyanax

2014-11-05 Thread Redmumba
Astyanax isn't deprecated; that user is wrong and is downvoted--and has a
comment mentioning the same.

What you're describing doesn't sound like you need a data store at all; it
/sounds/ like you need a file store.  Why not use S3 or similar to store
your images?  What benefits are you expecting to receive from Cassandra?
It sounds like you're incurring an awful lot of overhead for what amounts
to a file lookup.

On Wed, Nov 5, 2014 at 8:19 AM, Wim Deblauwe  wrote:

> Hi,
>
> I am currently testing with Cassandra and Spring Data Cassandra. I would
> now need to store files (images and avi files, normally up to 50 Mb big).
>
> I did find the Chuncked Object store
>  from
> Astyanax  which looks promising. However, I have no idea on how to combine
> Astyanax with Spring Data Cassandra ?
>
> Also this answer on SO  states
> that Netflix is no longer working on Astyanax, so maybe this is not a good
> option to base my application?
>
> Are there any other options (where I can keep using Spring Data Cassandra)?
>
> I also read
> http://www.datastax.com/docs/datastax_enterprise3.0/solutions/hadoop_multiple_cfs
> but it is unclear to me if I would need to install Hadoop as well if I want
> to use this?
>
> regards,
>
> Wim
>


Re: Storing files in Cassandra with Spring Data / Astyanax

2014-11-05 Thread Robert Coli
On Wed, Nov 5, 2014 at 8:19 AM, Wim Deblauwe  wrote:

> I am currently testing with Cassandra and Spring Data Cassandra. I would
> now need to store files (images and avi files, normally up to 50 Mb big).
>

https://github.com/mogilefs/

A for distributed/replicated file storage, would use again in a
heartbeat.

Yes, it uses MySQL as the datastore, fortunately most people know how to
make MySQL available enough to be the meta store for a filesystem.

=Rob
http://twitter.com/rcolidba


Re: Cassandra heap pre-1.1

2014-11-05 Thread Raj N
We are planning to upgrade soon. But in the meantime, I wanted to see if we
can tweak certain things.

-Rajesh

On Wed, Nov 5, 2014 at 3:10 PM, Robert Coli  wrote:

> On Tue, Nov 4, 2014 at 8:51 PM, Raj N  wrote:
>
>> Is there a good formula to calculate heap utilization in Cassandra
>> pre-1.1, specifically 1.0.10. We are seeing gc pressure on our nodes. And I
>> am trying to estimate what could be causing this? Using node tool info my
>> steady state heap is at about 10GB. XMX is 12G.
>>
>
> Basically, no. If you really want to know, take a heap dump and load it
> into Eclipse Memory Analyzer.
>
>
>> I have 4.5 GB of bloom filters which I can derive looking at cfstats
>>
>
> This is a *very* large percentage of your total heap, and is probably the
> lever you have most influence on pulling.
>
>
>> I have negligible row caching.
>>
>
> Row caching is generally not advised in that era, especially with heap
> pressure.
>
>
>> I have key caching enabled on my cfs. I couldn't find an easy way to
>> estimate how much this is using, but I tried to invalidate the key cache
>> and I got 1.3 GB back.
>>
>
> Key caching is generally advisable, but 1.3GB is a lot of key cache..
>
>
>> That still only adds up to 5.8 GB. I know there is index sampling going
>> on as well. I have around 800 million rows. Is there a way to estimate how
>> much space this would add up to?
>>
>
> Plenty. You should reduce your bloom filter size, or upgrade to a version
> of Cassandra that moves stuff off the heap.
>
> =Rob
> http://twitter.com/rcolidba
>
>
>


Why is one query 10 times slower than the other?

2014-11-05 Thread Jacob Rhoden
Hi Guys,

I have two cassandra 2.0.5 nodes, RF=2. When I do a:

select * from table1 where clustercolumn=‘something'

The trace indicates that it only needs to talk to one node, which I would have 
expected. However when I do a:

select * from table2

Which is a small table with only has 20 rows in it, should be fully replicated, 
and should be a much quicker query, trace indicates that cassandra is talking 
to both nodes. This adds a 200ms to the query results, and is not necessary for 
my application (this table might have an amendment once per year if that), 
theres no real need to check both nodes for consistency.

At this point I’ve not altered anything to do with consistency level. Does this 
mean that cassandra attempts to guess/infer what consistency level you need 
depending on if your query includes a filter on a particular key or clustering 
key?

Thanks,
Jacob


CREATE KEYSPACE mykeyspace WITH replication = { 'class': 'SimpleStrategy', 
'replication_factor': ‘2' };

CREATE TABLE organisation (uuid uuid, name text, url text, PRIMARY KEY (uuid))

CREATE TABLE lookup_code (type text, code text, name text, PRIMARY KEY ((type), 
code)) 


select * from lookup_code where type=‘mylist':

 activity  | 
timestamp| source   | source_elapsed
---+--+--+
execute_cql3_query | 
04:20:15,319 | 74.50.54.123 |  0
 Parsing select * from lookup_code where type='research_area' LIMIT 1; | 
04:20:15,319 | 74.50.54.123 | 64
   Preparing statement | 
04:20:15,320 | 74.50.54.123 |204
   Executing single-partition query on lookup_code | 
04:20:15,320 | 74.50.54.123 |849
  Acquiring sstable references | 
04:20:15,320 | 74.50.54.123 |870
   Merging memtable tombstones | 
04:20:15,320 | 74.50.54.123 |894
 Skipped 0/0 non-slice-intersecting sstables, included 0 due to tombstones | 
04:20:15,320 | 74.50.54.123 |958
Merging data from memtables and 0 sstables | 
04:20:15,320 | 74.50.54.123 |976
  Read 168 live and 0 tombstoned cells | 
04:20:15,321 | 74.50.54.123 |   1412
  Request complete | 
04:20:15,321 | 74.50.54.123 |   2043


select * from organisation:

 activity   
 | timestamp| source   | source_elapsed
-+--+--+
  
execute_cql3_query | 04:21:03,641 | 74.50.54.123 |  0
 Parsing select * from 
organisation LIMIT 1; | 04:21:03,641 | 74.50.54.123 | 68
 
Preparing statement | 04:21:03,641 | 74.50.54.123 |174
   Determining 
replicas to query | 04:21:03,642 | 74.50.54.123 |307
  Enqueuing request 
to /72.249.82.85 | 04:21:03,642 | 74.50.54.123 |   1034
Sending message 
to /72.249.82.85 | 04:21:03,643 | 74.50.54.123 |   1402
 Message received 
from /74.50.54.123 | 04:21:03,644 | 72.249.82.85 | 47
 Executing seq scan across 0 sstables for [min(-9223372036854775808), 
min(-9223372036854775808)] | 04:21:03,644 | 72.249.82.85 |461
  Read 1 live and 0 
tombstoned cells | 04:21:03,644 | 72.249.82.85 |560
  Read 1 live and 0 
tombstoned cells | 04:21:03,644 | 72.249.82.85 |611

………..etc….

smime.p7s
Description: S/MIME cryptographic signature


Re: Why is one query 10 times slower than the other?

2014-11-05 Thread graham sanderson
In your “lookup_code” example “type” is not a clustercolumn it is the partition 
key, and hence the first query only hits one partition
The second query is a range slice across all possible keys, so the sub-ranges 
are farmed out to nodes with the data.
You are likely at CL_ONE, so it only needs response from one node for each 
sub-range… I guess it has decided (based on the snitch) that it is not 
unreasonable to share the query across the two nodes 

> On Nov 5, 2014, at 10:41 PM, Jacob Rhoden  wrote:
> 
> Hi Guys,
> 
> I have two cassandra 2.0.5 nodes, RF=2. When I do a:
> 
> select * from table1 where clustercolumn=‘something'
> 
> The trace indicates that it only needs to talk to one node, which I would 
> have expected. However when I do a:
> 
> select * from table2
> 
> Which is a small table with only has 20 rows in it, should be fully 
> replicated, and should be a much quicker query, trace indicates that 
> cassandra is talking to both nodes. This adds a 200ms to the query results, 
> and is not necessary for my application (this table might have an amendment 
> once per year if that), theres no real need to check both nodes for 
> consistency.
> 
> At this point I’ve not altered anything to do with consistency level. Does 
> this mean that cassandra attempts to guess/infer what consistency level you 
> need depending on if your query includes a filter on a particular key or 
> clustering key?
> 
> Thanks,
> Jacob
> 
> 
> CREATE KEYSPACE mykeyspace WITH replication = { 'class': 'SimpleStrategy', 
> 'replication_factor': ‘2' };
> 
> CREATE TABLE organisation (uuid uuid, name text, url text, PRIMARY KEY (uuid))
> 
> CREATE TABLE lookup_code (type text, code text, name text, PRIMARY KEY 
> ((type), code)) 
> 
> 
> select * from lookup_code where type=‘mylist':
> 
>  activity  | 
> timestamp| source   | source_elapsed
> ---+--+--+
> execute_cql3_query | 
> 04:20:15,319 | 74.50.54.123 |  0
>  Parsing select * from lookup_code where type='research_area' LIMIT 1; | 
> 04:20:15,319 | 74.50.54.123 | 64
>Preparing statement | 
> 04:20:15,320 | 74.50.54.123 |204
>Executing single-partition query on lookup_code | 
> 04:20:15,320 | 74.50.54.123 |849
>   Acquiring sstable references | 
> 04:20:15,320 | 74.50.54.123 |870
>Merging memtable tombstones | 
> 04:20:15,320 | 74.50.54.123 |894
>  Skipped 0/0 non-slice-intersecting sstables, included 0 due to tombstones | 
> 04:20:15,320 | 74.50.54.123 |958
> Merging data from memtables and 0 sstables | 
> 04:20:15,320 | 74.50.54.123 |976
>   Read 168 live and 0 tombstoned cells | 
> 04:20:15,321 | 74.50.54.123 |   1412
>   Request complete | 
> 04:20:15,321 | 74.50.54.123 |   2043
> 
> 
> select * from organisation:
> 
>  activity 
>| timestamp| source   | source_elapsed
> -+--+--+
>   
> execute_cql3_query | 04:21:03,641 | 74.50.54.123 |  0
>  Parsing select * from 
> organisation LIMIT 1; | 04:21:03,641 | 74.50.54.123 | 68
>  
> Preparing statement | 04:21:03,641 | 74.50.54.123 |174
>
> Determining replicas to query | 04:21:03,642 | 74.50.54.123 |307
>   Enqueuing 
> request to /72.249.82.85 | 04:21:03,642 | 74.50.54.123 |   1034
> Sending 
> message to /72.249.82.85 | 04:21:03,643 | 74.50.54.123 |   1402
>  Message received 
> from /74.50.54.123 | 04:21:03,644 | 72.249.82.85 | 47
>  Executing seq scan across 0 sstables for [min(-9223372036854775808), 
> min(-9223372036854775808)] | 04:21:03,644 | 72.249.82.85 |461
>   Read 1 live and 
> 0 tombstoned cells | 04:21:03,644 | 72.249.82.85 |

Re: tuning concurrent_reads param

2014-11-05 Thread Jimmy Lin
Sorry I have late follow up question 

In the Cassandra.yaml file the concurrent_read section has the following
comment:

What does it mean by " the operations to enqueue low enough in the stack
that the OS and drives can reorder them." ? how does it help making the
system healthy?
What really happen if we increase it to a too high value? (maybe affecting
other read or write operation as it eat up all disk IO resource?)


thanks


# For workloads with more daa than can fit in memory, Cassandra's
# bottleneck will be reads that need to fetch data from
# disk. "concurrent_reads" shuld be set to (16 * number_of_drives) in
# order to allow the operations to enqueue low enough in the stack
# that the OS and drives can reorder them.

On Wed, Oct 29, 2014 at 8:47 PM, Chris Lohfink 
wrote:

> Theres a bit to it, sometimes it can use tweaking though.  Its a good
> default for most systems so I wouldn't increase right off the bat. When
> using ssds or something with a lot of horsepower it could be higher though
> (ie i2.xlarge+ on ec2).  If you monitor the number of active threads in
> read thread pool (nodetool tpstats) you can see if they are actually all
> busy or not.  If its near 32 (or whatever you set it at) all the time it
> may be a bottleneck.
>
> ---
> Chris Lohfink
>
> On Wed, Oct 29, 2014 at 10:41 PM, Jimmy Lin  wrote:
>
>> Hi,
>> looking at the docs, the default value for concurrent_reads is 32, which
>> seems bit small to me (comparing to say http server)? because if my node is
>> receiving slight traffic, any more than 32 concurrent read query will have
>> to wait.(?)
>>
>> Recommend rule is, 16* number of drives. Would that be different if I
>> have SSDs?
>>
>> I am attempting to increase it because I have a few tables have wide rows
>> that app will fetch them, the pure size of data may already eating up the
>> thread time, which can cause  other read threads need to wait and essential
>> slow.
>>
>> thanks
>>
>>
>>
>>
>


Re: Storing files in Cassandra with Spring Data / Astyanax

2014-11-05 Thread Wim Deblauwe
Hi,

We are building an application where we install it on-premise, usually
there is no internet connection at all there. As I am using Cassandra for
storing everything else in the application, it would be very convenient to
also use Cassandra for those files so I don't have to set up 2 distributed
systems for each installation we do.

Is there documentation somewhere on how to integrate/get started with
Astyanax with Spring Data Cassandra ?

regards,

Wim

2014-11-05 23:40 GMT+01:00 Redmumba :

> Astyanax isn't deprecated; that user is wrong and is downvoted--and has a
> comment mentioning the same.
>
> What you're describing doesn't sound like you need a data store at all; it
> /sounds/ like you need a file store.  Why not use S3 or similar to store
> your images?  What benefits are you expecting to receive from Cassandra?
> It sounds like you're incurring an awful lot of overhead for what amounts
> to a file lookup.
>
> On Wed, Nov 5, 2014 at 8:19 AM, Wim Deblauwe 
> wrote:
>
>> Hi,
>>
>> I am currently testing with Cassandra and Spring Data Cassandra. I would
>> now need to store files (images and avi files, normally up to 50 Mb big).
>>
>> I did find the Chuncked Object store
>>  from
>> Astyanax  which looks promising. However, I have no idea on how to combine
>> Astyanax with Spring Data Cassandra ?
>>
>> Also this answer on SO 
>> states that Netflix is no longer working on Astyanax, so maybe this is not
>> a good option to base my application?
>>
>> Are there any other options (where I can keep using Spring Data
>> Cassandra)?
>>
>> I also read
>> http://www.datastax.com/docs/datastax_enterprise3.0/solutions/hadoop_multiple_cfs
>> but it is unclear to me if I would need to install Hadoop as well if I want
>> to use this?
>>
>> regards,
>>
>> Wim
>>
>
>


Counter column impossible to delete and re-insert

2014-11-05 Thread Clément Fumey
Hi,

I have a table with counter column . When I insert (update) a row, delete
it and try to re-insert, it fail to re-insert the row. Here is the commands
i use :

CREATE TABLE test(
testId int,
year int,
testCounter counter,
PRIMARY KEY (testId, year)
)WITH CLUSTERING ORDER BY (year DESC);

UPDATE test SET testcounter = testcounter +5 WHERE testid = 2 AND year =
2014;
DELETE FROM test WHERE testid = 2 AND year = 2014;
UPDATE test SET testcounter = testcounter +5 WHERE testid = 2 AND year =
2014;

The last command failed, there is no error message but the table is empty
after it.
Is that normal? Am I doing something wrong?

Regards

Clément