Re: Flume and Cassandra

2012-02-10 Thread aaron morton
> How to do it ? Do I need to build a custom plugin/sink or can I configure an 
> existing sink to write data in a custom way ?
This is a good starting point https://github.com/thobbs/flume-cassandra-plugin

> 2 - My business process also use my Cassandra DB (without flume, directly via 
> thrift), how to ensure that log writing won't overload my database and 
> introduce latency in my business process ?
Anytime you have a data stream you don't control it's a good idea to put some 
sort of buffer in there between the outside world and the database. Flume has a 
buffered sync, I think your can subclass it and aggregate the counters for a 
minute or two 
http://archive.cloudera.com/cdh/3/flume/UserGuide/#_buffered_sink_and_decorator_semantics

Hope that helps. 
A
-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 10/02/2012, at 4:27 AM, Alain RODRIGUEZ wrote:

> Hi,
> 
> 1 - I would like to generate some statistics and store some raw events from 
> log files tailed with flume. I saw some plugins giving Cassandra sinks but I 
> would like to store data in a custom way, storing raw data but also 
> incrementing counters to get near real-time statistcis. How to do it ? Do I 
> need to build a custom plugin/sink or can I configure an existing sink to 
> write data in a custom way ?
> 
> 2 - My business process also use my Cassandra DB (without flume, directly via 
> thrift), how to ensure that log writing won't overload my database and 
> introduce latency in my business process ? I mean, is there a way to to 
> manage the throughput sent by the flume's tails and slow them when my 
> Cassandra cluster is overloaded ? I would like to avoid building 2 separated 
> clusters.
> 
> Thank you,
> 
> Alain
> 



Re: Tips for using OrderedPartitioner

2012-02-10 Thread aaron morton
> Also, if there's hot spot is there any way out of it, other than restarting 
> from scratch…
A cluster with a changed partitioner is like a mule with a spinning wheel. No 
one knows how it changed and danged if it knows how to return your data . 
(You cannot change it.)

By uniform I meat evenly distributed across the range of values. That is what 
the RandomPartitioner does by using the MD5 transform (also means we know that 
the tokens have finite range).

Cheers


-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 10/02/2012, at 8:31 AM, Tharindu Mathew wrote:

> That sounds like writing a DB... indexing the index row :)
> 
> By making the keys uniform Do you mean like keep the initial X characters 
> the same or the last Y the same... Could you elaborate, please?
> 
> Also, if there's hot spot is there any way out of it, other than restarting 
> from scratch...
> 
> On Tue, Jan 24, 2012 at 3:50 PM, R. Verlangen  wrote:
> If you would like to index your rows in an "index-row", you could also choose 
> for indexing the "index-rows". This will scale up for any needs and create a 
> tree structure.
> 
> 
> 2012/1/24 aaron morton 
> Nothing I can thin of other than making the keys uniform.
> 
> Having a single index row with the RP can be a pain. Is there a way to 
> partition it ?
> 
> Cheers
> 
> 
> -
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
> 
> On 23/01/2012, at 11:42 PM, Tharindu Mathew wrote:
> 
>> Hi,
>> 
>> We use Cassandra in a way we always want to range slice queries. Because, of 
>> the tendency to create hotspots with OrderedPartioner we decided to use 
>> RandomPartitioner. Then we would use, a row as an index row, holding values 
>> of the other row keys of the CF.
>> 
>> I feel this has become a burden and would like to move to an 
>> OrderedPartioner to avoid this work around. The index row workaround which 
>> has become cumbersome when we query the data store.
>> 
>> Is there any tips we can follow to allow for lesser amount of hot spots?
>> 
>> -- 
>> Regards,
>> 
>> Tharindu
>> 
>> blog: http://mackiemathew.com/
>> 
> 
> 
> 
> 
> 
> -- 
> Regards,
> 
> Tharindu
> 
> blog: http://mackiemathew.com/
> 



problem with sliceQuery with composite column

2012-02-10 Thread Deno Vichas

all,

could somebody clue me to why the below code doesn't work.  my schema is;

create column family StockHistory
with comparator = 'CompositeType(LongType,UTF8Type)'
and default_validation_class = 'UTF8Type'
and key_validation_class = 'UTF8Type';


 the time part works but i'm getting other column with the second half 
not equaling the value set.  it's like it's ignoring the string part of 
the composite.


Composite start = new Composite();
Composite end = new Composite();
start.addComponent(0, 
startDate.toDateTimeAtStartOfDay().toDate().getTime(), 
Composite.ComponentEquality.EQUAL);
end.addComponent(0, 
endDate.toDateMidnight().toDate().getTime(), 
Composite.ComponentEquality.EQUAL);


start.addComponent(1, "ticks", Composite.ComponentEquality.EQUAL);
end.addComponent(1, "ticks", 
Composite.ComponentEquality.GREATER_THAN_EQUAL);


SliceQuery sliceQuery =
HFactory.createSliceQuery(_keyspace, _stringSerializer, 
new CompositeSerializer(), _stringSerializer);

sliceQuery.setColumnFamily(CF_STOCK_HISTORY).setKey(symbol);
sliceQuery.setRange(start, end, false, 10);

QueryResult> result = 
sliceQuery.execute();

ColumnSlice cs = result.get();
SortedSet historyJSON = new TreeSet();
for ( HColumn col: cs.getColumns() ) {
System.out.println(col.getName().get(0, _longSerializer) 
+"|"+ col.getName().get(1,StringSerializer.get()));

}


this outputs the following;

132703560|ticks
132704640|quote
132729480|ticks
132730560|quote
132738120|ticks
132739200|quote
132746760|ticks
132747840|quote
132755400|ticks
132756480|quote

thanks,
deno


Re: internode communication using multiple network interfaces

2012-02-10 Thread Chris Hart
Thanks.  Setting the broadcast address to the external IP address and setting 
the listen_address to 0.0.0.0 seems to have fixed it.  Does that mean that all 
other nodes, even those on the same local network, will communicate with that 
node using it's external IP address?  It would be much better if nodes on the 
local network could use the internal IP address and only nodes not on the same 
network would use the external one.

- Original Message -
From: "aaron morton" 
To: user@cassandra.apache.org
Sent: Thursday, February 9, 2012 12:42:54 AM
Subject: Re: internode communication using multiple network interfaces



I have 3 Cassandra nodes in one data center all on the same local network, 
which needs to replicate from an off site data center. Only 1 of the 3 nodes, 
called dw01, is externally accessible. 


If you want to run a multi data centre cluster, all the nodes in both data 
centers need to be able to connect to each other. 


When it comes to exposing nodes behind a fire wall broadcast_address can help, 
see the help in cassandra.yam and 
https://issues.apache.org/jira/browse/CASSANDRA-2491 


Hope that helps. 







- 
Aaron Morton 
Freelance Developer 
@aaronmorton 
http://www.thelastpickle.com 


On 9/02/2012, at 9:56 AM, Chris Hart wrote: 



Hi, 

I have 3 Cassandra nodes in one data center all on the same local network, 
which needs to replicate from an off site data center. Only 1 of the 3 nodes, 
called dw01, is externally accessible. dw01 has 2 network interfaces, one 
externally accessible and one internal. All 3 nodes talk to each other fine 
when I set dw01's listen_address to the internal IP address. As soon as I set 
the listen_address to the external IP address, there is no communication 
between dw01 and other 2 nodes. The other nodes should be able to send to 
dw01's external IP address (I can telnet from them to dw01 on port 7000 and 
7001 just fine), but dw01 obviously would need to use it's internal network 
interface to send anything to the other 2 nodes. Is this a setup that is 
possible with Cassandra? If not, any recommendations on how I could implement 
this? 

Thanks, 
Chris 



Keyspace missing on restart

2012-02-10 Thread Todd Fast
My single-node cluster was working fine yesterday. I ctrl+c'd it last 
night, as I typically do, and restarted it this morning.


Now, inexplicably, it doesn't know anything about my keyspace. The SS 
table files are in the same directory as always and seem to be the 
expected size. I can't seem to do anything with nodetool, since the 
keyspace isn't known.


1. How can I recover the node?
2. What the heck happened that caused this?

Here is the console log (I'm on Windows 7):

Starting Cassandra Server
 INFO 09:33:41,803 Logging initialized
 INFO 09:33:41,809 JVM vendor/version: Java HotSpot(TM) Client VM/1.6.0_17
 INFO 09:33:41,809 Heap size: 1065484288/1065484288
 INFO 09:33:41,810 Classpath: ...
 INFO 09:33:41,815 JNA not found. Native methods will be disabled.
 INFO 09:33:41,826 Loading settings from 
file:/D:/Java/apache-cassandra-1.0.7/conf/cassandra.yaml
 INFO 09:33:41,930 DiskAccessMode 'auto' determined to be standard, 
indexAccessMode is standard

 INFO 09:33:41,939 Global memtable threshold is enabled at 338MB
 INFO 09:33:42,232 Opening 
\Data\cassandra\node1\system\LocationInfo-hc-1 (234 bytes)
 INFO 09:33:42,232 Opening 
\Data\cassandra\node1\system\LocationInfo-hc-2 (163 bytes)

 INFO 09:33:42,287 Couldn't detect any schema definitions in local storage.
 INFO 09:33:42,288 Found table data in data directories. Consider using 
the CLI to define your schema.
 INFO 09:33:42,307 Creating new commitlog segment 
/Data/cassandra/node1/commitlog\CommitLog-1328895222306.log
 INFO 09:33:42,316 Replaying 
\Data\cassandra\node1\commitlog\CommitLog-1328894913967.log
 INFO 09:33:42,356 Finished reading 
\Data\cassandra\node1\commitlog\CommitLog-1328894913967.log
 INFO 09:33:42,362 Enqueuing flush of Memtable-Versions@22744620(83/103 
serialized/live bytes, 3 ops)
 INFO 09:33:42,364 Writing Memtable-Versions@22744620(83/103 
serialized/live bytes, 3 ops)
 INFO 09:33:42,399 Completed flushing 
\Data\cassandra\node1\system\Versions-hc-1-Data.db (247 bytes)

 INFO 09:33:42,410 Log replay complete, 3 replayed mutations
 INFO 09:33:42,415 Cassandra version: 1.0.7
 INFO 09:33:42,416 Thrift API version: 19.20.0
 INFO 09:33:42,416 Loading persisted ring state
 INFO 09:33:42,420 Starting up server gossip
 INFO 09:33:42,429 Enqueuing flush of 
Memtable-LocationInfo@18721294(29/36 serialized/live bytes, 1 ops)
 INFO 09:33:42,430 Writing Memtable-LocationInfo@18721294(29/36 
serialized/live bytes, 1 ops)
 INFO 09:33:42,450 Completed flushing 
\Data\cassandra\node1\system\LocationInfo-hc-3-Data.db (80 bytes)

 INFO 09:33:42,459 Starting Messaging Service on port 7000
 INFO 09:33:42,469 Using saved token 
133677729504783243750441433892785690257
 INFO 09:33:42,471 Enqueuing flush of 
Memtable-LocationInfo@15427560(53/66 serialized/live bytes, 2 ops)
 INFO 09:33:42,471 Writing Memtable-LocationInfo@15427560(53/66 
serialized/live bytes, 2 ops)
 INFO 09:33:42,490 Completed flushing 
\Data\cassandra\node1\system\LocationInfo-hc-4-Data.db (163 bytes)

 INFO 09:33:42,494 Node localhost/127.0.0.1 state jump to normal
 INFO 09:33:42,511 Bootstrap/Replace/Move completed! Now serving reads.
 INFO 09:33:42,512 Will not load MX4J, mx4j-tools.jar is not in the 
classpath
 INFO 09:33:42,523 Compacting 
[SSTableReader(path='\Data\cassandra\node1\system\LocationInfo-hc-4-Data.db'), 
SSTableReader(path='\Data\cassandra\node1\system\LocationInfo-hc-2-Data.db'), 
SSTableReader(path='\Data\cassandra\node1\system\Loca
tionInfo-hc-3-Data.db'), 
SSTableReader(path='\Data\cassandra\node1\system\LocationInfo-hc-1-Data.db')]

 INFO 09:33:42,567 Binding thrift service to localhost/127.0.0.1:9160
 INFO 09:33:42,571 Using TFastFramedTransport with a max frame size of 
15728640 bytes.
 INFO 09:33:42,576 Using synchronous/threadpool thrift server on 
localhost/127.0.0.1 : 9160

 INFO 09:33:42,577 Listening for thrift clients...

Todd


hadoop distcp from brisk cluster to hadoop cluster

2012-02-10 Thread rk vishu
Could any one tell me how can we copy data from Cassandra-Brisk cluster to
Hadoop-HDFS cluster?

1) Is there a way to do hadoop distcp between clusters?
2) If hive table is created on Brisk cluster, will it similar like HDFS
file format? can we run map reduce on the other cluster to transform hive
data (on brisk)?

Thanks and Regards
RK


Deleting a column vs setting it's value to empty

2012-02-10 Thread Drew Kutcharian
Hi Everyone,

Let's say I have the following object which I would like to save in Cassandra:

class User {
  UUID id; //row key
  String name; //columnKey: "name", columnValue: the name of the user
  String description; //columnKey: "description", columnValue: the description 
of the user
}

Description can be nullable. What's the best approach when a user updates her 
description and sets it to null? Should I delete the description column or set 
it to an empty string?

In addition, if I go with the delete column strategy, since I don't know what 
was the previous value of description (the column could not even exist), what 
would happen when I delete a non existent column?

Thanks,

Drew



Implications of length of column names

2012-02-10 Thread Drew Kutcharian
What are the implications of using short vs long column names? Is it better to 
use short column names or longer ones?

I know for MongoDB you are better of using short field names 
http://www.mongodb.org/display/DOCS/Optimizing+Storage+of+Small+Objects   Does 
this apply to Cassandra column names?


-- Drew

Re: Keyspace missing on restart

2012-02-10 Thread Todd Fast
I found the problem; it was my fault. I made an accidental change to my 
cassandra.yaml file sometime between restarts and ended up pointing the 
node data directory to a different disk. Check your paths!


Todd


On 2/10/2012 10:22 AM, Todd Fast wrote:
My single-node cluster was working fine yesterday. I ctrl+c'd it last 
night, as I typically do, and restarted it this morning.


Now, inexplicably, it doesn't know anything about my keyspace. The SS 
table files are in the same directory as always and seem to be the 
expected size. I can't seem to do anything with nodetool, since the 
keyspace isn't known.


1. How can I recover the node?
2. What the heck happened that caused this?

Here is the console log (I'm on Windows 7):

Starting Cassandra Server
 INFO 09:33:41,803 Logging initialized
 INFO 09:33:41,809 JVM vendor/version: Java HotSpot(TM) Client 
VM/1.6.0_17

 INFO 09:33:41,809 Heap size: 1065484288/1065484288
 INFO 09:33:41,810 Classpath: ...
 INFO 09:33:41,815 JNA not found. Native methods will be disabled.
 INFO 09:33:41,826 Loading settings from 
file:/D:/Java/apache-cassandra-1.0.7/conf/cassandra.yaml
 INFO 09:33:41,930 DiskAccessMode 'auto' determined to be standard, 
indexAccessMode is standard

 INFO 09:33:41,939 Global memtable threshold is enabled at 338MB
 INFO 09:33:42,232 Opening 
\Data\cassandra\node1\system\LocationInfo-hc-1 (234 bytes)
 INFO 09:33:42,232 Opening 
\Data\cassandra\node1\system\LocationInfo-hc-2 (163 bytes)
 INFO 09:33:42,287 Couldn't detect any schema definitions in local 
storage.
 INFO 09:33:42,288 Found table data in data directories. Consider 
using the CLI to define your schema.
 INFO 09:33:42,307 Creating new commitlog segment 
/Data/cassandra/node1/commitlog\CommitLog-1328895222306.log
 INFO 09:33:42,316 Replaying 
\Data\cassandra\node1\commitlog\CommitLog-1328894913967.log
 INFO 09:33:42,356 Finished reading 
\Data\cassandra\node1\commitlog\CommitLog-1328894913967.log
 INFO 09:33:42,362 Enqueuing flush of 
Memtable-Versions@22744620(83/103 serialized/live bytes, 3 ops)
 INFO 09:33:42,364 Writing Memtable-Versions@22744620(83/103 
serialized/live bytes, 3 ops)
 INFO 09:33:42,399 Completed flushing 
\Data\cassandra\node1\system\Versions-hc-1-Data.db (247 bytes)

 INFO 09:33:42,410 Log replay complete, 3 replayed mutations
 INFO 09:33:42,415 Cassandra version: 1.0.7
 INFO 09:33:42,416 Thrift API version: 19.20.0
 INFO 09:33:42,416 Loading persisted ring state
 INFO 09:33:42,420 Starting up server gossip
 INFO 09:33:42,429 Enqueuing flush of 
Memtable-LocationInfo@18721294(29/36 serialized/live bytes, 1 ops)
 INFO 09:33:42,430 Writing Memtable-LocationInfo@18721294(29/36 
serialized/live bytes, 1 ops)
 INFO 09:33:42,450 Completed flushing 
\Data\cassandra\node1\system\LocationInfo-hc-3-Data.db (80 bytes)

 INFO 09:33:42,459 Starting Messaging Service on port 7000
 INFO 09:33:42,469 Using saved token 
133677729504783243750441433892785690257
 INFO 09:33:42,471 Enqueuing flush of 
Memtable-LocationInfo@15427560(53/66 serialized/live bytes, 2 ops)
 INFO 09:33:42,471 Writing Memtable-LocationInfo@15427560(53/66 
serialized/live bytes, 2 ops)
 INFO 09:33:42,490 Completed flushing 
\Data\cassandra\node1\system\LocationInfo-hc-4-Data.db (163 bytes)

 INFO 09:33:42,494 Node localhost/127.0.0.1 state jump to normal
 INFO 09:33:42,511 Bootstrap/Replace/Move completed! Now serving reads.
 INFO 09:33:42,512 Will not load MX4J, mx4j-tools.jar is not in the 
classpath
 INFO 09:33:42,523 Compacting 
[SSTableReader(path='\Data\cassandra\node1\system\LocationInfo-hc-4-Data.db'), 
SSTableReader(path='\Data\cassandra\node1\system\LocationInfo-hc-2-Data.db'), 
SSTableReader(path='\Data\cassandra\node1\system\Loca
tionInfo-hc-3-Data.db'), 
SSTableReader(path='\Data\cassandra\node1\system\LocationInfo-hc-1-Data.db')]

 INFO 09:33:42,567 Binding thrift service to localhost/127.0.0.1:9160
 INFO 09:33:42,571 Using TFastFramedTransport with a max frame size of 
15728640 bytes.
 INFO 09:33:42,576 Using synchronous/threadpool thrift server on 
localhost/127.0.0.1 : 9160

 INFO 09:33:42,577 Listening for thrift clients...

Todd


Re: Deleting a column vs setting it's value to empty

2012-02-10 Thread Jeremiah Jordan
Either one works fine.  Setting to "" may cause you less headaches as 
you won't have to deal with tombstones.  Deleting a non existent column 
is fine.


-Jeremiah

On 02/10/2012 02:15 PM, Drew Kutcharian wrote:

Hi Everyone,

Let's say I have the following object which I would like to save in Cassandra:

class User {
   UUID id; //row key
   String name; //columnKey: "name", columnValue: the name of the user
   String description; //columnKey: "description", columnValue: the description 
of the user
}

Description can be nullable. What's the best approach when a user updates her 
description and sets it to null? Should I delete the description column or set 
it to an empty string?

In addition, if I go with the delete column strategy, since I don't know what 
was the previous value of description (the column could not even exist), what 
would happen when I delete a non existent column?

Thanks,

Drew



Re: Implications of length of column names

2012-02-10 Thread Narendra Sharma
It is good to have short column names. They save space all the way from
network transfer to in-memory usage to storage. It is also good idea to
club immutables columns that are read together and store as single column.
We gained significant overall performance benefits with this.

-Naren

On Fri, Feb 10, 2012 at 12:20 PM, Drew Kutcharian  wrote:

> What are the implications of using short vs long column names? Is it
> better to use short column names or longer ones?
>
> I know for MongoDB you are better of using short field names
> http://www.mongodb.org/display/DOCS/Optimizing+Storage+of+Small+Objects
>  Does this apply to Cassandra column names?
>
>
> -- Drew
>



-- 
Narendra Sharma
Software Engineer
*http://www.aeris.com *
*http://narendrasharma.blogspot.com/*


Re: Deleting a column vs setting it's value to empty

2012-02-10 Thread Narendra Sharma
IMO deleting is always better. It is better to not store the column if
there is no value associated.

-Naren

On Fri, Feb 10, 2012 at 12:15 PM, Drew Kutcharian  wrote:

> Hi Everyone,
>
> Let's say I have the following object which I would like to save in
> Cassandra:
>
> class User {
>  UUID id; //row key
>  String name; //columnKey: "name", columnValue: the name of the user
>  String description; //columnKey: "description", columnValue: the
> description of the user
> }
>
> Description can be nullable. What's the best approach when a user updates
> her description and sets it to null? Should I delete the description column
> or set it to an empty string?
>
> In addition, if I go with the delete column strategy, since I don't know
> what was the previous value of description (the column could not even
> exist), what would happen when I delete a non existent column?
>
> Thanks,
>
> Drew
>
>


-- 
Narendra Sharma
Software Engineer
*http://www.aeris.com *
*http://narendrasharma.blogspot.com/*