Read TPS in 0.8.6

2011-12-23 Thread Jeesoo Shin
Hi, I'm doing stress test with tools/stress (java version)
I used 3 EC2 Xlarge with 4 raid instance storage for cluster.

I get write TPS min 4000 & up to 1
but I only get 50 for read TPS.
is this right? what am I doing wrong?
these are option.
java -jar stress.jar --nodes=ip1,ip2,ip3 --consistency-level=QUORUM
--progress-interval=10 --operation=INSERT --num-keys=1000
--threads=50 --family-type=Standard --columns=1 --column-size=400
java -jar stress.jar --nodes=ip1,ip2,ip3 --consistency-level=QUORUM
--progress-interval=10 --operation=READ --num-keys=1000
--threads=50 --family-type=Standard --columns=1 --column-size=400



stress test against 1.0.6 and get write, read similar TPS.

any insight?


Re: Counters and Top 10

2011-12-23 Thread aaron morton
Counters only update the value of the column, they cannot be used as column 
names. So you cannot have a dynamically updating top ten list using counters.

You have a couple of options. First use something like redis if that fits your 
use case. Redis could either be the database of record for the counts. Or just 
an aggregation layer, write the data to cassandra and sorted sets in redis then 
read the top ten from redis and use cassandra to rebuild redis if needed. 

The other is to periodically pivot the counts into a top ten row where you use 
regular integers for the column name. With only 10K users you could do this 
with an process that periodically reads all the users rows or where ever the 
counters are and updates the aggregate row. Depending on data size you cold use 
hive/pig or whatever regular programming language your are happy with.

I guess you could also use redis to keep the top ten sorted and then 
periodically dump that back to cassandra and serve the read traffic from there. 
 

Hope that helps 


-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 23/12/2011, at 3:46 AM, R. Verlangen wrote:

> I would suggest you to create a CF with a single row (or multiple for 
> historical data) with a date as key (utf8, e.g. 2011-12-22) and multiple 
> columns for every user's score. The column (utf8) would then be the score + 
> something unique of the user (e.g. hex representation of the TimeUUID). The 
> value would be the TimeUUID of the user.
> 
> By default columns will be sorted and you can perform a slice to get the top 
> 10.
> 
> 2011/12/14 cbert...@libero.it 
> Hi all,
> I'm using Cassandra in production for a small social network (~10.000 people).
> Now I have to assign some "credits" to each user operation (login, write post
> and so on) and then beeing capable of providing in each moment the top 10 of
> the most active users. I'm on Cassandra 0.7.6 I'd like to migrate to a new
> version in order to use Counters for the user points but ... what about the 
> top
> 10?
> I was thinking about a specific ROW that always keeps the 10 most active users
> ... but I think it would be heavy (to write and to handle in thread-safe mode)
> ... can counters provide something like a "value ordered list"?
> 
> Thanks for any help.
> Best regards,
> 
> Carlo
> 
> 
> 



Re: Routine nodetool repair

2011-12-23 Thread aaron morton
Next time I will finish my morning coffee first :)

A
-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 23/12/2011, at 5:08 AM, Peter Schuller wrote:

>> One other thing to consider is are you creating a few very large rows ? You
>> can check the min, max and average row size using nodetool cfstats.
> 
> Normall I agree, but assuming the two-node cluster has RF 2 it would
> actually not matter ;)
> 
> -- 
> / Peter Schuller (@scode, http://worldmodscode.wordpress.com)



Re: Choosing a Partitioner Type for Random java.util.UUID Row Keys

2011-12-23 Thread aaron morton
No problems. 

IMHO you should develop a sizable bruise banging your head against a using 
Standard CF's and the Random Partitioner before using something else. 

Cheers

-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 23/12/2011, at 6:29 AM, Bryce Allen wrote:

> Thanks, that definitely has advantages over using a super column. We
> ran into thrift timeouts when the super column got large, and with the
> super column range query there is no way (AFAIK) to batch the request at
> the subcolumn level.
> 
> -Bryce
> 
> On Thu, 22 Dec 2011 10:06:58 +1300
> aaron morton  wrote:
>> AFAIK there are no plans kill the BOP, but I would still try to make
>> your life easier by using the RP. . 
>> 
>> My understanding of the problem is at certain times you snapshot the
>> files in a dir; and the main query you want to handle is "At what
>> points between time t0 and time t1 did files x,y and z exist?".
>> 
>> You could consider:
>> 
>> 1) Partitioning the time series data in across each row, then make
>> the row key is the timestamp for the start of the partition. If you
>> have rollup partitions consider making the row key > partition_size> , e.g. <123456789."1d"> for a 1 day partition that
>> starts at 123456789 2) In each row use column names that have the
>> form  where time stamp is the time of the
>> snapshot. 
>> 
>> To query between two times (t0 and t1):
>> 
>> 1) Determine which partitions the time span covers, this will give
>> you a list of rows. 2) Execute a multi-get slice for the all rows
>> using   and  (I'm using * here as a null, check with your
>> client to see how to use composite columns.)
>> 
>> Hope that helps. 
>> Aaron
>> 
>> 
>> -
>> Aaron Morton
>> Freelance Developer
>> @aaronmorton
>> http://www.thelastpickle.com
>> 
>> On 21/12/2011, at 9:03 AM, Bryce Allen wrote:
>> 
>>> I wasn't aware of CompositeColumns, thanks for the tip. However I
>>> think it still doesn't allow me to do the query I need - basically
>>> I need to do a timestamp range query, limiting only to certain file
>>> names at each timestamp. With BOP and a separate row for each
>>> timestamp, prefixed by a random UUID, and file names as column
>>> names, I can do this query. With CompositeColumns, I can only query
>>> one contiguous range, so I'd have to know the timestamps before
>>> hand to limit the file names. I can resolve this using indexes, but
>>> on paper it looks like this would be significantly slower (it would
>>> take me 5 round trips instead of 3 to complete each query, and the
>>> query is made multiple times on every single client request).
>>> 
>>> The two down sides I've seen listed for BOP are balancing issues and
>>> hotspots. I can understand why RP is recommended, from the balancing
>>> issues alone. However these aren't problems for my application. Is
>>> there anything else I am missing? Does the Cassandra team plan on
>>> continuing to support BOP? I haven't completely ruled out RP, but I
>>> like having BOP as an option, it opens up interesting modeling
>>> alternatives that I think have real advantages for some
>>> (if uncommon) applications.
>>> 
>>> Thanks,
>>> Bryce
>>> 
>>> On Wed, 21 Dec 2011 08:08:16 +1300
>>> aaron morton  wrote:
 Bryce, 
Have you considered using CompositeColumns and a standard
 CF? Row key is the UUID column name is (timestamp : dir_entry) you
 can then slice all columns with a particular time stamp. 
 
Even if you have a random key, I would use the RP unless
 you have an extreme use case. 
 
 Cheers
 
 -
 Aaron Morton
 Freelance Developer
 @aaronmorton
 http://www.thelastpickle.com
 
 On 21/12/2011, at 3:06 AM, Bryce Allen wrote:
 
> I think it comes down to how much you benefit from row range
> scans, and how confident you are that going forward all data will
> continue to use random row keys.
> 
> I'm considering using BOP as a way of working around the non
> indexes super column limitation. In my current schema, row keys
> are random UUIDs, super column names are timestamps, and columns
> contain a snapshot in time of directory contents, and could be
> quite large. If instead I use row keys that are
> (uuid)-(timestamp), and use a standard column family, I can do a
> row range query and select only specific columns. I'm still
> evaluating if I can do this with BOP - ideally the token would
> just use the first 128 bits of the key, and I haven't found any
> documentation on how it compares keys of different length.
> 
> Another trick with BOP is to use MD5(rowkey)-rowkey for data that
> has non uniform row keys. I think it's reasonable to use if most
> data is uniform and benefits from range scans, but a few things
> are added that aren't/don't. This trick does make the keys larger,
> which increases storage cost and IO load, so it's

RE: Garbage collection freezes cassandra node

2011-12-23 Thread Rene Kochen
Thanks for your quick response!

I am currently running the performance tests with extended gc logging. I will 
post the gc logging if clients time out at the same moment that the full 
garbage collect runs.

Thanks

Rene

-Original Message-
From: sc...@scode.org [mailto:sc...@scode.org] On Behalf Of Peter Schuller
Sent: dinsdag 20 december 2011 6:36
To: user@cassandra.apache.org
Subject: Re: Garbage collection freezes cassandra node

>  During the garbage collections, Cassandra freezes for about ten seconds. I 
> observe the following log entries:
>
>
>
> “GC for ConcurrentMarkSweep: 11597 ms for 1 collections, 1887933144 used; max 
> is 8550678528”

Ok, first off: Are you certain that it is actually pausing, or are you
assuming that due to the log entry above? Because the log entry in no
way indicates a 10 second pause; it only indicates that CMS took 10
seconds - which is entirely expected, and most of CMS is concurrent
and implies only short pauses. A full pause can happen, but that log
entry is expected and is not in and of itself indicative of a
stop-the-world 10 second pause.

It is fully expected using the CMS collector that you'll have a
sawtooth pattern as young gen is being collected, and then a sudden
drop as CMS does its job concurrently without pausing the application
for a long period of time.

I will second the recommendation to run with -XX:+DisableExplicitGC
(or -XX:+ExplicitGCInvokesConcurrent) to eliminate that as a source.

I would also run with -XX:+PrintGC -XX:+PrintGCDetails
-XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps and report back the
results (i.e., the GC log around the time of the pause).

Your graph is looking very unusual for CMS. It's possible that
everything is as it otherwise should and CMS is kicking in too late,
but I am kind of skeptical towards that even the extremely smooth look
of your graph.

--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)


Re: Cassandra stress test and max vs. average read/write latency.

2011-12-23 Thread Peter Fales
Peter,

Thanks for your response. I'm looking into some of the ideas in your
other recent mail, but I had another followup question on this one...

Is there any way to control the CPU load when using the "stress" benchmark?
I have some control over that with our home-grown benchmark, but I
thought it made sense to use the official benchmark tool as people might
more readily believe those results and/or be able to reproduce them.  But
offhand, I don't see any to throttle back the load created by the 
stress test.

On Mon, Dec 19, 2011 at 09:47:32PM -0800, Peter Schuller wrote:
> > I'm trying to understand if this is expected or not, and if there is
> 
> Without careful tuning, outliers around a couple of hundred ms are
> definitely expected in general (not *necessarily*, depending on
> workload) as a result of garbage collection pauses. The impact will be
> worsened a bit if you are running under high CPU load (or even maxing
> it out with stress) because post-pause, if you are close to max CPU
> usage you will take considerably longer to "catch up".
> 
> Personally, I would just log each response time and feed it to gnuplot
> or something. It should be pretty obvious whether or not the latencies
> are due to periodic pauses.
> 
> If you are concerned with eliminating or reducing outliers, I would:
> 
> (1) Make sure that when you're benchmarking, that you're putting
> Cassandra under a reasonable amount of load. Latency benchmarks are
> usually useless if you're benchmarking against a saturated system. At
> least, start by achieving your latency goals at 25% or less CPU usage,
> and then go from there if you want to up it.
> 
> (2) One can affect GC pauses, but it's non-trivial to eliminate the
> problem completely. For example, the length of frequent young-gen
> pauses can typically be decreased by decreasing the size of the young
> generation, leading to more frequent shorter GC pauses. But that
> instead causes more promotion into the old generation, which will
> result in more frequent very long pauses (relative to normal; they
> would still be infrequent relative to young gen pauses) - IF your
> workload is such that you are suffering from fragmentation and
> eventually seeing Cassandra fall back to full compacting GC:s
> (stop-the-world) for the old generation.
> 
> I would start by adjusting young gen so that your frequent pauses are
> at an acceptable level, and then see whether or not you can sustain
> that in terms of old-gen.
> 
> Start with this in any case: Run Cassandra with -XX:+PrintGC
> -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps
> 
> -- 
> / Peter Schuller (@scode, http://worldmodscode.wordpress.com)

-- 
Peter Fales
Alcatel-Lucent
Member of Technical Staff
1960 Lucent Lane
Room: 9H-505
Naperville, IL 60566-7033
Email: peter.fa...@alcatel-lucent.com
Phone: 630 979 8031


cassandra data to hadoop.

2011-12-23 Thread ravikumar visweswara
Hello All,

I have a situation to dump cassandra data to hadoop cluster for further
analytics. Lot of other relevant data which is not present in cassandra is
already available in hdfs for analysis. Both are independent clusters right
now.
Is there a suggested way to get the data periodically or continuously to
HDFS from cassandra? Any ideas or references will be very helpful for me.

Thanks and Regards
R


Re: cassandra data to hadoop.

2011-12-23 Thread Brian O'Neill
I'm not sure this is much help, but we actually run Hadoop jobs to load and
extract data to and from HDFS.  You can use ColumnFamilyInputFormat to race
over the data in Cassandra and output it to a file.  That doesn't solve the
continuous problem, but should give you a batch mechanism to refresh the
data in HDFS.  I presume its even speedier if you are running enterprise,
because the Hadoop process is collocated with Cassandra.

-brian

On Fri, Dec 23, 2011 at 10:12 AM, ravikumar visweswara <
talk2had...@gmail.com> wrote:

> Hello All,
>
> I have a situation to dump cassandra data to hadoop cluster for further
> analytics. Lot of other relevant data which is not present in cassandra is
> already available in hdfs for analysis. Both are independent clusters right
> now.
> Is there a suggested way to get the data periodically or continuously to
> HDFS from cassandra? Any ideas or references will be very helpful for me.
>
> Thanks and Regards
> R
>



-- 
Brian ONeill
Lead Architect, Health Market Science (http://healthmarketscience.com)
mobile:215.588.6024
blog: http://weblogs.java.net/blog/boneill42/
blog: http://brianoneill.blogspot.com/


Re: Read TPS in 0.8.6

2011-12-23 Thread Radim Kolar
cassandra read performance depends on your disk cache (free memory at 
node not used by cassandra) and disk IOPS peformance. In ideal case (no 
need to merge sstables) cassandra needs 2 IOPS per data read if 
cassandra key/row caches are not used.


Standard hard drive has about 150 IOPS. If you have one hard drive per 
machine, 50 reads per second on ~ 10 GB CFs with random key access is 
about fine.


dont trust stress tool too much, it is way different from real world 
workloads. In real world you will have hotspots and these will get cached.


Re: cassandra data to hadoop.

2011-12-23 Thread Jeremy Hanna
We do this all the time.  Take a look at 
http://wiki.apache.org/cassandra/HadoopSupport for some details - you can use 
mapreduce or pig to get data out of cassandra.  If it's going to a separate 
hadoop cluster, I don't think you'd need to co-locate task trackers or data 
nodes on your cassandra nodes - it would just need to copy over the network 
though.  We also use oozie for job scheduling, fwiw.

On Dec 23, 2011, at 9:12 AM, ravikumar visweswara wrote:

> Hello All,
> 
> I have a situation to dump cassandra data to hadoop cluster for further 
> analytics. Lot of other relevant data which is not present in cassandra is 
> already available in hdfs for analysis. Both are independent clusters right 
> now.
> Is there a suggested way to get the data periodically or continuously to HDFS 
> from cassandra? Any ideas or references will be very helpful for me.
> 
> Thanks and Regards
> R



Re: cassandra data to hadoop.

2011-12-23 Thread Praveen Sadhu
Have you tried Brisk?



On Dec 23, 2011, at 9:30 AM, "Jeremy Hanna"  wrote:

> We do this all the time.  Take a look at 
> http://wiki.apache.org/cassandra/HadoopSupport for some details - you can use 
> mapreduce or pig to get data out of cassandra.  If it's going to a separate 
> hadoop cluster, I don't think you'd need to co-locate task trackers or data 
> nodes on your cassandra nodes - it would just need to copy over the network 
> though.  We also use oozie for job scheduling, fwiw.
> 
> On Dec 23, 2011, at 9:12 AM, ravikumar visweswara wrote:
> 
>> Hello All,
>> 
>> I have a situation to dump cassandra data to hadoop cluster for further 
>> analytics. Lot of other relevant data which is not present in cassandra is 
>> already available in hdfs for analysis. Both are independent clusters right 
>> now.
>> Is there a suggested way to get the data periodically or continuously to 
>> HDFS from cassandra? Any ideas or references will be very helpful for me.
>> 
>> Thanks and Regards
>> R
> 
> 


Re: cassandra data to hadoop.

2011-12-23 Thread Jeremy Hanna
We currently have cassandra nodes co-located with hadoop nodes and do a lot of 
data analytics with it.  We've looked at brisk - brisk still open-source and 
available but datastax is putting its resources in a closed version of brisk as 
part of datastax enterprise.  We'll likely be moving to that over the next 
month.  That said, there's nothing wrong with just using vanilla hadoop with 
cassandra either - it's worked for us for almost a year now.  Brisk/Datastax 
enterprise is simpler though, especially for doing a lot with the two together.

On Dec 23, 2011, at 11:33 AM, Praveen Sadhu wrote:

> Have you tried Brisk?
> 
> 
> 
> On Dec 23, 2011, at 9:30 AM, "Jeremy Hanna"  
> wrote:
> 
>> We do this all the time.  Take a look at 
>> http://wiki.apache.org/cassandra/HadoopSupport for some details - you can 
>> use mapreduce or pig to get data out of cassandra.  If it's going to a 
>> separate hadoop cluster, I don't think you'd need to co-locate task trackers 
>> or data nodes on your cassandra nodes - it would just need to copy over the 
>> network though.  We also use oozie for job scheduling, fwiw.
>> 
>> On Dec 23, 2011, at 9:12 AM, ravikumar visweswara wrote:
>> 
>>> Hello All,
>>> 
>>> I have a situation to dump cassandra data to hadoop cluster for further 
>>> analytics. Lot of other relevant data which is not present in cassandra is 
>>> already available in hdfs for analysis. Both are independent clusters right 
>>> now.
>>> Is there a suggested way to get the data periodically or continuously to 
>>> HDFS from cassandra? Any ideas or references will be very helpful for me.
>>> 
>>> Thanks and Regards
>>> R
>> 
>> 



Best way to determine how a Cassandra cluster is doing

2011-12-23 Thread Pierre-Luc Brunet
I just imported a lot of data in a 9 node Cassandra cluster and before I create 
a new ColumnFamily with even more data, I'd like to be able to determine how 
full my cluster currently is (in terms of memory usage). I'm not too sure what 
I need to look at. I don't want to import another 20-30GB of data and realize I 
should have added 5-6 more nodes.

In short, I have no idea if I have too few/many nodes right now for what's in 
the cluster.

Any help would be greatly appreciated :)

$ nodetool -h 192.168.1.87 ring
Address DC  RackStatus State   LoadOwns
Token   
   
151236607520417094872610936636341427313 
192.168.1.87datacenter1 rack1   Up Normal  7.19 GB 11.11%  
0   
192.168.1.86datacenter1 rack1   Up Normal  7.18 GB 11.11%  
18904575940052136859076367079542678414  
192.168.1.88datacenter1 rack1   Up Normal  7.23 GB 11.11%  
37809151880104273718152734159085356828  
192.168.1.84datacenter1 rack1   Up Normal  4.2 GB  11.11%  
56713727820156410577229101238628035242  
192.168.1.85datacenter1 rack1   Up Normal  4.25 GB 11.11%  
75618303760208547436305468318170713656  
192.168.1.82datacenter1 rack1   Up Normal  4.1 GB  11.11%  
94522879700260684295381835397713392071  
192.168.1.89datacenter1 rack1   Up Normal  4.83 GB 11.11%  
113427455640312821154458202477256070485 
192.168.1.51datacenter1 rack1   Up Normal  2.24 GB 11.11%  
132332031580364958013534569556798748899 
192.168.1.25datacenter1 rack1   Up Normal  3.06 GB 11.11%  
151236607520417094872610936636341427313
-

# nodetool -h 192.168.1.87 cfstats
  Keyspace: stats
  Read Count: 232
  Read Latency: 39.191931034482764 ms.
  Write Count: 160678758
  Write Latency: 0.0492021849459404 ms.
  Pending Tasks: 0
Column Family: DailyStats
SSTable count: 5267
Space used (live): 7710048931
Space used (total): 7710048931
Number of Keys (estimate): 10701952
Memtable Columns Count: 4401
Memtable Data Size: 23384563
Memtable Switch Count: 14368
Read Count: 232
Read Latency: 29.047 ms.
Write Count: 160678813
Write Latency: 0.053 ms.
Pending Tasks: 0
Bloom Filter False Postives: 0
Bloom Filter False Ratio: 0.0
Bloom Filter Space Used: 115533264
Key cache capacity: 20
Key cache size: 1894
Key cache hit rate: 0.627906976744186
Row cache: disabled
Compacted row minimum size: 216
Compacted row maximum size: 42510
Compacted row mean size: 3453
-

[default@stats] describe;
Keyspace: stats:
  Replication Strategy: org.apache.cassandra.locator.SimpleStrategy
  Durable Writes: true
Options: [replication_factor:3]
  Column Families:
ColumnFamily: DailyStats (Super)
  Key Validation Class: org.apache.cassandra.db.marshal.BytesType
  Default column value validator: org.apache.cassandra.db.marshal.UTF8Type
  Columns sorted by: 
org.apache.cassandra.db.marshal.UTF8Type/org.apache.cassandra.db.marshal.UTF8Type
  Row cache size / save period in seconds / keys to save : 0.0/0/all
  Row Cache Provider: 
org.apache.cassandra.cache.ConcurrentLinkedHashCacheProvider
  Key cache size / save period in seconds: 20.0/14400
  GC grace seconds: 864000
  Compaction min/max thresholds: 4/32
  Read repair chance: 1.0
  Replicate on write: true
  Built indexes: []
  Column Metadata:
   (removed)
  Compaction Strategy: 
org.apache.cassandra.db.compaction.LeveledCompactionStrategy
  Compression Options:
sstable_compression: org.apache.cassandra.io.compress.SnappyCompressor




Re: Best way to determine how a Cassandra cluster is doing

2011-12-23 Thread Jeremy Hanna
One way to get a good bird's eye view of the cluster would be to install 
DataStax Opscenter - the community edition is free.  You can do a lot of checks 
from a web interface that are based on the jmx hooks that are in Cassandra.  We 
use it and it's helped us a lot.  Hope it helps for what you're trying to 
determine: http://www.datastax.com/products/opscenter
 
On Dec 23, 2011, at 5:34 PM, Pierre-Luc Brunet wrote:

> I just imported a lot of data in a 9 node Cassandra cluster and before I 
> create a new ColumnFamily with even more data, I'd like to be able to 
> determine how full my cluster currently is (in terms of memory usage). I'm 
> not too sure what I need to look at. I don't want to import another 20-30GB 
> of data and realize I should have added 5-6 more nodes.
> 
> In short, I have no idea if I have too few/many nodes right now for what's in 
> the cluster.
> 
> Any help would be greatly appreciated :)
> 
> $ nodetool -h 192.168.1.87 ring
> Address DC  RackStatus State   LoadOwns   
>  Token   
>   
> 151236607520417094872610936636341427313 
> 192.168.1.87datacenter1 rack1   Up Normal  7.19 GB 11.11% 
>  0   
> 192.168.1.86datacenter1 rack1   Up Normal  7.18 GB 11.11% 
>  18904575940052136859076367079542678414  
> 192.168.1.88datacenter1 rack1   Up Normal  7.23 GB 11.11% 
>  37809151880104273718152734159085356828  
> 192.168.1.84datacenter1 rack1   Up Normal  4.2 GB  11.11% 
>  56713727820156410577229101238628035242  
> 192.168.1.85datacenter1 rack1   Up Normal  4.25 GB 11.11% 
>  75618303760208547436305468318170713656  
> 192.168.1.82datacenter1 rack1   Up Normal  4.1 GB  11.11% 
>  94522879700260684295381835397713392071  
> 192.168.1.89datacenter1 rack1   Up Normal  4.83 GB 11.11% 
>  113427455640312821154458202477256070485 
> 192.168.1.51datacenter1 rack1   Up Normal  2.24 GB 11.11% 
>  132332031580364958013534569556798748899 
> 192.168.1.25datacenter1 rack1   Up Normal  3.06 GB 11.11% 
>  151236607520417094872610936636341427313
> -
> 
> # nodetool -h 192.168.1.87 cfstats
>  Keyspace: stats
>  Read Count: 232
>  Read Latency: 39.191931034482764 ms.
>  Write Count: 160678758
>  Write Latency: 0.0492021849459404 ms.
>  Pending Tasks: 0
>Column Family: DailyStats
>SSTable count: 5267
>Space used (live): 7710048931
>Space used (total): 7710048931
>Number of Keys (estimate): 10701952
>Memtable Columns Count: 4401
>Memtable Data Size: 23384563
>Memtable Switch Count: 14368
>Read Count: 232
>Read Latency: 29.047 ms.
>Write Count: 160678813
>Write Latency: 0.053 ms.
>Pending Tasks: 0
>Bloom Filter False Postives: 0
>Bloom Filter False Ratio: 0.0
>Bloom Filter Space Used: 115533264
>Key cache capacity: 20
>Key cache size: 1894
>Key cache hit rate: 0.627906976744186
>Row cache: disabled
>Compacted row minimum size: 216
>Compacted row maximum size: 42510
>Compacted row mean size: 3453
> -
> 
> [default@stats] describe;
> Keyspace: stats:
>  Replication Strategy: org.apache.cassandra.locator.SimpleStrategy
>  Durable Writes: true
>Options: [replication_factor:3]
>  Column Families:
>ColumnFamily: DailyStats (Super)
>  Key Validation Class: org.apache.cassandra.db.marshal.BytesType
>  Default column value validator: org.apache.cassandra.db.marshal.UTF8Type
>  Columns sorted by: 
> org.apache.cassandra.db.marshal.UTF8Type/org.apache.cassandra.db.marshal.UTF8Type
>  Row cache size / save period in seconds / keys to save : 0.0/0/all
>  Row Cache Provider: 
> org.apache.cassandra.cache.ConcurrentLinkedHashCacheProvider
>  Key cache size / save period in seconds: 20.0/14400
>  GC grace seconds: 864000
>  Compaction min/max thresholds: 4/32
>  Read repair chance: 1.0
>  Replicate on write: true
>  Built indexes: []
>  Column Metadata:
>   (removed)
>  Compaction Strategy: 
> org.apache.cassandra.db.compaction.LeveledCompactionStrategy
>  Compression Options:
>sstable_compression: org.apache.cassandra.io.compress.SnappyCompressor
> 
> 



Re: Best way to determine how a Cassandra cluster is doing

2011-12-23 Thread Pierre-Luc Brunet
I was under the impression that Opscenter was only compatible with the DataStax 
version of Cassandra. 

I'll give that a shot :)

Thank you.

On 2011-12-23, at 6:49 PM, Jeremy Hanna wrote:

> One way to get a good bird's eye view of the cluster would be to install 
> DataStax Opscenter - the community edition is free.  You can do a lot of 
> checks from a web interface that are based on the jmx hooks that are in 
> Cassandra.  We use it and it's helped us a lot.  Hope it helps for what 
> you're trying to determine: http://www.datastax.com/products/opscenter
> 




Re: cassandra data to hadoop.

2011-12-23 Thread ravikumar visweswara
Jeremy,

We use cloudera distribution for our hadoop cluster and may not be possible
to migrate to brisk quickly because of flume/hue dependencies. Did you
successfully pull the data from independent cassandra cluster and dump into
completely disconnected hadoop cluster? It will be really helpful if you
elaborate on how to achieve this.

-R

On Fri, Dec 23, 2011 at 9:28 AM, Jeremy Hanna wrote:

> We do this all the time.  Take a look at
> http://wiki.apache.org/cassandra/HadoopSupport for some details - you can
> use mapreduce or pig to get data out of cassandra.  If it's going to a
> separate hadoop cluster, I don't think you'd need to co-locate task
> trackers or data nodes on your cassandra nodes - it would just need to copy
> over the network though.  We also use oozie for job scheduling, fwiw.
>
> On Dec 23, 2011, at 9:12 AM, ravikumar visweswara wrote:
>
> > Hello All,
> >
> > I have a situation to dump cassandra data to hadoop cluster for further
> analytics. Lot of other relevant data which is not present in cassandra is
> already available in hdfs for analysis. Both are independent clusters right
> now.
> > Is there a suggested way to get the data periodically or continuously to
> HDFS from cassandra? Any ideas or references will be very helpful for me.
> >
> > Thanks and Regards
> > R
>
>