Read TPS in 0.8.6
Hi, I'm doing stress test with tools/stress (java version) I used 3 EC2 Xlarge with 4 raid instance storage for cluster. I get write TPS min 4000 & up to 1 but I only get 50 for read TPS. is this right? what am I doing wrong? these are option. java -jar stress.jar --nodes=ip1,ip2,ip3 --consistency-level=QUORUM --progress-interval=10 --operation=INSERT --num-keys=1000 --threads=50 --family-type=Standard --columns=1 --column-size=400 java -jar stress.jar --nodes=ip1,ip2,ip3 --consistency-level=QUORUM --progress-interval=10 --operation=READ --num-keys=1000 --threads=50 --family-type=Standard --columns=1 --column-size=400 stress test against 1.0.6 and get write, read similar TPS. any insight?
Re: Counters and Top 10
Counters only update the value of the column, they cannot be used as column names. So you cannot have a dynamically updating top ten list using counters. You have a couple of options. First use something like redis if that fits your use case. Redis could either be the database of record for the counts. Or just an aggregation layer, write the data to cassandra and sorted sets in redis then read the top ten from redis and use cassandra to rebuild redis if needed. The other is to periodically pivot the counts into a top ten row where you use regular integers for the column name. With only 10K users you could do this with an process that periodically reads all the users rows or where ever the counters are and updates the aggregate row. Depending on data size you cold use hive/pig or whatever regular programming language your are happy with. I guess you could also use redis to keep the top ten sorted and then periodically dump that back to cassandra and serve the read traffic from there. Hope that helps - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 23/12/2011, at 3:46 AM, R. Verlangen wrote: > I would suggest you to create a CF with a single row (or multiple for > historical data) with a date as key (utf8, e.g. 2011-12-22) and multiple > columns for every user's score. The column (utf8) would then be the score + > something unique of the user (e.g. hex representation of the TimeUUID). The > value would be the TimeUUID of the user. > > By default columns will be sorted and you can perform a slice to get the top > 10. > > 2011/12/14 cbert...@libero.it > Hi all, > I'm using Cassandra in production for a small social network (~10.000 people). > Now I have to assign some "credits" to each user operation (login, write post > and so on) and then beeing capable of providing in each moment the top 10 of > the most active users. I'm on Cassandra 0.7.6 I'd like to migrate to a new > version in order to use Counters for the user points but ... what about the > top > 10? > I was thinking about a specific ROW that always keeps the 10 most active users > ... but I think it would be heavy (to write and to handle in thread-safe mode) > ... can counters provide something like a "value ordered list"? > > Thanks for any help. > Best regards, > > Carlo > > >
Re: Routine nodetool repair
Next time I will finish my morning coffee first :) A - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 23/12/2011, at 5:08 AM, Peter Schuller wrote: >> One other thing to consider is are you creating a few very large rows ? You >> can check the min, max and average row size using nodetool cfstats. > > Normall I agree, but assuming the two-node cluster has RF 2 it would > actually not matter ;) > > -- > / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: Choosing a Partitioner Type for Random java.util.UUID Row Keys
No problems. IMHO you should develop a sizable bruise banging your head against a using Standard CF's and the Random Partitioner before using something else. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 23/12/2011, at 6:29 AM, Bryce Allen wrote: > Thanks, that definitely has advantages over using a super column. We > ran into thrift timeouts when the super column got large, and with the > super column range query there is no way (AFAIK) to batch the request at > the subcolumn level. > > -Bryce > > On Thu, 22 Dec 2011 10:06:58 +1300 > aaron morton wrote: >> AFAIK there are no plans kill the BOP, but I would still try to make >> your life easier by using the RP. . >> >> My understanding of the problem is at certain times you snapshot the >> files in a dir; and the main query you want to handle is "At what >> points between time t0 and time t1 did files x,y and z exist?". >> >> You could consider: >> >> 1) Partitioning the time series data in across each row, then make >> the row key is the timestamp for the start of the partition. If you >> have rollup partitions consider making the row key > partition_size> , e.g. <123456789."1d"> for a 1 day partition that >> starts at 123456789 2) In each row use column names that have the >> form where time stamp is the time of the >> snapshot. >> >> To query between two times (t0 and t1): >> >> 1) Determine which partitions the time span covers, this will give >> you a list of rows. 2) Execute a multi-get slice for the all rows >> using and (I'm using * here as a null, check with your >> client to see how to use composite columns.) >> >> Hope that helps. >> Aaron >> >> >> - >> Aaron Morton >> Freelance Developer >> @aaronmorton >> http://www.thelastpickle.com >> >> On 21/12/2011, at 9:03 AM, Bryce Allen wrote: >> >>> I wasn't aware of CompositeColumns, thanks for the tip. However I >>> think it still doesn't allow me to do the query I need - basically >>> I need to do a timestamp range query, limiting only to certain file >>> names at each timestamp. With BOP and a separate row for each >>> timestamp, prefixed by a random UUID, and file names as column >>> names, I can do this query. With CompositeColumns, I can only query >>> one contiguous range, so I'd have to know the timestamps before >>> hand to limit the file names. I can resolve this using indexes, but >>> on paper it looks like this would be significantly slower (it would >>> take me 5 round trips instead of 3 to complete each query, and the >>> query is made multiple times on every single client request). >>> >>> The two down sides I've seen listed for BOP are balancing issues and >>> hotspots. I can understand why RP is recommended, from the balancing >>> issues alone. However these aren't problems for my application. Is >>> there anything else I am missing? Does the Cassandra team plan on >>> continuing to support BOP? I haven't completely ruled out RP, but I >>> like having BOP as an option, it opens up interesting modeling >>> alternatives that I think have real advantages for some >>> (if uncommon) applications. >>> >>> Thanks, >>> Bryce >>> >>> On Wed, 21 Dec 2011 08:08:16 +1300 >>> aaron morton wrote: Bryce, Have you considered using CompositeColumns and a standard CF? Row key is the UUID column name is (timestamp : dir_entry) you can then slice all columns with a particular time stamp. Even if you have a random key, I would use the RP unless you have an extreme use case. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 21/12/2011, at 3:06 AM, Bryce Allen wrote: > I think it comes down to how much you benefit from row range > scans, and how confident you are that going forward all data will > continue to use random row keys. > > I'm considering using BOP as a way of working around the non > indexes super column limitation. In my current schema, row keys > are random UUIDs, super column names are timestamps, and columns > contain a snapshot in time of directory contents, and could be > quite large. If instead I use row keys that are > (uuid)-(timestamp), and use a standard column family, I can do a > row range query and select only specific columns. I'm still > evaluating if I can do this with BOP - ideally the token would > just use the first 128 bits of the key, and I haven't found any > documentation on how it compares keys of different length. > > Another trick with BOP is to use MD5(rowkey)-rowkey for data that > has non uniform row keys. I think it's reasonable to use if most > data is uniform and benefits from range scans, but a few things > are added that aren't/don't. This trick does make the keys larger, > which increases storage cost and IO load, so it's
RE: Garbage collection freezes cassandra node
Thanks for your quick response! I am currently running the performance tests with extended gc logging. I will post the gc logging if clients time out at the same moment that the full garbage collect runs. Thanks Rene -Original Message- From: sc...@scode.org [mailto:sc...@scode.org] On Behalf Of Peter Schuller Sent: dinsdag 20 december 2011 6:36 To: user@cassandra.apache.org Subject: Re: Garbage collection freezes cassandra node > During the garbage collections, Cassandra freezes for about ten seconds. I > observe the following log entries: > > > > “GC for ConcurrentMarkSweep: 11597 ms for 1 collections, 1887933144 used; max > is 8550678528” Ok, first off: Are you certain that it is actually pausing, or are you assuming that due to the log entry above? Because the log entry in no way indicates a 10 second pause; it only indicates that CMS took 10 seconds - which is entirely expected, and most of CMS is concurrent and implies only short pauses. A full pause can happen, but that log entry is expected and is not in and of itself indicative of a stop-the-world 10 second pause. It is fully expected using the CMS collector that you'll have a sawtooth pattern as young gen is being collected, and then a sudden drop as CMS does its job concurrently without pausing the application for a long period of time. I will second the recommendation to run with -XX:+DisableExplicitGC (or -XX:+ExplicitGCInvokesConcurrent) to eliminate that as a source. I would also run with -XX:+PrintGC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps and report back the results (i.e., the GC log around the time of the pause). Your graph is looking very unusual for CMS. It's possible that everything is as it otherwise should and CMS is kicking in too late, but I am kind of skeptical towards that even the extremely smooth look of your graph. -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: Cassandra stress test and max vs. average read/write latency.
Peter, Thanks for your response. I'm looking into some of the ideas in your other recent mail, but I had another followup question on this one... Is there any way to control the CPU load when using the "stress" benchmark? I have some control over that with our home-grown benchmark, but I thought it made sense to use the official benchmark tool as people might more readily believe those results and/or be able to reproduce them. But offhand, I don't see any to throttle back the load created by the stress test. On Mon, Dec 19, 2011 at 09:47:32PM -0800, Peter Schuller wrote: > > I'm trying to understand if this is expected or not, and if there is > > Without careful tuning, outliers around a couple of hundred ms are > definitely expected in general (not *necessarily*, depending on > workload) as a result of garbage collection pauses. The impact will be > worsened a bit if you are running under high CPU load (or even maxing > it out with stress) because post-pause, if you are close to max CPU > usage you will take considerably longer to "catch up". > > Personally, I would just log each response time and feed it to gnuplot > or something. It should be pretty obvious whether or not the latencies > are due to periodic pauses. > > If you are concerned with eliminating or reducing outliers, I would: > > (1) Make sure that when you're benchmarking, that you're putting > Cassandra under a reasonable amount of load. Latency benchmarks are > usually useless if you're benchmarking against a saturated system. At > least, start by achieving your latency goals at 25% or less CPU usage, > and then go from there if you want to up it. > > (2) One can affect GC pauses, but it's non-trivial to eliminate the > problem completely. For example, the length of frequent young-gen > pauses can typically be decreased by decreasing the size of the young > generation, leading to more frequent shorter GC pauses. But that > instead causes more promotion into the old generation, which will > result in more frequent very long pauses (relative to normal; they > would still be infrequent relative to young gen pauses) - IF your > workload is such that you are suffering from fragmentation and > eventually seeing Cassandra fall back to full compacting GC:s > (stop-the-world) for the old generation. > > I would start by adjusting young gen so that your frequent pauses are > at an acceptable level, and then see whether or not you can sustain > that in terms of old-gen. > > Start with this in any case: Run Cassandra with -XX:+PrintGC > -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps > > -- > / Peter Schuller (@scode, http://worldmodscode.wordpress.com) -- Peter Fales Alcatel-Lucent Member of Technical Staff 1960 Lucent Lane Room: 9H-505 Naperville, IL 60566-7033 Email: peter.fa...@alcatel-lucent.com Phone: 630 979 8031
cassandra data to hadoop.
Hello All, I have a situation to dump cassandra data to hadoop cluster for further analytics. Lot of other relevant data which is not present in cassandra is already available in hdfs for analysis. Both are independent clusters right now. Is there a suggested way to get the data periodically or continuously to HDFS from cassandra? Any ideas or references will be very helpful for me. Thanks and Regards R
Re: cassandra data to hadoop.
I'm not sure this is much help, but we actually run Hadoop jobs to load and extract data to and from HDFS. You can use ColumnFamilyInputFormat to race over the data in Cassandra and output it to a file. That doesn't solve the continuous problem, but should give you a batch mechanism to refresh the data in HDFS. I presume its even speedier if you are running enterprise, because the Hadoop process is collocated with Cassandra. -brian On Fri, Dec 23, 2011 at 10:12 AM, ravikumar visweswara < talk2had...@gmail.com> wrote: > Hello All, > > I have a situation to dump cassandra data to hadoop cluster for further > analytics. Lot of other relevant data which is not present in cassandra is > already available in hdfs for analysis. Both are independent clusters right > now. > Is there a suggested way to get the data periodically or continuously to > HDFS from cassandra? Any ideas or references will be very helpful for me. > > Thanks and Regards > R > -- Brian ONeill Lead Architect, Health Market Science (http://healthmarketscience.com) mobile:215.588.6024 blog: http://weblogs.java.net/blog/boneill42/ blog: http://brianoneill.blogspot.com/
Re: Read TPS in 0.8.6
cassandra read performance depends on your disk cache (free memory at node not used by cassandra) and disk IOPS peformance. In ideal case (no need to merge sstables) cassandra needs 2 IOPS per data read if cassandra key/row caches are not used. Standard hard drive has about 150 IOPS. If you have one hard drive per machine, 50 reads per second on ~ 10 GB CFs with random key access is about fine. dont trust stress tool too much, it is way different from real world workloads. In real world you will have hotspots and these will get cached.
Re: cassandra data to hadoop.
We do this all the time. Take a look at http://wiki.apache.org/cassandra/HadoopSupport for some details - you can use mapreduce or pig to get data out of cassandra. If it's going to a separate hadoop cluster, I don't think you'd need to co-locate task trackers or data nodes on your cassandra nodes - it would just need to copy over the network though. We also use oozie for job scheduling, fwiw. On Dec 23, 2011, at 9:12 AM, ravikumar visweswara wrote: > Hello All, > > I have a situation to dump cassandra data to hadoop cluster for further > analytics. Lot of other relevant data which is not present in cassandra is > already available in hdfs for analysis. Both are independent clusters right > now. > Is there a suggested way to get the data periodically or continuously to HDFS > from cassandra? Any ideas or references will be very helpful for me. > > Thanks and Regards > R
Re: cassandra data to hadoop.
Have you tried Brisk? On Dec 23, 2011, at 9:30 AM, "Jeremy Hanna" wrote: > We do this all the time. Take a look at > http://wiki.apache.org/cassandra/HadoopSupport for some details - you can use > mapreduce or pig to get data out of cassandra. If it's going to a separate > hadoop cluster, I don't think you'd need to co-locate task trackers or data > nodes on your cassandra nodes - it would just need to copy over the network > though. We also use oozie for job scheduling, fwiw. > > On Dec 23, 2011, at 9:12 AM, ravikumar visweswara wrote: > >> Hello All, >> >> I have a situation to dump cassandra data to hadoop cluster for further >> analytics. Lot of other relevant data which is not present in cassandra is >> already available in hdfs for analysis. Both are independent clusters right >> now. >> Is there a suggested way to get the data periodically or continuously to >> HDFS from cassandra? Any ideas or references will be very helpful for me. >> >> Thanks and Regards >> R > >
Re: cassandra data to hadoop.
We currently have cassandra nodes co-located with hadoop nodes and do a lot of data analytics with it. We've looked at brisk - brisk still open-source and available but datastax is putting its resources in a closed version of brisk as part of datastax enterprise. We'll likely be moving to that over the next month. That said, there's nothing wrong with just using vanilla hadoop with cassandra either - it's worked for us for almost a year now. Brisk/Datastax enterprise is simpler though, especially for doing a lot with the two together. On Dec 23, 2011, at 11:33 AM, Praveen Sadhu wrote: > Have you tried Brisk? > > > > On Dec 23, 2011, at 9:30 AM, "Jeremy Hanna" > wrote: > >> We do this all the time. Take a look at >> http://wiki.apache.org/cassandra/HadoopSupport for some details - you can >> use mapreduce or pig to get data out of cassandra. If it's going to a >> separate hadoop cluster, I don't think you'd need to co-locate task trackers >> or data nodes on your cassandra nodes - it would just need to copy over the >> network though. We also use oozie for job scheduling, fwiw. >> >> On Dec 23, 2011, at 9:12 AM, ravikumar visweswara wrote: >> >>> Hello All, >>> >>> I have a situation to dump cassandra data to hadoop cluster for further >>> analytics. Lot of other relevant data which is not present in cassandra is >>> already available in hdfs for analysis. Both are independent clusters right >>> now. >>> Is there a suggested way to get the data periodically or continuously to >>> HDFS from cassandra? Any ideas or references will be very helpful for me. >>> >>> Thanks and Regards >>> R >> >>
Best way to determine how a Cassandra cluster is doing
I just imported a lot of data in a 9 node Cassandra cluster and before I create a new ColumnFamily with even more data, I'd like to be able to determine how full my cluster currently is (in terms of memory usage). I'm not too sure what I need to look at. I don't want to import another 20-30GB of data and realize I should have added 5-6 more nodes. In short, I have no idea if I have too few/many nodes right now for what's in the cluster. Any help would be greatly appreciated :) $ nodetool -h 192.168.1.87 ring Address DC RackStatus State LoadOwns Token 151236607520417094872610936636341427313 192.168.1.87datacenter1 rack1 Up Normal 7.19 GB 11.11% 0 192.168.1.86datacenter1 rack1 Up Normal 7.18 GB 11.11% 18904575940052136859076367079542678414 192.168.1.88datacenter1 rack1 Up Normal 7.23 GB 11.11% 37809151880104273718152734159085356828 192.168.1.84datacenter1 rack1 Up Normal 4.2 GB 11.11% 56713727820156410577229101238628035242 192.168.1.85datacenter1 rack1 Up Normal 4.25 GB 11.11% 75618303760208547436305468318170713656 192.168.1.82datacenter1 rack1 Up Normal 4.1 GB 11.11% 94522879700260684295381835397713392071 192.168.1.89datacenter1 rack1 Up Normal 4.83 GB 11.11% 113427455640312821154458202477256070485 192.168.1.51datacenter1 rack1 Up Normal 2.24 GB 11.11% 132332031580364958013534569556798748899 192.168.1.25datacenter1 rack1 Up Normal 3.06 GB 11.11% 151236607520417094872610936636341427313 - # nodetool -h 192.168.1.87 cfstats Keyspace: stats Read Count: 232 Read Latency: 39.191931034482764 ms. Write Count: 160678758 Write Latency: 0.0492021849459404 ms. Pending Tasks: 0 Column Family: DailyStats SSTable count: 5267 Space used (live): 7710048931 Space used (total): 7710048931 Number of Keys (estimate): 10701952 Memtable Columns Count: 4401 Memtable Data Size: 23384563 Memtable Switch Count: 14368 Read Count: 232 Read Latency: 29.047 ms. Write Count: 160678813 Write Latency: 0.053 ms. Pending Tasks: 0 Bloom Filter False Postives: 0 Bloom Filter False Ratio: 0.0 Bloom Filter Space Used: 115533264 Key cache capacity: 20 Key cache size: 1894 Key cache hit rate: 0.627906976744186 Row cache: disabled Compacted row minimum size: 216 Compacted row maximum size: 42510 Compacted row mean size: 3453 - [default@stats] describe; Keyspace: stats: Replication Strategy: org.apache.cassandra.locator.SimpleStrategy Durable Writes: true Options: [replication_factor:3] Column Families: ColumnFamily: DailyStats (Super) Key Validation Class: org.apache.cassandra.db.marshal.BytesType Default column value validator: org.apache.cassandra.db.marshal.UTF8Type Columns sorted by: org.apache.cassandra.db.marshal.UTF8Type/org.apache.cassandra.db.marshal.UTF8Type Row cache size / save period in seconds / keys to save : 0.0/0/all Row Cache Provider: org.apache.cassandra.cache.ConcurrentLinkedHashCacheProvider Key cache size / save period in seconds: 20.0/14400 GC grace seconds: 864000 Compaction min/max thresholds: 4/32 Read repair chance: 1.0 Replicate on write: true Built indexes: [] Column Metadata: (removed) Compaction Strategy: org.apache.cassandra.db.compaction.LeveledCompactionStrategy Compression Options: sstable_compression: org.apache.cassandra.io.compress.SnappyCompressor
Re: Best way to determine how a Cassandra cluster is doing
One way to get a good bird's eye view of the cluster would be to install DataStax Opscenter - the community edition is free. You can do a lot of checks from a web interface that are based on the jmx hooks that are in Cassandra. We use it and it's helped us a lot. Hope it helps for what you're trying to determine: http://www.datastax.com/products/opscenter On Dec 23, 2011, at 5:34 PM, Pierre-Luc Brunet wrote: > I just imported a lot of data in a 9 node Cassandra cluster and before I > create a new ColumnFamily with even more data, I'd like to be able to > determine how full my cluster currently is (in terms of memory usage). I'm > not too sure what I need to look at. I don't want to import another 20-30GB > of data and realize I should have added 5-6 more nodes. > > In short, I have no idea if I have too few/many nodes right now for what's in > the cluster. > > Any help would be greatly appreciated :) > > $ nodetool -h 192.168.1.87 ring > Address DC RackStatus State LoadOwns > Token > > 151236607520417094872610936636341427313 > 192.168.1.87datacenter1 rack1 Up Normal 7.19 GB 11.11% > 0 > 192.168.1.86datacenter1 rack1 Up Normal 7.18 GB 11.11% > 18904575940052136859076367079542678414 > 192.168.1.88datacenter1 rack1 Up Normal 7.23 GB 11.11% > 37809151880104273718152734159085356828 > 192.168.1.84datacenter1 rack1 Up Normal 4.2 GB 11.11% > 56713727820156410577229101238628035242 > 192.168.1.85datacenter1 rack1 Up Normal 4.25 GB 11.11% > 75618303760208547436305468318170713656 > 192.168.1.82datacenter1 rack1 Up Normal 4.1 GB 11.11% > 94522879700260684295381835397713392071 > 192.168.1.89datacenter1 rack1 Up Normal 4.83 GB 11.11% > 113427455640312821154458202477256070485 > 192.168.1.51datacenter1 rack1 Up Normal 2.24 GB 11.11% > 132332031580364958013534569556798748899 > 192.168.1.25datacenter1 rack1 Up Normal 3.06 GB 11.11% > 151236607520417094872610936636341427313 > - > > # nodetool -h 192.168.1.87 cfstats > Keyspace: stats > Read Count: 232 > Read Latency: 39.191931034482764 ms. > Write Count: 160678758 > Write Latency: 0.0492021849459404 ms. > Pending Tasks: 0 >Column Family: DailyStats >SSTable count: 5267 >Space used (live): 7710048931 >Space used (total): 7710048931 >Number of Keys (estimate): 10701952 >Memtable Columns Count: 4401 >Memtable Data Size: 23384563 >Memtable Switch Count: 14368 >Read Count: 232 >Read Latency: 29.047 ms. >Write Count: 160678813 >Write Latency: 0.053 ms. >Pending Tasks: 0 >Bloom Filter False Postives: 0 >Bloom Filter False Ratio: 0.0 >Bloom Filter Space Used: 115533264 >Key cache capacity: 20 >Key cache size: 1894 >Key cache hit rate: 0.627906976744186 >Row cache: disabled >Compacted row minimum size: 216 >Compacted row maximum size: 42510 >Compacted row mean size: 3453 > - > > [default@stats] describe; > Keyspace: stats: > Replication Strategy: org.apache.cassandra.locator.SimpleStrategy > Durable Writes: true >Options: [replication_factor:3] > Column Families: >ColumnFamily: DailyStats (Super) > Key Validation Class: org.apache.cassandra.db.marshal.BytesType > Default column value validator: org.apache.cassandra.db.marshal.UTF8Type > Columns sorted by: > org.apache.cassandra.db.marshal.UTF8Type/org.apache.cassandra.db.marshal.UTF8Type > Row cache size / save period in seconds / keys to save : 0.0/0/all > Row Cache Provider: > org.apache.cassandra.cache.ConcurrentLinkedHashCacheProvider > Key cache size / save period in seconds: 20.0/14400 > GC grace seconds: 864000 > Compaction min/max thresholds: 4/32 > Read repair chance: 1.0 > Replicate on write: true > Built indexes: [] > Column Metadata: > (removed) > Compaction Strategy: > org.apache.cassandra.db.compaction.LeveledCompactionStrategy > Compression Options: >sstable_compression: org.apache.cassandra.io.compress.SnappyCompressor > >
Re: Best way to determine how a Cassandra cluster is doing
I was under the impression that Opscenter was only compatible with the DataStax version of Cassandra. I'll give that a shot :) Thank you. On 2011-12-23, at 6:49 PM, Jeremy Hanna wrote: > One way to get a good bird's eye view of the cluster would be to install > DataStax Opscenter - the community edition is free. You can do a lot of > checks from a web interface that are based on the jmx hooks that are in > Cassandra. We use it and it's helped us a lot. Hope it helps for what > you're trying to determine: http://www.datastax.com/products/opscenter >
Re: cassandra data to hadoop.
Jeremy, We use cloudera distribution for our hadoop cluster and may not be possible to migrate to brisk quickly because of flume/hue dependencies. Did you successfully pull the data from independent cassandra cluster and dump into completely disconnected hadoop cluster? It will be really helpful if you elaborate on how to achieve this. -R On Fri, Dec 23, 2011 at 9:28 AM, Jeremy Hanna wrote: > We do this all the time. Take a look at > http://wiki.apache.org/cassandra/HadoopSupport for some details - you can > use mapreduce or pig to get data out of cassandra. If it's going to a > separate hadoop cluster, I don't think you'd need to co-locate task > trackers or data nodes on your cassandra nodes - it would just need to copy > over the network though. We also use oozie for job scheduling, fwiw. > > On Dec 23, 2011, at 9:12 AM, ravikumar visweswara wrote: > > > Hello All, > > > > I have a situation to dump cassandra data to hadoop cluster for further > analytics. Lot of other relevant data which is not present in cassandra is > already available in hdfs for analysis. Both are independent clusters right > now. > > Is there a suggested way to get the data periodically or continuously to > HDFS from cassandra? Any ideas or references will be very helpful for me. > > > > Thanks and Regards > > R > >