Re: key is sorted?
You do any kind of range slice, e.g. keys beginning with "abc"? But the results will not be ordered? Please answer one of the following: True True True False False False Explain? Thanks! On Sun, May 9, 2010 at 8:27 PM, Vijay wrote: > True, The Range slice support was enabled in Random Partitioner for the > hadoop support. > > Random partitioner actually hash the Key and those keys are sorted so we > cannot have the actual key in order (Hope this doesnt confuse you)... > > Regards, > > > > > On Sun, May 9, 2010 at 12:00 AM, David Boxenhorn wrote: > >> This is something that I'm not sure that I understand. Can somebody >> confirm/deny that I understand it? Thanks. >> >> If you use random partitioning, you can loop through all keys with a range >> query, but they will not be sorted. >> >> True or False? >> >> On Sat, May 8, 2010 at 3:45 AM, AJ Chen wrote: >> >>> thanks, that works. -aj >>> >>> >>> On Fri, May 7, 2010 at 1:17 PM, Stu Hood wrote: >>> Your IPartitioner implementation decides how the row keys are sorted: see http://wiki.apache.org/cassandra/StorageConfiguration#Partitioner . You need to be using one of the OrderPreservingPartitioners if you'd like a reasonable order for the keys. -Original Message- From: "AJ Chen" Sent: Friday, May 7, 2010 3:10pm To: user@cassandra.apache.org Subject: key is sorted? I have a super column family for "topic", key being the name of the topic. >>> CompareSubcolumnsWith="BytesType" /> When I retrieve the rows, the rows are not sorted by the key. Is the row key sorted in cassandra by default? -aj -- AJ Chen, PhD Chair, Semantic Web SIG, sdforum.org http://web2express.org twitter @web2express Palo Alto, CA, USA >>> >>> >>> -- >>> AJ Chen, PhD >>> Chair, Semantic Web SIG, sdforum.org >>> http://web2express.org >>> twitter @web2express >>> Palo Alto, CA, USA >>> >> >> >
Re: Human readable Cassandra limitations
On the scaleability and performance side, I found Yahoo's paper about the YCSB project interesting (benchmarking some NoSQL solutions with MySQL). See research.yahoo.com/files/*ycsb*.*pdf. *My concern with the denormalization approach is that it shouldn't be managed by the client side because this has big impact on your throughput. Is the map-reduce in that respect any better? Wouldn't it be nice to support a kind of PL-(No)SQL server side scripting that allows you to create and maintain materialized views? You might still give it as an option to maintain the view synchronously (extension of current row-level-atomicity) or asynchronously. Not sure how complicated this support would be... - David On Mon, May 10, 2010 at 10:38 PM, Paul Prescod wrote: > On Mon, May 10, 2010 at 1:23 PM, Peter Hsu wrote: > > Thanks for the response, Paul. > > ... > > > > * Cassandra and its siblings are weak at ad hoc queries on tables > > that you did not think to index in advance > > > > What is the normal way of dealing with this in Cassandra? Would you just > > create a new "index" and bring a big honking machine to the table to > process > > all the existing data in the database and store the new "index"? > > The latest version of Cassandra introduces a "map/reduce" paradigm > which is the main tool you'd use for batch processing of data. You > could either use that to DO your ad hoc query or to process the data > into an index for more efficient ad hoc queries in the future. > > * http://en.wikipedia.org/wiki/MapReduce > > * http://en.wikipedia.org/wiki/Hadoop > > * http://architects.dzone.com/news/cassandra-adds-hadoop > > You can read criticisms of MapReduce in the first link there. > > > On May 10, 2010, at 11:22 AM, Paul Prescod wrote: > > > > This is a very, very big topic. For the most part, the issues are > > covered in the various SQL versus NoSQL debates all over the Internet. > > For example: > > > > * Cassandra and its NoSQL siblings have no concept of an in-database > "join" > > > > * Cassandra and its NoSQL siblings do not allow you to update > > multiple "tables" in a single transactions > > > > * Cassandra's API is specific to it, and not portable to any other data > > store > > > > * Cassandra currently has simplistic facilities to deal with various > > kinds of conflicting write. > > > > * Cassandra is strongly optimized for multiple machine distributions, > > whereas relational databases tend to be optimized for a single > > powerful machine. > > > > * Cassandra and its siblings are weak at ad hoc queries on tables > > that you did not think to index in advance > > > > On Mon, May 10, 2010 at 11:06 AM, Peter Hsu > wrote: > > > > I've seen a lot of threads and posts about why Cassandra is great. I'm > > fairly sold on the features, and the few big deployments on Cassandra > give > > it a lot of credibility. > > > > However, I don't believe in magic bullets, so I really want to understand > > the potential downsides of Cassandra. Right now, I don't really have a > clue > > as to what Cassandra is bad at. I took a look at > > http://wiki.apache.org/cassandra/CassandraLimitations which is helpful, > but > > doesn't characterize its weaknesses in ways that I can really comprehend > > until I've actually used Cassandra and understand some of the internals. > It > > seems that the community would benefit from being able to answer some of > > these questions in terms of real world use cases. > > > > My main questions: > > > > * Are there designs in which a SQL database out-performs or out-scales > > Cassandra? > > > > * Is there a pros vs cons page of Cassandra against an open source SQL > > database (MySQL or Postgres)? > > > > I do plan on attending the training session next Friday in Palo Alto, but > > it'd be great if I had some more food for thought before I attend. > > > > > > > > >
Re: replication impact on write throughput
About this linear scaling of throughput(with keys perfectly distributed + requests balanced over all nodes): I would assume that this is not the case for small number of nodes because starting from 2 nodes onwards a part of the requests have to be handled by a proxy node + the actual node responsible for the key. Is there no other overhead that starts to play when increasing the number of nodes? Until now we were not able to reproduce this linearly scaling of throughput. We are still figuring out why but it would be interesting to know what the target should be... -David On Wed, May 12, 2010 at 4:13 AM, Paul Prescod wrote: > On Tue, May 11, 2010 at 5:56 PM, Mark Greene wrote: > > I was under the impression from what I've seen talked about on this list > > (perhaps I'm wrong here) that given the write throughput of one node in a > > cluster (again assuming each node has a given throughput and the same > > config) that you would simply multiply that throughput by the number of > > nodes you had, giving you the total throughput for the entire ring (like > you > > said this is somewhat artificial). The main benefit being that adding > > capacity was as simple as adding more nodes to the ring with no > degradation. > > So by your math, 100 nodes with each node getting 5k wps, I would assume > the > > total capacity is 500k wps. But perhaps I've misunderstood some key > > concepts. Still a novice myself ;-) > > If the replication factor is 2, then everything is written twice. So > your throughput is cut in half. > > Paul Prescod >
Re: timeout while running simple hadoop job
a follow up for anyone that may end up on this conversation again: I kept trying and neither changing the number of concurrent map tasks, nor the slice size helped. Finally, I found out a screw up in our logging system, which had forbidden us from noticing a couple of recurring errors in the logs : ERROR [ROW-READ-STAGE:1] 2010-05-11 16:43:32,328 DebuggableThreadPoolExecutor.java (line 101) Error in ThreadPoolExecutor java.lang.RuntimeException: java.lang.RuntimeException: corrupt sstable at org.apache.cassandra.service.RangeSliceVerbHandler.doVerb(RangeSliceVerbHandler.java:53) at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:40) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:619) Caused by: java.lang.RuntimeException: corrupt sstable at org.apache.cassandra.io.SSTableScanner.seekTo(SSTableScanner.java:73) at org.apache.cassandra.db.ColumnFamilyStore.getKeyRange(ColumnFamilyStore.java:907) at org.apache.cassandra.db.ColumnFamilyStore.getRangeSlice(ColumnFamilyStore.java:1000) at org.apache.cassandra.service.RangeSliceVerbHandler.doVerb(RangeSliceVerbHandler.java:41) ... 4 more Caused by: java.io.FileNotFoundException: /path/to/data/Keyspace/CF-123-Index.db (Too many open files) at java.io.RandomAccessFile.open(Native Method) at java.io.RandomAccessFile.(RandomAccessFile.java:212) at java.io.RandomAccessFile.(RandomAccessFile.java:98) at org.apache.cassandra.io.util.BufferedRandomAccessFile.(BufferedRandomAccessFile.java:143) at org.apache.cassandra.io.util.BufferedRandomAccessFile.(BufferedRandomAccessFile.java:138) at org.apache.cassandra.io.SSTableReader.getNearestPosition(SSTableReader.java:414) at org.apache.cassandra.io.SSTableScanner.seekTo(SSTableScanner.java:62) ... 7 more and the related WARN [main] 2010-05-11 16:43:38,076 TThreadPoolServer.java (line 190) Transport error occurred during acceptance of message. org.apache.thrift.transport.TTransportException: java.net.SocketException: Too many open files at org.apache.thrift.transport.TServerSocket.acceptImpl(TServerSocket.java:124) at org.apache.thrift.transport.TServerSocket.acceptImpl(TServerSocket.java:35) at org.apache.thrift.transport.TServerTransport.accept(TServerTransport.java:31) at org.apache.thrift.server.TThreadPoolServer.serve(TThreadPoolServer.java:184) at org.apache.cassandra.thrift.CassandraDaemon.start(CassandraDaemon.java:149) at org.apache.cassandra.thrift.CassandraDaemon.main(CassandraDaemon.java:190) Caused by: java.net.SocketException: Too many open files at java.net.PlainSocketImpl.socketAccept(Native Method) at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:390) at java.net.ServerSocket.implAccept(ServerSocket.java:453) at java.net.ServerSocket.accept(ServerSocket.java:421) at org.apache.thrift.transport.TServerSocket.acceptImpl(TServerSocket.java:119) ... 5 more The client was reporting timeouts in this case. The max fd limit on the process was in fact not exceedingly high (1024) and raising it seems to have solved the problem. Anyway It still seems that there may be two issues: - since we had never seen this error before with normal client connections (as in: non hadoop), is it possible that the Cassandra/hadoop layer is not closing sockets properly between one connection and the other, or not reusing connections efficiently? E.g. TSocket seems to have a close() method but I don't see it used in ColumnFamilyInputFormat.(getSubSplits, getRangeMap) but it may well be inside CassandraClient. Anyway, judging by lsof's output I can only see about a hundred TCP connections, but those from the hadoop jobs seem to always be below 60 so this may just be my wrong impression. - is it possible that such errors show up on the client side as timeoutErrors when they could be reported better? this would probably help other people in diagnosing/reporting internal errors in the future. Thanks again to everyone with this, I promise I'll put the discussion on the wiki for future reference :)
Re: what/how do you guys monitor "slow" nodes?
There is a per cf read and write latency jmx. On May 12, 2010 12:55 AM, "Jordan Pittier - Rezel" wrote: For sure you have to pay particular attention to memory allocation on each node, especially be sure your servers dont swap. Then you can monitor how load are balanced among your nodes (nodetools -h XX ring). On Tue, May 11, 2010 at 11:46 PM, S Ahmed wrote: > > If you have 3-4 nodes,...
Terminology, getting slices by time
Hi! I am having trouble understanding the "column" terminology Cassandra uses. I am developing in Ruby. I need to store data for vehicles which will come in at different times and retrieve data for a specific vehicle for specific slices of time. So each record could look like: vehicle_id, { time => { 'sensor1' => 1000, 'sensor2' => 2000 } } The actual insert: @cas.insert(:ReportData, vehicle_id.to_s, { UUID.new.to_s => { 'sensor1' => 1000, 'sensor' => 2000 }}) The model: org.apache.cassandra.locator.RackUnawareStrategy 1 org.apache.cassandra.locator.EndPointSnitch I am trying to store this with a supercolumn. Can someone tell me: 1. Are "sensor1" and "sensor2" columns? 2. Is "time" a supercolumn? 3. Where is the "subcolumn"? 4. Which item would "CompareWith" refer to? 5. Which item would "CompareSubcolumnsWith" refer to? Also, about time: 1. To store "time", must I use UUID.new.to_s as above? 2. Isn't a time key redundant? Can Cassandra return slices based on the time it keeps for each record? 3. How do I get a time slice for a vehicle? I get various generic "application" errors when trying to use get_range to get time ranges. 4. Any advice on this data model?
Is it possible to delete records based upon where condition
Hi All, In Cassandra it possible to remove records based upon where condition. We are planning to move the session and cache table from MySql to Cassandra and where doing the fesability study. Everything seems to be Ok other than garbage collection of session table. Was not able to remove super columns from the Key based upon certain conditions in the column value. Follows the schema for Session tables FESession = { sessions : { 1001 : { ses_name : "user", ses_userid : 258, ses_tstamp: “1273504587” }, 1002 : { ses_name: "user", ses_userid: 259, ses_tstamp: “1273504597” }, }, } I wanted to remove the records based upon the value of the column ses_tstamp ie (delete from sessions where ses_tstamp between XXX & YYY OR delete from session where ses_tstamp < XXX ) Is it possible to achieve this in Cassandra If so please let me know how. To my knowledge I dont see any kind of where condition in Casandra, If that is the case also please let me know the other ways like restructuring the Schema or whatever... Regards Moses
Re: replication impact on write throughput
>>If the replication factor is 2, then everything is written twice. So >>your throughput is cut in half. throughput of new inserts is cut in half right? I think I was thinking about capacity in more general terms from the node's perspective. The node has the ability to write so many operations per second whether that be a replicated key or not and you could potentially take that number and multiply it by the number of nodes in the ring to give you your write capacity per second. On Tue, May 11, 2010 at 10:13 PM, Paul Prescod wrote: > On Tue, May 11, 2010 at 5:56 PM, Mark Greene wrote: > > I was under the impression from what I've seen talked about on this list > > (perhaps I'm wrong here) that given the write throughput of one node in a > > cluster (again assuming each node has a given throughput and the same > > config) that you would simply multiply that throughput by the number of > > nodes you had, giving you the total throughput for the entire ring (like > you > > said this is somewhat artificial). The main benefit being that adding > > capacity was as simple as adding more nodes to the ring with no > degradation. > > So by your math, 100 nodes with each node getting 5k wps, I would assume > the > > total capacity is 500k wps. But perhaps I've misunderstood some key > > concepts. Still a novice myself ;-) > > If the replication factor is 2, then everything is written twice. So > your throughput is cut in half. > > Paul Prescod >
RE: Extremly slow inserts on LAN
I don't understand all the problems yet you guys are facing. Just wanted to let you know that I'm getting my feet wet with Cassandra. In a few days/weeks I'll be re-reading all your notes again :) I'm bound to provide a production Cassandra environment based on win 2008 (I know, I know...It's better to go with Cassandra on Win than no Cassandra!) Great news group. -Original Message- From: Arie Keren [mailto:a...@doubleverify.com] Sent: Monday, May 10, 2010 3:09 AM To: user@cassandra.apache.org Subject: RE: Extremly slow inserts on LAN That solved the problem! It also helps to increase the buffer size from the default value of 1024: var transport = new TBufferedTransport(new TSocket("192.168.0.123", 9160), 64*1024); Thanks a lot! -Original Message- From: Viktor Jevdokimov [mailto:viktor.jevdoki...@adform.com] Sent: May 10, 2010 8:57 AM To: user@cassandra.apache.org Subject: RE: Extremly slow inserts on LAN We had similar experience. Problem was with TSocket as transport alone: var transport = new TSocket("192.168.0.123", 9160); var protocol = new TBinaryProtocol(transport); var client = new Cassandra.Client(protocol); Using TBufferedTransport helped a lot: var transport = new TBufferedTransport(new TSocket("192.168.0.123", 9160)); var protocol = new TBinaryProtocol(transport); var client = new Cassandra.Client(protocol); Viktor -Original Message- From: Arie Keren [mailto:a...@doubleverify.com] Sent: Monday, May 10, 2010 8:51 AM To: user@cassandra.apache.org Subject: RE: Extremly slow inserts on LAN No - just Windows. So I'm going to do some experiments to isolate the cause: - use java client on windows - use linux server - use java client on linux Thanx -Original Message- From: David Strauss [mailto:da...@fourkitchens.com] Sent: May 09, 2010 5:48 PM To: user@cassandra.apache.org Subject: Re: Extremly slow inserts on LAN >From a naive (not caring about Cassandra internals) basis, the first step is >to isolate whether the problem is on the client or server side. Have you tried a Linux-based server or a Linux-based client? On 2010-05-09 14:06, Arie Keren wrote: > While making our first steps with Cassandra, we experience slow > inserts working on LAN. > > Inserting 7000 keys with 1 column family takes about 10 seconds when > Cassandra server running on the same host with the client. > > But when server runs on a different host on LAN, the same inserts take > more than 10 (!) minutes. > > In both cases Cassandra server contains a single node. > > > > We use Cassandra version 0.6.0 running on Windows server 2008. > > The client is .NET c# application. -- David Strauss | da...@fourkitchens.com Four Kitchens | http://fourkitchens.com | +1 512 454 6659 [office] | +1 512 870 8453 [direct] __ Information from ESET NOD32 Antivirus, version of virus signature database 4628 (20091122) __ The message was checked by ESET NOD32 Antivirus. http://www.eset.com __ Information from ESET NOD32 Antivirus, version of virus signature database 4628 (20091122) __ The message was checked by ESET NOD32 Antivirus. http://www.eset.com __ Information from ESET NOD32 Antivirus, version of virus signature database 4628 (20091122) __ The message was checked by ESET NOD32 Antivirus. http://www.eset.com
what is DCQUORUM
Hi I have read about QUORUM but lately came across DCQUORUM. What is it and whats the difference between the two ?
Re: what is DCQUORUM
QUORUM is a high consistency level. It refers to the number of nodes that have to acknowledge read or write operations in order to be assured that Cassandra is in a consistent state. It uses / 2 + 1. DCQUORUM means "Data Center Quorum", and balances consistency with performance. It puts multiple replicas in each Data Center so operations can prefer replicas in the same DC for lower latency. See https://issues.apache.org/jira/browse/CASSANDRA-492 for a little discussion. Also see http://svn.apache.org/repos/asf/cassandra/trunk/src/java/org/apache/cassandra/locator/DatacenterShardStrategy.java and http://svn.apache.org/repos/asf/cassandra/trunk/src/java/org/apache/cassandra/service/DatacenterWriteResponseHandler.java Eben On Wed, May 12, 2010 at 6:15 AM, vd wrote: > Hi > > I have read about QUORUM but lately came across DCQUORUM. What is it and > whats the difference between the two ? > > -- "In science there are no 'depths'; there is surface everywhere." --Rudolph Carnap
Re: what is DCQUORUM
Thanks Eben On Wed, May 12, 2010 at 7:33 PM, Eben Hewitt wrote: > QUORUM is a high consistency level. It refers to the number of nodes that > have to acknowledge read or write operations in order to be assured that > Cassandra is in a consistent state. It uses / 2 + 1. > > DCQUORUM means "Data Center Quorum", and balances consistency with > performance. It puts multiple replicas in each Data Center so operations can > prefer replicas in the same DC for lower latency. > > See https://issues.apache.org/jira/browse/CASSANDRA-492 for a little > discussion. > > Also see > http://svn.apache.org/repos/asf/cassandra/trunk/src/java/org/apache/cassandra/locator/DatacenterShardStrategy.java > and > > http://svn.apache.org/repos/asf/cassandra/trunk/src/java/org/apache/cassandra/service/DatacenterWriteResponseHandler.java > > Eben > > > > On Wed, May 12, 2010 at 6:15 AM, vd wrote: > >> Hi >> >> I have read about QUORUM but lately came across DCQUORUM. What is it and >> whats the difference between the two ? >> >> > > > -- > "In science there are no 'depths'; there is surface everywhere." > --Rudolph Carnap >
Re: Real-time Web Analysis tool using Cassandra. Doubts...
On Tue, May 11, 2010 at 1:52 PM, Paulo Gabriel Poiati wrote: > - First of all, my first thoughts is to have two CF one for raw client > request (~10 millions++ per day) and other for aggregated metrics in some > defined inteval time like 1min, 5min, 15min... Is this a good approach ? Sure. > - It is a good idea to use a OrderPreservingPartitioner ? To maintain the > order of my requests in the raw data CF ? Or the overhead is too big. The problem with OPP isn't overhead (it is lower-overhead than RP) but the tendency to have hotspots in sequentially-written data. > - Initially the cluster will contain only three nodes, is it a problem (to > few maybe) ? You'll have to do some load testing to see. > - I think the best way to do the aggregation job is through a hadoop > MapReduce job. Right ? Is there any other way to consider ? Map/Reduce is usually better than rolling your own because it parallelizes for you. > - Is really Cassandra suitable for it ? Maybe HBase is better in this case? Nothing here makes me think "Cassandra is a poor choice." -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of Riptano, the source for professional Cassandra support http://riptano.com
Re: (Binary)Memtable flushing
https://issues.apache.org/jira/browse/CASSANDRA-856 On Tue, May 11, 2010 at 3:44 PM, Tobias Jungen wrote: > Yet another BMT question, thought this may apply for regular memtables as > well... > > After doing a batch insert, I accidentally submitted the flush command > twice. To my surprise, the target node's log indicates that it wrote a new > *-Data.db file, and the disk usage went up accordingly. I tested and issued > the flush command a few more times, and after a few more data files I > eventually triggered a compaction, bringing the disk usage back down. The > data appears to continue to stick around in memory, however, as further > flush commands continue to result in new data files. > > Shouldn't flushing a memtable remove it from memory, or is expected behavior > that it sticks around until the node needs to reclaim the memory? Should I > worry about getting out-of-memory errors if I'm doing lots of inserts in > this manner? > -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of Riptano, the source for professional Cassandra support http://riptano.com
Re: nodetool drain disables writes?
On Tue, May 11, 2010 at 4:18 PM, Anthony Molinaro wrote: > Hi, > > I thought that 'nodetool drain' was supposed to flush the commit logs > through the system, which it appears to do (verified by running ls in > the commit log directory and seeing no files). > > However, it also appears to disable writes completely (ie, scripts attempting > to write data were frozen, ls of commit log directory showed no files), is > that expected behavior? Yes. The intent is that after the drain your commitlog is empty (e.g. for upgrading to 0.7). If it were to continue to accept writes that would not be the case. If you just want a flush, run `nodetool flush`. -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of Riptano, the source for professional Cassandra support http://riptano.com
Re: Timed out reads still in queue
This is a slightly different way of describing https://issues.apache.org/jira/browse/CASSANDRA-685 On Tue, May 11, 2010 at 9:01 PM, Jeremy Dunck wrote: > Reddit posted a blog entry about some recent downtime, partially due > to issues with Cassandra. > http://blog.reddit.com/2010/05/reddits-may-2010-state-of-servers.html > > This part surprised me: > " > First, Cassandra has an internal queue of work to do. When it times > out a client (10s by default), it still leaves the operation in the > queue of work to complete (even though the person that asked for the > read is no longer even holding the socket), which given a constant > stream of requests makes the amount of pending work snowball > effectively infinitely (specifically, ROW-READ-STAGE's PENDING > operations grow unbounded). > " > > I've searched Jira for an issue related to this -- it seems like a bug > to have reads in queue when the result is useless (because the reader > is gone). Obviously a 10-second read is not a normal run condition, > but removing stale reads could remove a cause of cascading failure. > > Should I open a ticket, or have I misunderstood something? > -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of Riptano, the source for professional Cassandra support http://riptano.com
Re: Unexpected TType error on web traffic increase
Sounds like the sort of error you'd see if you were using thread-unsafe Thrift clients on multiple threads. On Tue, May 11, 2010 at 11:23 PM, Waqas Badar wrote: > Dear all, > > We are using Cassandra on website. Whenever website traffic increases, we > got the following error (Python): > > File "/usr/local/lib/python2.6/dist-packages/pycassa/columnfamily.py", line > 199, in multiget > self._rcl(read_consistency_level)) > > File "/usr/local/lib/python2.6/dist-packages/pycassa/connection.py", line > 114, in client_call > return getattr(self._client, attr)(*args, **kwargs) > > File "/usr/local/lib/python2.6/dist-packages/cassandra/Cassandra.py", line > 346, in multiget_slice > return self.recv_multiget_slice() > > File "/usr/local/lib/python2.6/dist-packages/cassandra/Cassandra.py", line > 368, in recv_multiget_slice > result.read(self._iprot) > > File "/usr/local/lib/python2.6/dist-packages/cassandra/Cassandra.py", line > 1767, in read > fastbinary.decode_binary(self, iprot.trans, (self.__class__, > self.thrift_spec)) > > TypeError: Unexpected TType > > Any body also has got this issue ever? I Google on it but no help found. We > are using pycassa for Cassandra connectivity. Cassandra version is 0.5.1 > > Can anybody also guide how should we use Cassandra for website e.g. as on > each page access there is a new process start so how can we implement > connection pooling on website for Cassandra? > > -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of Riptano, the source for professional Cassandra support http://riptano.com
Re: timeout while running simple hadoop job
On Wed, May 12, 2010 at 5:11 AM, gabriele renzi wrote: > - is it possible that such errors show up on the client side as > timeoutErrors when they could be reported better? No, if the node the client is talking to doesn't get a reply from the data node, there is no way for it to magically find out what happened since ipso facto it got no reply. -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of Riptano, the source for professional Cassandra support http://riptano.com
Re: timeout while running simple hadoop job
On Wed, May 12, 2010 at 4:43 PM, Jonathan Ellis wrote: > On Wed, May 12, 2010 at 5:11 AM, gabriele renzi wrote: >> - is it possible that such errors show up on the client side as >> timeoutErrors when they could be reported better? > > No, if the node the client is talking to doesn't get a reply from the > data node, there is no way for it to magically find out what happened > since ipso facto it got no reply. Sorry I was not clear: I meant the first error (where we get a RuntimeException in reading the file, not in the socket.accept()). There we have a reasonable error message (either "too many open files" or "corrupt sstable") that does not appear client side. -- blog en: http://www.riffraff.info blog it: http://riffraff.blogsome.com
Cache capacities set by nodetool
When I first brought this cluster online, the storage-conf.xml file had a few cache capacities set. Since then, we've completely changed how we use cassandra's caching, and no longer use any of the caches I setup in the original configuration. I'm finding that cassandra doesn't want to keep my new cache settings. I can disable those caches with setcachecapacity, check cfstats and see that they are disabled, but within a few minutes, cfstats is reporting that the caches are back. Anything I can do?
Re: timeout while running simple hadoop job
Have you checked your open file handler limit? You can do that by using "ulimit" in the shell. If it's too low, you will encounter the "too many open files" error. You can also see how many open handlers an application has with "lsof". Héctor Izquierdo On 12/05/10 17:00, gabriele renzi wrote: On Wed, May 12, 2010 at 4:43 PM, Jonathan Ellis wrote: On Wed, May 12, 2010 at 5:11 AM, gabriele renzi wrote: - is it possible that such errors show up on the client side as timeoutErrors when they could be reported better? No, if the node the client is talking to doesn't get a reply from the data node, there is no way for it to magically find out what happened since ipso facto it got no reply. Sorry I was not clear: I meant the first error (where we get a RuntimeException in reading the file, not in the socket.accept()). There we have a reasonable error message (either "too many open files" or "corrupt sstable") that does not appear client side.
Re: Cache capacities set by nodetool
It's a bug: https://issues.apache.org/jira/browse/CASSANDRA-1079 -ryan On Wed, May 12, 2010 at 8:16 AM, James Golick wrote: > When I first brought this cluster online, the storage-conf.xml file had a > few cache capacities set. Since then, we've completely changed how we use > cassandra's caching, and no longer use any of the caches I setup in the > original configuration. > I'm finding that cassandra doesn't want to keep my new cache settings. I can > disable those caches with setcachecapacity, check cfstats and see that they > are disabled, but within a few minutes, cfstats is reporting that the caches > are back. > Anything I can do?
Re: timeout while running simple hadoop job
Looking over the code this is in fact an issue in 0.6. It's fixed in trunk/0.7. Connections will be reused and closed properly, see https://issues.apache.org/jira/browse/CASSANDRA-1017 for more details. We can either backport that patch or make at least close the connections properly in 0.6. Can you open an ticket for this bug? /Johan On 12 maj 2010, at 12.11, gabriele renzi wrote: > a follow up for anyone that may end up on this conversation again: > > I kept trying and neither changing the number of concurrent map tasks, > nor the slice size helped. > Finally, I found out a screw up in our logging system, which had > forbidden us from noticing a couple of recurring errors in the logs : > > ERROR [ROW-READ-STAGE:1] 2010-05-11 16:43:32,328 > DebuggableThreadPoolExecutor.java (line 101) Error in > ThreadPoolExecutor > java.lang.RuntimeException: java.lang.RuntimeException: corrupt sstable >at > org.apache.cassandra.service.RangeSliceVerbHandler.doVerb(RangeSliceVerbHandler.java:53) >at > org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:40) >at > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) >at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) >at java.lang.Thread.run(Thread.java:619) > Caused by: java.lang.RuntimeException: corrupt sstable >at > org.apache.cassandra.io.SSTableScanner.seekTo(SSTableScanner.java:73) >at > org.apache.cassandra.db.ColumnFamilyStore.getKeyRange(ColumnFamilyStore.java:907) >at > org.apache.cassandra.db.ColumnFamilyStore.getRangeSlice(ColumnFamilyStore.java:1000) >at > org.apache.cassandra.service.RangeSliceVerbHandler.doVerb(RangeSliceVerbHandler.java:41) >... 4 more > Caused by: java.io.FileNotFoundException: > /path/to/data/Keyspace/CF-123-Index.db (Too many open files) >at java.io.RandomAccessFile.open(Native Method) >at java.io.RandomAccessFile.(RandomAccessFile.java:212) >at java.io.RandomAccessFile.(RandomAccessFile.java:98) >at > org.apache.cassandra.io.util.BufferedRandomAccessFile.(BufferedRandomAccessFile.java:143) >at > org.apache.cassandra.io.util.BufferedRandomAccessFile.(BufferedRandomAccessFile.java:138) >at > org.apache.cassandra.io.SSTableReader.getNearestPosition(SSTableReader.java:414) >at > org.apache.cassandra.io.SSTableScanner.seekTo(SSTableScanner.java:62) >... 7 more > > and the related > > WARN [main] 2010-05-11 16:43:38,076 TThreadPoolServer.java (line 190) > Transport error occurred during acceptance of message. > org.apache.thrift.transport.TTransportException: > java.net.SocketException: Too many open files >at > org.apache.thrift.transport.TServerSocket.acceptImpl(TServerSocket.java:124) >at > org.apache.thrift.transport.TServerSocket.acceptImpl(TServerSocket.java:35) >at > org.apache.thrift.transport.TServerTransport.accept(TServerTransport.java:31) >at > org.apache.thrift.server.TThreadPoolServer.serve(TThreadPoolServer.java:184) >at > org.apache.cassandra.thrift.CassandraDaemon.start(CassandraDaemon.java:149) >at > org.apache.cassandra.thrift.CassandraDaemon.main(CassandraDaemon.java:190) > Caused by: java.net.SocketException: Too many open files >at java.net.PlainSocketImpl.socketAccept(Native Method) >at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:390) >at java.net.ServerSocket.implAccept(ServerSocket.java:453) >at java.net.ServerSocket.accept(ServerSocket.java:421) >at > org.apache.thrift.transport.TServerSocket.acceptImpl(TServerSocket.java:119) >... 5 more > > The client was reporting timeouts in this case. > > > The max fd limit on the process was in fact not exceedingly high > (1024) and raising it seems to have solved the problem. > > Anyway It still seems that there may be two issues: > > - since we had never seen this error before with normal client > connections (as in: non hadoop), is it possible that the > Cassandra/hadoop layer is not closing sockets properly between one > connection and the other, or not reusing connections efficiently? > E.g. TSocket seems to have a close() method but I don't see it used in > ColumnFamilyInputFormat.(getSubSplits, getRangeMap) but it may well be > inside CassandraClient. > > Anyway, judging by lsof's output I can only see about a hundred TCP > connections, but those from the hadoop jobs seem to always be below 60 > so this may just be my wrong impression. > > - is it possible that such errors show up on the client side as > timeoutErrors when they could be reported better? this would probably > help other people in diagnosing/reporting internal errors in the > future. > > > Thanks again to everyone with this, I promise I'll put the discussion > on the wiki for future reference :)
Re: Real-time Web Analysis tool using Cassandra. Doubts...
What makes cassandra a poor choice is the fact that, you can't use a keyrange as input for the map phase for Hadoop. On Wed, May 12, 2010 at 4:37 PM, Jonathan Ellis wrote: > On Tue, May 11, 2010 at 1:52 PM, Paulo Gabriel Poiati > wrote: > > - First of all, my first thoughts is to have two CF one for raw client > > request (~10 millions++ per day) and other for aggregated metrics in some > > defined inteval time like 1min, 5min, 15min... Is this a good approach ? > > Sure. > > > - It is a good idea to use a OrderPreservingPartitioner ? To maintain the > > order of my requests in the raw data CF ? Or the overhead is too big. > > The problem with OPP isn't overhead (it is lower-overhead than RP) but > the tendency to have hotspots in sequentially-written data. > > > - Initially the cluster will contain only three nodes, is it a problem > (to > > few maybe) ? > > You'll have to do some load testing to see. > > > - I think the best way to do the aggregation job is through a hadoop > > MapReduce job. Right ? Is there any other way to consider ? > > Map/Reduce is usually better than rolling your own because it > parallelizes for you. > > > - Is really Cassandra suitable for it ? Maybe HBase is better in this > case? > > Nothing here makes me think "Cassandra is a poor choice." > > -- > Jonathan Ellis > Project Chair, Apache Cassandra > co-founder of Riptano, the source for professional Cassandra support > http://riptano.com >
Re: (Binary)Memtable flushing
D'oh, forgot to search the JIRA on this one. Thanks Jonathan! On Wed, May 12, 2010 at 9:37 AM, Jonathan Ellis wrote: > https://issues.apache.org/jira/browse/CASSANDRA-856 > > On Tue, May 11, 2010 at 3:44 PM, Tobias Jungen > wrote: > > Yet another BMT question, thought this may apply for regular memtables as > > well... > > > > After doing a batch insert, I accidentally submitted the flush command > > twice. To my surprise, the target node's log indicates that it wrote a > new > > *-Data.db file, and the disk usage went up accordingly. I tested and > issued > > the flush command a few more times, and after a few more data files I > > eventually triggered a compaction, bringing the disk usage back down. The > > data appears to continue to stick around in memory, however, as further > > flush commands continue to result in new data files. > > > > Shouldn't flushing a memtable remove it from memory, or is expected > behavior > > that it sticks around until the node needs to reclaim the memory? Should > I > > worry about getting out-of-memory errors if I'm doing lots of inserts in > > this manner? > > > > > > -- > Jonathan Ellis > Project Chair, Apache Cassandra > co-founder of Riptano, the source for professional Cassandra support > http://riptano.com >
Re: timeout while running simple hadoop job
On Wed, May 12, 2010 at 5:46 PM, Johan Oskarsson wrote: > Looking over the code this is in fact an issue in 0.6. > It's fixed in trunk/0.7. Connections will be reused and closed properly, see > https://issues.apache.org/jira/browse/CASSANDRA-1017 for more details. > > We can either backport that patch or make at least close the connections > properly in 0.6. Can you open an ticket for this bug? done as https://issues.apache.org/jira/browse/CASSANDRA-1081
Re: Is it possible to delete records based upon where condition
The functionality of a WHERE clause usually means maintaining an inverted index, usually another CF, on the information of interest (ses_tstamp in your example). You then retrieve index rows from that CF to find the data rows. b On Wed, May 12, 2010 at 5:34 AM, Moses Dinakaran wrote: > Hi All, > > In Cassandra it possible to remove records based upon where condition. > > We are planning to move the session and cache table from MySql to Cassandra > and where doing the fesability study. Everything seems to be Ok other than > garbage collection of session table. > > Was not able to remove super columns from the Key based upon certain > conditions in the column value. > > Follows the schema for Session tables > > > FESession = { > sessions : { > 1001 : { > ses_name : "user", > ses_userid : 258, > ses_tstamp : “1273504587” > }, > 1002 : { > ses_name : "user", > ses_userid : 259, > ses_tstamp : “1273504597” > }, > }, > } > > I wanted to remove the records based upon the value of the column ses_tstamp > ie (delete from sessions where ses_tstamp between XXX & YYY OR delete from > session where ses_tstamp < XXX ) > > Is it possible to achieve this in Cassandra If so please let me know how. > > To my knowledge I dont see any kind of where condition in Casandra, > If that is the case also please let me know the other ways like > restructuring the Schema or whatever... > > Regards > Moses > > > > > > >
Load Balancing Mapper Tasks
I've been trying to improve the time it takes to map 30 million rows using a hadoop / cassandra cluster with 30 nodes. I discovered that since CassandraInputFormat returns an ordered list of splits, when there are many splits (e.g. hundreds or more) the load on cassandra is horribly unbalanced. e.g. if I have 30 tasks processing 600 splits, then the first 30 splits are all located on the same one or two nodes. I added *Collections.shuffle(splits) *before returning the splits in getSplits(). As a result, the load is much better distributed, throughput was increased (about 3X in my case) and TimedOutExceptions were all but eliminated. Joost.
Re: Cache capacities set by nodetool
Hmmm that's definitely what we're seeing. Although, we aren't seeing cache settings being picked up properly on a restart either. On Wed, May 12, 2010 at 8:42 AM, Ryan King wrote: > It's a bug: > > https://issues.apache.org/jira/browse/CASSANDRA-1079 > > -ryan > > On Wed, May 12, 2010 at 8:16 AM, James Golick > wrote: > > When I first brought this cluster online, the storage-conf.xml file had a > > few cache capacities set. Since then, we've completely changed how we use > > cassandra's caching, and no longer use any of the caches I setup in the > > original configuration. > > I'm finding that cassandra doesn't want to keep my new cache settings. I > can > > disable those caches with setcachecapacity, check cfstats and see that > they > > are disabled, but within a few minutes, cfstats is reporting that the > caches > > are back. > > Anything I can do? >
Re: Cache capacities set by nodetool
Picked up out of config, I mean. On Wed, May 12, 2010 at 11:10 AM, James Golick wrote: > Hmmm that's definitely what we're seeing. Although, we aren't seeing > cache settings being picked up properly on a restart either. > > > On Wed, May 12, 2010 at 8:42 AM, Ryan King wrote: > >> It's a bug: >> >> https://issues.apache.org/jira/browse/CASSANDRA-1079 >> >> -ryan >> >> On Wed, May 12, 2010 at 8:16 AM, James Golick >> wrote: >> > When I first brought this cluster online, the storage-conf.xml file had >> a >> > few cache capacities set. Since then, we've completely changed how we >> use >> > cassandra's caching, and no longer use any of the caches I setup in the >> > original configuration. >> > I'm finding that cassandra doesn't want to keep my new cache settings. I >> can >> > disable those caches with setcachecapacity, check cfstats and see that >> they >> > are disabled, but within a few minutes, cfstats is reporting that the >> caches >> > are back. >> > Anything I can do? >> > >
RE: Extremly slow inserts on LAN
Hi, I read your post and noticed you are running Cassandra on win 2008. Do you run it in a production environment? I'm contacting you because there aren't that many windows installations. I need to provide a live Cassandra environment on win 2008 and was stumbling into some problems with node tool and storage.xml setup. Best, Stephan From: Arie Keren [mailto:a...@doubleverify.com] Sent: Sunday, May 09, 2010 10:07 AM To: user@cassandra.apache.org Subject: Extremly slow inserts on LAN While making our first steps with Cassandra, we experience slow inserts working on LAN. Inserting 7000 keys with 1 column family takes about 10 seconds when Cassandra server running on the same host with the client. But when server runs on a different host on LAN, the same inserts take more than 10 (!) minutes. In both cases Cassandra server contains a single node. We use Cassandra version 0.6.0 running on Windows server 2008. The client is .NET c# application. __ Information from ESET NOD32 Antivirus, version of virus signature database 4628 (20091122) __ The message was checked by ESET NOD32 Antivirus. http://www.eset.com
Re: Human readable Cassandra limitations
On Wed, May 12, 2010 at 2:02 AM, David Vanderfeesten wrote: >... > > My concern with the denormalization approach is that it shouldn't be managed > by the client side because this has big impact on your throughput. Is the > map-reduce in that respect any better? > Wouldn't it be nice to support a kind of PL-(No)SQL server side scripting > that allows you to create and maintain materialized views? You might still > give it as an option to maintain the view synchronously (extension of > current row-level-atomicity) or asynchronously. I think that this is what you're talking about: https://issues.apache.org/jira/browse/CASSANDRA-1016 Paul Prescod
Re: Is it possible to delete records based upon where condition
On Thu, May 13, 2010 at 12:34 AM, Moses Dinakaran wrote: > I wanted to remove the records based upon the value of the column ses_tstamp > ie (delete from sessions where ses_tstamp between XXX & YYY OR delete from > session where ses_tstamp < XXX ) > > Is it possible to achieve this in Cassandra If so please let me know how. >From my limited understanding of Cassandra (haven't actually deployed it and have mostly just been learning from reading for now) the way you'd want to do this is by either rearranging your schema (or manually creating an index) so that the timestamps are used as the column names within a CF that is sorted by the timestamp. Joel Pitt, PhD | http://ferrouswheel.me OpenCog Developer | http://opencog.org Board member, Humanity+ | http://humanityplus.org
how does cassandra compare with mongodb?
I tried searching mail-archive, but the search feature is a bit wacky (or more probably I don't know how to use it). What are the key differences between Cassandra and Mongodb? Is there a particular use case where each solution shines?
paging row keys
Can someone point me to a thrift sample (preferable java) to list all the rows in a ColumnFamily for my Cassandra server. I noticed some examples using SlicePredicate and SliceRange to perform a similar query against the columns with paging, but I was looking for something similar for rows with paging. I'm also new to Cassandra, so I might have misunderstood the existing samples. -Corey
Re: key is sorted?
If you use Random partitioner, You will *NOT* get RowKey's sorted. (Columns are sorted always). Answer: If used Random partitioner True True Regards, On Wed, May 12, 2010 at 1:25 AM, David Boxenhorn wrote: > You do any kind of range slice, e.g. keys beginning with "abc"? But the > results will not be ordered? > > Please answer one of the following: > > True True > True False > False False > > Explain? > > Thanks! > > > On Sun, May 9, 2010 at 8:27 PM, Vijay wrote: > >> True, The Range slice support was enabled in Random Partitioner for the >> hadoop support. >> >> Random partitioner actually hash the Key and those keys are sorted so we >> cannot have the actual key in order (Hope this doesnt confuse you)... >> >> Regards, >> >> >> >> >> On Sun, May 9, 2010 at 12:00 AM, David Boxenhorn wrote: >> >>> This is something that I'm not sure that I understand. Can somebody >>> confirm/deny that I understand it? Thanks. >>> >>> If you use random partitioning, you can loop through all keys with a >>> range query, but they will not be sorted. >>> >>> True or False? >>> >>> On Sat, May 8, 2010 at 3:45 AM, AJ Chen wrote: >>> thanks, that works. -aj On Fri, May 7, 2010 at 1:17 PM, Stu Hood wrote: > Your IPartitioner implementation decides how the row keys are sorted: > see http://wiki.apache.org/cassandra/StorageConfiguration#Partitioner. > You need to be using one of the OrderPreservingPartitioners if you'd like > a reasonable order for the keys. > > -Original Message- > From: "AJ Chen" > Sent: Friday, May 7, 2010 3:10pm > To: user@cassandra.apache.org > Subject: key is sorted? > > I have a super column family for "topic", key being the name of the > topic. > CompareSubcolumnsWith="BytesType" /> > When I retrieve the rows, the rows are not sorted by the key. Is the > row key > sorted in cassandra by default? > > -aj > -- > AJ Chen, PhD > Chair, Semantic Web SIG, sdforum.org > http://web2express.org > twitter @web2express > Palo Alto, CA, USA > > > -- AJ Chen, PhD Chair, Semantic Web SIG, sdforum.org http://web2express.org twitter @web2express Palo Alto, CA, USA >>> >>> >> >
Re: paging row keys
Here is a basic example using get_range_slices to retrieve 500 rows via hector: http://github.com/bosyak/hector/blob/master/src/main/java/me/prettyprint/cassandra/examples/ExampleReadAllKeys.java To page, use the last key you got back as the start key. -Nate On Wed, May 12, 2010 at 3:37 PM, Corey Hulen wrote: > > Can someone point me to a thrift sample (preferable java) to list all the > rows in a ColumnFamily for my Cassandra server. I noticed some examples > using SlicePredicate and SliceRange to perform a similar query against the > columns with paging, but I was looking for something similar for rows with > paging. I'm also new to Cassandra, so I might have misunderstood the > existing samples. > -Corey > > >
Re: paging row keys
Oh, thanks to Andrey Panov for providing that example, btw. We are always looking for good usage examples to post on the Hector wiki If anyone else has them. -Nate On Wed, May 12, 2010 at 5:01 PM, Nathan McCall wrote: > Here is a basic example using get_range_slices to retrieve 500 rows via > hector: > http://github.com/bosyak/hector/blob/master/src/main/java/me/prettyprint/cassandra/examples/ExampleReadAllKeys.java > > To page, use the last key you got back as the start key. > > -Nate > > On Wed, May 12, 2010 at 3:37 PM, Corey Hulen wrote: >> >> Can someone point me to a thrift sample (preferable java) to list all the >> rows in a ColumnFamily for my Cassandra server. I noticed some examples >> using SlicePredicate and SliceRange to perform a similar query against the >> columns with paging, but I was looking for something similar for rows with >> paging. I'm also new to Cassandra, so I might have misunderstood the >> existing samples. >> -Corey >> >> >> >
Re: key is sorted?
Although, if replication factor spans all nodes, then the disparity in row allocation should be a non-issue when using OrderPreservingPartitioner. On Wed, May 12, 2010 at 6:42 PM, Vijay wrote: > If you use Random partitioner, You will NOT get RowKey's sorted. (Columns > are sorted always). > Answer: If used Random partitioner > True True > > Regards, > > > > > On Wed, May 12, 2010 at 1:25 AM, David Boxenhorn wrote: >> >> You do any kind of range slice, e.g. keys beginning with "abc"? But the >> results will not be ordered? >> >> Please answer one of the following: >> >> True True >> True False >> False False >> >> Explain? >> >> Thanks! >> >> On Sun, May 9, 2010 at 8:27 PM, Vijay wrote: >>> >>> True, The Range slice support was enabled in Random Partitioner for the >>> hadoop support. >>> Random partitioner actually hash the Key and those keys are sorted so we >>> cannot have the actual key in order (Hope this doesnt confuse you)... >>> Regards, >>> >>> >>> >>> >>> On Sun, May 9, 2010 at 12:00 AM, David Boxenhorn >>> wrote: This is something that I'm not sure that I understand. Can somebody confirm/deny that I understand it? Thanks. If you use random partitioning, you can loop through all keys with a range query, but they will not be sorted. True or False? On Sat, May 8, 2010 at 3:45 AM, AJ Chen wrote: > > thanks, that works. -aj > > On Fri, May 7, 2010 at 1:17 PM, Stu Hood > wrote: >> >> Your IPartitioner implementation decides how the row keys are sorted: >> see http://wiki.apache.org/cassandra/StorageConfiguration#Partitioner . >> You >> need to be using one of the OrderPreservingPartitioners if you'd like a >> reasonable order for the keys. >> >> -Original Message- >> From: "AJ Chen" >> Sent: Friday, May 7, 2010 3:10pm >> To: user@cassandra.apache.org >> Subject: key is sorted? >> >> I have a super column family for "topic", key being the name of the >> topic. >> > CompareSubcolumnsWith="BytesType" /> >> When I retrieve the rows, the rows are not sorted by the key. Is the >> row key >> sorted in cassandra by default? >> >> -aj >> -- >> AJ Chen, PhD >> Chair, Semantic Web SIG, sdforum.org >> http://web2express.org >> twitter @web2express >> Palo Alto, CA, USA >> >> > > > > -- > AJ Chen, PhD > Chair, Semantic Web SIG, sdforum.org > http://web2express.org > twitter @web2express > Palo Alto, CA, USA >>> >> > >
Re: how does cassandra compare with mongodb?
Hi, >From my understanding, Cassandra entities are indexed on only one key, so this can be a problem if you are searching for example by two values such as if you are storing an entity with a x,y then wish to search for entities in a box ie x>5 and x<10 and y>5 and y<10. MongoDB can do this, Cassandra cannot due to only indexing on one key. Cassandra can scale automatically just by adding nodes, almost infinite storage easily, MongoDB requires database administration to add nodes, setting up replication or allowing sharding, but not too complex. MongoDB requires you to create sharded keys if you want to scale horizontally, Cassandra just works automatically for scale horizontally. Cassandra requires the schema to be defined before the database starts, MongoDB can have any schema at run-time just like a normal database. In the end I choose MongoDB as I require more indexes than Cassandra provides, although I really like Cassandras ability to store almost infinite amount of data just by adding nodes. Thanks, Phil On Thu, May 13, 2010 at 5:57 AM, S Ahmed wrote: > I tried searching mail-archive, but the search feature is a bit wacky (or > more probably I don't know how to use it). > > What are the key differences between Cassandra and Mongodb? > > Is there a particular use case where each solution shines? >
Re: Apache Cassandra and Net::Cassandra::Easy
Well, ain't that a kick in the tush. :-/ I grabbed the svn trunk, but build failed, probably because Fedora 11 is too old for this bleeding edge stuff. Server is upgrading itself now, but I wondered: is anyone using an rpm-based distro for Net::Cassandra::Easy and the svn Cassandra? Thanks. :) "Ted Zlatanov" wrote: >SD> This is using Net-Cassandra-Easy-0.14-7Jcufu and Cassandra 0.6.1. > >I'll stabilize N::C::Easy at 0.7 and support both 0.7 and trunk from >that point on. Sorry but 0.6.x is not supported. I made this a >stronger statement in the N::C::Easy docs as of next release; sorry for >the inconvenience. > >Ted -- Sent from my Android phone with K-9. Please excuse my brevity.
Re: how does cassandra compare with mongodb?
You can choose to have keys ordered by using an OrderPreservingPartioner with the trade-off that key ranges can get denser on certain nodes than others. On Wed, May 12, 2010 at 7:48 PM, philip andrew wrote: > > Hi, > From my understanding, Cassandra entities are indexed on only one key, so > this can be a problem if you are searching for example by two values such as > if you are storing an entity with a x,y then wish to search for entities in > a box ie x>5 and x<10 and y>5 and y<10. MongoDB can do this, Cassandra > cannot due to only indexing on one key. > Cassandra can scale automatically just by adding nodes, almost infinite > storage easily, MongoDB requires database administration to add nodes, > setting up replication or allowing sharding, but not too complex. > MongoDB requires you to create sharded keys if you want to scale > horizontally, Cassandra just works automatically for scale horizontally. > Cassandra requires the schema to be defined before the database starts, > MongoDB can have any schema at run-time just like a normal database. > In the end I choose MongoDB as I require more indexes than Cassandra > provides, although I really like Cassandras ability to store almost infinite > amount of data just by adding nodes. > Thanks, Phil > > On Thu, May 13, 2010 at 5:57 AM, S Ahmed wrote: >> >> I tried searching mail-archive, but the search feature is a bit wacky (or >> more probably I don't know how to use it). >> What are the key differences between Cassandra and Mongodb? >> Is there a particular use case where each solution shines? >
Re: key is sorted?
Thank you. That is very good news. I can sort the results myself - what is important is that I get them! On Thu, May 13, 2010 at 2:42 AM, Vijay wrote: > If you use Random partitioner, You will *NOT* get RowKey's sorted. > (Columns are sorted always). > > Answer: If used Random partitioner > True True > > Regards, > > > > > > On Wed, May 12, 2010 at 1:25 AM, David Boxenhorn wrote: > >> You do any kind of range slice, e.g. keys beginning with "abc"? But the >> results will not be ordered? >> >> Please answer one of the following: >> >> True True >> True False >> False False >> >> Explain? >> >> Thanks! >> >> >> On Sun, May 9, 2010 at 8:27 PM, Vijay wrote: >> >>> True, The Range slice support was enabled in Random Partitioner for the >>> hadoop support. >>> >>> Random partitioner actually hash the Key and those keys are sorted so we >>> cannot have the actual key in order (Hope this doesnt confuse you)... >>> >>> Regards, >>> >>> >>> >>> >>> On Sun, May 9, 2010 at 12:00 AM, David Boxenhorn wrote: >>> This is something that I'm not sure that I understand. Can somebody confirm/deny that I understand it? Thanks. If you use random partitioning, you can loop through all keys with a range query, but they will not be sorted. True or False? On Sat, May 8, 2010 at 3:45 AM, AJ Chen wrote: > thanks, that works. -aj > > > On Fri, May 7, 2010 at 1:17 PM, Stu Hood wrote: > >> Your IPartitioner implementation decides how the row keys are sorted: >> see http://wiki.apache.org/cassandra/StorageConfiguration#Partitioner. >> You need to be using one of the OrderPreservingPartitioners if you'd like >> a reasonable order for the keys. >> >> -Original Message- >> From: "AJ Chen" >> Sent: Friday, May 7, 2010 3:10pm >> To: user@cassandra.apache.org >> Subject: key is sorted? >> >> I have a super column family for "topic", key being the name of the >> topic. >> > CompareSubcolumnsWith="BytesType" /> >> When I retrieve the rows, the rows are not sorted by the key. Is the >> row key >> sorted in cassandra by default? >> >> -aj >> -- >> AJ Chen, PhD >> Chair, Semantic Web SIG, sdforum.org >> http://web2express.org >> twitter @web2express >> Palo Alto, CA, USA >> >> >> > > > -- > AJ Chen, PhD > Chair, Semantic Web SIG, sdforum.org > http://web2express.org > twitter @web2express > Palo Alto, CA, USA > >>> >> >