Re: key is sorted?

2010-05-12 Thread David Boxenhorn
You do any kind of range slice, e.g. keys beginning with "abc"? But the
results will not be ordered?

Please answer one of the following:

True True
True False
False False

Explain?

Thanks!

On Sun, May 9, 2010 at 8:27 PM, Vijay  wrote:

> True, The Range slice support was enabled in Random Partitioner for the
> hadoop support.
>
> Random partitioner actually hash the Key and those keys are sorted so we
> cannot have the actual key in order (Hope this doesnt confuse you)...
>
> Regards,
> 
>
>
>
> On Sun, May 9, 2010 at 12:00 AM, David Boxenhorn wrote:
>
>> This is something that I'm not sure that I understand. Can somebody
>> confirm/deny that I understand it? Thanks.
>>
>> If you use random partitioning, you can loop through all keys with a range
>> query, but they will not be sorted.
>>
>> True or False?
>>
>> On Sat, May 8, 2010 at 3:45 AM, AJ Chen  wrote:
>>
>>> thanks, that works. -aj
>>>
>>>
>>> On Fri, May 7, 2010 at 1:17 PM, Stu Hood  wrote:
>>>
 Your IPartitioner implementation decides how the row keys are sorted:
 see http://wiki.apache.org/cassandra/StorageConfiguration#Partitioner .
 You need to be using one of the OrderPreservingPartitioners if you'd like a
 reasonable order for the keys.

 -Original Message-
 From: "AJ Chen" 
 Sent: Friday, May 7, 2010 3:10pm
 To: user@cassandra.apache.org
 Subject: key is sorted?

 I have a super column family for "topic", key being the name of the
 topic.
 >>> CompareSubcolumnsWith="BytesType" />
 When I retrieve the rows, the rows are not sorted by the key. Is the row
 key
 sorted in cassandra by default?

 -aj
 --
 AJ Chen, PhD
 Chair, Semantic Web SIG, sdforum.org
 http://web2express.org
 twitter @web2express
 Palo Alto, CA, USA



>>>
>>>
>>> --
>>> AJ Chen, PhD
>>> Chair, Semantic Web SIG, sdforum.org
>>> http://web2express.org
>>> twitter @web2express
>>> Palo Alto, CA, USA
>>>
>>
>>
>


Re: Human readable Cassandra limitations

2010-05-12 Thread David Vanderfeesten
On the scaleability and performance side, I found Yahoo's paper about the
YCSB project interesting (benchmarking some NoSQL solutions with MySQL). See
research.yahoo.com/files/*ycsb*.*pdf.

*My concern with the denormalization approach is that it shouldn't be
managed by the client side because this has big impact on your throughput.
Is the map-reduce in that respect any better?
Wouldn't it be nice to support a kind of PL-(No)SQL server side scripting
that allows you to create and maintain materialized views? You might still
give it as an option to maintain the view synchronously (extension of
current row-level-atomicity)  or asynchronously.

Not sure how complicated this support would be...

- David

On Mon, May 10, 2010 at 10:38 PM, Paul Prescod  wrote:

> On Mon, May 10, 2010 at 1:23 PM, Peter Hsu  wrote:
> > Thanks for the response, Paul.
> > ...
> >
> > * Cassandra and its siblings are weak at ad hoc queries on tables
> > that you did not think to index in advance
> >
> > What is the normal way of dealing with this in Cassandra?  Would you just
> > create a new "index" and bring a big honking machine to the table to
> process
> > all the existing data in the database and store the new "index"?
>
> The latest version of Cassandra introduces a "map/reduce" paradigm
> which is the main tool you'd use for batch processing of data. You
> could either use that to DO your ad hoc query or to process the data
> into an index for more efficient ad hoc queries in the future.
>
>  * http://en.wikipedia.org/wiki/MapReduce
>
>  * http://en.wikipedia.org/wiki/Hadoop
>
>  * http://architects.dzone.com/news/cassandra-adds-hadoop
>
> You can read criticisms of MapReduce in the first link there.
>
> > On May 10, 2010, at 11:22 AM, Paul Prescod wrote:
> >
> > This is a very, very big topic. For the most part, the issues are
> > covered in the various SQL versus NoSQL debates all over the Internet.
> > For example:
> >
> > * Cassandra and its NoSQL siblings have no concept of an in-database
> "join"
> >
> > * Cassandra and its NoSQL siblings do not allow you to update
> > multiple "tables" in a single transactions
> >
> > * Cassandra's API is specific to it, and not portable to any other data
> > store
> >
> > * Cassandra currently has simplistic facilities to deal with various
> > kinds of conflicting write.
> >
> > * Cassandra is strongly optimized for multiple machine distributions,
> > whereas relational databases tend to be optimized for a single
> > powerful machine.
> >
> > * Cassandra and its siblings are weak at ad hoc queries on tables
> > that you did not think to index in advance
> >
> > On Mon, May 10, 2010 at 11:06 AM, Peter Hsu 
> wrote:
> >
> > I've seen a lot of threads and posts about why Cassandra is great.  I'm
> > fairly sold on the features, and the few big deployments on Cassandra
> give
> > it a lot of credibility.
> >
> > However, I don't believe in magic bullets, so I really want to understand
> > the potential downsides of Cassandra.  Right now, I don't really have a
> clue
> > as to what Cassandra is bad at.  I took a look at
> > http://wiki.apache.org/cassandra/CassandraLimitations which is helpful,
> but
> > doesn't characterize its weaknesses in ways that I can really comprehend
> > until I've actually used Cassandra and understand some of the internals.
>  It
> > seems that the community would benefit from being able to answer some of
> > these questions in terms of real world use cases.
> >
> > My main questions:
> >
> >  * Are there designs in which a SQL database out-performs or out-scales
> > Cassandra?
> >
> >  * Is there a pros vs cons page of Cassandra against an open source SQL
> > database (MySQL or Postgres)?
> >
> > I do plan on attending the training session next Friday in Palo Alto, but
> > it'd be great if I had some more food for thought before I attend.
> >
> >
> >
> >
>


Re: replication impact on write throughput

2010-05-12 Thread David Vanderfeesten
About this linear scaling of throughput(with keys perfectly distributed +
requests balanced over all nodes):
I would assume that this is not the case for small number of nodes because
starting from 2 nodes onwards a part of the requests have to be handled by a
proxy node + the actual node responsible for the key.
Is there no other overhead that starts to play when increasing the number of
nodes?
Until now we were not able to reproduce this linearly scaling of throughput.
We are still figuring out why but it would be interesting to know what the
target should be...

-David

On Wed, May 12, 2010 at 4:13 AM, Paul Prescod  wrote:

> On Tue, May 11, 2010 at 5:56 PM, Mark Greene  wrote:
> > I was under the impression from what I've seen talked about on this list
> > (perhaps I'm wrong here) that given the write throughput of one node in a
> > cluster (again assuming each node has a given throughput and the same
> > config) that you would simply multiply that throughput by the number of
> > nodes you had, giving you the total throughput for the entire ring (like
> you
> > said this is somewhat artificial). The main benefit being that adding
> > capacity was as simple as adding more nodes to the ring with no
> degradation.
> > So by your math, 100 nodes with each node getting 5k wps, I would assume
> the
> > total capacity is 500k wps. But perhaps I've misunderstood some key
> > concepts. Still a novice myself ;-)
>
> If the replication factor is 2, then everything is written twice. So
> your throughput is cut in half.
>
>  Paul Prescod
>


Re: timeout while running simple hadoop job

2010-05-12 Thread gabriele renzi
a follow up for anyone that may end up on this conversation again:

I kept trying and neither changing the number of concurrent map tasks,
nor the slice size helped.
Finally, I found out a screw up in our logging system,  which had
forbidden us from noticing a couple of recurring errors in the logs :

ERROR [ROW-READ-STAGE:1] 2010-05-11 16:43:32,328
DebuggableThreadPoolExecutor.java (line 101) Error in
ThreadPoolExecutor
java.lang.RuntimeException: java.lang.RuntimeException: corrupt sstable
at 
org.apache.cassandra.service.RangeSliceVerbHandler.doVerb(RangeSliceVerbHandler.java:53)
at 
org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:40)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)
Caused by: java.lang.RuntimeException: corrupt sstable
at org.apache.cassandra.io.SSTableScanner.seekTo(SSTableScanner.java:73)
at 
org.apache.cassandra.db.ColumnFamilyStore.getKeyRange(ColumnFamilyStore.java:907)
at 
org.apache.cassandra.db.ColumnFamilyStore.getRangeSlice(ColumnFamilyStore.java:1000)
at 
org.apache.cassandra.service.RangeSliceVerbHandler.doVerb(RangeSliceVerbHandler.java:41)
... 4 more
Caused by: java.io.FileNotFoundException:
/path/to/data/Keyspace/CF-123-Index.db (Too many open files)
at java.io.RandomAccessFile.open(Native Method)
at java.io.RandomAccessFile.(RandomAccessFile.java:212)
at java.io.RandomAccessFile.(RandomAccessFile.java:98)
at 
org.apache.cassandra.io.util.BufferedRandomAccessFile.(BufferedRandomAccessFile.java:143)
at 
org.apache.cassandra.io.util.BufferedRandomAccessFile.(BufferedRandomAccessFile.java:138)
at 
org.apache.cassandra.io.SSTableReader.getNearestPosition(SSTableReader.java:414)
at org.apache.cassandra.io.SSTableScanner.seekTo(SSTableScanner.java:62)
... 7 more

and the related

 WARN [main] 2010-05-11 16:43:38,076 TThreadPoolServer.java (line 190)
Transport error occurred during acceptance of message.
org.apache.thrift.transport.TTransportException:
java.net.SocketException: Too many open files
at 
org.apache.thrift.transport.TServerSocket.acceptImpl(TServerSocket.java:124)
at 
org.apache.thrift.transport.TServerSocket.acceptImpl(TServerSocket.java:35)
at 
org.apache.thrift.transport.TServerTransport.accept(TServerTransport.java:31)
at 
org.apache.thrift.server.TThreadPoolServer.serve(TThreadPoolServer.java:184)
at 
org.apache.cassandra.thrift.CassandraDaemon.start(CassandraDaemon.java:149)
at 
org.apache.cassandra.thrift.CassandraDaemon.main(CassandraDaemon.java:190)
Caused by: java.net.SocketException: Too many open files
at java.net.PlainSocketImpl.socketAccept(Native Method)
at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:390)
at java.net.ServerSocket.implAccept(ServerSocket.java:453)
at java.net.ServerSocket.accept(ServerSocket.java:421)
at 
org.apache.thrift.transport.TServerSocket.acceptImpl(TServerSocket.java:119)
... 5 more

The client was reporting timeouts in this case.


The max fd limit on the process was in fact not exceedingly high
(1024) and raising it seems to have solved the problem.

Anyway It still seems that there may be two issues:

- since we had never seen this error before with normal client
connections (as in: non hadoop), is it possible that the
Cassandra/hadoop layer is not closing sockets properly between one
connection and the other, or not reusing connections efficiently?
E.g. TSocket seems to have a close() method but I don't see it used in
ColumnFamilyInputFormat.(getSubSplits, getRangeMap) but it may well be
inside CassandraClient.

Anyway, judging by lsof's output I can only see about a hundred TCP
connections, but those from the hadoop jobs seem to always be below 60
so this may just be my wrong impression.

- is it possible that such errors show up on the client side as
timeoutErrors when they could be reported better? this would probably
help other people in diagnosing/reporting internal errors in the
future.


Thanks again to everyone with this, I promise I'll put the discussion
on the wiki for future reference :)


Re: what/how do you guys monitor "slow" nodes?

2010-05-12 Thread Ran Tavory
There is a per cf read and write latency jmx.

On May 12, 2010 12:55 AM, "Jordan Pittier - Rezel"  wrote:

For sure you have to pay particular attention to memory allocation on each
node, especially be sure your servers dont swap. Then you can monitor how
load are balanced among your nodes (nodetools -h XX ring).



On Tue, May 11, 2010 at 11:46 PM, S Ahmed  wrote:
>
> If you have 3-4 nodes,...


Terminology, getting slices by time

2010-05-12 Thread Leslie Viljoen
Hi!

I am having trouble understanding the "column" terminology Cassandra
uses. I am developing in Ruby. I need to store data for vehicles which
will come in at different times and retrieve data for a specific
vehicle for specific slices of time. So each record could look like:

vehicle_id, { time => { 'sensor1' => 1000, 'sensor2' => 2000 } }

The actual insert:

@cas.insert(:ReportData, vehicle_id.to_s, { UUID.new.to_s => {
'sensor1' => 1000, 'sensor' => 2000 }})

The model:


 
 
org.apache.cassandra.locator.RackUnawareStrategy
 1
 
org.apache.cassandra.locator.EndPointSnitch




I am trying to store this with a supercolumn. Can someone tell me:

1. Are "sensor1" and "sensor2" columns?
2. Is "time" a supercolumn?
3. Where is the "subcolumn"?
4. Which item would "CompareWith" refer to?
5. Which item would "CompareSubcolumnsWith" refer to?


Also, about time:
1. To store "time", must I use UUID.new.to_s as above?
2. Isn't a time key redundant? Can Cassandra return slices based on
the time it keeps for each record?
3. How do I get a time slice for a vehicle? I get various generic
"application" errors when trying to use get_range to get time ranges.
4. Any advice on this data model?


Is it possible to delete records based upon where condition

2010-05-12 Thread Moses Dinakaran
Hi All,

In Cassandra it possible to remove records based upon where condition.

We are planning to move the session and cache table from MySql to Cassandra
and where doing the fesability study. Everything seems to be Ok other than
garbage collection of session table.

Was not able to  remove super columns from the Key based upon certain
conditions in the column value.

Follows the schema for Session tables


FESession  = {
sessions : {
1001 : {
ses_name   : "user",
ses_userid  :  258,
ses_tstamp: “1273504587”
},
1002 : {
ses_name: "user",
ses_userid:  259,
ses_tstamp: “1273504597”
},
},
}

I wanted to remove the records based upon the value of the column ses_tstamp
ie (delete from sessions where ses_tstamp between XXX & YYY OR delete from
session where ses_tstamp < XXX )

Is it possible to achieve this in Cassandra If so  please let me know how.

To my knowledge I dont see any kind of where condition in Casandra,
If that is the case also please let me know the other ways like
restructuring the Schema or whatever...

Regards
Moses


Re: replication impact on write throughput

2010-05-12 Thread Mark Greene
>>If the replication factor is 2, then everything is written twice. So
>>your throughput is cut in half.

throughput of new inserts is cut in half right? I think I was thinking about
capacity in more general terms from the node's perspective. The node has the
ability to write so many operations per second whether that be a replicated
key or not and you could potentially take that number and multiply it by the
number of nodes in the ring to give you your write capacity per second.

On Tue, May 11, 2010 at 10:13 PM, Paul Prescod  wrote:

> On Tue, May 11, 2010 at 5:56 PM, Mark Greene  wrote:
> > I was under the impression from what I've seen talked about on this list
> > (perhaps I'm wrong here) that given the write throughput of one node in a
> > cluster (again assuming each node has a given throughput and the same
> > config) that you would simply multiply that throughput by the number of
> > nodes you had, giving you the total throughput for the entire ring (like
> you
> > said this is somewhat artificial). The main benefit being that adding
> > capacity was as simple as adding more nodes to the ring with no
> degradation.
> > So by your math, 100 nodes with each node getting 5k wps, I would assume
> the
> > total capacity is 500k wps. But perhaps I've misunderstood some key
> > concepts. Still a novice myself ;-)
>
> If the replication factor is 2, then everything is written twice. So
> your throughput is cut in half.
>
>  Paul Prescod
>


RE: Extremly slow inserts on LAN

2010-05-12 Thread Stephan Pfammatter
I don't understand all the problems yet you guys are facing. Just wanted to let 
you know that I'm getting my feet wet with Cassandra. In a few days/weeks I'll 
be re-reading all your notes again :)

I'm bound to provide a production Cassandra environment based on win 2008 (I 
know, I know...It's better to go with Cassandra on Win than no Cassandra!)

Great news group.

-Original Message-
From: Arie Keren [mailto:a...@doubleverify.com] 
Sent: Monday, May 10, 2010 3:09 AM
To: user@cassandra.apache.org
Subject: RE: Extremly slow inserts on LAN

That solved the problem!
It also helps to increase the buffer size from the default value of 1024:
var transport = new TBufferedTransport(new TSocket("192.168.0.123", 9160), 
64*1024);

Thanks a lot!

-Original Message-
From: Viktor Jevdokimov [mailto:viktor.jevdoki...@adform.com]
Sent: May 10, 2010 8:57 AM
To: user@cassandra.apache.org
Subject: RE: Extremly slow inserts on LAN

We had similar experience.

Problem was with TSocket as transport alone:

var transport = new TSocket("192.168.0.123", 9160);
var protocol = new TBinaryProtocol(transport);
var client = new Cassandra.Client(protocol);

Using TBufferedTransport helped a lot:

var transport = new TBufferedTransport(new TSocket("192.168.0.123", 
9160));
var protocol = new TBinaryProtocol(transport);
var client = new Cassandra.Client(protocol);

Viktor


-Original Message-
From: Arie Keren [mailto:a...@doubleverify.com]
Sent: Monday, May 10, 2010 8:51 AM
To: user@cassandra.apache.org
Subject: RE: Extremly slow inserts on LAN

No - just Windows.
So I'm going to do some experiments to isolate the cause:
- use java client on windows
- use linux server
- use java client on linux

Thanx


-Original Message-
From: David Strauss [mailto:da...@fourkitchens.com]
Sent: May 09, 2010 5:48 PM
To: user@cassandra.apache.org
Subject: Re: Extremly slow inserts on LAN

>From a naive (not caring about Cassandra internals) basis, the first step is 
>to isolate whether the problem is on the client or server side.
Have you tried a Linux-based server or a Linux-based client?

On 2010-05-09 14:06, Arie Keren wrote:
> While making our first steps with Cassandra, we experience slow
> inserts working on LAN.
>
> Inserting 7000 keys with 1 column family takes about 10 seconds when
> Cassandra server running on the same host with the client.
>
> But when server runs on a different host on LAN, the same inserts take
> more than 10 (!) minutes.
>
> In both cases Cassandra server contains a single node.
>
>
>
> We use Cassandra version 0.6.0 running on Windows server 2008.
>
> The client is .NET c# application.

--
David Strauss
   | da...@fourkitchens.com
Four Kitchens
   | http://fourkitchens.com
   | +1 512 454 6659 [office]
   | +1 512 870 8453 [direct]



__ Information from ESET NOD32 Antivirus, version of virus signature 
database 4628 (20091122) __

The message was checked by ESET NOD32 Antivirus.

http://www.eset.com



__ Information from ESET NOD32 Antivirus, version of virus signature 
database 4628 (20091122) __

The message was checked by ESET NOD32 Antivirus.

http://www.eset.com



__ Information from ESET NOD32 Antivirus, version of virus signature 
database 4628 (20091122) __

The message was checked by ESET NOD32 Antivirus.

http://www.eset.com



what is DCQUORUM

2010-05-12 Thread vd
Hi

I have read about QUORUM but lately came across DCQUORUM. What is it  and
whats the difference between the two ?


Re: what is DCQUORUM

2010-05-12 Thread Eben Hewitt
QUORUM is a high consistency level. It refers to the number of nodes that
have to acknowledge read or write operations in order to be assured that
Cassandra is in a consistent state. It uses  / 2 + 1.

DCQUORUM means "Data Center Quorum", and balances consistency with
performance. It puts multiple replicas in each Data Center so operations can
prefer replicas in the same DC for lower latency.

See https://issues.apache.org/jira/browse/CASSANDRA-492 for a little
discussion.

Also see
http://svn.apache.org/repos/asf/cassandra/trunk/src/java/org/apache/cassandra/locator/DatacenterShardStrategy.java
 and
http://svn.apache.org/repos/asf/cassandra/trunk/src/java/org/apache/cassandra/service/DatacenterWriteResponseHandler.java

Eben



On Wed, May 12, 2010 at 6:15 AM, vd  wrote:

> Hi
>
> I have read about QUORUM but lately came across DCQUORUM. What is it  and
> whats the difference between the two ?
>
>


-- 
"In science there are no 'depths'; there is surface everywhere."
--Rudolph Carnap


Re: what is DCQUORUM

2010-05-12 Thread vd
Thanks Eben


On Wed, May 12, 2010 at 7:33 PM, Eben Hewitt  wrote:

> QUORUM is a high consistency level. It refers to the number of nodes that
> have to acknowledge read or write operations in order to be assured that
> Cassandra is in a consistent state. It uses  / 2 + 1.
>
> DCQUORUM means "Data Center Quorum", and balances consistency with
> performance. It puts multiple replicas in each Data Center so operations can
> prefer replicas in the same DC for lower latency.
>
> See https://issues.apache.org/jira/browse/CASSANDRA-492 for a little
> discussion.
>
> Also see
> http://svn.apache.org/repos/asf/cassandra/trunk/src/java/org/apache/cassandra/locator/DatacenterShardStrategy.java
>  and
>
> http://svn.apache.org/repos/asf/cassandra/trunk/src/java/org/apache/cassandra/service/DatacenterWriteResponseHandler.java
>
> Eben
>
>
>
> On Wed, May 12, 2010 at 6:15 AM, vd  wrote:
>
>> Hi
>>
>> I have read about QUORUM but lately came across DCQUORUM. What is it  and
>> whats the difference between the two ?
>>
>>
>
>
> --
> "In science there are no 'depths'; there is surface everywhere."
> --Rudolph Carnap
>


Re: Real-time Web Analysis tool using Cassandra. Doubts...

2010-05-12 Thread Jonathan Ellis
On Tue, May 11, 2010 at 1:52 PM, Paulo Gabriel Poiati
 wrote:
> - First of all, my first thoughts is to have two CF one for raw client
> request (~10 millions++ per day) and other for aggregated metrics in some
> defined inteval time like 1min, 5min, 15min... Is this a good approach ?

Sure.

> - It is a good idea to use a OrderPreservingPartitioner ? To maintain the
> order of my requests in the raw data CF ? Or the overhead is too big.

The problem with OPP isn't overhead (it is lower-overhead than RP) but
the tendency to have hotspots in sequentially-written data.

> - Initially the cluster will contain only three nodes, is it a problem (to
> few maybe) ?

You'll have to do some load testing to see.

> - I think the best way to do the aggregation job is through a hadoop
> MapReduce job. Right ? Is there any other way to consider ?

Map/Reduce is usually better than rolling your own because it
parallelizes for you.

> - Is really Cassandra suitable for it ? Maybe HBase is better in this case?

Nothing here makes me think "Cassandra is a poor choice."

-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com


Re: (Binary)Memtable flushing

2010-05-12 Thread Jonathan Ellis
https://issues.apache.org/jira/browse/CASSANDRA-856

On Tue, May 11, 2010 at 3:44 PM, Tobias Jungen  wrote:
> Yet another BMT question, thought this may apply for regular memtables as
> well...
>
> After doing a batch insert, I accidentally submitted the flush command
> twice. To my surprise, the target node's log indicates that it wrote a new
> *-Data.db file, and the disk usage went up accordingly. I tested and issued
> the flush command a few more times, and after a few more data files I
> eventually triggered a compaction, bringing the disk usage back down. The
> data appears to continue to stick around in memory, however, as further
> flush commands continue to result in new data files.
>
> Shouldn't flushing a memtable remove it from memory, or is expected behavior
> that it sticks around until the node needs to reclaim the memory? Should I
> worry about getting out-of-memory errors if I'm doing lots of inserts in
> this manner?
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com


Re: nodetool drain disables writes?

2010-05-12 Thread Jonathan Ellis
On Tue, May 11, 2010 at 4:18 PM, Anthony Molinaro
 wrote:
> Hi,
>
>  I thought that 'nodetool drain' was supposed to flush the commit logs
> through the system, which it appears to do (verified by running ls in
> the commit log directory and seeing no files).
>
> However, it also appears to disable writes completely (ie, scripts attempting
> to write data were frozen, ls of commit log directory showed no files), is
> that expected behavior?

Yes.  The intent is that after the drain your commitlog is empty (e.g.
for upgrading to 0.7).  If it were to continue to accept writes that
would not be the case.

If you just want a flush, run `nodetool flush`.

-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com


Re: Timed out reads still in queue

2010-05-12 Thread Jonathan Ellis
This is a slightly different way of describing
https://issues.apache.org/jira/browse/CASSANDRA-685

On Tue, May 11, 2010 at 9:01 PM, Jeremy Dunck  wrote:
> Reddit posted a blog entry about some recent downtime, partially due
> to issues with Cassandra.
> http://blog.reddit.com/2010/05/reddits-may-2010-state-of-servers.html
>
> This part surprised me:
> "
> First, Cassandra has an internal queue of work to do. When it times
> out a client (10s by default), it still leaves the operation in the
> queue of work to complete (even though the person that asked for the
> read is no longer even holding the socket), which given a constant
> stream of requests makes the amount of pending work snowball
> effectively infinitely (specifically, ROW-READ-STAGE's PENDING
> operations grow unbounded).
> "
>
> I've searched Jira for an issue related to this -- it seems like a bug
> to have reads in queue when the result is useless (because the reader
> is gone).  Obviously a 10-second read is not a normal run condition,
> but removing stale reads could remove a cause of cascading failure.
>
> Should I open a ticket, or have I misunderstood something?
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com


Re: Unexpected TType error on web traffic increase

2010-05-12 Thread Jonathan Ellis
Sounds like the sort of error you'd see if you were using
thread-unsafe Thrift clients on multiple threads.

On Tue, May 11, 2010 at 11:23 PM, Waqas Badar
 wrote:
> Dear all,
>
> We are using Cassandra on website. Whenever website traffic increases, we
> got the following error (Python):
>
> File "/usr/local/lib/python2.6/dist-packages/pycassa/columnfamily.py", line
> 199, in multiget
>    self._rcl(read_consistency_level))
>
> File "/usr/local/lib/python2.6/dist-packages/pycassa/connection.py", line
> 114, in client_call
>    return getattr(self._client, attr)(*args, **kwargs)
>
> File "/usr/local/lib/python2.6/dist-packages/cassandra/Cassandra.py", line
> 346, in multiget_slice
>    return self.recv_multiget_slice()
>
> File "/usr/local/lib/python2.6/dist-packages/cassandra/Cassandra.py", line
> 368, in recv_multiget_slice
>    result.read(self._iprot)
>
> File "/usr/local/lib/python2.6/dist-packages/cassandra/Cassandra.py", line
> 1767, in read
>    fastbinary.decode_binary(self, iprot.trans, (self.__class__,
> self.thrift_spec))
>
> TypeError: Unexpected TType
>
> Any body also has got this issue ever? I Google on it but no help found. We
> are using pycassa for Cassandra connectivity. Cassandra version is 0.5.1
>
> Can anybody also guide how should we use Cassandra for website e.g. as on
> each page access there is a new process start so how can we implement
> connection pooling on website for Cassandra?
>
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com


Re: timeout while running simple hadoop job

2010-05-12 Thread Jonathan Ellis
On Wed, May 12, 2010 at 5:11 AM, gabriele renzi  wrote:
> - is it possible that such errors show up on the client side as
> timeoutErrors when they could be reported better?

No, if the node the client is talking to doesn't get a reply from the
data node, there is no way for it to magically find out what happened
since ipso facto it got no reply.

-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com


Re: timeout while running simple hadoop job

2010-05-12 Thread gabriele renzi
On Wed, May 12, 2010 at 4:43 PM, Jonathan Ellis  wrote:
> On Wed, May 12, 2010 at 5:11 AM, gabriele renzi  wrote:
>> - is it possible that such errors show up on the client side as
>> timeoutErrors when they could be reported better?
>
> No, if the node the client is talking to doesn't get a reply from the
> data node, there is no way for it to magically find out what happened
> since ipso facto it got no reply.

Sorry I was not clear: I meant the first error (where we get a
RuntimeException in reading the file, not in the socket.accept()).
There we have a reasonable error message (either "too many open files"
or "corrupt sstable") that does not appear client side.



-- 
blog en: http://www.riffraff.info
blog it: http://riffraff.blogsome.com


Cache capacities set by nodetool

2010-05-12 Thread James Golick
When I first brought this cluster online, the storage-conf.xml file had a
few cache capacities set. Since then, we've completely changed how we use
cassandra's caching, and no longer use any of the caches I setup in the
original configuration.

I'm finding that cassandra doesn't want to keep my new cache settings. I can
disable those caches with setcachecapacity, check cfstats and see that they
are disabled, but within a few minutes, cfstats is reporting that the caches
are back.

Anything I can do?


Re: timeout while running simple hadoop job

2010-05-12 Thread Héctor Izquierdo
Have you checked your open file handler limit? You can do that by using 
"ulimit" in the shell. If it's too low, you will encounter the "too many 
open files" error. You can also see how many open handlers an 
application has with "lsof".


Héctor Izquierdo

On 12/05/10 17:00, gabriele renzi wrote:

On Wed, May 12, 2010 at 4:43 PM, Jonathan Ellis  wrote:
   

On Wed, May 12, 2010 at 5:11 AM, gabriele renzi  wrote:
 

- is it possible that such errors show up on the client side as
timeoutErrors when they could be reported better?
   

No, if the node the client is talking to doesn't get a reply from the
data node, there is no way for it to magically find out what happened
since ipso facto it got no reply.
 

Sorry I was not clear: I meant the first error (where we get a
RuntimeException in reading the file, not in the socket.accept()).
There we have a reasonable error message (either "too many open files"
or "corrupt sstable") that does not appear client side.



   




Re: Cache capacities set by nodetool

2010-05-12 Thread Ryan King
It's a bug:

https://issues.apache.org/jira/browse/CASSANDRA-1079

-ryan

On Wed, May 12, 2010 at 8:16 AM, James Golick  wrote:
> When I first brought this cluster online, the storage-conf.xml file had a
> few cache capacities set. Since then, we've completely changed how we use
> cassandra's caching, and no longer use any of the caches I setup in the
> original configuration.
> I'm finding that cassandra doesn't want to keep my new cache settings. I can
> disable those caches with setcachecapacity, check cfstats and see that they
> are disabled, but within a few minutes, cfstats is reporting that the caches
> are back.
> Anything I can do?


Re: timeout while running simple hadoop job

2010-05-12 Thread Johan Oskarsson
Looking over the code this is in fact an issue in 0.6. 
It's fixed in trunk/0.7. Connections will be reused and closed properly, see 
https://issues.apache.org/jira/browse/CASSANDRA-1017 for more details.

We can either backport that patch or make at least close the connections 
properly in 0.6. Can you open an ticket for this bug?

/Johan

On 12 maj 2010, at 12.11, gabriele renzi wrote:

> a follow up for anyone that may end up on this conversation again:
> 
> I kept trying and neither changing the number of concurrent map tasks,
> nor the slice size helped.
> Finally, I found out a screw up in our logging system,  which had
> forbidden us from noticing a couple of recurring errors in the logs :
> 
> ERROR [ROW-READ-STAGE:1] 2010-05-11 16:43:32,328
> DebuggableThreadPoolExecutor.java (line 101) Error in
> ThreadPoolExecutor
> java.lang.RuntimeException: java.lang.RuntimeException: corrupt sstable
>at 
> org.apache.cassandra.service.RangeSliceVerbHandler.doVerb(RangeSliceVerbHandler.java:53)
>at 
> org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:40)
>at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>at java.lang.Thread.run(Thread.java:619)
> Caused by: java.lang.RuntimeException: corrupt sstable
>at 
> org.apache.cassandra.io.SSTableScanner.seekTo(SSTableScanner.java:73)
>at 
> org.apache.cassandra.db.ColumnFamilyStore.getKeyRange(ColumnFamilyStore.java:907)
>at 
> org.apache.cassandra.db.ColumnFamilyStore.getRangeSlice(ColumnFamilyStore.java:1000)
>at 
> org.apache.cassandra.service.RangeSliceVerbHandler.doVerb(RangeSliceVerbHandler.java:41)
>... 4 more
> Caused by: java.io.FileNotFoundException:
> /path/to/data/Keyspace/CF-123-Index.db (Too many open files)
>at java.io.RandomAccessFile.open(Native Method)
>at java.io.RandomAccessFile.(RandomAccessFile.java:212)
>at java.io.RandomAccessFile.(RandomAccessFile.java:98)
>at 
> org.apache.cassandra.io.util.BufferedRandomAccessFile.(BufferedRandomAccessFile.java:143)
>at 
> org.apache.cassandra.io.util.BufferedRandomAccessFile.(BufferedRandomAccessFile.java:138)
>at 
> org.apache.cassandra.io.SSTableReader.getNearestPosition(SSTableReader.java:414)
>at 
> org.apache.cassandra.io.SSTableScanner.seekTo(SSTableScanner.java:62)
>... 7 more
> 
> and the related
> 
> WARN [main] 2010-05-11 16:43:38,076 TThreadPoolServer.java (line 190)
> Transport error occurred during acceptance of message.
> org.apache.thrift.transport.TTransportException:
> java.net.SocketException: Too many open files
>at 
> org.apache.thrift.transport.TServerSocket.acceptImpl(TServerSocket.java:124)
>at 
> org.apache.thrift.transport.TServerSocket.acceptImpl(TServerSocket.java:35)
>at 
> org.apache.thrift.transport.TServerTransport.accept(TServerTransport.java:31)
>at 
> org.apache.thrift.server.TThreadPoolServer.serve(TThreadPoolServer.java:184)
>at 
> org.apache.cassandra.thrift.CassandraDaemon.start(CassandraDaemon.java:149)
>at 
> org.apache.cassandra.thrift.CassandraDaemon.main(CassandraDaemon.java:190)
> Caused by: java.net.SocketException: Too many open files
>at java.net.PlainSocketImpl.socketAccept(Native Method)
>at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:390)
>at java.net.ServerSocket.implAccept(ServerSocket.java:453)
>at java.net.ServerSocket.accept(ServerSocket.java:421)
>at 
> org.apache.thrift.transport.TServerSocket.acceptImpl(TServerSocket.java:119)
>... 5 more
> 
> The client was reporting timeouts in this case.
> 
> 
> The max fd limit on the process was in fact not exceedingly high
> (1024) and raising it seems to have solved the problem.
> 
> Anyway It still seems that there may be two issues:
> 
> - since we had never seen this error before with normal client
> connections (as in: non hadoop), is it possible that the
> Cassandra/hadoop layer is not closing sockets properly between one
> connection and the other, or not reusing connections efficiently?
> E.g. TSocket seems to have a close() method but I don't see it used in
> ColumnFamilyInputFormat.(getSubSplits, getRangeMap) but it may well be
> inside CassandraClient.
> 
> Anyway, judging by lsof's output I can only see about a hundred TCP
> connections, but those from the hadoop jobs seem to always be below 60
> so this may just be my wrong impression.
> 
> - is it possible that such errors show up on the client side as
> timeoutErrors when they could be reported better? this would probably
> help other people in diagnosing/reporting internal errors in the
> future.
> 
> 
> Thanks again to everyone with this, I promise I'll put the discussion
> on the wiki for future reference :)



Re: Real-time Web Analysis tool using Cassandra. Doubts...

2010-05-12 Thread Utku Can Topçu
What makes cassandra a poor choice is the fact that, you can't use a
keyrange as input for the map phase for Hadoop.


On Wed, May 12, 2010 at 4:37 PM, Jonathan Ellis  wrote:

> On Tue, May 11, 2010 at 1:52 PM, Paulo Gabriel Poiati
>  wrote:
> > - First of all, my first thoughts is to have two CF one for raw client
> > request (~10 millions++ per day) and other for aggregated metrics in some
> > defined inteval time like 1min, 5min, 15min... Is this a good approach ?
>
> Sure.
>
> > - It is a good idea to use a OrderPreservingPartitioner ? To maintain the
> > order of my requests in the raw data CF ? Or the overhead is too big.
>
> The problem with OPP isn't overhead (it is lower-overhead than RP) but
> the tendency to have hotspots in sequentially-written data.
>
> > - Initially the cluster will contain only three nodes, is it a problem
> (to
> > few maybe) ?
>
> You'll have to do some load testing to see.
>
> > - I think the best way to do the aggregation job is through a hadoop
> > MapReduce job. Right ? Is there any other way to consider ?
>
> Map/Reduce is usually better than rolling your own because it
> parallelizes for you.
>
> > - Is really Cassandra suitable for it ? Maybe HBase is better in this
> case?
>
> Nothing here makes me think "Cassandra is a poor choice."
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of Riptano, the source for professional Cassandra support
> http://riptano.com
>


Re: (Binary)Memtable flushing

2010-05-12 Thread Tobias Jungen
D'oh, forgot to search the JIRA on this one. Thanks Jonathan!

On Wed, May 12, 2010 at 9:37 AM, Jonathan Ellis  wrote:

> https://issues.apache.org/jira/browse/CASSANDRA-856
>
> On Tue, May 11, 2010 at 3:44 PM, Tobias Jungen 
> wrote:
> > Yet another BMT question, thought this may apply for regular memtables as
> > well...
> >
> > After doing a batch insert, I accidentally submitted the flush command
> > twice. To my surprise, the target node's log indicates that it wrote a
> new
> > *-Data.db file, and the disk usage went up accordingly. I tested and
> issued
> > the flush command a few more times, and after a few more data files I
> > eventually triggered a compaction, bringing the disk usage back down. The
> > data appears to continue to stick around in memory, however, as further
> > flush commands continue to result in new data files.
> >
> > Shouldn't flushing a memtable remove it from memory, or is expected
> behavior
> > that it sticks around until the node needs to reclaim the memory? Should
> I
> > worry about getting out-of-memory errors if I'm doing lots of inserts in
> > this manner?
> >
>
>
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of Riptano, the source for professional Cassandra support
> http://riptano.com
>


Re: timeout while running simple hadoop job

2010-05-12 Thread gabriele renzi
On Wed, May 12, 2010 at 5:46 PM, Johan Oskarsson  wrote:
> Looking over the code this is in fact an issue in 0.6.
> It's fixed in trunk/0.7. Connections will be reused and closed properly, see 
> https://issues.apache.org/jira/browse/CASSANDRA-1017 for more details.
>
> We can either backport that patch or make at least close the connections 
> properly in 0.6. Can you open an ticket for this bug?

done as https://issues.apache.org/jira/browse/CASSANDRA-1081


Re: Is it possible to delete records based upon where condition

2010-05-12 Thread Benjamin Black
The functionality of a WHERE clause usually means maintaining an
inverted index, usually another CF, on the information of interest
(ses_tstamp in your example).  You then retrieve index rows from that
CF to find the data rows.


b

On Wed, May 12, 2010 at 5:34 AM, Moses Dinakaran
 wrote:
> Hi All,
>
> In Cassandra it possible to remove records based upon where condition.
>
> We are planning to move the session and cache table from MySql to Cassandra
> and where doing the fesability study. Everything seems to be Ok other than
> garbage collection of session table.
>
> Was not able to  remove super columns from the Key based upon certain
> conditions in the column value.
>
> Follows the schema for Session tables
>
>
> FESession  = {
>     sessions : {
>     1001 : {
>     ses_name       : "user",
>     ses_userid      :  258,
>     ses_tstamp    : “1273504587”
>     },
>     1002 : {
>     ses_name    : "user",
>     ses_userid    :  259,
>     ses_tstamp    : “1273504597”
>     },
>     },
> }
>
> I wanted to remove the records based upon the value of the column ses_tstamp
> ie (delete from sessions where ses_tstamp between XXX & YYY OR delete from
> session where ses_tstamp < XXX )
>
> Is it possible to achieve this in Cassandra If so  please let me know how.
>
> To my knowledge I dont see any kind of where condition in Casandra,
> If that is the case also please let me know the other ways like
> restructuring the Schema or whatever...
>
> Regards
> Moses
>
>
>
>
>
>
>


Load Balancing Mapper Tasks

2010-05-12 Thread Joost Ouwerkerk
I've been trying to improve the time it takes to map 30 million rows using a
hadoop / cassandra cluster with 30 nodes.  I discovered that since
CassandraInputFormat returns an ordered list of splits, when there are many
splits (e.g. hundreds or more) the load on cassandra is horribly unbalanced.
 e.g. if I have 30 tasks processing 600 splits, then the first 30 splits are
all located on the same one or two nodes.

I added *Collections.shuffle(splits) *before returning the splits in
getSplits().  As a result, the load is much better distributed, throughput
 was increased (about 3X in my case) and TimedOutExceptions were all but
eliminated.

Joost.


Re: Cache capacities set by nodetool

2010-05-12 Thread James Golick
Hmmm that's definitely what we're seeing. Although, we aren't seeing
cache settings being picked up properly on a restart either.

On Wed, May 12, 2010 at 8:42 AM, Ryan King  wrote:

> It's a bug:
>
> https://issues.apache.org/jira/browse/CASSANDRA-1079
>
> -ryan
>
> On Wed, May 12, 2010 at 8:16 AM, James Golick 
> wrote:
> > When I first brought this cluster online, the storage-conf.xml file had a
> > few cache capacities set. Since then, we've completely changed how we use
> > cassandra's caching, and no longer use any of the caches I setup in the
> > original configuration.
> > I'm finding that cassandra doesn't want to keep my new cache settings. I
> can
> > disable those caches with setcachecapacity, check cfstats and see that
> they
> > are disabled, but within a few minutes, cfstats is reporting that the
> caches
> > are back.
> > Anything I can do?
>


Re: Cache capacities set by nodetool

2010-05-12 Thread James Golick
Picked up out of config, I mean.

On Wed, May 12, 2010 at 11:10 AM, James Golick wrote:

> Hmmm that's definitely what we're seeing. Although, we aren't seeing
> cache settings being picked up properly on a restart either.
>
>
> On Wed, May 12, 2010 at 8:42 AM, Ryan King  wrote:
>
>> It's a bug:
>>
>> https://issues.apache.org/jira/browse/CASSANDRA-1079
>>
>> -ryan
>>
>> On Wed, May 12, 2010 at 8:16 AM, James Golick 
>> wrote:
>> > When I first brought this cluster online, the storage-conf.xml file had
>> a
>> > few cache capacities set. Since then, we've completely changed how we
>> use
>> > cassandra's caching, and no longer use any of the caches I setup in the
>> > original configuration.
>> > I'm finding that cassandra doesn't want to keep my new cache settings. I
>> can
>> > disable those caches with setcachecapacity, check cfstats and see that
>> they
>> > are disabled, but within a few minutes, cfstats is reporting that the
>> caches
>> > are back.
>> > Anything I can do?
>>
>
>


RE: Extremly slow inserts on LAN

2010-05-12 Thread Stephan Pfammatter
Hi, I read your post and noticed you are running Cassandra on win 2008.
Do you run it in a production environment?

I'm contacting you because there aren't that many windows installations. I need 
to provide a live Cassandra environment on win 2008 and was stumbling into some 
problems with node tool and storage.xml setup.

Best, Stephan

From: Arie Keren [mailto:a...@doubleverify.com]
Sent: Sunday, May 09, 2010 10:07 AM
To: user@cassandra.apache.org
Subject: Extremly slow inserts on LAN

While making our first steps with Cassandra, we experience slow inserts working 
on LAN.
Inserting 7000 keys with 1 column family takes about 10 seconds when Cassandra 
server running on the same host with the client.
But when server runs on a different host on LAN, the same inserts take more 
than 10 (!) minutes.
In both cases Cassandra server contains a single node.

We use Cassandra version 0.6.0 running on Windows server 2008.
The client is .NET c# application.








__ Information from ESET NOD32 Antivirus, version of virus signature 
database 4628 (20091122) __

The message was checked by ESET NOD32 Antivirus.

http://www.eset.com


Re: Human readable Cassandra limitations

2010-05-12 Thread Paul Prescod
On Wed, May 12, 2010 at 2:02 AM, David Vanderfeesten  wrote:
>...
>
> My concern with the denormalization approach is that it shouldn't be managed
> by the client side because this has big impact on your throughput.  Is the
> map-reduce in that respect any better?
> Wouldn't it be nice to support a kind of PL-(No)SQL server side scripting
> that allows you to create and maintain materialized views? You might still
> give it as an option to maintain the view synchronously (extension of
> current row-level-atomicity)  or asynchronously.

I think that this is what you're talking about:

https://issues.apache.org/jira/browse/CASSANDRA-1016

 Paul Prescod


Re: Is it possible to delete records based upon where condition

2010-05-12 Thread Joel Pitt
On Thu, May 13, 2010 at 12:34 AM, Moses Dinakaran
 wrote:
> I wanted to remove the records based upon the value of the column ses_tstamp
> ie (delete from sessions where ses_tstamp between XXX & YYY OR delete from
> session where ses_tstamp < XXX )
>
> Is it possible to achieve this in Cassandra If so  please let me know how.

>From my limited understanding of Cassandra (haven't actually deployed
it and have mostly just been learning from reading for now) the way
you'd want to do this is by either rearranging your schema (or
manually creating an index) so that the timestamps are used as the
column names within a CF that is sorted by the timestamp.

Joel Pitt, PhD | http://ferrouswheel.me
OpenCog Developer | http://opencog.org
Board member, Humanity+ | http://humanityplus.org


how does cassandra compare with mongodb?

2010-05-12 Thread S Ahmed
I tried searching mail-archive, but the search feature is a bit wacky (or
more probably I don't know how to use it).

What are the key differences between Cassandra and Mongodb?

Is there a particular use case where each solution shines?


paging row keys

2010-05-12 Thread Corey Hulen
Can someone point me to a thrift sample (preferable java) to list all the
rows in a ColumnFamily for my Cassandra server.  I noticed some examples
using SlicePredicate and SliceRange to perform a similar query against the
columns with paging, but I was looking for something similar for rows with
paging.  I'm also new to Cassandra, so I might have misunderstood the
existing samples.

-Corey


Re: key is sorted?

2010-05-12 Thread Vijay
If you use Random partitioner, You will *NOT* get RowKey's sorted. (Columns
are sorted always).

Answer: If used Random partitioner
True True

Regards,




On Wed, May 12, 2010 at 1:25 AM, David Boxenhorn  wrote:

> You do any kind of range slice, e.g. keys beginning with "abc"? But the
> results will not be ordered?
>
> Please answer one of the following:
>
> True True
> True False
> False False
>
> Explain?
>
> Thanks!
>
>
> On Sun, May 9, 2010 at 8:27 PM, Vijay  wrote:
>
>> True, The Range slice support was enabled in Random Partitioner for the
>> hadoop support.
>>
>> Random partitioner actually hash the Key and those keys are sorted so we
>> cannot have the actual key in order (Hope this doesnt confuse you)...
>>
>> Regards,
>> 
>>
>>
>>
>> On Sun, May 9, 2010 at 12:00 AM, David Boxenhorn wrote:
>>
>>> This is something that I'm not sure that I understand. Can somebody
>>> confirm/deny that I understand it? Thanks.
>>>
>>> If you use random partitioning, you can loop through all keys with a
>>> range query, but they will not be sorted.
>>>
>>> True or False?
>>>
>>> On Sat, May 8, 2010 at 3:45 AM, AJ Chen  wrote:
>>>
 thanks, that works. -aj


 On Fri, May 7, 2010 at 1:17 PM, Stu Hood wrote:

> Your IPartitioner implementation decides how the row keys are sorted:
> see http://wiki.apache.org/cassandra/StorageConfiguration#Partitioner. 
> You need to be using one of the OrderPreservingPartitioners if you'd like
> a reasonable order for the keys.
>
> -Original Message-
> From: "AJ Chen" 
> Sent: Friday, May 7, 2010 3:10pm
> To: user@cassandra.apache.org
> Subject: key is sorted?
>
> I have a super column family for "topic", key being the name of the
> topic.
>  CompareSubcolumnsWith="BytesType" />
> When I retrieve the rows, the rows are not sorted by the key. Is the
> row key
> sorted in cassandra by default?
>
> -aj
> --
> AJ Chen, PhD
> Chair, Semantic Web SIG, sdforum.org
> http://web2express.org
> twitter @web2express
> Palo Alto, CA, USA
>
>
>


 --
 AJ Chen, PhD
 Chair, Semantic Web SIG, sdforum.org
 http://web2express.org
 twitter @web2express
 Palo Alto, CA, USA

>>>
>>>
>>
>


Re: paging row keys

2010-05-12 Thread Nathan McCall
Here is a basic example using get_range_slices to retrieve 500 rows via hector:
http://github.com/bosyak/hector/blob/master/src/main/java/me/prettyprint/cassandra/examples/ExampleReadAllKeys.java

To page, use the last key you got back as the start key.

-Nate

On Wed, May 12, 2010 at 3:37 PM, Corey Hulen  wrote:
>
> Can someone point me to a thrift sample (preferable java) to list all the
> rows in a ColumnFamily for my Cassandra server.  I noticed some examples
> using SlicePredicate and SliceRange to perform a similar query against the
> columns with paging, but I was looking for something similar for rows with
> paging.  I'm also new to Cassandra, so I might have misunderstood the
> existing samples.
> -Corey
>
>
>


Re: paging row keys

2010-05-12 Thread Nathan McCall
Oh, thanks to Andrey Panov for providing that example, btw. We are
always looking for good usage examples to post on the Hector wiki If
anyone else has them.
-Nate

On Wed, May 12, 2010 at 5:01 PM, Nathan McCall  wrote:
> Here is a basic example using get_range_slices to retrieve 500 rows via 
> hector:
> http://github.com/bosyak/hector/blob/master/src/main/java/me/prettyprint/cassandra/examples/ExampleReadAllKeys.java
>
> To page, use the last key you got back as the start key.
>
> -Nate
>
> On Wed, May 12, 2010 at 3:37 PM, Corey Hulen  wrote:
>>
>> Can someone point me to a thrift sample (preferable java) to list all the
>> rows in a ColumnFamily for my Cassandra server.  I noticed some examples
>> using SlicePredicate and SliceRange to perform a similar query against the
>> columns with paging, but I was looking for something similar for rows with
>> paging.  I'm also new to Cassandra, so I might have misunderstood the
>> existing samples.
>> -Corey
>>
>>
>>
>


Re: key is sorted?

2010-05-12 Thread Jonathan Shook
Although, if replication factor spans all nodes, then the disparity in
row allocation should be a non-issue when using
OrderPreservingPartitioner.

On Wed, May 12, 2010 at 6:42 PM, Vijay  wrote:
> If you use Random partitioner, You will NOT get RowKey's sorted. (Columns
> are sorted always).
> Answer: If used Random partitioner
> True True
>
> Regards,
> 
>
>
>
> On Wed, May 12, 2010 at 1:25 AM, David Boxenhorn  wrote:
>>
>> You do any kind of range slice, e.g. keys beginning with "abc"? But the
>> results will not be ordered?
>>
>> Please answer one of the following:
>>
>> True True
>> True False
>> False False
>>
>> Explain?
>>
>> Thanks!
>>
>> On Sun, May 9, 2010 at 8:27 PM, Vijay  wrote:
>>>
>>> True, The Range slice support was enabled in Random Partitioner for the
>>> hadoop support.
>>> Random partitioner actually hash the Key and those keys are sorted so we
>>> cannot have the actual key in order (Hope this doesnt confuse you)...
>>> Regards,
>>> 
>>>
>>>
>>>
>>> On Sun, May 9, 2010 at 12:00 AM, David Boxenhorn 
>>> wrote:

 This is something that I'm not sure that I understand. Can somebody
 confirm/deny that I understand it? Thanks.

 If you use random partitioning, you can loop through all keys with a
 range query, but they will not be sorted.

 True or False?

 On Sat, May 8, 2010 at 3:45 AM, AJ Chen  wrote:
>
> thanks, that works. -aj
>
> On Fri, May 7, 2010 at 1:17 PM, Stu Hood 
> wrote:
>>
>> Your IPartitioner implementation decides how the row keys are sorted:
>> see http://wiki.apache.org/cassandra/StorageConfiguration#Partitioner . 
>> You
>> need to be using one of the OrderPreservingPartitioners if you'd like a
>> reasonable order for the keys.
>>
>> -Original Message-
>> From: "AJ Chen" 
>> Sent: Friday, May 7, 2010 3:10pm
>> To: user@cassandra.apache.org
>> Subject: key is sorted?
>>
>> I have a super column family for "topic", key being the name of the
>> topic.
>> > CompareSubcolumnsWith="BytesType" />
>> When I retrieve the rows, the rows are not sorted by the key. Is the
>> row key
>> sorted in cassandra by default?
>>
>> -aj
>> --
>> AJ Chen, PhD
>> Chair, Semantic Web SIG, sdforum.org
>> http://web2express.org
>> twitter @web2express
>> Palo Alto, CA, USA
>>
>>
>
>
>
> --
> AJ Chen, PhD
> Chair, Semantic Web SIG, sdforum.org
> http://web2express.org
> twitter @web2express
> Palo Alto, CA, USA

>>>
>>
>
>


Re: how does cassandra compare with mongodb?

2010-05-12 Thread philip andrew
Hi,

>From my understanding, Cassandra entities are indexed on only one key, so
this can be a problem if you are searching for example by two values such as
if you are storing an entity with a x,y then wish to search for entities in
a box ie x>5 and x<10 and y>5 and y<10. MongoDB can do this, Cassandra
cannot due to only indexing on one key.

Cassandra can scale automatically just by adding nodes, almost infinite
storage easily, MongoDB requires database administration to add nodes,
setting up replication or allowing sharding, but not too complex.

MongoDB requires you to create sharded keys if you want to scale
horizontally, Cassandra just works automatically for scale horizontally.

Cassandra requires the schema to be defined before the database starts,
MongoDB can have any schema at run-time just like a normal database.

In the end I choose MongoDB as I require more indexes than Cassandra
provides, although I really like Cassandras ability to store almost infinite
amount of data just by adding nodes.

Thanks, Phil

On Thu, May 13, 2010 at 5:57 AM, S Ahmed  wrote:

> I tried searching mail-archive, but the search feature is a bit wacky (or
> more probably I don't know how to use it).
>
> What are the key differences between Cassandra and Mongodb?
>
> Is there a particular use case where each solution shines?
>


Re: Apache Cassandra and Net::Cassandra::Easy

2010-05-12 Thread Scott Doty
Well, ain't that a kick in the tush. :-/

I grabbed the svn trunk, but build failed, probably because Fedora 11 is too 
old for this bleeding edge stuff.  Server is upgrading itself now, but I 
wondered:  is anyone using an rpm-based distro for Net::Cassandra::Easy and the 
svn Cassandra?

Thanks. :)


"Ted Zlatanov"  wrote:
>SD> This is using Net-Cassandra-Easy-0.14-7Jcufu and Cassandra 0.6.1.
>
>I'll stabilize N::C::Easy at 0.7 and support both 0.7 and trunk from
>that point on.  Sorry but 0.6.x is not supported.  I made this a
>stronger statement in the N::C::Easy docs as of next release; sorry for
>the inconvenience.
>
>Ted

--
Sent from my Android phone with K-9. Please excuse my brevity.


Re: how does cassandra compare with mongodb?

2010-05-12 Thread Jonathan Shook
You can choose to have keys ordered by using an
OrderPreservingPartioner with the trade-off that key ranges can get
denser on certain nodes than others.

On Wed, May 12, 2010 at 7:48 PM, philip andrew  wrote:
>
> Hi,
> From my understanding, Cassandra entities are indexed on only one key, so
> this can be a problem if you are searching for example by two values such as
> if you are storing an entity with a x,y then wish to search for entities in
> a box ie x>5 and x<10 and y>5 and y<10. MongoDB can do this, Cassandra
> cannot due to only indexing on one key.
> Cassandra can scale automatically just by adding nodes, almost infinite
> storage easily, MongoDB requires database administration to add nodes,
> setting up replication or allowing sharding, but not too complex.
> MongoDB requires you to create sharded keys if you want to scale
> horizontally, Cassandra just works automatically for scale horizontally.
> Cassandra requires the schema to be defined before the database starts,
> MongoDB can have any schema at run-time just like a normal database.
> In the end I choose MongoDB as I require more indexes than Cassandra
> provides, although I really like Cassandras ability to store almost infinite
> amount of data just by adding nodes.
> Thanks, Phil
>
> On Thu, May 13, 2010 at 5:57 AM, S Ahmed  wrote:
>>
>> I tried searching mail-archive, but the search feature is a bit wacky (or
>> more probably I don't know how to use it).
>> What are the key differences between Cassandra and Mongodb?
>> Is there a particular use case where each solution shines?
>


Re: key is sorted?

2010-05-12 Thread David Boxenhorn
Thank you. That is very good news. I can sort the results myself - what is
important is that I get them!

On Thu, May 13, 2010 at 2:42 AM, Vijay  wrote:

> If you use Random partitioner, You will *NOT* get RowKey's sorted.
> (Columns are sorted always).
>
> Answer: If used Random partitioner
> True True
>
> Regards,
> 
>
>
>
>
> On Wed, May 12, 2010 at 1:25 AM, David Boxenhorn wrote:
>
>> You do any kind of range slice, e.g. keys beginning with "abc"? But the
>> results will not be ordered?
>>
>> Please answer one of the following:
>>
>> True True
>> True False
>> False False
>>
>> Explain?
>>
>> Thanks!
>>
>>
>> On Sun, May 9, 2010 at 8:27 PM, Vijay  wrote:
>>
>>> True, The Range slice support was enabled in Random Partitioner for the
>>> hadoop support.
>>>
>>> Random partitioner actually hash the Key and those keys are sorted so we
>>> cannot have the actual key in order (Hope this doesnt confuse you)...
>>>
>>> Regards,
>>> 
>>>
>>>
>>>
>>> On Sun, May 9, 2010 at 12:00 AM, David Boxenhorn wrote:
>>>
 This is something that I'm not sure that I understand. Can somebody
 confirm/deny that I understand it? Thanks.

 If you use random partitioning, you can loop through all keys with a
 range query, but they will not be sorted.

 True or False?

 On Sat, May 8, 2010 at 3:45 AM, AJ Chen  wrote:

> thanks, that works. -aj
>
>
> On Fri, May 7, 2010 at 1:17 PM, Stu Hood wrote:
>
>> Your IPartitioner implementation decides how the row keys are sorted:
>> see http://wiki.apache.org/cassandra/StorageConfiguration#Partitioner. 
>> You need to be using one of the OrderPreservingPartitioners if you'd like
>> a reasonable order for the keys.
>>
>> -Original Message-
>> From: "AJ Chen" 
>> Sent: Friday, May 7, 2010 3:10pm
>> To: user@cassandra.apache.org
>> Subject: key is sorted?
>>
>> I have a super column family for "topic", key being the name of the
>> topic.
>> > CompareSubcolumnsWith="BytesType" />
>> When I retrieve the rows, the rows are not sorted by the key. Is the
>> row key
>> sorted in cassandra by default?
>>
>> -aj
>> --
>> AJ Chen, PhD
>> Chair, Semantic Web SIG, sdforum.org
>> http://web2express.org
>> twitter @web2express
>> Palo Alto, CA, USA
>>
>>
>>
>
>
> --
> AJ Chen, PhD
> Chair, Semantic Web SIG, sdforum.org
> http://web2express.org
> twitter @web2express
> Palo Alto, CA, USA
>


>>>
>>
>