iteration does not yield all data with consistency ONE

2010-11-10 Thread Eric van Orsouw
Hello,

We have a cluster of 4 nodes (0.6.6) and use the random partitioner and a 
replication of 2.
When I insert a number of rows I can always retrieve them by their explicit id 
(get_range_slices("","", 1).
Playing with consistency levels and temporarily shutting down a Cassandra node 
all yields the expected result.

However when I use get_range_slices("","", n) to iterate over all rows, I 
sometimes don't get anything (depending on the node).

I then reduced the problem to inserting just a single row.
Specifically, the 'iteration' only seems to succeed when I issue the request to 
the node that contains the first copy.
I Discovered that when I iterate using a consistency level of Quorum/All the 
iteration always succeeds and I properly get the one row.

So a solution would be to always use consistency level One/All but that has a 
performance penalty.

Can anyone explain why iterating using get_range_slices("","",n) does not 
always function with consistency level One on all nodes?

Thanks,
Eric

P.S. To rule out any discussion on whether or not to use iteration in the first 
place, we only plan to use it for backup and periodic cleanup cycles.


Re: WordCount example problem

2010-11-10 Thread Patrik Modesto
Hi,

I'm trying the WordCount example and getting this error:

[12:33]$ ./bin/word_count
10/11/10 12:34:35 INFO WordCount: output reducer type: filesystem
10/11/10 12:34:36 INFO jvm.JvmMetrics: Initializing JVM Metrics with
processName=JobTracker, sessionId=
10/11/10 12:34:36 INFO WordCount: XXX:text0
10/11/10 12:34:36 INFO mapred.JobClient: Running job: job_local_0001
10/11/10 12:34:36 INFO mapred.MapTask: io.sort.mb = 100
10/11/10 12:34:36 INFO mapred.MapTask: data buffer = 79691776/99614720
10/11/10 12:34:36 INFO mapred.MapTask: record buffer = 262144/327680
10/11/10 12:34:36 WARN mapred.LocalJobRunner: job_local_0001
java.lang.ClassCastException: java.nio.HeapByteBuffer cannot be cast to [B
       at WordCount$TokenizerMapper.map(WordCount.java:73)
       at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
       at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
       at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
       at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
10/11/10 12:34:37 INFO mapred.JobClient:  map 0% reduce 0%
10/11/10 12:34:37 INFO mapred.JobClient: Job complete: job_local_0001
10/11/10 12:34:37 INFO mapred.JobClient: Counters: 0

I'm using cassandra 0.7.0beta3 (from latest trunk) on just one
machine. Is the example working for anybody?

Thanks,
P.


Re: Data management on a ring

2010-11-10 Thread aaron morton
If I understand your correctly, you just want to add 8 nodes to a ring that 
already has 2 ? 

You could add the nodes and manually assign them tokens following the 
guidelines here http://wiki.apache.org/cassandra/Operations

I'm not sure how to ensure the minimum amount of data transfer though. Adding 
all 8 at once is probably a bad idea. 

How about you make a new cluster of 8 nodes, manually assign tokens and then 
copy the data from the 2 node ring to the 8 node. Then move the 2 original 
nodes into the new cluster?

Hope that helps.
Aaron

On 10 Nov 2010, at 20:56, Jean-Yves LEBLEU wrote:

> Hello all,
> 
> We have an installation of 10 nodes, and we choose to deploy 5 rings of 2 
> nodes.
> 
> We would like to change to a ring of 10 nodes.
> 
> Some data have to be replicated on the 10 nodes, some should stay on 2 nodes. 
> Do you have any idea or documentation pointer in order to have a ring of 10 
> nodes with such data repartition ?
> 
> Thanks for any answer.
> 
> Jean-Yves



Re: Data management on a ring

2010-11-10 Thread Jean-Yves LEBLEU
Thanks for the anwser.

It was not exactly my point, I would like to know if in a 10 nodes rings if
it is possible to restrict replication of some data to only 2 nodes, and
other data to all nodes ?
Regards.
Jean-Yves

On Wed, Nov 10, 2010 at 11:17 AM, aaron morton wrote:

> If I understand your correctly, you just want to add 8 nodes to a ring that
> already has 2 ?
>
> You could add the nodes and manually assign them tokens following the
> guidelines here http://wiki.apache.org/cassandra/Operations
>
> I'm not sure how to ensure the minimum amount of data transfer though.
> Adding all 8 at once is probably a bad idea.
>
> How about you make a new cluster of 8 nodes, manually assign tokens and
> then copy the data from the 2 node ring to the 8 node. Then move the 2
> original nodes into the new cluster?
>
> Hope that helps.
> Aaron
>
> On 10 Nov 2010, at 20:56, Jean-Yves LEBLEU wrote:
>
> > Hello all,
> >
> > We have an installation of 10 nodes, and we choose to deploy 5 rings of 2
> nodes.
> >
> > We would like to change to a ring of 10 nodes.
> >
> > Some data have to be replicated on the 10 nodes, some should stay on 2
> nodes. Do you have any idea or documentation pointer in order to have a ring
> of 10 nodes with such data repartition ?
> >
> > Thanks for any answer.
> >
> > Jean-Yves
>
>


about key sorting and token partitioning

2010-11-10 Thread zangds
Hi,
I am using cassandra to store a message steam, and want to use timestamps (like 
mmddhhMIss or something alike) as the keys.
So if I use RandomPartitioner, I will loose the order when using 
get_range_slices().
If I use OrderPreservingPartitioner, how should I configure cassandra to make 
load balance between the nodes?

Thanks!

2010-11-10 



zangds 


Re: about key sorting and token partitioning

2010-11-10 Thread Peter Schuller
> I am using cassandra to store a message steam, and want to use timestamps
> (like mmddhhMIss or something alike) as the keys.
> So if I use RandomPartitioner, I will loose the order when using
> get_range_slices().
> If I use OrderPreservingPartitioner, how should I configure cassandra to
> make load balance between the nodes?

AFAIK there's no silver bullet to making the order preserving
partitioner easy to use w.r.t. node balancing in the situation you're
describing.

One thing to consider is to use the random partitioner (for its
simplicity in managing the cluster) and use a granular subset of the
timestamp as the row key. For example, you could have the row key be
mmddhh to get an entire hour per row.

A reasonable granularity would depend on your use-case; but the idea
is to be able to take advantage of the simplicity of using the random
partitioner, while having reasonable efficiency on range slices by
making each row contain a pretty large range such that any additional
overhead in jumping across nodes is negligible in comparison to the
other work done.

-- 
/ Peter Schuller


Re: Data management on a ring

2010-11-10 Thread Jonathan Ellis
Yes, on a per-keyspace basis with NetworkTopologyStrategy (in 0.7).

On Wed, Nov 10, 2010 at 4:40 AM, Jean-Yves LEBLEU  wrote:
> Thanks for the anwser.
>
> It was not exactly my point, I would like to know if in a 10 nodes rings if
> it is possible to restrict replication of some data to only 2 nodes, and
> other data to all nodes ?
> Regards.
> Jean-Yves
>
> On Wed, Nov 10, 2010 at 11:17 AM, aaron morton 
> wrote:
>>
>> If I understand your correctly, you just want to add 8 nodes to a ring
>> that already has 2 ?
>>
>> You could add the nodes and manually assign them tokens following the
>> guidelines here http://wiki.apache.org/cassandra/Operations
>>
>> I'm not sure how to ensure the minimum amount of data transfer though.
>> Adding all 8 at once is probably a bad idea.
>>
>> How about you make a new cluster of 8 nodes, manually assign tokens and
>> then copy the data from the 2 node ring to the 8 node. Then move the 2
>> original nodes into the new cluster?
>>
>> Hope that helps.
>> Aaron
>>
>> On 10 Nov 2010, at 20:56, Jean-Yves LEBLEU wrote:
>>
>> > Hello all,
>> >
>> > We have an installation of 10 nodes, and we choose to deploy 5 rings of
>> > 2 nodes.
>> >
>> > We would like to change to a ring of 10 nodes.
>> >
>> > Some data have to be replicated on the 10 nodes, some should stay on 2
>> > nodes. Do you have any idea or documentation pointer in order to have a 
>> > ring
>> > of 10 nodes with such data repartition ?
>> >
>> > Thanks for any answer.
>> >
>> > Jean-Yves
>>
>
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com


Re: WordCount example problem

2010-11-10 Thread Jonathan Ellis
http://www.mail-archive.com/user@cassandra.apache.org/msg07093.html

On Wed, Nov 10, 2010 at 5:47 AM, Patrik Modesto
 wrote:
> Hi,
>
> I'm trying the WordCount example and getting this error:
>
> [12:33]$ ./bin/word_count
> 10/11/10 12:34:35 INFO WordCount: output reducer type: filesystem
> 10/11/10 12:34:36 INFO jvm.JvmMetrics: Initializing JVM Metrics with
> processName=JobTracker, sessionId=
> 10/11/10 12:34:36 INFO WordCount: XXX:text0
> 10/11/10 12:34:36 INFO mapred.JobClient: Running job: job_local_0001
> 10/11/10 12:34:36 INFO mapred.MapTask: io.sort.mb = 100
> 10/11/10 12:34:36 INFO mapred.MapTask: data buffer = 79691776/99614720
> 10/11/10 12:34:36 INFO mapred.MapTask: record buffer = 262144/327680
> 10/11/10 12:34:36 WARN mapred.LocalJobRunner: job_local_0001
> java.lang.ClassCastException: java.nio.HeapByteBuffer cannot be cast to [B
>        at WordCount$TokenizerMapper.map(WordCount.java:73)
>        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
>        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
>        at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
> 10/11/10 12:34:37 INFO mapred.JobClient:  map 0% reduce 0%
> 10/11/10 12:34:37 INFO mapred.JobClient: Job complete: job_local_0001
> 10/11/10 12:34:37 INFO mapred.JobClient: Counters: 0
>
> I'm using cassandra 0.7.0beta3 (from latest trunk) on just one
> machine. Is the example working for anybody?
>
> Thanks,
> P.
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com


Re: iteration does not yield all data with consistency ONE

2010-11-10 Thread Jonathan Ellis
Was the node that should have the other replica of this row down when
it was inserted?

On Wed, Nov 10, 2010 at 6:08 AM, Eric van Orsouw
 wrote:
>
> Hello,
>
>
>
> We have a cluster of 4 nodes (0.6.6) and use the random partitioner and a 
> replication of 2.
>
> When I insert a number of rows I can always retrieve them by their explicit 
> id (get_range_slices(“”,””, 1).
>
> Playing with consistency levels and temporarily shutting down a Cassandra 
> node all yields the expected result.
>
>
>
> However when I use get_range_slices(“”,””, n) to iterate over all rows, I 
> sometimes don’t get anything (depending on the node).
>
>
>
> I then reduced the problem to inserting just a single row.
>
> Specifically, the ‘iteration’ only seems to succeed when I issue the request 
> to the node that contains the first copy.
>
> I Discovered that when I iterate using a consistency level of Quorum/All the 
> iteration always succeeds and I properly get the one row.
>
>
>
> So a solution would be to always use consistency level One/All but that has a 
> performance penalty.
>
>
>
> Can anyone explain why iterating using get_range_slices(“”,””,n) does not 
> always function with consistency level One on all nodes?
>
>
>
> Thanks,
>
> Eric
>
>
>
> P.S. To rule out any discussion on whether or not to use iteration in the 
> first place, we only plan to use it for backup and periodic cleanup cycles.


--
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com


Question about consitency level & data propagation & eventually consistent

2010-11-10 Thread Thibaut Britz
Hi,

Assuming I'm reading and writing with consitency level 1 (one), read repair
turned off, I have a few questions about data propagation.
Data is being stored at consistency level 3.

I'm not interested in the deletes. I can live with older data (or data that
has been deleted and will reappear), but I need to know how long it will
take until the data will be available at the other nodes, since I have
turned read repair off.

1) If all nodes are up:
 - Will all writes eventually reach all nodes (of the 3 nodes)?
 - What will be the maximal time until the last write reaches the last node
(of the 3 nodes)? (e.g. Assume one of the node is doing compactation at that
time)

2) If one or two nodes are down
- As I understood it, one node will buffer the writes for the remaining
nodes.
- If the nodes go up again: When will these writes be propagated, at
compactation?, what will be the maximal time until the writes reach the 2
nodes? Will these writes be propagated at all?

In case of 2:

The best way would then be to run nodetool repair after the two nodes will
be available again. Is there a way to make the node not accept any
connections during that time until it is finished repairing? (eg throw the
Unavailableexception)


Thanks,
Thibaut


RE: iteration does not yield all data with consistency ONE

2010-11-10 Thread Eric van Orsouw
No, all nodes were up and running while the single key was inserted.
The insert however was with consistency One. I assume however that the replicas 
are still written in this case.
It is btw also very reproducible.

-Original Message-
From: Jonathan Ellis [mailto:jbel...@gmail.com] 
Sent: woensdag 10 november 2010 15:44
To: user
Subject: Re: iteration does not yield all data with consistency ONE

Was the node that should have the other replica of this row down when
it was inserted?

On Wed, Nov 10, 2010 at 6:08 AM, Eric van Orsouw
 wrote:
>
> Hello,
>
>
>
> We have a cluster of 4 nodes (0.6.6) and use the random partitioner and a 
> replication of 2.
>
> When I insert a number of rows I can always retrieve them by their explicit 
> id (get_range_slices("","", 1).
>
> Playing with consistency levels and temporarily shutting down a Cassandra 
> node all yields the expected result.
>
>
>
> However when I use get_range_slices("","", n) to iterate over all rows, I 
> sometimes don't get anything (depending on the node).
>
>
>
> I then reduced the problem to inserting just a single row.
>
> Specifically, the 'iteration' only seems to succeed when I issue the request 
> to the node that contains the first copy.
>
> I Discovered that when I iterate using a consistency level of Quorum/All the 
> iteration always succeeds and I properly get the one row.
>
>
>
> So a solution would be to always use consistency level One/All but that has a 
> performance penalty.
>
>
>
> Can anyone explain why iterating using get_range_slices("","",n) does not 
> always function with consistency level One on all nodes?
>
>
>
> Thanks,
>
> Eric
>
>
>
> P.S. To rule out any discussion on whether or not to use iteration in the 
> first place, we only plan to use it for backup and periodic cleanup cycles.


--
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com


Range queries using token instead of key

2010-11-10 Thread Anand Somani
Hi,

I am trying to iterate over the entire dataset to calculate some
information. Now the way I am trying to do this is by going directly to the
node that has a data range, so here is the route I am following

   - get TokenRange using - describe_ring
   - then for each tokenRange pick a node and get all data from that node
   (so talk directly to that node for local data) - using get_range_slices ()
   and using KeyRange with start and end token. I want to get about N tokens at
   a time.
   - I want to use paging approach for this, but I cannot seem to find a way
   to get the token for my last keyslice? The only thing I can find is key, now
   is there way to get token given a key? As per some suggestions I can do the
   md5 on the last key and use that as the starting token for the next query,
   would that work?

Also is there a better way of doing this? The data per row is very small.
This looks like a hadoop kind of a job, but am trying to avoid hadoop since
have no other use for it and this operation will be infrequent.

I am using 0.6.6, RandomPartitioner.

Thanks
Anand


Re: Question about consitency level & data propagation & eventually consistent

2010-11-10 Thread Peter Schuller
> 1) If all nodes are up:
>  - Will all writes eventually reach all nodes (of the 3 nodes)?

I believe that if read repair is completely off, then for data that
was written that did *not* get saved by hinted hand-off, would not
propagate until anti-entropy as part of a 'nodetool repair' or perhaps
as part of node movement in the ring (as a side-effect).

Also see http://wiki.apache.org/cassandra/Operations under "Consistency".

>  - What will be the maximal time until the last write reaches the last node
> (of the 3 nodes)? (e.g. Assume one of the node is doing compactation at that
> time)

There is no particular time guarantee, unless you yourself take steps
that would imply such a guarantee (such as by running repair with a
certain frequency).

> 2) If one or two nodes are down
> - As I understood it, one node will buffer the writes for the remaining
> nodes.

AFAIK not all. I.e., only when a node is marked as down will hinted
hand-off start eating writes for the node (right, anyone?).

Hinted hand-off is not supposed to be a guarantee that all data will
become up-to-date; it's rather a way to lessen the impact of nodes
going down by decreasing the amount of data that remains out of synch.

> - If the nodes go up again: When will these writes be propagated, at
> compactation?, what will be the maximal time until the writes reach the 2
> nodes? Will these writes be propagated at all?

Again there's no time guarantee as such. As for the writes, I believe
hinted hand-off sends those along independently of compaction (but I'm
not sure).

-- 
/ Peter Schuller


RE: WordCount example problem

2010-11-10 Thread Aditya Muralidharan
Also, your Mapper class needs to look like this:
MyMapper extends Mapper,Text,SumWritable> ... with all the necessary fixes to the map method.

AD

-Original Message-
From: Jonathan Ellis [mailto:jbel...@gmail.com] 
Sent: Wednesday, November 10, 2010 8:40 AM
To: user
Subject: Re: WordCount example problem

http://www.mail-archive.com/user@cassandra.apache.org/msg07093.html

On Wed, Nov 10, 2010 at 5:47 AM, Patrik Modesto
 wrote:
> Hi,
>
> I'm trying the WordCount example and getting this error:
>
> [12:33]$ ./bin/word_count
> 10/11/10 12:34:35 INFO WordCount: output reducer type: filesystem
> 10/11/10 12:34:36 INFO jvm.JvmMetrics: Initializing JVM Metrics with
> processName=JobTracker, sessionId=
> 10/11/10 12:34:36 INFO WordCount: XXX:text0
> 10/11/10 12:34:36 INFO mapred.JobClient: Running job: job_local_0001
> 10/11/10 12:34:36 INFO mapred.MapTask: io.sort.mb = 100
> 10/11/10 12:34:36 INFO mapred.MapTask: data buffer = 79691776/99614720
> 10/11/10 12:34:36 INFO mapred.MapTask: record buffer = 262144/327680
> 10/11/10 12:34:36 WARN mapred.LocalJobRunner: job_local_0001
> java.lang.ClassCastException: java.nio.HeapByteBuffer cannot be cast to [B
>        at WordCount$TokenizerMapper.map(WordCount.java:73)
>        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
>        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
>        at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
> 10/11/10 12:34:37 INFO mapred.JobClient:  map 0% reduce 0%
> 10/11/10 12:34:37 INFO mapred.JobClient: Job complete: job_local_0001
> 10/11/10 12:34:37 INFO mapred.JobClient: Counters: 0
>
> I'm using cassandra 0.7.0beta3 (from latest trunk) on just one
> machine. Is the example working for anybody?
>
> Thanks,
> P.
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com


Re: iteration does not yield all data with consistency ONE

2010-11-10 Thread Jonathan Ellis
Interesting.  Does it simplify further to RF=1 and 2 nodes?

On Wed, Nov 10, 2010 at 8:58 AM, Eric van Orsouw
 wrote:
> No, all nodes were up and running while the single key was inserted.
> The insert however was with consistency One. I assume however that the 
> replicas are still written in this case.
> It is btw also very reproducible.
>
> -Original Message-
> From: Jonathan Ellis [mailto:jbel...@gmail.com]
> Sent: woensdag 10 november 2010 15:44
> To: user
> Subject: Re: iteration does not yield all data with consistency ONE
>
> Was the node that should have the other replica of this row down when
> it was inserted?
>
> On Wed, Nov 10, 2010 at 6:08 AM, Eric van Orsouw
>  wrote:
>>
>> Hello,
>>
>>
>>
>> We have a cluster of 4 nodes (0.6.6) and use the random partitioner and a 
>> replication of 2.
>>
>> When I insert a number of rows I can always retrieve them by their explicit 
>> id (get_range_slices("","", 1).
>>
>> Playing with consistency levels and temporarily shutting down a Cassandra 
>> node all yields the expected result.
>>
>>
>>
>> However when I use get_range_slices("","", n) to iterate over all rows, I 
>> sometimes don't get anything (depending on the node).
>>
>>
>>
>> I then reduced the problem to inserting just a single row.
>>
>> Specifically, the 'iteration' only seems to succeed when I issue the request 
>> to the node that contains the first copy.
>>
>> I Discovered that when I iterate using a consistency level of Quorum/All the 
>> iteration always succeeds and I properly get the one row.
>>
>>
>>
>> So a solution would be to always use consistency level One/All but that has 
>> a performance penalty.
>>
>>
>>
>> Can anyone explain why iterating using get_range_slices("","",n) does not 
>> always function with consistency level One on all nodes?
>>
>>
>>
>> Thanks,
>>
>> Eric
>>
>>
>>
>> P.S. To rule out any discussion on whether or not to use iteration in the 
>> first place, we only plan to use it for backup and periodic cleanup cycles.
>
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of Riptano, the source for professional Cassandra support
> http://riptano.com
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com


Re: Question about consitency level & data propagation & eventually consistent

2010-11-10 Thread Jonathan Ellis
On Wed, Nov 10, 2010 at 8:54 AM, Thibaut Britz
 wrote:
> Assuming I'm reading and writing with consitency level 1 (one), read repair
> turned off, I have a few questions about data propagation.
> Data is being stored at consistency level 3.
> 1) If all nodes are up:
>  - Will all writes eventually reach all nodes (of the 3 nodes)?

Yes.

>  - What will be the maximal time until the last write reaches the last node

Situation-dependent.  The important thing is that if you are writing
at CL.ALL, it will be before the write is acked to the client.

> 2) If one or two nodes are down
> - As I understood it, one node will buffer the writes for the remaining
> nodes.

Yes: _after_ the failure detector recognizes them as down. This will
take several seconds.

> - If the nodes go up again: When will these writes be propagated

When FD recognizes them as back up.

> The best way would then be to run nodetool repair after the two nodes will
> be available again. Is there a way to make the node not accept any
> connections during that time until it is finished repairing? (eg throw the
> Unavailableexception)

No.  The way to prevent stale reads is to use an appropriate
consistencylevel, not error-prone heuristics.  (For instance: what if
the replica with the most recent data were itself down when the first
node recovered and initiated repair?)

-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com


Re: MapReduce/Hadoop in cassandra 0.7 beta3

2010-11-10 Thread Jeremy Hanna
Aditya,

Can you reproduce the problem locally with "pig -x local myscript.pig"?

Also, moving this message back to the cassandra user list.

On Nov 10, 2010, at 10:47 AM, Aditya Muralidharan wrote:

> Hi,
> 
> I'm still getting the error associated with 
> https://issues.apache.org/jira/browse/CASSANDRA-1700
> I have 7 suse nodes running Cassandra0.7 branch (latest as of the morning of 
> Nov 9). I've loaded 10 rows with one column family(replication factor=4) and 
> 100 super columns. Using the ColumnFamilyInputFormat with mapreduce 
> (LocalJobRunner) to retrieve all the rows gives me the following exception:
> 
> 10/11/10 10:33:15 WARN mapred.LocalJobRunner: job_local_0001
> java.lang.RuntimeException: org.apache.thrift.TApplicationException: Internal 
> error processing get_range_slices
>at 
> org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.maybeInit(ColumnFamilyRecordReader.java:277)
>at 
> org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.computeNext(ColumnFamilyRecordReader.java:292)
>at 
> org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.computeNext(ColumnFamilyRecordReader.java:189)
>at 
> com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:136)
>at 
> com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:131)
>at 
> org.apache.cassandra.hadoop.ColumnFamilyRecordReader.nextKeyValue(ColumnFamilyRecordReader.java:148)
>at 
> org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:423)
>at 
> org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
>at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
>at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
>at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
>at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
> Caused by: org.apache.thrift.TApplicationException: Internal error processing 
> get_range_slices
>at 
> org.apache.thrift.TApplicationException.read(TApplicationException.java:108)
>at 
> org.apache.cassandra.thrift.Cassandra$Client.recv_get_range_slices(Cassandra.java:724)
>at 
> org.apache.cassandra.thrift.Cassandra$Client.get_range_slices(Cassandra.java:704)
>at 
> org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.maybeInit(ColumnFamilyRecordReader.java:255)
>... 11 more
> 
> The server has the following exception:
> ERROR [pool-1-thread-11] 2010-11-10 10:35:58,839 Cassandra.java (line 2876) 
> Internal error processing get_range_slices
> java.lang.AssertionError: 
> (150596448267070854052355226693835429313,18886431880788352792108545029372560769]
>at 
> org.apache.cassandra.db.ColumnFamilyStore.getRangeSlice(ColumnFamilyStore.java:1200)
>at 
> org.apache.cassandra.service.StorageProxy.getRangeSlice(StorageProxy.java:429)
>at 
> org.apache.cassandra.thrift.CassandraServer.get_range_slices(CassandraServer.java:513)
>at 
> org.apache.cassandra.thrift.Cassandra$Processor$get_range_slices.process(Cassandra.java:2868)
>at 
> org.apache.cassandra.thrift.Cassandra$Processor.process(Cassandra.java:2555)
>at 
> org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:167)
>at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>at java.lang.Thread.run(Thread.java:619)
> 
> Any help would be appreciated.
> 
> Thanks.
> 
> AD



Re: Cassandra 0.7 bootstrap exception on windows

2010-11-10 Thread Jeremy Hanna
moving this to the cassandra user list.

On Nov 10, 2010, at 11:05 AM, Aditya Muralidharan wrote:

> Hi,
> 
> I'm building (on windows) a release tar from the HEAD of the Cassandra 0.7 
> branch. Running a new single node instance of Cassandra gives me the 
> following bootstrap exception:
> INFO 10:54:14,030 Enqueuing flush of memtable-locationi...@613975815(227 
> bytes, 4 operations)
> INFO 10:54:14,036 Writing memtable-locationi...@613975815(227 bytes, 4 
> operations)
> ERROR 10:54:14,278 Fatal exception in thread Thread[FlushWriter:1,5,main]
> java.io.IOError: java.io.IOException: rename failed of 
> \var\lib\cassandra\data\system\LocationInfo-e-1-Data.db
>at 
> org.apache.cassandra.io.sstable.SSTableWriter.rename(SSTableWriter.java:238)
>at 
> org.apache.cassandra.io.sstable.SSTableWriter.closeAndOpenReader(SSTableWriter.java:208)
>at 
> org.apache.cassandra.io.sstable.SSTableWriter.closeAndOpenReader(SSTableWriter.java:191)
>at 
> org.apache.cassandra.db.Memtable.writeSortedContents(Memtable.java:161)
>at org.apache.cassandra.db.Memtable.access$000(Memtable.java:49)
>at org.apache.cassandra.db.Memtable$1.runMayThrow(Memtable.java:174)
>at 
> org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30)
>at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
>at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>at java.lang.Thread.run(Thread.java:619)
> Caused by: java.io.IOException: rename failed of 
> \var\lib\cassandra\data\system\LocationInfo-e-1-Data.db
>at 
> org.apache.cassandra.utils.FBUtilities.renameWithConfirm(FBUtilities.java:359)
>at 
> org.apache.cassandra.io.sstable.SSTableWriter.rename(SSTableWriter.java:234)
>... 12 more
> 
> 
> This is not a problem on linux. Any thoughts? Anyone else seeing this 
> behavior?
> 
> Thanks.
> 
> AD



Re: MapReduce/Hadoop in cassandra 0.7 beta3

2010-11-10 Thread Stu Hood
Hey Aditya,

Would you mind attaching that last hundred few lines from before the exception 
from the server log to this ticket: 
https://issues.apache.org/jira/browse/CASSANDRA-1724 ?

Thanks,
Stu

-Original Message-
From: "Jeremy Hanna" 
Sent: Wednesday, November 10, 2010 11:40am
To: user@cassandra.apache.org
Subject: Re: MapReduce/Hadoop in cassandra 0.7 beta3

Aditya,

Can you reproduce the problem locally with "pig -x local myscript.pig"?

Also, moving this message back to the cassandra user list.

On Nov 10, 2010, at 10:47 AM, Aditya Muralidharan wrote:

> Hi,
> 
> I'm still getting the error associated with 
> https://issues.apache.org/jira/browse/CASSANDRA-1700
> I have 7 suse nodes running Cassandra0.7 branch (latest as of the morning of 
> Nov 9). I've loaded 10 rows with one column family(replication factor=4) and 
> 100 super columns. Using the ColumnFamilyInputFormat with mapreduce 
> (LocalJobRunner) to retrieve all the rows gives me the following exception:
> 
> 10/11/10 10:33:15 WARN mapred.LocalJobRunner: job_local_0001
> java.lang.RuntimeException: org.apache.thrift.TApplicationException: Internal 
> error processing get_range_slices
>at 
> org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.maybeInit(ColumnFamilyRecordReader.java:277)
>at 
> org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.computeNext(ColumnFamilyRecordReader.java:292)
>at 
> org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.computeNext(ColumnFamilyRecordReader.java:189)
>at 
> com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:136)
>at 
> com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:131)
>at 
> org.apache.cassandra.hadoop.ColumnFamilyRecordReader.nextKeyValue(ColumnFamilyRecordReader.java:148)
>at 
> org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:423)
>at 
> org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
>at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
>at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
>at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
>at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
> Caused by: org.apache.thrift.TApplicationException: Internal error processing 
> get_range_slices
>at 
> org.apache.thrift.TApplicationException.read(TApplicationException.java:108)
>at 
> org.apache.cassandra.thrift.Cassandra$Client.recv_get_range_slices(Cassandra.java:724)
>at 
> org.apache.cassandra.thrift.Cassandra$Client.get_range_slices(Cassandra.java:704)
>at 
> org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.maybeInit(ColumnFamilyRecordReader.java:255)
>... 11 more
> 
> The server has the following exception:
> ERROR [pool-1-thread-11] 2010-11-10 10:35:58,839 Cassandra.java (line 2876) 
> Internal error processing get_range_slices
> java.lang.AssertionError: 
> (150596448267070854052355226693835429313,18886431880788352792108545029372560769]
>at 
> org.apache.cassandra.db.ColumnFamilyStore.getRangeSlice(ColumnFamilyStore.java:1200)
>at 
> org.apache.cassandra.service.StorageProxy.getRangeSlice(StorageProxy.java:429)
>at 
> org.apache.cassandra.thrift.CassandraServer.get_range_slices(CassandraServer.java:513)
>at 
> org.apache.cassandra.thrift.Cassandra$Processor$get_range_slices.process(Cassandra.java:2868)
>at 
> org.apache.cassandra.thrift.Cassandra$Processor.process(Cassandra.java:2555)
>at 
> org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:167)
>at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>at java.lang.Thread.run(Thread.java:619)
> 
> Any help would be appreciated.
> 
> Thanks.
> 
> AD





encoding of values in cassandra

2010-11-10 Thread Koert Kuipers
Cassandra keys and values are just bytes. My values range from simple doubles 
to complex objects so I need to serialize them with something like avro, thrift 
or protobuf.

Since I am working in a test environment and casssandra is moving to avro I 
decided to use the avro protocol  to communicate with cassandra (from python 
and java). So naturally I would also like to encode my values with avro (why 
have 2 serialization frameworks around?). However avro needs to safe the schema 
with the serialized values. This is considerable overhead (even if I just safe 
pointers to schemas  or something like that with the serialized values). It 
also seems complicated compared to thrift or protobuf where one can just store 
values.

Did anyone find a neat solution to this? Or should I just use avro for 
communication and something like protobuf for value serialization?

Best, Koert



RE: MapReduce/Hadoop in cassandra 0.7 beta3

2010-11-10 Thread Aditya Muralidharan
My bad. Moved to Cassandra user list.

-Original Message-
From: Aditya Muralidharan [mailto:aditya.muralidha...@nisc.coop] 
Sent: Wednesday, November 10, 2010 10:48 AM
To: u...@pig.apache.org
Subject: RE: MapReduce/Hadoop in cassandra 0.7 beta3

Hi,

I'm still getting the error associated with 
https://issues.apache.org/jira/browse/CASSANDRA-1700
I have 7 suse nodes running Cassandra0.7 branch (latest as of the morning of 
Nov 9). I've loaded 10 rows with one column family(replication factor=4) and 
100 super columns. Using the ColumnFamilyInputFormat with mapreduce 
(LocalJobRunner) to retrieve all the rows gives me the following exception:

10/11/10 10:33:15 WARN mapred.LocalJobRunner: job_local_0001
java.lang.RuntimeException: org.apache.thrift.TApplicationException: Internal 
error processing get_range_slices
at 
org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.maybeInit(ColumnFamilyRecordReader.java:277)
at 
org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.computeNext(ColumnFamilyRecordReader.java:292)
at 
org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.computeNext(ColumnFamilyRecordReader.java:189)
at 
com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:136)
at 
com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:131)
at 
org.apache.cassandra.hadoop.ColumnFamilyRecordReader.nextKeyValue(ColumnFamilyRecordReader.java:148)
at 
org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:423)
at 
org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
Caused by: org.apache.thrift.TApplicationException: Internal error processing 
get_range_slices
at 
org.apache.thrift.TApplicationException.read(TApplicationException.java:108)
at 
org.apache.cassandra.thrift.Cassandra$Client.recv_get_range_slices(Cassandra.java:724)
at 
org.apache.cassandra.thrift.Cassandra$Client.get_range_slices(Cassandra.java:704)
at 
org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.maybeInit(ColumnFamilyRecordReader.java:255)
... 11 more

The server has the following exception:
ERROR [pool-1-thread-11] 2010-11-10 10:35:58,839 Cassandra.java (line 2876) 
Internal error processing get_range_slices
java.lang.AssertionError: 
(150596448267070854052355226693835429313,18886431880788352792108545029372560769]
at 
org.apache.cassandra.db.ColumnFamilyStore.getRangeSlice(ColumnFamilyStore.java:1200)
at 
org.apache.cassandra.service.StorageProxy.getRangeSlice(StorageProxy.java:429)
at 
org.apache.cassandra.thrift.CassandraServer.get_range_slices(CassandraServer.java:513)
at 
org.apache.cassandra.thrift.Cassandra$Processor$get_range_slices.process(Cassandra.java:2868)
at 
org.apache.cassandra.thrift.Cassandra$Processor.process(Cassandra.java:2555)
at 
org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:167)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)

Any help would be appreciated.

Thanks.

AD


Re: encoding of values in cassandra

2010-11-10 Thread Jonathan Ellis
We are moving towards treating Thrift more as a driver than as a
format itself, and using libraries like Hector, pycassa, and phpcassa
from the client.

On Wed, Nov 10, 2010 at 1:03 PM, Koert Kuipers
 wrote:
> Cassandra keys and values are just bytes. My values range from simple
> doubles to complex objects so I need to serialize them with something like
> avro, thrift or protobuf.
>
>
>
> Since I am working in a test environment and casssandra is moving to avro I
> decided to use the avro protocol  to communicate with cassandra (from python
> and java). So naturally I would also like to encode my values with avro (why
> have 2 serialization frameworks around?). However avro needs to safe the
> schema with the serialized values. This is considerable overhead (even if I
> just safe pointers to schemas  or something like that with the serialized
> values). It also seems complicated compared to thrift or protobuf where one
> can just store values.
>
>
>
> Did anyone find a neat solution to this? Or should I just use avro for
> communication and something like protobuf for value serialization?
>
>
>
> Best, Koert
>
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com


multiple datacenter with low replication factor - idea for greater flexibility

2010-11-10 Thread Wayne Lewis
Hello,

We've had Cassandra running in a single production data center now for several 
months and have started detailed plans to add data center fault tolerance.

Our requirements do not appear to be solved out-of-the-box with Cassandra. I'd 
like to share a solution we're planning and find others considering similar 
problems.

We require the following:

1. Two data centers
One is primary, the other hot standby to be used when primary fails. Of course 
Cassandra has no such bias, but as will be seen below this becomes important 
when considering app latency.

2. No more than 3 copies of data total
We are storing blob-like objects. Cost per unit of usable storage is closely 
scrutinized vs other solutions. Hence we want to keep replication factor low.
Two copies will be held in the primary DC, 1 in the secondary DC - with the 
corresponding ratio of machines in each DC.

3. Immediate consistency

4. No waiting on remote data center
The application front-end runs in the primary data center and expects that 
operations using a local coordinator node will not suffer a response time 
determined by the WAN. Hence we cannot require a response from the node in the 
secondary data center to achieve quorum.

5. Ability to operate with a single working node per key, if necessary
We wish to temporarily operate with even a single working node per token in 
desperate situations involving data center failures or combinations of node and 
data center failure.


Existing Cassandra solutions offer combinations of the above, but it is not at 
all clear how to achieve all the above without custom work. 
Normal quorum with N=3 can only work with a single down node regardless of 
topology. Furthermore if one node in the primary DC fails, quorum requires 
synchronous operations over the WAN.
NetworkTopologyStrategy is nice, but requiring quorum in the primary DC with 2 
nodes means no tolerance to a single node failure there.
If we're overlooking something I'd love to know.


Hence the following proposal for a new replication strategy we're calling 
SubQuorum.

In short SubQuorum allows administratively marking some nodes as being exempt 
from participating in quorum. As all nodes agree as to exemption status, 
consistency is still guaranteed as quorum is still achieved amongst the 
remaining nodes. We gain tremendous flexibility to deal with node and DC 
failures. Exempt nodes, if up, still receive mutation messages as usual.

For example : If a primary DC node fails we can mark its remote counterpart 
exempt from quorum, hence allowing continued operation without a synchronous 
call over the WAN.

Or another example : If the primary DC fails we mark all primary DC nodes 
exempt and move the entire application to the secondary DC where it runs as 
usual but with just the one copy.



The implementation is trivial and consists of two pieces:

1. Exempt node management. The list of exempt nodes is broadcast out of band. 
In our case we're leveraging puppet and a admin server.

2. We've written an implementation of AbstractReplicationStrategy that returns 
custom QuorumResponseHandler and IWriteResponseHandler. These simply wait for 
quorum amongst non-exempt nodes.
This requires a small change to the AbstractReplicationStrategy interface to 
pass the endpoints to getQuorumResponseHandler and getWriteResponseHandler, but 
otherwise changes are contained in the plugin.


There is more analysis I can share if anyone is interested. But at this point 
I'd like to get feedback.

Thanks,
Wayne Lewis



[RELEASE] 0.6.7

2010-11-10 Thread Eric Evans

It's been about a month since our last stable update and we've
accumulated a few changes[1] worth having, so I'm pleased to announce
the release of 0.6.7.

If you're coming from a version older than 0.6.6 then please be sure to
read the release notes[2]; upgrades from 0.6.6. should be completely
seamless.

As usual, links to binary and source archives are available from the
Downloads page[3], and packages for Debian-based systems are available
from our repo[4].

Thanks, and enjoy!

[1]: http://goo.gl/pGEx5 [CHANGES.txt]
[2]: http://goo.gl/IQ3rR [NEWS.txt]
[3]: http://cassandra.apache.org/download
[4]: http://wiki.apache.org/cassandra/DebianPackaging

-- 
Eric Evans
eev...@rackspace.com




Re: Range queries using token instead of key

2010-11-10 Thread Edward Capriolo
On Wed, Nov 10, 2010 at 10:05 AM, Anand Somani  wrote:
> Hi,
>
> I am trying to iterate over the entire dataset to calculate some
> information. Now the way I am trying to do this is by going directly to the
> node that has a data range, so here is the route I am following
>
> get TokenRange using - describe_ring
> then for each tokenRange pick a node and get all data from that node (so
> talk directly to that node for local data) - using get_range_slices () and
> using KeyRange with start and end token. I want to get about N tokens at a
> time.
> I want to use paging approach for this, but I cannot seem to find a way to
> get the token for my last keyslice? The only thing I can find is key, now is
> there way to get token given a key? As per some suggestions I can do the md5
> on the last key and use that as the starting token for the next query, would
> that work?
>
> Also is there a better way of doing this? The data per row is very small.
> This looks like a hadoop kind of a job, but am trying to avoid hadoop since
> have no other use for it and this operation will be infrequent.
>
> I am using 0.6.6, RandomPartitioner.
>
> Thanks
> Anand
>

You should take the last key from your keyslice and pass it into
FBUtilities.hash(key)  to get its token.

Edward


CF Stats in 0.7beta3

2010-11-10 Thread Rock, Paul
Afternoon all - I'm playing with 0.7beta3 on some boxes I have here at the 
office and while checking out the stats from one of my tests I'm seeing Write 
Latency being reported as "0.009 ms". I haven't done any timing yet in my 
client, but is this really microsecond latency, or is there a mismatch between 
the numeric and the label? Granted, I'm not loading the complex up at all just 
writing with a single thread to play with pycassa so the cluster doesn't have 
anything to do but handle my write, but I'd like to make sure before I run off 
trying to talk my manager into something :-)

Column Family: NameServer2Domain
SSTable count: 0
Space used (live): 0
Space used (total): 0
Memtable Columns Count: 39718
Memtable Data Size: 2531109
Memtable Switch Count: 0
Read Count: 0
Read Latency: NaN ms.
Write Count: 39718
Write Latency: 0.009 ms.
Pending Tasks: 0
Key cache capacity: 20
Key cache size: 0
Key cache hit rate: NaN
Row cache: disabled
Compacted row minimum size: 0
Compacted row maximum size: 0
Compacted row mean size: 0



Re: CF Stats in 0.7beta3

2010-11-10 Thread Ryan King
Yeah, that's really microsecond latency. Note, though that this isn't
the full request timing, its just the storage proxy down, so it
doesn't account for any latency added by thrift or the network.

-ryan

On Wed, Nov 10, 2010 at 1:43 PM, Rock, Paul  wrote:
> Afternoon all - I'm playing with 0.7beta3 on some boxes I have here at the 
> office and while checking out the stats from one of my tests I'm seeing Write 
> Latency being reported as "0.009 ms". I haven't done any timing yet in my 
> client, but is this really microsecond latency, or is there a mismatch 
> between the numeric and the label? Granted, I'm not loading the complex up at 
> all just writing with a single thread to play with pycassa so the cluster 
> doesn't have anything to do but handle my write, but I'd like to make sure 
> before I run off trying to talk my manager into something :-)
>
>                Column Family: NameServer2Domain
>                SSTable count: 0
>                Space used (live): 0
>                Space used (total): 0
>                Memtable Columns Count: 39718
>                Memtable Data Size: 2531109
>                Memtable Switch Count: 0
>                Read Count: 0
>                Read Latency: NaN ms.
>                Write Count: 39718
>                Write Latency: 0.009 ms.
>                Pending Tasks: 0
>                Key cache capacity: 20
>                Key cache size: 0
>                Key cache hit rate: NaN
>                Row cache: disabled
>                Compacted row minimum size: 0
>                Compacted row maximum size: 0
>                Compacted row mean size: 0
>
>


Non-Unique Indexes, How ?

2010-11-10 Thread J T
Hi,

I'm trying to work out a way to support a non-unique index.

For example, lets say I have a contact list, where its possible to have
Names that are the same but are for different people and so should have
different contact entries but I'd want to be able to search on their full
name and get a list of potential matches.

In cassandra, as far as i know, column names and row keys need to be unique
- so unless I some how construct a unique form of the full name to use as a
column name or key value I'm left with using the column value (as opposed to
the name) and the indexing facility in 0.7 - but its not clear to me whether
the 0.7 index facility would support non-unique column values this way.

e.g.

CF: Contacts (with an index on 'fullname')

key : id1 { fullname : "John Brown", address : "London" }
key : id2 { fullname : "John Brown", address : "Paris"}

Would the 0.7 index on fullname allow me to lookup the 2 entries if I
searched on "John" or "John Brown" ?

Regards

Jason


rename column family with cassandra-cli in 0.7.0-beta3

2010-11-10 Thread gbanks


Re: Non-Unique Indexes, How ?

2010-11-10 Thread Jonathan Ellis
On Wed, Nov 10, 2010 at 5:55 PM, J T  wrote:
> CF: Contacts (with an index on 'fullname')
> key : id1 { fullname : "John Brown", address : "London" }
> key : id2 { fullname : "John Brown", address : "Paris"    }
> Would the 0.7 index on fullname allow me to lookup the 2 entries if I
> searched on "John" or "John Brown" ?

Yes, the latter.

-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com


Re: Non-Unique Indexes, How ?

2010-11-10 Thread J T
Ok, so non-unique indexes are supported, but only full equality matches on
the values are supported right now.

Will it in the future allow for partial/range matches ?

e.g. Find all contacts with a J as the first letter ?

Jason

On Thu, Nov 11, 2010 at 12:13 AM, Jonathan Ellis  wrote:

> On Wed, Nov 10, 2010 at 5:55 PM, J T  wrote:
> > CF: Contacts (with an index on 'fullname')
> > key : id1 { fullname : "John Brown", address : "London" }
> > key : id2 { fullname : "John Brown", address : "Paris"}
> > Would the 0.7 index on fullname allow me to lookup the 2 entries if I
> > searched on "John" or "John Brown" ?
>
> Yes, the latter.
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of Riptano, the source for professional Cassandra support
> http://riptano.com
>


Re: rename column family with cassandra-cli in 0.7.0-beta3

2010-11-10 Thread Jonathan Ellis
https://issues.apache.org/jira/browse/CASSANDRA-1630

On Wed, Nov 10, 2010 at 6:09 PM, gbanks  wrote:
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com


Re: Non-Unique Indexes, How ?

2010-11-10 Thread Jonathan Ellis
Yes.

On Wed, Nov 10, 2010 at 6:39 PM, J T  wrote:
> Ok, so non-unique indexes are supported, but only full equality matches on
> the values are supported right now.
> Will it in the future allow for partial/range matches ?
>
> e.g. Find all contacts with a J as the first letter ?
> Jason
> On Thu, Nov 11, 2010 at 12:13 AM, Jonathan Ellis  wrote:
>>
>> On Wed, Nov 10, 2010 at 5:55 PM, J T  wrote:
>> > CF: Contacts (with an index on 'fullname')
>> > key : id1 { fullname : "John Brown", address : "London" }
>> > key : id2 { fullname : "John Brown", address : "Paris"    }
>> > Would the 0.7 index on fullname allow me to lookup the 2 entries if I
>> > searched on "John" or "John Brown" ?
>>
>> Yes, the latter.
>>
>> --
>> Jonathan Ellis
>> Project Chair, Apache Cassandra
>> co-founder of Riptano, the source for professional Cassandra support
>> http://riptano.com
>
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com


Unsubscribe

2010-11-10 Thread Vibhaw P Rajan

Warm regards,
Vibhaw Rajan
Application Developer-Mainframes
IBM India Pvt. Ltd.  DLF IT Park, Chennai, India
Office +91 44 22723552  Mobile +91 996 253 3029
Email   vibra...@in.ibm.com
"Success is not final, failure is not fatal: it is the courage to continue
that counts"



Re: WordCount example problem

2010-11-10 Thread Patrik Modesto
Thanks, I'll do.

P.

On Wed, Nov 10, 2010 at 16:28, Aditya Muralidharan
 wrote:
> Also, your Mapper class needs to look like this:
> MyMapper extends Mapper IColumn>,Text,SumWritable> ... with all the necessary fixes to the map method.
>
> AD
>
> -Original Message-
> From: Jonathan Ellis [mailto:jbel...@gmail.com]
> Sent: Wednesday, November 10, 2010 8:40 AM
> To: user
> Subject: Re: WordCount example problem
>
> http://www.mail-archive.com/user@cassandra.apache.org/msg07093.html
>
> On Wed, Nov 10, 2010 at 5:47 AM, Patrik Modesto
>  wrote:
>> Hi,
>>
>> I'm trying the WordCount example and getting this error:
>>
>> [12:33]$ ./bin/word_count
>> 10/11/10 12:34:35 INFO WordCount: output reducer type: filesystem
>> 10/11/10 12:34:36 INFO jvm.JvmMetrics: Initializing JVM Metrics with
>> processName=JobTracker, sessionId=
>> 10/11/10 12:34:36 INFO WordCount: XXX:text0
>> 10/11/10 12:34:36 INFO mapred.JobClient: Running job: job_local_0001
>> 10/11/10 12:34:36 INFO mapred.MapTask: io.sort.mb = 100
>> 10/11/10 12:34:36 INFO mapred.MapTask: data buffer = 79691776/99614720
>> 10/11/10 12:34:36 INFO mapred.MapTask: record buffer = 262144/327680
>> 10/11/10 12:34:36 WARN mapred.LocalJobRunner: job_local_0001
>> java.lang.ClassCastException: java.nio.HeapByteBuffer cannot be cast to [B
>>        at WordCount$TokenizerMapper.map(WordCount.java:73)
>>        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>>        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
>>        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
>>        at 
>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
>> 10/11/10 12:34:37 INFO mapred.JobClient:  map 0% reduce 0%
>> 10/11/10 12:34:37 INFO mapred.JobClient: Job complete: job_local_0001
>> 10/11/10 12:34:37 INFO mapred.JobClient: Counters: 0
>>
>> I'm using cassandra 0.7.0beta3 (from latest trunk) on just one
>> machine. Is the example working for anybody?
>>
>> Thanks,
>> P.
>>
>
>
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of Riptano, the source for professional Cassandra support
> http://riptano.com
>


Re: WordCount example problem

2010-11-10 Thread Patrik Modesto
That's exactly what's happening to me. I wonder why Google did't find it.

Thanks!

P.

On Wed, Nov 10, 2010 at 15:39, Jonathan Ellis  wrote:
> http://www.mail-archive.com/user@cassandra.apache.org/msg07093.html
>
> On Wed, Nov 10, 2010 at 5:47 AM, Patrik Modesto
>  wrote:
>> Hi,
>>
>> I'm trying the WordCount example and getting this error:
>>
>> [12:33]$ ./bin/word_count
>> 10/11/10 12:34:35 INFO WordCount: output reducer type: filesystem
>> 10/11/10 12:34:36 INFO jvm.JvmMetrics: Initializing JVM Metrics with
>> processName=JobTracker, sessionId=
>> 10/11/10 12:34:36 INFO WordCount: XXX:text0
>> 10/11/10 12:34:36 INFO mapred.JobClient: Running job: job_local_0001
>> 10/11/10 12:34:36 INFO mapred.MapTask: io.sort.mb = 100
>> 10/11/10 12:34:36 INFO mapred.MapTask: data buffer = 79691776/99614720
>> 10/11/10 12:34:36 INFO mapred.MapTask: record buffer = 262144/327680
>> 10/11/10 12:34:36 WARN mapred.LocalJobRunner: job_local_0001
>> java.lang.ClassCastException: java.nio.HeapByteBuffer cannot be cast to [B
>>        at WordCount$TokenizerMapper.map(WordCount.java:73)
>>        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>>        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
>>        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
>>        at 
>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
>> 10/11/10 12:34:37 INFO mapred.JobClient:  map 0% reduce 0%
>> 10/11/10 12:34:37 INFO mapred.JobClient: Job complete: job_local_0001
>> 10/11/10 12:34:37 INFO mapred.JobClient: Counters: 0
>>
>> I'm using cassandra 0.7.0beta3 (from latest trunk) on just one
>> machine. Is the example working for anybody?
>>
>> Thanks,
>> P.
>>
>
>
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of Riptano, the source for professional Cassandra support
> http://riptano.com
>