Re: Previously deleted rows resurrected by repair?

2011-12-28 Thread Dominic Williams
Hmm interesting could be some variation on 3510 (which caught me out).

Personally I really don't like having to rely on repair to stop deletes
being undone. If you agree follow this proposal for an alternative
https://issues.apache.org/jira/browse/CASSANDRA-3620 which also stops
tombstone build up.

2011/12/27 Jonas Borgström 

> Hi,
>
> I Have a 3 node cluster running Cassandra 1.0.3 and using replication
> factor=3.
>
> Recently I've noticed that some previously deleted rows have started to
> reappear for some reason. And now I wonder if this is a known issue with
> 1.0.3?
>
> Repairs have been running every weekend (gc_grace is 10 days) and always
> completed successfully. But while looking at the logs I noticed that a fair
> number of ranges (around 10% of the total number of keys) have been
> streamed between these nodes during the repair sessions. This seems a bit
> high to me given that everything is written using quorum and all nodes have
> been up all the time.
>
> For me this looks suspiciously like some already deleted keys are streamed
> to other nodes during repair.
>
>
> Some more details about the data:
> All keys are written to only once and most of them are deleted a couple of
> days/weeks later. Some keys are large enough to require incremental
> compaction.
>
> Could this bug cause this?
>
> https://issues.apache.org/**jira/browse/CASSANDRA-3510
>
> Regards,
> Jonas
>


optimizing index sampling for better memory usage

2011-12-28 Thread Radim Kolar
currently j.o.a.c.io.sstable.indexsummary is implemented as ArrayList of 
KeyPosition (RowPosition key, long offset)

i propose to change it to

RowPosition keys[]
long offsets[]

this will lower number of java objects used per entry from 2 
(KeyPosition + RowPosition) to 1.


For building these arrays convenient ArrayList class can be used and 
then call to .toArray() on it.


Re: Previously deleted rows resurrected by repair?

2011-12-28 Thread Jonas Borgström

On 2011-12-28 12:52 , Dominic Williams wrote:

Hmm interesting could be some variation on 3510 (which caught me out).


Actually after making some further reading of the changelog 2786 looks 
like a likely culprit.
If I'm reading the jira correctly all versions < 0.8.8 and < 1.0.4 are 
at risk of getting deleted rows resurrected.


Hopefully an upgrade to 1.0.6 will stop this problem from increasing but 
I still need to manually re-delete a bunch of rows.


https://issues.apache.org/jira/browse/CASSANDRA-2786


Personally I really don't like having to rely on repair to stop deletes
being undone. If you agree follow this proposal for an alternative
https://issues.apache.org/jira/browse/CASSANDRA-3620 which also stops
tombstone build up.


Thanks, this definitely looks interesting. I'll have a look.

/ Jonas


Re: improving cassandra-vs-mongodb-vs-couchdb-vs-redis

2011-12-28 Thread Filipe Gonçalves
There really is no generic way of comparing these systems, NoSQL
databases are highly heterogeneous.
The only credible and accurate way of doing a comparison is for a
specific, well defined, use case. Other than that you are always going
to be comparing apples to oranges thus having an crappy (and in that
one, even inaccurate) comparison to work with.
Some engineers (facebook, twitter and netflix among others if I'm not
mistaken) have done some interesting articles describing where and why
their companies use each database, google those for a minimally
accurate perspective of the NoSQL (and SQL in some cases) database
world.

2011/12/28 CharSyam :
> Don't trust NoSQL Benchmark. It's not a lie. but. NoSQL has different
> performance in many different environment.
>
> Do Benchmark with your real environment. and choose it.
>
> Thank you.
>
>
> 2011/12/28 Igor Lino 
>>
>> You are totally right. I'm far from being an expert on the subject, but
>> the comparison felt inconsistent and incomplete. (I could not express that
>> in my 1st email, not to bias the opinion)
>>
>> Do you know of any similar comparison, which is not biased towards some
>> particular technology or solution?   (so not coming from
>> http://cassandra.apache.org/)
>> I want to understand how superior is Cassandra in its latest release
>> against closer competitors, ideally with the opinion from expert guys.
>>
>>
>> On Wed, Dec 28, 2011 at 12:14 AM, Edward Capriolo 
>> wrote:
>>
>>    This is not really a comparison of anything because each NoSQL has its
>> own bullet points like:
>>    Boats
>>      great for traveling on water
>>    Cars
>>      great for traveling on land
>>    So the conclusion I should gather is?
>>    Also as for the Cassandra bullet points, they are really thin (and
>> wrong). Such as:
>>    Cassandra:
>>    Best used: When you write more than you read (logging). If every
>> component of the system must be in Java. ("No one gets fired for choosing
>> Apache's stuff.")
>>    I view that as:
>>    Nonsensical, inaccurate, and anecdotal.
>>    Also I notice on the other side (and not trying to pick on hbase, but)
>>    hbase:
>>    No single point of failure
>>    Random access performance is like MySQL
>>    Hbase has several SPOF's, its random access performance is definitely
>> NOT 'like mysql',
>>    Cassandra ACTUALLY has no SPOF but as they author mentions, he/she does
>> not like Cassandra so that fact was left out.
>>    From what I can see of the writeup, it is obviously inaccurate in
>> numerous places (without even reading the entire thing).
>>    Also when comparing these technologies very subtle differences in
>> design have profound in effects in operation and performance. Thus someone
>> trying to paper over 6 technologies and compare them with a few bullet
>> points is really doing the world an injustice.
>>    On Tue, Dec 27, 2011 at 5:01 PM, Igor Lino  wrote:
>>
>>        Hi!
>>
>>        I was trying to get an understanding of the real strengths of
>> Cassandra against other competitors. Its actually not that simple and
>> depends a lot on details on the actual requirements.
>>
>>        Reading the following comparison:
>>        http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis
>>
>>        It felt like the description of Cassandra painted a limiting
>> picture of its capabilities. Is there any Cassandra expert that could
>> improve that summary? is there any important thing missing? or is there a
>> more fitting common use case for Cassandra than what Mr. Kovacs has given?
>>        (I believe/think that a Cassandra expert can improve that generic
>> description)
>>
>>        Thanks,
>>        Igor
>>
>>
>>
>



-- 
Filipe Gonçalves


Re: Previously deleted rows resurrected by repair?

2011-12-28 Thread Edward Capriolo
I wanted to throw out a "solution" of mine . With hinted handoff being
guaranteed losing a delete is almost impossible so solutions like this are
no longer required, but...

My "solution" is always write deletes at cl.all. if a delete fails at all
queue it up, on a message queue, write a log, etc. Then later on have some
process consume the log and replay the write the delete at all again.

This sounds like a pain but it it really not that bad. if you are using c*
you likely already have a message queue or hadoop system. Generally lost
deletes is an edge case so you would not expect this backup system to be
very busy.


On Wednesday, December 28, 2011, Jonas Borgström  wrote:
> On 2011-12-28 12:52 , Dominic Williams wrote:
>>
>> Hmm interesting could be some variation on 3510 (which caught me out).
>
> Actually after making some further reading of the changelog 2786 looks
like a likely culprit.
> If I'm reading the jira correctly all versions < 0.8.8 and < 1.0.4 are at
risk of getting deleted rows resurrected.
>
> Hopefully an upgrade to 1.0.6 will stop this problem from increasing but
I still need to manually re-delete a bunch of rows.
>
> https://issues.apache.org/jira/browse/CASSANDRA-2786
>
>> Personally I really don't like having to rely on repair to stop deletes
>> being undone. If you agree follow this proposal for an alternative
>> https://issues.apache.org/jira/browse/CASSANDRA-3620 which also stops
>> tombstone build up.
>
> Thanks, this definitely looks interesting. I'll have a look.
>
> / Jonas
>


Re: Merging 3 rows that are mostly read together from CF into single rows with composite col names ?

2011-12-28 Thread Edward Capriolo
On Monday, December 26, 2011, Asil Klin  wrote:
> If 3 rows in a column family need to be read together 'always', is it
preferable to just merge them in 1 row using composite col names(instead of
keeping in 3 rows) ? Does this improve read performance, anyway ?

You almost definitely want to merge this row. In spinning disk systems like
SCSI or sata seeking data on disk is at a premium so having 3 independent
seeks for data that is always read together is not good.


Re: Retrieve all composite columns from a row, whose composite name's first component matches from a list of Integers

2011-12-28 Thread Edward Capriolo
You need to execute one get slice operation for each item id or if the row
is not large , you can try one large get slice on the entire row and deal
with the results client side.

If you try method 1 When doing slices on composites you can set the start
inclusive or exclusive values to get only the column you want and not some
extra columns up to slice range size.

On Tuesday, December 27, 2011, Aditya  wrote:
> I need to store data of all activities by user's followies in single row.
I am trying to do that making use of composite column names in a single
user specific row named 'rowX'.
> On any activity by a user's followie on an item, a column is stored in
'rowX'. The column has a composite type column name made up of
itemId+userId (which makes it unique col. name) in rowX. (& column value
contains the activity data related to that item by that followie)
>
> Now I want to retrieve activity by all users on a list of items. So I
need to retrieve all composite columns with composite's first component
matching the itemId. Is it possible to do such a query to Cassandra ? I am
using Hector.


Re: Retrieve all composite columns from a row, whose composite name's first component matches from a list of Integers

2011-12-28 Thread Aditya
Since I have around 20 items to query, I guess making 20 queries to
retrieve activities by all followies on all of those 20 columns would too
inefficient, so to take the advantage of more efficient queries, are
supercolumns recommended for this case ? Anyways, in case I use
supercolumns, I need to retrieve the entire supercolumn at any point of
time & I am writing subcolumn(s) to the supercolumn at different times not
at once.

On Wed, Dec 28, 2011 at 8:07 PM, Edward Capriolo wrote:

> You need to execute one get slice operation for each item id or if the row
> is not large , you can try one large get slice on the entire row and deal
> with the results client side.
>
> If you try method 1 When doing slices on composites you can set the start
> inclusive or exclusive values to get only the column you want and not some
> extra columns up to slice range size.
>
>
> On Tuesday, December 27, 2011, Aditya  wrote:
> > I need to store data of all activities by user's followies in single
> row. I am trying to do that making use of composite column names in a
> single user specific row named 'rowX'.
> > On any activity by a user's followie on an item, a column is stored in
> 'rowX'. The column has a composite type column name made up of
> itemId+userId (which makes it unique col. name) in rowX. (& column value
> contains the activity data related to that item by that followie)
> >
> > Now I want to retrieve activity by all users on a list of items. So I
> need to retrieve all composite columns with composite's first component
> matching the itemId. Is it possible to do such a query to Cassandra ? I am
> using Hector.
>


Re: Retrieve all composite columns from a row, whose composite name's first component matches from a list of Integers

2011-12-28 Thread Edward Capriolo
Super columns have the same fundamental problem and perform worse in
general. So switching from composites to super columns is NEVER a good idea.


On Wed, Dec 28, 2011 at 1:19 PM, Aditya  wrote:

> Since I have around 20 items to query, I guess making 20 queries to
> retrieve activities by all followies on all of those 20 columns would too
> inefficient, so to take the advantage of more efficient queries, are
> supercolumns recommended for this case ? Anyways, in case I use
> supercolumns, I need to retrieve the entire supercolumn at any point of
> time & I am writing subcolumn(s) to the supercolumn at different times not
> at once.
>
> On Wed, Dec 28, 2011 at 8:07 PM, Edward Capriolo wrote:
>
>> You need to execute one get slice operation for each item id or if the
>> row is not large , you can try one large get slice on the entire row and
>> deal with the results client side.
>>
>> If you try method 1 When doing slices on composites you can set the start
>> inclusive or exclusive values to get only the column you want and not some
>> extra columns up to slice range size.
>>
>>
>> On Tuesday, December 27, 2011, Aditya  wrote:
>> > I need to store data of all activities by user's followies in single
>> row. I am trying to do that making use of composite column names in a
>> single user specific row named 'rowX'.
>> > On any activity by a user's followie on an item, a column is stored in
>> 'rowX'. The column has a composite type column name made up of
>> itemId+userId (which makes it unique col. name) in rowX. (& column value
>> contains the activity data related to that item by that followie)
>> >
>> > Now I want to retrieve activity by all users on a list of items. So I
>> need to retrieve all composite columns with composite's first component
>> matching the itemId. Is it possible to do such a query to Cassandra ? I am
>> using Hector.
>>
>
>


Re: Retrieve all composite columns from a row, whose composite name's first component matches from a list of Integers

2011-12-28 Thread Martin Arrowsmith
I believe this calls for Cassanda Cookbook 2nd edition :)

On Wed, Dec 28, 2011 at 10:26 AM, Edward Capriolo wrote:

> Super columns have the same fundamental problem and perform worse in
> general. So switching from composites to super columns is NEVER a good idea.
>
>
> On Wed, Dec 28, 2011 at 1:19 PM, Aditya  wrote:
>
>> Since I have around 20 items to query, I guess making 20 queries to
>> retrieve activities by all followies on all of those 20 columns would too
>> inefficient, so to take the advantage of more efficient queries, are
>> supercolumns recommended for this case ? Anyways, in case I use
>> supercolumns, I need to retrieve the entire supercolumn at any point of
>> time & I am writing subcolumn(s) to the supercolumn at different times not
>> at once.
>>
>> On Wed, Dec 28, 2011 at 8:07 PM, Edward Capriolo 
>> wrote:
>>
>>> You need to execute one get slice operation for each item id or if the
>>> row is not large , you can try one large get slice on the entire row and
>>> deal with the results client side.
>>>
>>> If you try method 1 When doing slices on composites you can set the
>>> start inclusive or exclusive values to get only the column you want and not
>>> some extra columns up to slice range size.
>>>
>>>
>>> On Tuesday, December 27, 2011, Aditya  wrote:
>>> > I need to store data of all activities by user's followies in single
>>> row. I am trying to do that making use of composite column names in a
>>> single user specific row named 'rowX'.
>>> > On any activity by a user's followie on an item, a column is stored in
>>> 'rowX'. The column has a composite type column name made up of
>>> itemId+userId (which makes it unique col. name) in rowX. (& column value
>>> contains the activity data related to that item by that followie)
>>> >
>>> > Now I want to retrieve activity by all users on a list of items. So I
>>> need to retrieve all composite columns with composite's first component
>>> matching the itemId. Is it possible to do such a query to Cassandra ? I am
>>> using Hector.
>>>
>>
>>
>


Bootstrap without initial token

2011-12-28 Thread Gabriel Ki
Hi,

I was getting a runtime exception "Chose token 0 which is already in use by
..." when bootstrapping without initial_token.  So looking at
StorageService getBootstrapToken(), if the ring has more than 2 nodes and
all nodes are up, do I always have to specify initial token when adding new
nodes?  Did I miss anything?

Thanks,
-gabe


cassandra 1.0.6 rpm

2011-12-28 Thread Shu Zhang
Hi, it looks like cassandra 1.0.6 was released a while ago, but I still don't 
see the rpm here: http://rpm.datastax.com/EL/5/i386/

Any idea when that will be out?

Thanks,
Shu

Re: Consistency Level

2011-12-28 Thread Peter Schuller
> exception "May not be enough replicas present to handle consistency level"

Check for mistakes in using getendpoints. Cassandra says Unavailable
when there is not enough replicas *IN THE REPLICA SET FOR THE ROW KEY*
to satisfy the consistency level.

> I tried to read data using cassandra-cli but I am getting "null".

This is just cassandra-cli quirkyness IIRC; I think you get "null" on
exceptions.

> With consistency level ONE, I would assume that with just one node up and
> running (of course the one that has the data) I should get my data back. But
> this is not happening.

1 node among the ones in the replica set of your row has to be up.

> Will the read repair happen automatically even if I read and write using the
> consistency level ONE?

Yes, assuming it's turned on.

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)


column family names

2011-12-28 Thread Scott Lewis
I've noticed when creating column families that the name of the column 
family apparently has some restrictions...e.g. the presence of a '.' 
character in the column family name seems to throw an exception.  Is 
there anywhere articulated the restrictions on column family names (and 
keyspace names...if there are any such restrictions).  If so, where?


Thanksinadvance,

Scott