Re: Storage question

2013-03-04 Thread Rustam Aliyev
Each storage system has its own purpose. While Cassandra would be good 
for metadata, depending on the size of objects Cassandra could be not 
the best fit. You need something more like Amazon S3 for blob storage. 
Try Ceph RADOS or OpenStack Object Store which both provide S3 
compatible API.


On 04/03/2013 19:34, Kanwar Sangha wrote:


Hi – Can someone suggest the optimal way to store files / images ? We 
are planning to use cassandra for meta-data for these files.  HDFS is 
not good for small file size .. can we


look at something else ?

Thanks,

Kanwar





Re: CassandraFS in 1.0?

2011-07-12 Thread Rustam Aliyev

Hi David,

This is interesting topic and it would be interesting to hear from 
someone who is using it in prod.


Particularly - How your fs implementation behaves for medium/large 
files, e.g. > 1MB?


If you store large files, how large is your store per node and how does 
it handle compactions (any performance issues while compacting large data)?


Also would be interesting to hear some benchmarks and performance stats 
for read/writes.


Regards,
Rustam.


On 12/07/2011 04:51, David Strauss wrote:

It's not, currently, but I'm happy to answer questions about its architecture.

On Thu, Jul 7, 2011 at 10:35, Norman Maurer
  wrote:

May I ask if its opensource by any chance ?

bye
norman

Am Donnerstag, 7. Juli 2011 schrieb David Strauss:

I'm not sure HDFS has the right properties for a media-storage file
system. We have, however, built a WebDAV server on top of Cassandra
that avoids any pretension of being a general-purpose, POSIX-compliant
file system. We mount it on our servers using davfs2, which is also
nice for a few reasons:

* We can use standard HTTP load-balancing and dead host avoidance
strategies with WebDAV.
* Encrypting access and authenticating clients with PKI/HTTPS works seamlessly.
* WebDAV + davfs2 is etag-header aware, allowing clients to
efficiently validate cached items.
* HTTP is browser and CDN/reverse proxy cache friendly for
distributing content to people who don't need to mount the file
system.
* We could extend the server's support to allow connections from a
broad variety of interactive desktop clients.

On Wed, Jul 6, 2011 at 13:11, Joseph Stein  wrote:

Hey folks, I am going to start prototyping our media tier using cassandra as
a file system (meaning upload video/audio/images to web server save in
cassandra and then streaming them out)
Has anyone done this before?
I was thinking brisk's CassandraFS might be a fantastic implementation for
this but then I feel that I need to run another/different Cassandra cluster
outside of what our ops folks do with Apache Cassandra 0.8.X
Am I best to just compress files uploaded to the web server and then start
chunking and saving chunks in rows and columns so the mem issue does not
smack me in the face?  And use our existing cluster and build it out
accordingly?
I am sure our ops people would like the command line aspect of CassandraFS
but looking for something that makes the most sense all around.
It seems to me there is a REALLY great thing in CassandraFS and would love
to see it as part of 1.0 =8^)  or at a minimum some streamlined
implementation to-do the same thing.
If comparing to HDFS that is part of Hadoop project even though Cloudera has
a distribution of Hadoop :) maybe that can work here too _fingers_crosed_
(or mongodb->gridfs)
happy to help as I am moving down this road in general
Thanks!

/*
Joe Stein
http://www.linkedin.com/in/charmalloc
Twitter: @allthingshadoop
*/




--
David Strauss
| da...@davidstrauss.net
| +1 512 577 5827 [mobile]






Re: building a new email-like inbox service with cassandra

2011-11-17 Thread Rustam Aliyev

Hi Dotan,

We have already built something similar and were planning to open source 
it. It will be available under http://www.elasticinbox.com/.


We haven't followed exactly IBM's paper, we believe our Cassandra model 
design is more robust. It's written in Java and provides LMTP and REST 
interfaces. ElasticInbox also stores original messages outside of the 
Cassandra, in the blob store.


Let me know if you are interested, I will need some time to do cleanup.

Regards,
Rustam.

On 17/11/2011 14:17, Dotan N. wrote:

Hi all,
New to cassandra, i'm about to embrak on building a scalable user 
inbox service on top of cassandra.
I've done the preliminary googling and got some more info on 
bluerunner (IBM's project on the subject),

and now looking for more information in this specific topic.

If anyone can point me to researches/articles that would nudge me in 
the right direction i'd be tremendously thankful!


Thanks!

--
Dotan, @jondot 



Re: building a new email-like inbox service with cassandra

2011-11-18 Thread Rustam Aliyev
It's pleasing to see interest out there. We'll try to do some cleanups 
and push it to github this weekend.


You can follow us on twitter: @elasticinbox

Regards,
Rustam.

On Fri Nov 18 01:42:01 2011, Andrey V. Panov wrote:
I'm also interesting in your project and will be glad to follow you on 
twitter if I can.


On 18 November 2011 00:37, Rustam Aliyev <mailto:rus...@code.az>> wrote:


Hi Dotan,

We have already built something similar and were planning to open
source it. It will be available under http://www.elasticinbox.com/.

We haven't followed exactly IBM's paper, we believe our Cassandra
model design is more robust. It's written in Java and provides
LMTP and REST interfaces. ElasticInbox also stores original
messages outside of the Cassandra, in the blob store.

Let me know if you are interested, I will need some time to do
cleanup.

Regards,
Rustam.

On 17/11/2011 14:17, Dotan N. wrote:

Hi all,
New to cassandra, i'm about to embrak on building a scalable user
inbox service on top of cassandra.
I've done the preliminary googling and got some more info on
bluerunner (IBM's project on the subject),
and now looking for more information in this specific topic.

If anyone can point me to researches/articles that would nudge me
in the right direction i'd be tremendously thankful!

Thanks!

-- 
Dotan, @jondot <http://twitter.com/jondot>






Re: building a new email-like inbox service with cassandra

2011-12-06 Thread Rustam Aliyev

Hi,

Just updating this thread:

We've pushed initial version to github today. You can find sources, 
binary package and some information here: 
https://github.com/elasticinbox/elasticinbox/wiki


Your feedback is most welcome. We can discuss it further on 
elasticin...@googlegroups.com mail list.


Regards,
Rustam.

On 18/11/2011 13:08, Dotan N. wrote:

Thanks!!
--
Dotan, @jondot <http://twitter.com/jondot>



On Fri, Nov 18, 2011 at 2:48 PM, Rustam Aliyev <mailto:rus...@code.az>> wrote:


It's pleasing to see interest out there. We'll try to do some
cleanups and push it to github this weekend.

You can follow us on twitter: @elasticinbox

Regards,
Rustam.


On Fri Nov 18 01:42:01 2011 , Andrey V. Panov wrote:

I'm also interesting in your project and will be glad to
follow you on twitter if I can.

    On 18 November 2011  00:37, Rustam Aliyev
mailto:rus...@code.az> <mailto:rus...@code.az
<mailto:rus...@code.az>>> wrote:

   Hi Dotan,

   We have already built something similar and were planning
to open
   source it. It will be available under
http://www.elasticinbox.com/.

   We haven't followed exactly IBM's paper, we believe our
Cassandra
   model design is more robust. It's written in Java and provides
   LMTP and REST interfaces. ElasticInbox also stores original
   messages outside of the Cassandra, in the blob store.

   Let me know if you are interested, I will need some time to do
   cleanup.

   Regards,
   Rustam.

   On 17/11/2011 14:17, Dotan N. wrote:

   Hi all,
   New to cassandra, i'm about to embrak on building a
scalable user
   inbox service on top of cassandra.
   I've done the preliminary googling and got some more
info on
   bluerunner (IBM's project on the subject),
   and now looking for more information in this specific
topic.

   If anyone can point me to researches/articles that
would nudge me
   in the right direction i'd be tremendously thankful!

   Thanks!

   -- Dotan, @jondot <http://twitter.com/jondot>





Re: cassandra as an email store ...

2011-12-16 Thread Rustam Aliyev

Hi Sasha,

Replying to the old thread just for reference. We've released a code 
which we use to store emails in Cassandra as an open source project: 
http://elasticinbox.com/


Hope you find it helpful.

Regards,
Rustam.

On Fri Apr 29 15:20:07 2011, Sasha Dolgy wrote:

Great read.  thanks.

On Apr 29, 2011 4:07 PM, "sridhar basam" > wrote:

> Have you already looked at some research out of IBM about this usecase?
> Paper is at
>
> http://ewh.ieee.org/r6/scv/computer/nfic/2009/IBM-Jun-Rao.pdf
>
> Sridhar


Re: cassandra as an email store ...

2011-12-16 Thread Rustam Aliyev

Hi Sasha,

There's been a lot of fud in regards to SuperColumns. But actually in 
our case we found them quite useful.


The main argument for using SC in that case is that message metadata is 
immutable and in most of the cases read and written alltogether (i.e. 
you fetch all message headers together). There are few exceptions when 
we need to update only one column, e.g. when updating labels/markers. 
But the number of such requests shouldn't affect performance 
dramatically.


Currently we have this model running in prod with 150K subs and 20M 
messages and quite happy with the performance.


Would be interesting to see other alternatives though.

Regards,
Rustam.

On Fri Dec 16 11:51:24 2011, Sasha Dolgy wrote:

Hi Rustam,

Thanks for posting that.

Interesting to see that you opted to use Super Column's:
https://github.com/elasticinbox/elasticinbox/wiki/Data-Model ..
wondering, for the sake of argument/discussion .. if anyone can come
up with an alternative data model that doesn't use SC's.

-sd

On Fri, Dec 16, 2011 at 11:10 AM, Rustam Aliyev  wrote:

Hi Sasha,

Replying to the old thread just for reference. We've released a code which
we use to store emails in Cassandra as an open source project:
http://elasticinbox.com/

Hope you find it helpful.

Regards,
Rustam.


On Fri Apr 29 15:20:07 2011, Sasha Dolgy wrote:


Great read.  thanks.


On Apr 29, 2011 4:07 PM, "sridhar basam"mailto:s...@basam.org>>  wrote:

Have you already looked at some research out of IBM about this usecase?
Paper is at

http://ewh.ieee.org/r6/scv/computer/nfic/2009/IBM-Jun-Rao.pdf

Sridhar






Re: What is the future of supercolumns ?

2012-01-07 Thread Rustam Aliyev
My suggestion is simple: don't use any deprecated stuff out there. In 
practically any case there is a good reason why it's deprecated.


SuperColumns are not deprecated.

On Sat Jan  7 19:51:55 2012, R. Verlangen wrote:
My suggestion is simple: don't use any deprecated stuff out there. In 
practically any case there is a good reason why it's deprecated.


I've seen a couple of composite-column vs supercolumn discussions in 
the past weeks here: I think a little bit of searching will get you 
around.


Cheers

2012/1/7 Aklin_81 mailto:asdk...@gmail.com>>

I read entire columns inside the supercolumns at any time but as for
writing them, I write the columns at different times. I don't have the
need to update them except that die after their TTL period of 60 days.
But since they are going to be deprecated, I don't know if it would be
really advisable to use them right now.

I believe if it was possible to do wildchard querying for a list of
column names then the supercolumns use cases may be easily replaced by
normal columns. Could it practically possible, in future ?

On Sat, Jan 7, 2012 at 8:05 AM, Terje Marthinussen
mailto:tmarthinus...@gmail.com>> wrote:
> Please realize that I do not make any decisions here and I am
not part of the core Cassandra developer team.
>
> What has been said before is that they will most likely go away
and at least under the hood be replaced by composite columns.
>
> Jonathan have however stated that he would like the supercolumn
API/abstraction to remain at least for backwards compatibility.
>
> Please understand that under the hood, supercolumns are merely
groups of columns serialized as a single block of data.
>
>
> The fact that there is a specialized and hardcoded way to
serialize these column groups into supercolumns is a problem
however and they should probably go away to make space for a more
generic implementation allowing more flexible data structures and
less code specific for one special data structure.
>
> Today there are tons of extra code to deal with the slight
difference in serialization and features of supercolumns vs
columns and hopefully most of that could go away if things got
structured a bit different.
>
> I also hope that we keep APIs to allow simple access to groups
of key/value pairs to simplify application logic as working with
just columns can add a lot of application code which should not be
needed.
>
> If you almost always need all or mostly all of the columns in a
supercolumn, and you normally update all of them at the same time,
they will most likely be faster than normal columns.
>
> Processing wise, you will actually do a bit more work on
serialization/deserialization of SC's but the I/O part will
usually be better grouped/require less operations.
>
> I think we did some benchmarks on some heavy use cases with ~30
small columns per SC some time back and I think we ended up with
 SCs being 10-20% faster.
>
>
> Terje
>
> On Jan 5, 2012, at 2:37 PM, Aklin_81 wrote:
>
>> I have seen supercolumns usage been discouraged most of the times.
>> However sometimes the supercolumns seem to fit the scenario most
>> appropriately not only in terms of how the data is stored but
also in
>> terms of how is it retrieved. Some of the queries supported by
SCs are
>> uniquely capable of doing the task which no other alternative
schema
>> could do.(Like recently I asked about getting the equivalent of
>> retrieving a list of (full)supercolumns by name, through use of
>> composite columns, unfortunately there was no way to do this
without
>> reading lots of extra columns).
>>
>> So I am really confused whether:
>>
>> 1. Should I really not use the supercolumns for any case at all,
>> however appropriate, or I just need to be just careful while
realizing
>> that supercolumns fit my use case appropriately or what!?
>>
>> 2. Are there any performance concerns with supercolumns even in the
>> cases where they are used most appropriately. Like when you need to
>> retrieve the entire supercolumns everytime & max. no of subcolumns
>> vary between 0-10.
>> (I don't write all the subcolumns inside supercolumn, at once
though!
>> Does this also matter?)
>>
>> 3. What is their future? Are they going to be deprecated or may be
>> enhanced later?
>




Re: Deploying Cassandra 1.0.7 on EC2 in minutes

2012-01-18 Thread Rustam Aliyev

Hi Andrei,

As you know, we are using Whirr for ElasticInbox 
(https://github.com/elasticinbox/whirr-elasticinbox). While testing we 
encountered a few minor problems which I think could be improved. Note 
that we were using 0.6 (there were some strange bug in 0.7, maybe fixed 
already).


Although initial_token is pre-calculated to form balanced cluster, our 
tests cluster (4 nodes) was always unbalanced. There were no 
initial_token specified (just default).


Second note is AWS specific - for the performance reasons it's better to 
store data files on ephemeral drive. Currently data stored under default 
location (/var/...)


Thanks for the great work!

--
Rustam.

On 18/01/2012 13:00, Andrei Savu wrote:

Hi guys,

I just want to the let you know that  Apache Whirr trunk (the upcoming 
0.7.1 release) can deploy Cassandra 1.0.7 on AWS EC2 & Rackspace Cloud.


You can give it a try by running the following commands:
https://gist.github.com/1632893

And the last thing we would appreciate any suggestions on improving 
the deployment scripts or on improving Whirr.


Thanks,

-- Andrei Savu / andreisavu.ro 



Re: Deploying Cassandra 1.0.7 on EC2 in minutes

2012-01-19 Thread Rustam Aliyev

Great, will try 0.7.1 when it's ready.

(Bug I mentioned was already reported)

On 19/01/2012 13:15, Andrei Savu wrote:


On Wed, Jan 18, 2012 at 7:58 PM, Rustam Aliyev <mailto:rus...@code.az>> wrote:


Hi Andrei,

As you know, we are using Whirr for ElasticInbox
(https://github.com/elasticinbox/whirr-elasticinbox). While
testing we encountered a few minor problems which I think could be
improved. Note that we were using 0.6 (there were some strange bug
in 0.7, maybe fixed already).


Please report any bugs you've found in 0.7. We are preparing a 0.7.1 
release to address them.



Although initial_token is pre-calculated to form balanced cluster,
our tests cluster (4 nodes) was always unbalanced. There were no
initial_token specified (just default).


We are now computing the value for initial_token - the cluster should 
be balanced with the trunk version.



Second note is AWS specific - for the performance reasons it's
better to store data files on ephemeral drive. Currently data
stored under default location (/var/...)


I think you can work around this by selecting an instance-store AMI.


Thanks for the great work!

--
Rustam.


On 18/01/2012 13:00, Andrei Savu wrote:

Hi guys,

I just want to the let you know that  Apache Whirr trunk (the
upcoming 0.7.1 release) can deploy Cassandra 1.0.7 on AWS EC2 &
Rackspace Cloud.

You can give it a try by running the following commands:
https://gist.github.com/1632893

And the last thing we would appreciate any suggestions on
improving the deployment scripts or on improving Whirr.

Thanks,

-- Andrei Savu / andreisavu.ro <http://andreisavu.ro>





Re: [RELEASE] Apache Cassandra 0.8.9 released

2012-01-26 Thread Rustam Aliyev

Hi,

I was just about to upgrade to the latest 0.8.x, but noticed that 
there's no RPM package for 0.8.9 on DataStax repo. Latest is 0.8.8.


Any plans to publish 0.8.9 rpm?

--
Rustam

On 14/12/2011 19:59, Sylvain Lebresne wrote:

The Cassandra team is pleased to announce the release of Apache Cassandra
version 0.8.9.

Cassandra is a highly scalable second-generation distributed database,
bringing together Dynamo's fully distributed design and Bigtable's
ColumnFamily-based data model. You can read more here:

  http://cassandra.apache.org/

Downloads of source and binary distributions are listed in our download
section:

  http://cassandra.apache.org/download/

This version is a maintenance/bug fix release[1]. Please pay attention to the
release notes[2] before upgrading and let us know[3] if you were to encounter
any problem.

Have fun!


[1]: http://goo.gl/Kx7d0 (CHANGES.txt)
[2]: http://goo.gl/Tv2NW (NEWS.txt)
[3]: https://issues.apache.org/jira/browse/CASSANDRA



Re: [RELEASE] Apache Cassandra 0.8.10 released

2012-02-14 Thread Rustam Aliyev

No more RPMs from DataStax?

http://rpm.datastax.com/community/x86_64/

On Mon Feb 13 10:40:13 2012, Sylvain Lebresne wrote:

The Cassandra team is pleased to announce the release of Apache Cassandra
version 0.8.10.

Cassandra is a highly scalable second-generation distributed database,
bringing together Dynamo's fully distributed design and Bigtable's
ColumnFamily-based data model. You can read more here:

  http://cassandra.apache.org/

Downloads of source and binary distributions are listed in our download
section:

  http://cassandra.apache.org/download/

This version is a maintenance/bug fix release[1] for the 0.8 branch. Please
pay attention to the release notes[2] before upgrading and let us know[3] if
you were to encounter any problem.

Have fun!


[1]: http://goo.gl/V1M1q (CHANGES.txt)
[2]: http://goo.gl/AojHc (NEWS.txt)
[3]: https://issues.apache.org/jira/browse/CASSANDRA



Re: Please advise -- 750MB object possible?

2012-02-22 Thread Rustam Aliyev

Hi Maxim,

If you need to store Blobs, then BlobStores such as OpenStack Object 
Store (aka Swift) should be better choise.


As far as I know, MogileFS (which is also a sort of BlobStore) has 
scalability bottleneck - MySQL.


There are few reasons why BlobStores are better choise. In the 
following presentation, I summarised why we chose to store blobs for 
ElasticInbox on BlobStores, not Cassandra: 
http://www.elasticinbox.com/blog/slides-and-video-from-london-meetup/


Main downside of BlobStores in comparison to Cassandra is write speed. 
Cassandra writes to memtabes, BlobStores to disk.


-
Rustam.


On Wed Feb 22 22:19:26 2012, Maxim Potekhin wrote:

Thank you so much, looks nice, I'll be looking into it.


On 2/22/2012 3:08 PM, Rob Coli wrote:



On Wed, Feb 22, 2012 at 10:37 AM, Maxim Potekhin > wrote:


The idea was to provide redundancy, resilience, automatic load
balancing
and automatic repairs. Going the way of the file system does not
achieve any of that.


(Apologies for continuing slightly OT thread, but if people google 
and find this thread, I'd like to to contain the below relevant 
suggestion.. :D)


With the caveat that you would have to ensure that your client code 
streams instead of buffering the entire object, you probably want 
something like MogileFS :


http://danga.com/mogilefs/

I have operated a sizable MogileFS cluster for Digg, and it was one 
of the simplest, most comprehensible and least error prone parts of 
our infrastructure. A++ would run again.


--
=Robert Coli
rc...@palominodb.com 




Re: Adding node to Cassandra

2012-03-12 Thread Rustam Aliyev

Hi,

If you use SizeTieredCompactionStrategy, you should have x2 disk space 
to be on the safe side. So if you want to store 2TB data, you need 
partition size of 4TB at least.  LeveledCompactionStrategy is available 
in 1.x and supposed to require less free disk space (but comes at price 
of I/O).


--
Rustam.

On 12/03/2012 09:23, Vanger wrote:
*We have cassandra 4 nodes cluster* with RF = 3 (nodes named from 'A' 
to 'D', initial tokens:

*A (25%)*: 20543402371996174596346065790779111550, *
B (25%)*: 63454860067234500516210522518260948578,
*C (25%)*: 106715317233367107622067286720208938865,
*D (25%)*: 150141183460469231731687303715884105728),
*and want to add 5th node* ('E') with initial token = 
164163260474281062972548100673162157075,  then we want to rebalance A, 
D, E nodes such way they'll own equal percentage of data. All nodes 
have ~400 GB of data and around ~300GB disk free space.

What we did:
1. 'Join' new cassandra instance (node 'E') to cluster and wait 'till 
it loads data for it tokens range.


2. Move node 'D' initial token down from 150... to 130...
Here we ran into a problem. When "move" started disk usage for node C 
grows from 400 to 750GB, we saw running compactions on node 'D' but 
some compactions failed with /"WARN [CompactionExecutor:580] 
2012-03-11 16:57:56,036 CompactionTask.java (line 87) insufficient 
space to compact all requested files SSTableReader"/ after that we 
killed "move" process to avoid "out of disk space" error (when 5GB of 
free space left). After restart it frees 100GB of space and now we 
have total of 105GB free disk space on node 'D'. Also we noticed 
increased disk usage by ~150GB at node 'B' but it stops growing before 
we stopped "move token".



So now we have 5 nodes in cluster in status like this:
Node, Owns%, Load, Init. token
A: 16%   400GB020...
B: 25%   520GB063...
C: 25%   400GB106...
D: 25%   640GB150...
E:  9% 300GB164...

We'll add disk space for all nodes and run some cleanups, but there's 
still left some questions:


What is the best next step  for us from this point?
What is correct procedure after all and what should we expect when 
adding node to cassandra cluster?
We expected decrease of used disk space on node 'D' 'cause we shrink 
token range for this node, but saw the opposite, why it happened and 
is it normal behavior?
What if we'll have 2TB of data on 2.5TB disk and we wanted to add 
another node and move tokens?
Is it possible to automate node addition to cluster and be sure we 
won't run out of space?


Thank.


Re: Adding node to Cassandra

2012-03-12 Thread Rustam Aliyev

What version of Cassandra do you have?

On 12/03/2012 11:38, Vanger wrote:
We were aware of compaction overhead, but still don't understand why 
that shall happened: node 'D' was in stable condition, works for at 
least month, had all data for its token range and was comfortable with 
such disk space.
Why suddenly node needs 2x more space for data it already have? Why 
decreasing token range not lead to decreasing disk usage?


On 12.03.2012 15:14, Rustam Aliyev wrote:

Hi,

If you use SizeTieredCompactionStrategy, you should have x2 disk 
space to be on the safe side. So if you want to store 2TB data, you 
need partition size of 4TB at least.  LeveledCompactionStrategy is 
available in 1.x and supposed to require less free disk space (but 
comes at price of I/O).


--
Rustam.

On 12/03/2012 09:23, Vanger wrote:
*We have cassandra 4 nodes cluster* with RF = 3 (nodes named from 
'A' to 'D', initial tokens:

*A (25%)*: 20543402371996174596346065790779111550, *
B (25%)*: 63454860067234500516210522518260948578,
*C (25%)*: 106715317233367107622067286720208938865,
*D (25%)*: 150141183460469231731687303715884105728),
*and want to add 5th node* ('E') with initial token = 
164163260474281062972548100673162157075,  then we want to rebalance 
A, D, E nodes such way they'll own equal percentage of data. All 
nodes have ~400 GB of data and around ~300GB disk free space.

What we did:
1. 'Join' new cassandra instance (node 'E') to cluster and wait 
'till it loads data for it tokens range.


2. Move node 'D' initial token down from 150... to 130...
Here we ran into a problem. When "move" started disk usage for node 
C grows from 400 to 750GB, we saw running compactions on node 'D' 
but some compactions failed with /"WARN [CompactionExecutor:580] 
2012-03-11 16:57:56,036 CompactionTask.java (line 87) insufficient 
space to compact all requested files SSTableReader"/ after that we 
killed "move" process to avoid "out of disk space" error (when 5GB 
of free space left). After restart it frees 100GB of space and now 
we have total of 105GB free disk space on node 'D'. Also we noticed 
increased disk usage by ~150GB at node 'B' but it stops growing 
before we stopped "move token".



So now we have 5 nodes in cluster in status like this:
Node, Owns%, Load, Init. token
A: 16%   400GB020...
B: 25%   520GB063...
C: 25%   400GB106...
D: 25%   640GB150...
E:  9% 300GB164...

We'll add disk space for all nodes and run some cleanups, but 
there's still left some questions:


What is the best next step  for us from this point?
What is correct procedure after all and what should we expect when 
adding node to cassandra cluster?
We expected decrease of used disk space on node 'D' 'cause we shrink 
token range for this node, but saw the opposite, why it happened and 
is it normal behavior?
What if we'll have 2TB of data on 2.5TB disk and we wanted to add 
another node and move tokens?
Is it possible to automate node addition to cluster and be sure we 
won't run out of space?


Thank.




Re: Adding node to Cassandra

2012-03-12 Thread Rustam Aliyev
It's hard to answer this question because there are whole bunch of 
operations which may cause disk usage growth - repair, compaction, move 
etc. Any combination of these operations will make things only worse. 
But let's assume that in your case the only operation increasing disk 
usage was "move".


Simply speaking "move" does not move data from one node to another, it 
just copies data. Once data copied, you need to cleanup data which node 
is not responsible for using "cleanup" command.


If you can't increase storage, maybe you can try moving nodes slowly. 
I.e. Instead of moving node D from 150... to 130..., try going first to 
140..., cleanup and then from 140... to 130... However, I never tried 
this and can't guarantee that it will use less disk space.


In the past, someone reported x2.5 increase when they went from 4 nodes 
to 5.


--
Rustam.

On 12/03/2012 12:46, Vanger wrote:

Cassandra v1.0.8
once again: 4-nodes cluster, RF = 3.


On 12.03.2012 16:18, Rustam Aliyev wrote:

What version of Cassandra do you have?

On 12/03/2012 11:38, Vanger wrote:
We were aware of compaction overhead, but still don't understand why 
that shall happened: node 'D' was in stable condition, works for at 
least month, had all data for its token range and was comfortable 
with such disk space.
Why suddenly node needs 2x more space for data it already have? Why 
decreasing token range not lead to decreasing disk usage?


On 12.03.2012 15:14, Rustam Aliyev wrote:

Hi,

If you use SizeTieredCompactionStrategy, you should have x2 disk 
space to be on the safe side. So if you want to store 2TB data, you 
need partition size of 4TB at least.  LeveledCompactionStrategy is 
available in 1.x and supposed to require less free disk space (but 
comes at price of I/O).


--
Rustam.

On 12/03/2012 09:23, Vanger wrote:
*We have cassandra 4 nodes cluster* with RF = 3 (nodes named from 
'A' to 'D', initial tokens:

*A (25%)*: 20543402371996174596346065790779111550, *
B (25%)*: 63454860067234500516210522518260948578,
*C (25%)*: 106715317233367107622067286720208938865,
*D (25%)*: 150141183460469231731687303715884105728),
*and want to add 5th node* ('E') with initial token = 
164163260474281062972548100673162157075,  then we want to 
rebalance A, D, E nodes such way they'll own equal percentage of 
data. All nodes have ~400 GB of data and around ~300GB disk free 
space.

What we did:
1. 'Join' new cassandra instance (node 'E') to cluster and wait 
'till it loads data for it tokens range.


2. Move node 'D' initial token down from 150... to 130...
Here we ran into a problem. When "move" started disk usage for 
node C grows from 400 to 750GB, we saw running compactions on node 
'D' but some compactions failed with /"WARN 
[CompactionExecutor:580] 2012-03-11 16:57:56,036 
CompactionTask.java (line 87) insufficient space to compact all 
requested files SSTableReader"/ after that we killed "move" 
process to avoid "out of disk space" error (when 5GB of free space 
left). After restart it frees 100GB of space and now we have total 
of 105GB free disk space on node 'D'. Also we noticed increased 
disk usage by ~150GB at node 'B' but it stops growing before we 
stopped "move token".



So now we have 5 nodes in cluster in status like this:
Node, Owns%, Load, Init. token
A: 16%   400GB020...
B: 25%   520GB063...
C: 25%   400GB106...
D: 25%   640GB150...
E:  9% 300GB164...

We'll add disk space for all nodes and run some cleanups, but 
there's still left some questions:


What is the best next step  for us from this point?
What is correct procedure after all and what should we expect when 
adding node to cassandra cluster?
We expected decrease of used disk space on node 'D' 'cause we 
shrink token range for this node, but saw the opposite, why it 
happened and is it normal behavior?
What if we'll have 2TB of data on 2.5TB disk and we wanted to add 
another node and move tokens?
Is it possible to automate node addition to cluster and be sure we 
won't run out of space?


Thank.






Re: counter columns question

2012-03-15 Thread Rustam Aliyev

  
  
No, it's not possible.

On 15/03/2012 10:53, Tamar Fraenkel wrote:

  Watched the video, really good!


One question:
I wonder if it is possible to mix counter columns in
  Cassandra 1.0.7 with regular columns in the same CF.


Even if it is possible, I am not sure I should mix them as
  I use Hector, and this means I will have to user both Hector
  and CQL,
Am I right?
Thanks,



  
Tamar Fraenkel 
  Senior Software Engineer, TOK Media 
  
  

  ta...@tok-media.com
  Tel:   +972
2 6409736 
  Mob:  +972
54 8356490 
  Fax:   +972
2 5612956 
  
  


  
  
  
  
  On Wed, Mar 14, 2012 at 1:52 PM,
Alain RODRIGUEZ 
wrote:

  Here you got a good explaination of how the counters
work. 
  
  
  
  Video : http://blip.tv/datastax/counters-in-cassandra-5497678 
  
Sildes : http://www.datastax.com/wp-content/uploads/2011/07/cassandra_sf_counters.pdf
  
  
  I think it will answer your question and help you
understanding interesting things about counters.
  
  
  If you need a short answer, it concurrency will be
handled nicely and you'll have the expected result.
  
  
  
  Alain

  

  
2012/3/14 Tamar Fraenkel 
  
Hi!
  
  
  My apologies for the naive question, but
I am new to Cassandra and even more to
counter columns.
  
  
  If I have a counter column, what happens
if two threads are trying to increment it's
value at the same time for the same key?
  
  
  Thanks,
  

  Tamar Fraenkel 
Senior Software Engineer, TOK Media 


  
ta...@tok-media.com
Tel:   +972 2
  6409736 
Mob:  +972 54
  8356490 
Fax:   +972 2
  5612956 


  
  


  

  


  

  

  
  

  

  



Deleting from SCF wide row makes node unresponsive

2012-06-02 Thread Rustam Aliyev
 Hi all,

I have SCF with ~250K rows. One of these rows is relatively large - it's a
wide row (according to compaction logs) containing ~100.000 super columns
and overall size of 1GB. Each super column has average size of 10K and ~10
sub columns.

When I'm trying to delete ~90% of the columns in this particular row,
Cassandra nodes which own this wide row (3 of 5, RF=3) quickly run out of
the heap space. See logs from one of the hosts here:

http://pastebin.com/raw.php?i=kwn7b3rP

After that, all 3 nodes start flapping up/down and GC messages (like the
one in the bottom of the pastebin above) appearing in the logs. Cassandra
never repairs from this mode and the only way out if to "kill -9" and start
again. On IRC it was suggested that it enters GC death spiral.

I tried to throttle delete requests on the client side - sending batch of
100 delete requests each 500ms. So no more than 200 deletes/sec. But it
didn't help. I can reduce it further to 100/sec, but I don't think it will
help much.

I delete millions of columns from other row in this SCF at the same rate
and never have hit this problem. It only happens when I try to delete from
this particular wide row.

So right now I don't know how can I delete these columns. Any ideas?


Many thanks,
Rustam.


Re: Can't delete from SCF wide row

2012-06-10 Thread Rustam Aliyev

Hi Aaron,

Thanks for reply. I did some more tests and it looks like the problem is 
not in deletes/writes, it rather in reads (I do read before deleting).


It turns out that problem was in another CF which had wide row of 1.2GB 
and row cache. Cassandra tries to read this row into cache and becomes 
unresponsive. Disabling row cache on this CF helped to read through this 
row and perform cleanup. It seems that Cassandra reads into cache all 
columns, even those which were deleted (w/ tombstones) but not GCed.


Seems that CASSANDRA-2864 
<https://issues.apache.org/jira/browse/CASSANDRA-2864> and 
CASSANDRA-1956 <https://issues.apache.org/jira/browse/CASSANDRA-1956> 
opened to address this problem.


Best,
Rustam.


On 04/06/2012 19:41, aaron morton wrote:

Delete is a no look write operation, like normal writes. So it should not be 
directly causing a lot of memory allocation.

It may be causing a lot of compaction activity, which due to the wide row may 
be throwing up lots of GC.

Try the following to get through the deletions:

* disable compaction by setting min_compaction_level and max_compaction_level 
to 0 (via nodetool on current versions)

Once you have finished compaction
* lower the in_memory_compaction_limit in the yaml.
* set concurrent_compactions to 2 in the yaml
* enable compaction again

Once everything has settled down restore the in_memory_compaction_limit and 
concurrent_compactions

Hope that helps.

-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 2/06/2012, at 7:53 AM, Rustam Aliyev wrote:


Hi all,

I have SCF with ~250K rows. One of these rows is relatively large - it's a wide 
row (according to compaction logs) containing ~100.000 super columns and 
overall size of 1GB. Each super column has average size of 10K and ~10 sub 
columns.

When I'm trying to delete ~90% of the columns in this particular row, Cassandra 
nodes which own this wide row (3 of 5, RF=3) quickly run out of the heap space. 
See logs from one of the hosts here:

http://pastebin.com/raw.php?i=kwn7b3rP

After that, all 3 nodes start flapping up/down and GC messages (like the one in the 
bottom of the pastebin above) appearing in the logs. Cassandra never repairs from this 
mode and the only way out if to "kill -9" and start again. On IRC it was 
suggested that it enters GC death spiral.

I tried to throttle delete requests on the client side - sending batch of 100 
delete requests each 500ms. So no more than 200 deletes/sec. But it didn't 
help. I can reduce it further to 100/sec, but I don't think it will help much.

I delete millions of columns from other row in this SCF at the same rate and 
never have hit this problem. It only happens when I try to delete from this 
particular wide row.

So right now I don't know how can I delete these columns. Any ideas?


Many thanks,
Rustam.




Re: Ball is rolling on High Performance Cassandra Cookbook second edition

2012-06-27 Thread Rustam Aliyev

Hi Edward,

That's a great news!

One thing I'd like to see in the new edition is Counters, known issues 
and how to avoid them:
 - avoid double counting (don't retry on failure, use write consistency 
level ONE, use dedicated Hector connector?)

 - delete counters (tricky, reset to zero?)
 - other tips and tricks

I personally had (and still have to some extend) problems with 
maintaining counter accuracy.


Best,
Rustam.


On 26/06/2012 22:25, Edward Capriolo wrote:

Hello all,

It has not been very long since the first book was published but
several things have been added to Cassandra and a few things have
changed. I am putting together a list of changed content, for example
features like the old per Column family memtable flush settings versus
the new system with the global variable.

My editors have given me the green light to grow the second edition
from ~200 pages currently up to 300 pages! This gives us the ability
to add more items/sections to the text.

Some things were missing from the first edition such as Hector
support. Nate has offered to help me in this area. Please feel contact
me with any ideas and suggestions of recipes you would like to see in
the book. Also get in touch if you want to write a recipe. Several
people added content to the first edition and it would be great to see
that type of participation again.

Thank you,
Edward





Cassandra 1.0.10 to 1.2.3 upgrade "post-mortem"

2013-04-02 Thread Rustam Aliyev

Hi,

I just wanted to share our experience of upgrading 1.0.10 to 1.2.3. It 
happened that first we upgraded both of our two seeds to 1.2.3. And 
basically after that old nodes couldn't communicate with new ones 
anymore. Cluster was down until we upgraded all nodes to 1.2.3. We don't 
have many nodes and that process didn't took long. Yet it caused outage 
for ~10 mins.


Here are some logs:

On the new, freshly upgraded seed node (v1.2.3):

ERROR [OptionalTasks:1] 2013-03-31 08:48:19,370 CassandraDaemon.java 
(line 164) Exception in thread Thread[OptionalTasks:1,5,main]

java.lang.NullPointerException
at 
org.apache.cassandra.service.MigrationManager$1.run(MigrationManager.java:137)

at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:98)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:206)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

at java.lang.Thread.run(Thread.java:662)
 WARN [MutationStage:20] 2013-03-31 08:48:23,613 StorageProxy.java 
(line 577) Unable to store hint for host with missing ID, /10.0.1.8 (old 
node?)




ERROR [MutationStage:33] 2013-03-31 09:00:02,899 CassandraDaemon.java 
(line 164) Exception in thread Thread[MutationStage:33,5,main]

java.lang.AssertionError: Missing host ID for 10.0.1.8
at 
org.apache.cassandra.service.StorageProxy.writeHintForMutation(StorageProxy.java:580)
at 
org.apache.cassandra.service.StorageProxy$5.runMayThrow(StorageProxy.java:555)
at 
org.apache.cassandra.service.StorageProxy$HintRunnable.run(StorageProxy.java:1643)

at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

at java.lang.Thread.run(Thread.java:662)



At the same time, old nodes (v1.0.10) were blinded:


ERROR [RequestResponseStage:441] 2013-03-31 09:04:07,955 
AbstractCassandraDaemon.java (line 139) Fatal exception in thread 
Thread[RequestResponseStage:441,5,main]

java.io.IOError: java.io.EOFException
at 
org.apache.cassandra.service.AbstractRowResolver.preprocess(AbstractRowResolver.java:71)

at org.apache.cassandra.service.ReadCallback.response(ReadCallback.java:132)
at 
org.apache.cassandra.net.ResponseVerbHandler.doVerb(ResponseVerbHandler.java:45)
at 
org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:59)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

at java.lang.Thread.run(Thread.java:662)
Caused by: java.io.EOFException
at java.io.DataInputStream.readFully(DataInputStream.java:180)
at 
org.apache.cassandra.db.ReadResponseSerializer.deserialize(ReadResponse.java:100)
at 
org.apache.cassandra.db.ReadResponseSerializer.deserialize(ReadResponse.java:81)
at 
org.apache.cassandra.service.AbstractRowResolver.preprocess(AbstractRowResolver.java:64)

... 6 more

.

 INFO [GossipStage:3] 2013-03-31 09:06:08,885 Gossiper.java (line 804) 
InetAddress /10.0.1.8 is now UP
ERROR [GossipStage:3] 2013-03-31 09:06:08,885 
AbstractCassandraDaemon.java (line 139) Fatal exception in thread 
Thread[GossipStage:3,5,main]

java.lang.UnsupportedOperationException: Not a time-based UUID
at java.util.UUID.timestamp(UUID.java:308)
at 
org.apache.cassandra.service.MigrationManager.updateHighestKnown(MigrationManager.java:121)
at 
org.apache.cassandra.service.MigrationManager.rectify(MigrationManager.java:99)
at 
org.apache.cassandra.service.MigrationManager.onAlive(MigrationManager.java:83)

at org.apache.cassandra.gms.Gossiper.markAlive(Gossiper.java:806)
at org.apache.cassandra.gms.Gossiper.applyStateLocally(Gossiper.java:923)
at 
org.apache.cassandra.gms.GossipDigestAckVerbHandler.doVerb(GossipDigestAckVerbHandler.java:68)
at 
org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:59)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

at java.lang.Thread.run(Thread.java:662)
ERROR [GossipStage:3] 2013-03-31 09:06:08,886 
AbstractCassandraDaemon.java (line 139) Fatal exception in thread 
Thread[GossipStage:3,5,main]

java.lang.UnsupportedOperationException: Not a time-based UUID
at java.util.UUID.ti

Re: Cassandra 1.0.10 to 1.2.3 upgrade "post-mortem"

2013-04-04 Thread Rustam Aliyev

On 04/04/2013 02:24, aaron morton wrote:

I just wanted to share our experience of upgrading 1.0.10 to 1.2.3

In general it's dangerous to skip a major release when upgrading.


True. But in that case it was supposed to be fine.
ERROR [MutationStage:33] 2013-03-31 09:00:02,899 CassandraDaemon.java 
(line 164) Exception in thread Thread[MutationStage:33,5,main]

java.lang.AssertionError: Missing host ID for 10.0.1.8

Was 10.0.1.8 been updated ?
IIRC not at this stage. 10.0.1.8 was second seed server (at that moment 
1.0.10) and this particular error appeared on the first seed server 
after upgrade to 1.2.3.


Cheers

-
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 3/04/2013, at 4:09 AM, Rustam Aliyev <mailto:rustam.li...@code.az>> wrote:



Hi,

I just wanted to share our experience of upgrading 1.0.10 to 1.2.3. 
It happened that first we upgraded both of our two seeds to 1.2.3. 
And basically after that old nodes couldn't communicate with new ones 
anymore. Cluster was down until we upgraded all nodes to 1.2.3. We 
don't have many nodes and that process didn't took long. Yet it 
caused outage for ~10 mins.


Here are some logs:

On the new, freshly upgraded seed node (v1.2.3):

ERROR [OptionalTasks:1] 2013-03-31 08:48:19,370 CassandraDaemon.java 
(line 164) Exception in thread Thread[OptionalTasks:1,5,main]

java.lang.NullPointerException
at 
org.apache.cassandra.service.MigrationManager$1.run(MigrationManager.java:137)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)

at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:98)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:206)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

at java.lang.Thread.run(Thread.java:662)
WARN [MutationStage:20] 2013-03-31 08:48:23,613 StorageProxy.java 
(line 577) Unable to store hint for host with missing ID, /10.0.1.8 
(old node?)




ERROR [MutationStage:33] 2013-03-31 09:00:02,899 CassandraDaemon.java 
(line 164) Exception in thread Thread[MutationStage:33,5,main]

java.lang.AssertionError: Missing host ID for 10.0.1.8
at 
org.apache.cassandra.service.StorageProxy.writeHintForMutation(StorageProxy.java:580)
at 
org.apache.cassandra.service.StorageProxy$5.runMayThrow(StorageProxy.java:555)
at 
org.apache.cassandra.service.StorageProxy$HintRunnable.run(StorageProxy.java:1643)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)

at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

at java.lang.Thread.run(Thread.java:662)



At the same time, old nodes (v1.0.10) were blinded:


ERROR [RequestResponseStage:441] 2013-03-31 09:04:07,955 
AbstractCassandraDaemon.java (line 139) Fatal exception in thread 
Thread[RequestResponseStage:441,5,main]

java.io.IOError: java.io.EOFException
at 
org.apache.cassandra.service.AbstractRowResolver.preprocess(AbstractRowResolver.java:71)
at 
org.apache.cassandra.service.ReadCallback.response(ReadCallback.java:132)
at 
org.apache.cassandra.net.ResponseVerbHandler.doVerb(ResponseVerbHandler.java:45)
at 
org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:59)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

at java.lang.Thread.run(Thread.java:662)
Caused by: java.io.EOFException
at java.io.DataInputStream.readFully(DataInputStream.java:180)
at 
org.apache.cassandra.db.ReadResponseSerializer.deserialize(ReadResponse.java:100)
at 
org.apache.cassandra.db.ReadResponseSerializer.deserialize(ReadResponse.java:81)
at 
org.apache.cassandra.service.AbstractRowResolver.preprocess(AbstractRowResolver.java:64)

... 6 more

.

INFO [GossipStage:3] 2013-03-31 09:06:08,885 Gossiper.java (line 804) 
InetAddress /10.0.1.8 is now UP
ERROR [GossipStage:3] 2013-03-31 09:06:08,885 
AbstractCassandraDaemon.java (line 139) Fatal exception in thread 
Thread[GossipStage:3,5,main]

java.lang.UnsupportedOperationException: Not a time-based UUID
at java.util.UUID.timestamp(UUID.java:308)
at 
org.apache.cassandra.service.MigrationManager.updateHighestKnown(MigrationManager.java:121)
at 
org.apache.cassandra.service.MigrationManager.rectify

Problems with shuffle

2013-04-07 Thread Rustam Aliyev

Hi,

After upgrading to the vnodes I created and enabled shuffle operation as 
suggested. After running for a couple of hours I had to disable it 
because nodes were not catching up with compactions. I repeated this 
process 3 times (enable/disable).


I have 5 nodes and each of them had ~35GB. After shuffle operations 
described above some nodes are now reaching ~170GB. In the log files I 
can see same files transferred 2-4 times to the same host within the 
same shuffle session. Worst of all, after all of these I had only 20 
vnodes transferred out of 1280. So if it will continue at the same speed 
it will take about a month or two to complete shuffle.


I had few question to better understand shuffle:

1. Does disabling and re-enabling shuffle starts shuffle process from
   scratch or it resumes from the last point?

2. Will vnode reallocations speedup as shuffle proceeds or it will
   remain the same?

3. Why I see multiple transfers of the same file to the same host? e.g.:

   INFO [Streaming to /10.0.1.8:6] 2013-04-07 14:27:10,038
   StreamReplyVerbHandler.java (line 44) Successfully sent
   /u01/cassandra/data/Keyspace/Metadata/Keyspace-Metadata-ib-111-Data.db
   to /10.0.1.8
   INFO [Streaming to /10.0.1.8:7] 2013-04-07 16:27:07,427
   StreamReplyVerbHandler.java (line 44) Successfully sent
   /u01/cassandra/data/Keyspace/Metadata/Keyspace-Metadata-ib-111-Data.db
   to /10.0.1.8

4. When I enable/disable shuffle I receive warning message such as
   below. Do I need to worry about it?

   cassandra-shuffle -h localhost disable
   Failed to enable shuffling on 10.0.1.1!
   Failed to enable shuffling on 10.0.1.3!

I couldn't find many docs on shuffle, only read through JIRA and 
original proposal by Eric.


BR,
Rustam.



Re: Problems with shuffle

2013-04-08 Thread Rustam Aliyev
After 2 days of endless compactions and streaming I had to stop this and 
cancel shuffle. One of the nodes even complained that there's no free 
disk space (grew from 30GB to 400GB). After all these problems number of 
the moved tokens were less than 40 (out of 1280!).


Now, when nodes start they report duplicate ranges. I wonder how bad is 
that and how do I get rid of that?


 INFO [GossipStage:1] 2013-04-09 02:16:37,920 StorageService.java (line 
1386) Nodes /10.0.1.2 and /10.0.1.1 have the same token 
99027485685976232531333625990885670910.  Ignoring /10.0.1.2
 INFO [GossipStage:1] 2013-04-09 02:16:37,921 StorageService.java (line 
1386) Nodes /10.0.1.2 and /10.0.1.4 have the same token 
4319990986300976586937372945998718.  Ignoring /10.0.1.2


Overall, I'm not sure how bad it is to leave data unshuffled (I read 
DataStax blog post, not clear). When adding new node wouldn't it be 
assigned ranges randomly from all nodes?


Some other notes inline below:

On 08/04/2013 15:00, Eric Evans wrote:

[ Rustam Aliyev ]

Hi,

After upgrading to the vnodes I created and enabled shuffle
operation as suggested. After running for a couple of hours I had to
disable it because nodes were not catching up with compactions. I
repeated this process 3 times (enable/disable).

I have 5 nodes and each of them had ~35GB. After shuffle operations
described above some nodes are now reaching ~170GB. In the log files
I can see same files transferred 2-4 times to the same host within
the same shuffle session. Worst of all, after all of these I had
only 20 vnodes transferred out of 1280. So if it will continue at
the same speed it will take about a month or two to complete
shuffle.

As Edward says, you'll need to issue a cleanup post-shuffle if you expect
to see disk usage match your expectations.


I had few question to better understand shuffle:

1. Does disabling and re-enabling shuffle starts shuffle process from
scratch or it resumes from the last point?

It resumes.


2. Will vnode reallocations speedup as shuffle proceeds or it will
remain the same?

The shuffle proceeds synchronously, 1 range at a time; It's not going to
speed up as it progresses.


3. Why I see multiple transfers of the same file to the same host? e.g.:

INFO [Streaming to /10.0.1.8:6] 2013-04-07 14:27:10,038
StreamReplyVerbHandler.java (line 44) Successfully sent
/u01/cassandra/data/Keyspace/Metadata/Keyspace-Metadata-ib-111-Data.db
to /10.0.1.8
INFO [Streaming to /10.0.1.8:7] 2013-04-07 16:27:07,427
StreamReplyVerbHandler.java (line 44) Successfully sent
/u01/cassandra/data/Keyspace/Metadata/Keyspace-Metadata-ib-111-Data.db
to /10.0.1.8

I'm not sure, but perhaps that file contained data for two different
ranges?
Does it mean that if I have huge file (e.g. 20GB) which contain a lot of 
ranges (let's say 100) it will be transferred each time (20GB*100)?



4. When I enable/disable shuffle I receive warning message such as
below. Do I need to worry about it?

cassandra-shuffle -h localhost disable
Failed to enable shuffling on 10.0.1.1!
Failed to enable shuffling on 10.0.1.3!

Is that the verbatim output?  Did it report failing to enable when you
tried to disable?
Yes, this is verbatim output. It reports failure for enable as well as 
disable. Nodes .1.1 and .1.3 were not RELOCATING unless I ran 
cassandra-shuffle enable command on them locally.


As a rule of thumb though, you don't want an disable/enable to result in
only a subset of nodes shuffling.  Are there no other errors?  What do
the logs say?

No errors in logs. Only INFO about streams and WARN about relocation.



I couldn't find many docs on shuffle, only read through JIRA and
original proposal by Eric.




Re: Problems with shuffle

2013-04-13 Thread Rustam Aliyev
Just a followup on this issue. Due to the cost of shuffle, we decided 
not to do it. Recently, we added new node and ended up in not well 
balanced cluster:


Datacenter: datacenter1
===
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address   Load   Tokens  Owns   Host 
ID   Rack
UN  10.0.1.8  52.28 GB   260 18.3% 
d28df6a6-c888-4658-9be1-f9e286368dce  rack1
UN  10.0.1.11 55.21 GB 256 9.4%   
7b0cf3c8-0c42-4443-9b0c-68f794299443  rack1
UN  10.0.1.2  49.03 GB   259 17.9% 
2d308bc3-1fd7-4fa4-b33f-cbbbdc557b2f  rack1
UN  10.0.1.4  48.51 GB   255 18.4% 
c253dcdf-3e93-495c-baf1-e4d2a033bce3  rack1
UN  10.0.1.1  67.14 GB   253 17.9% 
4f77fd70-b134-486b-9c25-cfea96b6d412  rack1
UN  10.0.1.3  47.65 GB   253 18.0% 
4d03690d-5363-42c1-85c2-5084596e09fc  rack1


It looks like new node took from each other node equal amount of vnodes 
- which is good. However, it's not clear why it decided to have twice 
less than other nodes.


How does Cassandra with vnodes exactly decide how many vnodes to move?

Btw, during JOINING nodetool status command does not show any 
information about joining node. It appears only when join finished (on 
v1.2.3).


-- Rustam


On 08/04/2013 22:33, Rustam Aliyev wrote:
After 2 days of endless compactions and streaming I had to stop this 
and cancel shuffle. One of the nodes even complained that there's no 
free disk space (grew from 30GB to 400GB). After all these problems 
number of the moved tokens were less than 40 (out of 1280!).


Now, when nodes start they report duplicate ranges. I wonder how bad 
is that and how do I get rid of that?


 INFO [GossipStage:1] 2013-04-09 02:16:37,920 StorageService.java 
(line 1386) Nodes /10.0.1.2 and /10.0.1.1 have the same token 
99027485685976232531333625990885670910.  Ignoring /10.0.1.2
 INFO [GossipStage:1] 2013-04-09 02:16:37,921 StorageService.java 
(line 1386) Nodes /10.0.1.2 and /10.0.1.4 have the same token 
4319990986300976586937372945998718.  Ignoring /10.0.1.2


Overall, I'm not sure how bad it is to leave data unshuffled (I read 
DataStax blog post, not clear). When adding new node wouldn't it be 
assigned ranges randomly from all nodes?


Some other notes inline below:

On 08/04/2013 15:00, Eric Evans wrote:

[ Rustam Aliyev ]

Hi,

After upgrading to the vnodes I created and enabled shuffle
operation as suggested. After running for a couple of hours I had to
disable it because nodes were not catching up with compactions. I
repeated this process 3 times (enable/disable).

I have 5 nodes and each of them had ~35GB. After shuffle operations
described above some nodes are now reaching ~170GB. In the log files
I can see same files transferred 2-4 times to the same host within
the same shuffle session. Worst of all, after all of these I had
only 20 vnodes transferred out of 1280. So if it will continue at
the same speed it will take about a month or two to complete
shuffle.
As Edward says, you'll need to issue a cleanup post-shuffle if you 
expect

to see disk usage match your expectations.


I had few question to better understand shuffle:

1. Does disabling and re-enabling shuffle starts shuffle process from
scratch or it resumes from the last point?

It resumes.


2. Will vnode reallocations speedup as shuffle proceeds or it will
remain the same?

The shuffle proceeds synchronously, 1 range at a time; It's not going to
speed up as it progresses.

3. Why I see multiple transfers of the same file to the same host? 
e.g.:


INFO [Streaming to /10.0.1.8:6] 2013-04-07 14:27:10,038
StreamReplyVerbHandler.java (line 44) Successfully sent
/u01/cassandra/data/Keyspace/Metadata/Keyspace-Metadata-ib-111-Data.db
to /10.0.1.8
INFO [Streaming to /10.0.1.8:7] 2013-04-07 16:27:07,427
StreamReplyVerbHandler.java (line 44) Successfully sent
/u01/cassandra/data/Keyspace/Metadata/Keyspace-Metadata-ib-111-Data.db
to /10.0.1.8

I'm not sure, but perhaps that file contained data for two different
ranges?
Does it mean that if I have huge file (e.g. 20GB) which contain a lot 
of ranges (let's say 100) it will be transferred each time (20GB*100)?



4. When I enable/disable shuffle I receive warning message such as
below. Do I need to worry about it?

cassandra-shuffle -h localhost disable
Failed to enable shuffling on 10.0.1.1!
Failed to enable shuffling on 10.0.1.3!

Is that the verbatim output?  Did it report failing to enable when you
tried to disable?
Yes, this is verbatim output. It reports failure for enable as well as 
disable. Nodes .1.1 and .1.3 were not RELOCATING unless I ran 
cassandra-shuffle enable command on them locally.


As a rule of thumb though, you don't want an disable/enable to result in
only a subset of nodes shuffling.  Are there no other errors? What do
the logs say?

No errors in logs. Only INFO 

Re: Problems with shuffle

2013-04-14 Thread Rustam Aliyev

How does Cassandra with vnodes exactly decide how many vnodes to move?

The num_tokens setting in the yaml file. What did you set this to?

256, same as on all other nodes.



Cheers

-
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 14/04/2013, at 11:56 AM, Rustam Aliyev  wrote:


Just a followup on this issue. Due to the cost of shuffle, we decided not to do 
it. Recently, we added new node and ended up in not well balanced cluster:

Datacenter: datacenter1
===
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address   Load   Tokens  Owns   Host ID 
  Rack
UN  10.0.1.8  52.28 GB   260 18.3%  
d28df6a6-c888-4658-9be1-f9e286368dce  rack1
UN  10.0.1.11 55.21 GB   256 9.4%   
7b0cf3c8-0c42-4443-9b0c-68f794299443  rack1
UN  10.0.1.2  49.03 GB   259 17.9%  
2d308bc3-1fd7-4fa4-b33f-cbbbdc557b2f  rack1
UN  10.0.1.4  48.51 GB   255 18.4%  
c253dcdf-3e93-495c-baf1-e4d2a033bce3  rack1
UN  10.0.1.1  67.14 GB   253 17.9%  
4f77fd70-b134-486b-9c25-cfea96b6d412  rack1
UN  10.0.1.3  47.65 GB   253 18.0%  
4d03690d-5363-42c1-85c2-5084596e09fc  rack1

It looks like new node took from each other node equal amount of vnodes - which 
is good. However, it's not clear why it decided to have twice less than other 
nodes.

How does Cassandra with vnodes exactly decide how many vnodes to move?

Btw, during JOINING nodetool status command does not show any information about 
joining node. It appears only when join finished (on v1.2.3).

-- Rustam


On 08/04/2013 22:33, Rustam Aliyev wrote:

After 2 days of endless compactions and streaming I had to stop this and cancel 
shuffle. One of the nodes even complained that there's no free disk space (grew 
from 30GB to 400GB). After all these problems number of the moved tokens were 
less than 40 (out of 1280!).

Now, when nodes start they report duplicate ranges. I wonder how bad is that 
and how do I get rid of that?

  INFO [GossipStage:1] 2013-04-09 02:16:37,920 StorageService.java (line 1386) 
Nodes /10.0.1.2 and /10.0.1.1 have the same token 
99027485685976232531333625990885670910.  Ignoring /10.0.1.2
  INFO [GossipStage:1] 2013-04-09 02:16:37,921 StorageService.java (line 1386) 
Nodes /10.0.1.2 and /10.0.1.4 have the same token 
4319990986300976586937372945998718.  Ignoring /10.0.1.2

Overall, I'm not sure how bad it is to leave data unshuffled (I read DataStax 
blog post, not clear). When adding new node wouldn't it be assigned ranges 
randomly from all nodes?

Some other notes inline below:

On 08/04/2013 15:00, Eric Evans wrote:

[ Rustam Aliyev ]

Hi,

After upgrading to the vnodes I created and enabled shuffle
operation as suggested. After running for a couple of hours I had to
disable it because nodes were not catching up with compactions. I
repeated this process 3 times (enable/disable).

I have 5 nodes and each of them had ~35GB. After shuffle operations
described above some nodes are now reaching ~170GB. In the log files
I can see same files transferred 2-4 times to the same host within
the same shuffle session. Worst of all, after all of these I had
only 20 vnodes transferred out of 1280. So if it will continue at
the same speed it will take about a month or two to complete
shuffle.

As Edward says, you'll need to issue a cleanup post-shuffle if you expect
to see disk usage match your expectations.


I had few question to better understand shuffle:

1. Does disabling and re-enabling shuffle starts shuffle process from
 scratch or it resumes from the last point?

It resumes.


2. Will vnode reallocations speedup as shuffle proceeds or it will
 remain the same?

The shuffle proceeds synchronously, 1 range at a time; It's not going to
speed up as it progresses.


3. Why I see multiple transfers of the same file to the same host? e.g.:

 INFO [Streaming to /10.0.1.8:6] 2013-04-07 14:27:10,038
 StreamReplyVerbHandler.java (line 44) Successfully sent
 /u01/cassandra/data/Keyspace/Metadata/Keyspace-Metadata-ib-111-Data.db
 to /10.0.1.8
 INFO [Streaming to /10.0.1.8:7] 2013-04-07 16:27:07,427
 StreamReplyVerbHandler.java (line 44) Successfully sent
 /u01/cassandra/data/Keyspace/Metadata/Keyspace-Metadata-ib-111-Data.db
 to /10.0.1.8

I'm not sure, but perhaps that file contained data for two different
ranges?

Does it mean that if I have huge file (e.g. 20GB) which contain a lot of ranges 
(let's say 100) it will be transferred each time (20GB*100)?

4. When I enable/disable shuffle I receive warning message such as
 below. Do I need to worry about it?

 cassandra-shuffle -h localhost disable
 Failed to enable shuffling on 10.0.1.1!
 Failed to enable shuffling on 10.0.1.3!

Is that the verbatim output?  Did it report failing to enable when you
tried to disable?

Yes, thi

Re: Cassandra and disk space

2010-12-09 Thread Rustam Aliyev

Is there any plans to improve this in future?

For big data clusters this could be very expensive. Based on your 
comment, I will need 200TB of storage for 100TB of data to keep 
Cassandra running.


--
Rustam.

On 09/12/2010 17:56, Tyler Hobbs wrote:
If you are on 0.6, repair is particularly dangerous with respect to 
disk space usage.  If your replica is sufficiently out of sync, you 
can triple your disk usage pretty easily.  This has been improved in 
0.7, so repairs should use about half as much disk space, on average.


In general, yes, keep your nodes under 50% disk usage at all times.  
Any of: compaction, cleanup, snapshotting, repair, or bootstrapping 
(the latter two are improved in 0.7) can double your disk usage 
temporarily.


You should plan to add more disk space or add nodes when you get close 
to this limit.  Once you go over 50%, it's more difficult to add 
nodes, at least in 0.6.


- Tyler

On Thu, Dec 9, 2010 at 11:19 AM, Mark > wrote:


I recently ran into a problem during a repair operation where my
nodes completely ran out of space and my whole cluster was...
well, clusterfucked.

I want to make sure how to prevent this problem in the future.

Should I make sure that at all times every node is under 50% of
its disk space? Are there any normal day-to-day operations that
would cause the any one node to double in size that I should be
aware of? If on or more nodes to surpass the 50% mark, what should
I plan to do?

Thanks for any advice




Re: Cassandra and disk space

2010-12-09 Thread Rustam Aliyev


That depends on your scenario.  In the worst case of one big CF, 
there's not much that can be easily done for the disk usage of 
compaction and cleanup (which is essentially compaction).


If, instead, you have several column families and no single CF makes 
up the majority of your data, you can push your disk usage a bit higher.




Is there any formula to calculate this? Let's say I have 500GB in single 
CF. So I need at least 500GB of free space for compaction. If I 
partition this CF and split it into 10 proportional CFs each 50GB, does 
it mean that I will need only 50GB of free space?


Also, is there recommended maximum of data size per node?

Thanks.

A fundamental idea behind Cassandra's architecture is that disk space 
is cheap (which, indeed, it is).  If you are particularly sensitive to 
this, Cassandra might not be the best solution to your problem.  Also 
keep in mind that Cassandra performs well with average disks, so you 
don't need to spend a lot there.  Additionally, most people find that 
the replication protects their data enough to allow them to use RAID 0 
instead of 1, 10, 5, or 6.


- Tyler

On Thu, Dec 9, 2010 at 12:20 PM, Rustam Aliyev <mailto:rus...@code.az>> wrote:


Is there any plans to improve this in future?

For big data clusters this could be very expensive. Based on your
comment, I will need 200TB of storage for 100TB of data to keep
Cassandra running.

--
Rustam.

On 09/12/2010 17:56, Tyler Hobbs wrote:

If you are on 0.6, repair is particularly dangerous with respect
to disk space usage.  If your replica is sufficiently out of
sync, you can triple your disk usage pretty easily.  This has
been improved in 0.7, so repairs should use about half as much
disk space, on average.

In general, yes, keep your nodes under 50% disk usage at all
times.  Any of: compaction, cleanup, snapshotting, repair, or
bootstrapping (the latter two are improved in 0.7) can double
your disk usage temporarily.

You should plan to add more disk space or add nodes when you get
close to this limit.  Once you go over 50%, it's more difficult
to add nodes, at least in 0.6.

- Tyler

On Thu, Dec 9, 2010 at 11:19 AM, Mark mailto:static.void@gmail.com>> wrote:

I recently ran into a problem during a repair operation where
my nodes completely ran out of space and my whole cluster
was... well, clusterfucked.

I want to make sure how to prevent this problem in the future.

Should I make sure that at all times every node is under 50%
of its disk space? Are there any normal day-to-day operations
that would cause the any one node to double in size that I
should be aware of? If on or more nodes to surpass the 50%
mark, what should I plan to do?

Thanks for any advice






Re: Cassandra and disk space

2010-12-09 Thread Rustam Aliyev

Thanks Tyler, this is really useful.

Also, I noticed that you can specify multiple data file directories 
located on different disks. Let's say if I have machine with 4 x 500GB 
drives, what would be the difference between following 2 setups:


  1. each drive mounted separately and has data file dirs on it (so 4x
 data file dirs)
  2. disks are in RAID0 and mounted as one drive with one data folder on it

In other words, does splitting data folder into smaller ones bring any 
performance or stability advantages?



On 10/12/2010 00:03, Tyler Hobbs wrote:
Yes, that's correct, but I wouldn't push it too far.  You'll become 
much more sensitive to disk usage changes; in particular, rebalancing 
your cluster will particularly difficult, and repair will also become 
dangerous.  Disk performance also tends to drop when a disk nears 
capacity.


There's no recommended maximum size -- it all depends on your access 
rates.  Anywhere from 10 GB to 1TB is typical.


- Tyler

On Thu, Dec 9, 2010 at 5:52 PM, Rustam Aliyev <mailto:rus...@code.az>> wrote:




That depends on your scenario.  In the worst case of one big CF,
there's not much that can be easily done for the disk usage of
compaction and cleanup (which is essentially compaction).

If, instead, you have several column families and no single CF
makes up the majority of your data, you can push your disk usage
a bit higher.



Is there any formula to calculate this? Let's say I have 500GB in
single CF. So I need at least 500GB of free space for compaction.
If I partition this CF and split it into 10 proportional CFs each
50GB, does it mean that I will need only 50GB of free space?

Also, is there recommended maximum of data size per node?

Thanks.



A fundamental idea behind Cassandra's architecture is that disk
space is cheap (which, indeed, it is).  If you are particularly
sensitive to this, Cassandra might not be the best solution to
your problem.  Also keep in mind that Cassandra performs well
with average disks, so you don't need to spend a lot there. 
Additionally, most people find that the replication protects

their data enough to allow them to use RAID 0 instead of 1, 10,
5, or 6.

    - Tyler

On Thu, Dec 9, 2010 at 12:20 PM, Rustam Aliyev mailto:rus...@code.az>> wrote:

Is there any plans to improve this in future?

For big data clusters this could be very expensive. Based on
your comment, I will need 200TB of storage for 100TB of data
to keep Cassandra running.

--
Rustam.

On 09/12/2010 17:56, Tyler Hobbs wrote:

If you are on 0.6, repair is particularly dangerous with
respect to disk space usage.  If your replica is
sufficiently out of sync, you can triple your disk usage
pretty easily.  This has been improved in 0.7, so repairs
should use about half as much disk space, on average.

In general, yes, keep your nodes under 50% disk usage at all
times.  Any of: compaction, cleanup, snapshotting, repair,
or bootstrapping (the latter two are improved in 0.7) can
double your disk usage temporarily.

You should plan to add more disk space or add nodes when you
get close to this limit.  Once you go over 50%, it's more
difficult to add nodes, at least in 0.6.

- Tyler

On Thu, Dec 9, 2010 at 11:19 AM, Mark
mailto:static.void@gmail.com>> wrote:

I recently ran into a problem during a repair operation
where my nodes completely ran out of space and my whole
cluster was... well, clusterfucked.

I want to make sure how to prevent this problem in the
future.

Should I make sure that at all times every node is under
50% of its disk space? Are there any normal day-to-day
operations that would cause the any one node to double
in size that I should be aware of? If on or more nodes
to surpass the 50% mark, what should I plan to do?

Thanks for any advice








Distributed counters

2011-01-19 Thread Rustam Aliyev

Hi,

Does anyone use CASSANDRA-1072 counters patch with 0.7 stable branch? I 
need this functionality but can't wait until 0.8.


Also, does Hector trunk version has any support for these counters? 
(this question is probably for hector-users group, but most of us anyway 
here).


Many thanks,
Rustam Aliyev.

<http://www.linkedin.com/in/aliyev>


Re: Distributed counters

2011-01-21 Thread Rustam Aliyev

Hi Kelvin,

Thanks for sharing! That's exactly what I was looking for.

Good luck with the migration.

Regards,
Rustam.


On 20/01/2011 17:40, Kelvin Kakugawa wrote:

Hi Rustam,

All of our large production clusters are still on 0.6.6.

However, we have an 0.7 branch, here:
https://github.com/kakugawa/cassandra/tree/twttr-cassandra-0.7-counts

that is our migration target.  It passes our internal distributed 
tests and will be in production soon.


-Kelvin

On Thu, Jan 20, 2011 at 8:24 AM, Nate McCall <mailto:n...@riptano.com>> wrote:


On the Hector side, we will be adding this to trunk (and thus moving
Hector trunk to Cassandra 0.8.x) in the next week or two.

On Wed, Jan 19, 2011 at 6:12 PM, Rustam Aliyev mailto:rus...@code.az>> wrote:
> Hi,
>
> Does anyone use CASSANDRA-1072 counters patch with 0.7 stable
branch? I need
> this functionality but can't wait until 0.8.
>
> Also, does Hector trunk version has any support for these
counters? (this
> question is probably for hector-users group, but most of us
    anyway here).
>
> Many thanks,
> Rustam Aliyev.
>
>




Re: Explaining the Replication Factor, N and W and R

2011-02-13 Thread Rustam Aliyev

On 13/02/2011 13:49, Janne Jalkanen wrote:

Folks,

as it seems that wrapping the brain around the R+W>N concept is a big hurdle 
for a lot of users, I made a simple web page that allows you to try out the 
different parameters and see how they affect the system.

http://www.ecyrd.com/cassandracalculator/

Let me know if you have any suggestions to improve the wording, or you spot a 
bug. Trying to go for simplicity and clarity over absolute correctness here, as 
this is meant to help newbies.

(App is completely self-contained HTML and Javascript.)

/Janne


Excellent! How about adding Hinted Handoff enabled/disabled option?

--
Rustam