UUIDType

2011-11-20 Thread Guy Incognito
am i correct that neither of Cassandra's UUIDTypes (at least in 0.7) 
compare UUIDs according to RFC4122 (ie as two unsigned longs)?


Re: Efficiency of Cross Data Center Replication...?

2011-11-20 Thread Boris Yen
A quick question, what if DC2 is down, and after a while it comes back on.
how does the data get sync to DC2 in this case? (assume hint is disable)

Thanks in advance.

On Thu, Nov 17, 2011 at 10:46 AM, Jeremiah Jordan <
jeremiah.jor...@morningstar.com> wrote:

> Pretty sure data is sent to the coordinating node in DC2 at the same time
> it is sent to replicas in DC1, so I would think 10's of milliseconds after
> the transport time to DC2.
>
> On Nov 16, 2011, at 3:48 PM, ehers...@gmail.com wrote:
>
> On a related note - assuming there are available resources across the
> board (cpu and memory on every node, low network latency, non-saturated
> nics/circuits/disks), what's a reasonable expectation for timing on
> replication? Sub-second? Less than five seconds?
>
> Ernie
>
> On Wed, Nov 16, 2011 at 4:00 PM, Brian Fleming 
> wrote:
>
>> Great - thanks Jake
>>
>> B.
>>
>> On Wed, Nov 16, 2011 at 8:40 PM, Jake Luciani  wrote:
>>
>>> the former
>>>
>>>
>>> On Wed, Nov 16, 2011 at 3:33 PM, Brian Fleming <
>>> bigbrianflem...@gmail.com> wrote:
>>>

 Hi All,

 I have a question about inter-data centre replication : if you have 2
 Data Centers, each with a local RF of 2 (i.e. total RF of 4) and write to a
 node in DC1, how efficient is the replication to DC2 - i.e. is that data :
  - replicated over to a single node in DC2 once and internally
 replicated
  or
  - replicated explicitly to two separate nodes?

 Obviously from a LAN resource utilisation perspective, the former would
 be preferable.

 Many thanks,

 Brian


>>>
>>>
>>> --
>>> http://twitter.com/tjake
>>>
>>
>>
>
>


Re: Network traffic patterns

2011-11-20 Thread Boris Yen
I am just curious about which partitioner you are using?

On Thu, Nov 17, 2011 at 4:30 PM, Philippe  wrote:

> Hi Todd
> Yes all equal hardware. Nearly no CPU usage and no memory issues.
> Repairs are running in tens of minutes so i don't understand why
> replication would be backed up.
>
> Any other ideas?
> Le 17 nov. 2011 02:33, "Todd Burruss"  a écrit :
>
> Are all of your machines equal hardware?  Since those machines are sending
>> data somewhere, maybe they are behind in replicating and are continuously
>> catching up?
>>
>> Use a tool like tcpdump to find out where the data is going
>>
>> From: Philippe 
>> Reply-To: "user@cassandra.apache.org" 
>> Date: Tue, 15 Nov 2011 13:22:38 -0800
>> To: user 
>> Subject: Re: Network traffic patterns
>>
>> Sorry about the previous message, I've enabled keyboard shortcuts on
>> gmail...*sigh*...
>>
>> Hello,
>> I'm trying to understand the network usage I am seeing in my cluster, can
>> anyone shed some light?
>> It's an RF=3, 12-node, cassandra 0.8.6 cluster. repair is performed on
>> each node once a week, with a rolling schedule.
>> The nodes are p13,p14,p15...p24 and are consecutive in that order on the
>> ring. Each node is only a cassandra database. I am hitting the cluster from
>> another server (p4).
>>
>> p4 is doing this with 20 threads in parallel
>>
>>1. read a lot of data (some columns for hundreds to tens of thousands
>>of keys, split into 512-key multigets)
>>2. process the data
>>3. write back a byte array to cassandra (average size is 400 bytes)
>>4. go back to 1
>>
>> According to my munin graphs, network usage is about as follows. I am not
>> surprised at the bias towards p13-p15 as p4 is getting & storing data
>> mainly for keys located on one of those nodes.
>>
>>- p4 : 1.5Mb/s in and out
>>- p13-p15 : 15Mb/s in and 80Mb/s out
>>- p16-p24 : 45Mb/s in and 5Mb/s out
>>
>> What I don't understand is why p4 is only seeing 1.5Mb/s while I see
>> 80Mb/s on p13 & p15.
>>
>> The way I understand this:
>>
>>- p4 makes a multiget to the cluster, electing to use any node in the
>>cluster (IN traffic for describe the query)
>>- coordinator node replays the query on all 3 replicas (so 3 servers
>>each get the IN traffic, mostly p13-p15)
>>- each server replies to coordinator
>>- coordinator chooses matching values and sends back data to p4
>>
>> So if p13-p15 are outputting 80Mb/s why am I not seeing 80Mb/s coming
>> into p4 which is on the receiving end ?
>>
>> Thanks
>>
>> 2011/11/15 Philippe 
>>
>>> Hello,
>>> I'm trying to understand the network usage I am seeing in my cluster,
>>> can anyone shed some light?
>>> It's an RF=3, 12-node, cassandra 0.8.6 cluster. The nodes are
>>> p13,p14,p15...p24 and are consecutive in that order on the ring.
>>> Each node is only a cassandra database. I am hitting the cluster from
>>> another server (p4).
>>>
>>> The pattern on p4 is the pattern is to
>>>
>>>1. read a lot of data (some columns for hundreds to tens of
>>>thousands of keys, split into 512-key multigets)
>>>2. process the data
>>>3. write back a byte array to cassandra (average size is 400 bytes)
>>>
>>>
>>> p4 reads as
>>>
>>
>>


Re: Efficiency of Cross Data Center Replication...?

2011-11-20 Thread Mohit Anchlia
On Sun, Nov 20, 2011 at 4:01 AM, Boris Yen  wrote:
> A quick question, what if DC2 is down, and after a while it comes back on.
> how does the data get sync to DC2 in this case? (assume hint is disable)
> Thanks in advance.

Manually, use nodetool repair in rolling fashion on all the nodes of DC2

> On Thu, Nov 17, 2011 at 10:46 AM, Jeremiah Jordan
>  wrote:
>>
>> Pretty sure data is sent to the coordinating node in DC2 at the same time
>> it is sent to replicas in DC1, so I would think 10's of milliseconds after
>> the transport time to DC2.
>> On Nov 16, 2011, at 3:48 PM, ehers...@gmail.com wrote:
>>
>> On a related note - assuming there are available resources across the
>> board (cpu and memory on every node, low network latency, non-saturated
>> nics/circuits/disks), what's a reasonable expectation for timing on
>> replication? Sub-second? Less than five seconds?
>> Ernie
>>
>> On Wed, Nov 16, 2011 at 4:00 PM, Brian Fleming 
>> wrote:
>>>
>>> Great - thanks Jake
>>> B.
>>>
>>> On Wed, Nov 16, 2011 at 8:40 PM, Jake Luciani  wrote:

 the former

 On Wed, Nov 16, 2011 at 3:33 PM, Brian Fleming
  wrote:
>
> Hi All,
>
> I have a question about inter-data centre replication : if you have 2
> Data Centers, each with a local RF of 2 (i.e. total RF of 4) and write to 
> a
> node in DC1, how efficient is the replication to DC2 - i.e. is that data :
>  - replicated over to a single node in DC2 once and internally
> replicated
>  or
>  - replicated explicitly to two separate nodes?
> Obviously from a LAN resource utilisation perspective, the former would
> be preferable.
> Many thanks,
> Brian



 --
 http://twitter.com/tjake
>>>
>>
>>
>
>


Re: Efficiency of Cross Data Center Replication...?

2011-11-20 Thread Jeremiah Jordan
If hinting is off. Read Repair and Manual Repair are the only ways data will 
get there (just like when a single node is down).

On Nov 20, 2011, at 6:01 AM, Boris Yen wrote:

> A quick question, what if DC2 is down, and after a while it comes back on. 
> how does the data get sync to DC2 in this case? (assume hint is disable) 
> 
> Thanks in advance.
> 
> On Thu, Nov 17, 2011 at 10:46 AM, Jeremiah Jordan 
>  wrote:
> Pretty sure data is sent to the coordinating node in DC2 at the same time it 
> is sent to replicas in DC1, so I would think 10's of milliseconds after the 
> transport time to DC2.
> 
> On Nov 16, 2011, at 3:48 PM, ehers...@gmail.com wrote:
> 
>> On a related note - assuming there are available resources across the board 
>> (cpu and memory on every node, low network latency, non-saturated 
>> nics/circuits/disks), what's a reasonable expectation for timing on 
>> replication? Sub-second? Less than five seconds? 
>> 
>> Ernie
>> 
>> On Wed, Nov 16, 2011 at 4:00 PM, Brian Fleming  
>> wrote:
>> Great - thanks Jake
>> 
>> B.
>> 
>> On Wed, Nov 16, 2011 at 8:40 PM, Jake Luciani  wrote:
>> the former
>> 
>> 
>> On Wed, Nov 16, 2011 at 3:33 PM, Brian Fleming  
>> wrote:
>> 
>> Hi All,
>>  
>> I have a question about inter-data centre replication : if you have 2 Data 
>> Centers, each with a local RF of 2 (i.e. total RF of 4) and write to a node 
>> in DC1, how efficient is the replication to DC2 - i.e. is that data :
>>  - replicated over to a single node in DC2 once and internally replicated
>>  or 
>>  - replicated explicitly to two separate nodes?
>> 
>> Obviously from a LAN resource utilisation perspective, the former would be 
>> preferable.
>> 
>> Many thanks,
>> 
>> Brian
>> 
>> 
>> 
>> 
>> -- 
>> http://twitter.com/tjake
>> 
>> 
> 
> 



data agility

2011-11-20 Thread Dotan N.
Hi all,
my question may be more philosophical than related technically
to Cassandra, but please bear with me.

Given that a young startup may not know its product full at the early
stages, but that it definitely points to ~200M users,
would Cassandra will be the right way to go?

That is, the requirement is for a large data store, that can move with
product changes and requirements swiftly.

Given that in Cassandra one thinks hard about the queries, and then builds
a model to suit it best, I was thinking of
this situation as problematic.

So here are some questions:

- would it be wiser to start with a more agile data store (such as mongodb)
and then progress onto Cassandra, when the product itself solidifies?
- given that we start with Cassandra from the get go, what is a common (and
quick in terms of development) way or practice to change data, change
schemas, as the product evolves?
- is it even smart to start with Cassandra? would only startups whose core
business is big data start with it from the get go?
- how would you do map/reduce with Cassandra? how agile is that? (for
example, can you run map/reduce _very_ frequently?)

Thanks!

--
Dotan, @jondot 


Re: data agility

2011-11-20 Thread David McNelis
Dotan,

I think that if you're in the early stages you have a basic idea of what
your product is going to be, architecturally speaking.  While you may
change your business model, or features at the display layer, I would think
the data models itself would remain relatively similar
throughout...otherwise you'd have another product on your hands, no?

But, even if your requirements radically shift, Cassandra is schemaless, so
you'd be able to make 'structural' changes to your data without as much
risk as in a traditional RDBMS, i.e. MySql.

At the end of the day, I don't think you've given enough information about
your proposed data models for anyone to say, "Yes, Cassandra would or would
not be the right choice for your startup."  If well administered, depending
on the services offered, MySQL or Oracle  could support a site with 200M
users, and a poorly designed Cassandra data store could work very poorly
for a site supporting 200 users.

I will say that I think it makes a lot of  sense  to use tradional RDBMS
systems for relational data and a Cassandra-like system when there is a
need  for larger data storage, or something that lends itself well to a
structureless design.  If you are using a framework that supports a good
ORM layer (i.e. Hibernate for Java), you can have  your build process
update your database  schema as you build out your application.  I haven't
done much work in Rails or Django, but I understand those support the
transparent schema updating as well.  That sort of setup can work very
effectively in early development...but that is more a discussion for other
communities.

If you're interested in doing Map/Reduce jobs with Cassandra, look into
Brisk, the system created by DataStax (which is also open source) that
allows you to run Hadoop on top of your Cassandra cluster.  This may not be
exactly what you're looking for when asking this question...but it might
give you the insights you're looking for.

Hope this has been at least somewhat helpful.

David

On Sun, Nov 20, 2011 at 1:06 PM, Dotan N.  wrote:

> Hi all,
> my question may be more philosophical than related technically
> to Cassandra, but please bear with me.
>
> Given that a young startup may not know its product full at the early
> stages, but that it definitely points to ~200M users,
> would Cassandra will be the right way to go?
>
> That is, the requirement is for a large data store, that can move with
> product changes and requirements swiftly.
>
> Given that in Cassandra one thinks hard about the queries, and then builds
> a model to suit it best, I was thinking of
> this situation as problematic.
>
> So here are some questions:
>
> - would it be wiser to start with a more agile data store (such as
> mongodb) and then progress onto Cassandra, when the product itself
> solidifies?
> - given that we start with Cassandra from the get go, what is a common
> (and quick in terms of development) way or practice to change data, change
> schemas, as the product evolves?
> - is it even smart to start with Cassandra? would only startups whose core
> business is big data start with it from the get go?
> - how would you do map/reduce with Cassandra? how agile is that? (for
> example, can you run map/reduce _very_ frequently?)
>
> Thanks!
>
> --
> Dotan, @jondot 
>
>


-- 
*David McNelis*
Lead Software Engineer
Agentis Energy
www.agentisenergy.com
c: 219.384.5143

*A Smart Grid technology company focused on helping consumers of energy
control an often under-managed resource.*


Re: data agility

2011-11-20 Thread Stephen Connolly
if your startup is bootstrapping then cassandra is sometimes to heavy to
start with.

i.e. it needs to be fed ram... you're not going to seriously run it in less
than 1gb per node... that level of ram commitment can be too much while
bootstrapping.

if your startup has enough cash to pay for 3-5 recommended spec (see wiki)
nodes to be up 24/7 then cassandra is a good fit...

a friend of mine is bootstrapping a startup and had to drop back to mysql
while he finds his pain points and customers... he knows he will end up
jumping back to cassandra when he gets enough customers (or a VC) but for
now the running costs are too much to pay from his own pocket... note that
the jdbc driver and cql will make jumping back easy for him (as he still
tests with c*... just runs at present against mysql nuts eh!)

- Stephen

---
Sent from my Android phone, so random spelling mistakes, random nonsense
words and other nonsense are a direct result of using swype to type on the
screen
On 20 Nov 2011 19:07, "Dotan N."  wrote:

> Hi all,
> my question may be more philosophical than related technically
> to Cassandra, but please bear with me.
>
> Given that a young startup may not know its product full at the early
> stages, but that it definitely points to ~200M users,
> would Cassandra will be the right way to go?
>
> That is, the requirement is for a large data store, that can move with
> product changes and requirements swiftly.
>
> Given that in Cassandra one thinks hard about the queries, and then builds
> a model to suit it best, I was thinking of
> this situation as problematic.
>
> So here are some questions:
>
> - would it be wiser to start with a more agile data store (such as
> mongodb) and then progress onto Cassandra, when the product itself
> solidifies?
> - given that we start with Cassandra from the get go, what is a common
> (and quick in terms of development) way or practice to change data, change
> schemas, as the product evolves?
> - is it even smart to start with Cassandra? would only startups whose core
> business is big data start with it from the get go?
> - how would you do map/reduce with Cassandra? how agile is that? (for
> example, can you run map/reduce _very_ frequently?)
>
> Thanks!
>
> --
> Dotan, @jondot 
>
>


Re: data agility

2011-11-20 Thread Dotan N.
Thanks David.
Stephen: thanks for the tip, we can run a recommended configuration, so
that wouldn't be an issue. I guess I can focus that my questions are on
complexity of development.

After digesting David's answer, I guess my follow up questions would be
- how would you process data in a cassandra cluster, typically? via one-off
coded offline jobs?
- how easy is map/reduce on existing data (just looked at brisk but it may
be unrelated, any case, not too much written about it)
- how would you do analytics over a cassandra cluster
- given the common examples of time-series, how would you recommend to
aggregate (sum, avg, facet) and provide statistics over the collected data?
for example if it were kinds of logs and you'd like to group all of certain
fields in it, or provide a histogram over it.

Thanks!


--
Dotan, @jondot 



On Sun, Nov 20, 2011 at 10:32 PM, Stephen Connolly <
stephen.alan.conno...@gmail.com> wrote:

> if your startup is bootstrapping then cassandra is sometimes to heavy to
> start with.
>
> i.e. it needs to be fed ram... you're not going to seriously run it in
> less than 1gb per node... that level of ram commitment can be too much
> while bootstrapping.
>
> if your startup has enough cash to pay for 3-5 recommended spec (see wiki)
> nodes to be up 24/7 then cassandra is a good fit...
>
> a friend of mine is bootstrapping a startup and had to drop back to mysql
> while he finds his pain points and customers... he knows he will end up
> jumping back to cassandra when he gets enough customers (or a VC) but for
> now the running costs are too much to pay from his own pocket... note that
> the jdbc driver and cql will make jumping back easy for him (as he still
> tests with c*... just runs at present against mysql nuts eh!)
>
> - Stephen
>
> ---
> Sent from my Android phone, so random spelling mistakes, random nonsense
> words and other nonsense are a direct result of using swype to type on the
> screen
> On 20 Nov 2011 19:07, "Dotan N."  wrote:
>
>> Hi all,
>> my question may be more philosophical than related technically
>> to Cassandra, but please bear with me.
>>
>> Given that a young startup may not know its product full at the early
>> stages, but that it definitely points to ~200M users,
>> would Cassandra will be the right way to go?
>>
>> That is, the requirement is for a large data store, that can move with
>> product changes and requirements swiftly.
>>
>> Given that in Cassandra one thinks hard about the queries, and then
>> builds a model to suit it best, I was thinking of
>> this situation as problematic.
>>
>> So here are some questions:
>>
>> - would it be wiser to start with a more agile data store (such as
>> mongodb) and then progress onto Cassandra, when the product itself
>> solidifies?
>> - given that we start with Cassandra from the get go, what is a common
>> (and quick in terms of development) way or practice to change data, change
>> schemas, as the product evolves?
>> - is it even smart to start with Cassandra? would only startups whose
>> core business is big data start with it from the get go?
>> - how would you do map/reduce with Cassandra? how agile is that? (for
>> example, can you run map/reduce _very_ frequently?)
>>
>> Thanks!
>>
>> --
>> Dotan, @jondot 
>>
>>


Re: data agility

2011-11-20 Thread Jahangir Mohammed
IMHO, you should start with something very simple RDBMS and meanwhile
getting handle over Cassandra or other noSql technology. Start out with
simple, but always be aware and conscious of the next thing you will have
in stack. It's timetaking to work with new technology if you are in the
phase of prototyping something fast and geared towards a Vc demo. In most
of the cases, you won't need noSql for a while unless there is a very
strong case.

Thanks,
Jahangir
On Nov 20, 2011 4:04 PM, "Dotan N."  wrote:

> Thanks David.
> Stephen: thanks for the tip, we can run a recommended configuration, so
> that wouldn't be an issue. I guess I can focus that my questions are on
> complexity of development.
>
> After digesting David's answer, I guess my follow up questions would be
> - how would you process data in a cassandra cluster, typically? via
> one-off coded offline jobs?
> - how easy is map/reduce on existing data (just looked at brisk but it may
> be unrelated, any case, not too much written about it)
> - how would you do analytics over a cassandra cluster
> - given the common examples of time-series, how would you recommend to
> aggregate (sum, avg, facet) and provide statistics over the collected data?
> for example if it were kinds of logs and you'd like to group all of certain
> fields in it, or provide a histogram over it.
>
> Thanks!
>
>
> --
> Dotan, @jondot 
>
>
>
> On Sun, Nov 20, 2011 at 10:32 PM, Stephen Connolly <
> stephen.alan.conno...@gmail.com> wrote:
>
>> if your startup is bootstrapping then cassandra is sometimes to heavy to
>> start with.
>>
>> i.e. it needs to be fed ram... you're not going to seriously run it in
>> less than 1gb per node... that level of ram commitment can be too much
>> while bootstrapping.
>>
>> if your startup has enough cash to pay for 3-5 recommended spec (see
>> wiki) nodes to be up 24/7 then cassandra is a good fit...
>>
>> a friend of mine is bootstrapping a startup and had to drop back to mysql
>> while he finds his pain points and customers... he knows he will end up
>> jumping back to cassandra when he gets enough customers (or a VC) but for
>> now the running costs are too much to pay from his own pocket... note that
>> the jdbc driver and cql will make jumping back easy for him (as he still
>> tests with c*... just runs at present against mysql nuts eh!)
>>
>> - Stephen
>>
>> ---
>> Sent from my Android phone, so random spelling mistakes, random nonsense
>> words and other nonsense are a direct result of using swype to type on the
>> screen
>> On 20 Nov 2011 19:07, "Dotan N."  wrote:
>>
>>> Hi all,
>>> my question may be more philosophical than related technically
>>> to Cassandra, but please bear with me.
>>>
>>> Given that a young startup may not know its product full at the early
>>> stages, but that it definitely points to ~200M users,
>>> would Cassandra will be the right way to go?
>>>
>>> That is, the requirement is for a large data store, that can move with
>>> product changes and requirements swiftly.
>>>
>>> Given that in Cassandra one thinks hard about the queries, and then
>>> builds a model to suit it best, I was thinking of
>>> this situation as problematic.
>>>
>>> So here are some questions:
>>>
>>> - would it be wiser to start with a more agile data store (such as
>>> mongodb) and then progress onto Cassandra, when the product itself
>>> solidifies?
>>> - given that we start with Cassandra from the get go, what is a common
>>> (and quick in terms of development) way or practice to change data, change
>>> schemas, as the product evolves?
>>> - is it even smart to start with Cassandra? would only startups whose
>>> core business is big data start with it from the get go?
>>> - how would you do map/reduce with Cassandra? how agile is that? (for
>>> example, can you run map/reduce _very_ frequently?)
>>>
>>> Thanks!
>>>
>>> --
>>> Dotan, @jondot 
>>>
>>>
>


Re: read performance problem

2011-11-20 Thread Jahangir Mohammed
There is something wrong with the system. Your benchmarks are way off. How
are you benchmarking? Are you using the stress lib included?
On Nov 19, 2011 8:58 PM, "Kent Tong"  wrote:

> Hi,
>
> On my computer with 2G RAM and a core 2 duo CPU E4600 @ 2.40GHz, I am
> testing the
> performance of Cassandra. The write performance is good: It can write a
> million records
> in 10 minutes. However, the query performance is poor and it takes 10
> minutes to read
> 10K records with sequential keys from 0 to  (about 100 QPS). This is
> far away from
> the 3,xxx QPS found on the net.
>
> Cassandra decided to use 1G as the Java heap size which seems to be fine
> as at the end
> of the benchmark the swap was barely used (only 1M used).
>
> I understand that my computer may be not as powerful as those used in the
> other benchmarks,
> but it shouldn't be that far off (1:30), right?
>
> Any suggestion? Thanks in advance!
>
>


Re: data agility

2011-11-20 Thread Dotan N.
Jahangir, thanks! however I've noted that we may very well need to scale to
200M users or "entities" within a short amount of time - say a year or two,
10M within few months.


--
Dotan, @jondot 



On Sun, Nov 20, 2011 at 11:14 PM, Jahangir Mohammed  wrote:

> IMHO, you should start with something very simple RDBMS and meanwhile
> getting handle over Cassandra or other noSql technology. Start out with
> simple, but always be aware and conscious of the next thing you will have
> in stack. It's timetaking to work with new technology if you are in the
> phase of prototyping something fast and geared towards a Vc demo. In most
> of the cases, you won't need noSql for a while unless there is a very
> strong case.
>
> Thanks,
> Jahangir
> On Nov 20, 2011 4:04 PM, "Dotan N."  wrote:
>
>> Thanks David.
>> Stephen: thanks for the tip, we can run a recommended configuration, so
>> that wouldn't be an issue. I guess I can focus that my questions are on
>> complexity of development.
>>
>> After digesting David's answer, I guess my follow up questions would be
>> - how would you process data in a cassandra cluster, typically? via
>> one-off coded offline jobs?
>> - how easy is map/reduce on existing data (just looked at brisk but it
>> may be unrelated, any case, not too much written about it)
>> - how would you do analytics over a cassandra cluster
>> - given the common examples of time-series, how would you recommend to
>> aggregate (sum, avg, facet) and provide statistics over the collected data?
>> for example if it were kinds of logs and you'd like to group all of certain
>> fields in it, or provide a histogram over it.
>>
>> Thanks!
>>
>>
>> --
>> Dotan, @jondot 
>>
>>
>>
>> On Sun, Nov 20, 2011 at 10:32 PM, Stephen Connolly <
>> stephen.alan.conno...@gmail.com> wrote:
>>
>>> if your startup is bootstrapping then cassandra is sometimes to heavy to
>>> start with.
>>>
>>> i.e. it needs to be fed ram... you're not going to seriously run it in
>>> less than 1gb per node... that level of ram commitment can be too much
>>> while bootstrapping.
>>>
>>> if your startup has enough cash to pay for 3-5 recommended spec (see
>>> wiki) nodes to be up 24/7 then cassandra is a good fit...
>>>
>>> a friend of mine is bootstrapping a startup and had to drop back to
>>> mysql while he finds his pain points and customers... he knows he will end
>>> up jumping back to cassandra when he gets enough customers (or a VC) but
>>> for now the running costs are too much to pay from his own pocket... note
>>> that the jdbc driver and cql will make jumping back easy for him (as he
>>> still tests with c*... just runs at present against mysql nuts eh!)
>>>
>>> - Stephen
>>>
>>> ---
>>> Sent from my Android phone, so random spelling mistakes, random nonsense
>>> words and other nonsense are a direct result of using swype to type on the
>>> screen
>>> On 20 Nov 2011 19:07, "Dotan N."  wrote:
>>>
 Hi all,
 my question may be more philosophical than related technically
 to Cassandra, but please bear with me.

 Given that a young startup may not know its product full at the early
 stages, but that it definitely points to ~200M users,
 would Cassandra will be the right way to go?

 That is, the requirement is for a large data store, that can move with
 product changes and requirements swiftly.

 Given that in Cassandra one thinks hard about the queries, and then
 builds a model to suit it best, I was thinking of
 this situation as problematic.

 So here are some questions:

 - would it be wiser to start with a more agile data store (such as
 mongodb) and then progress onto Cassandra, when the product itself
 solidifies?
 - given that we start with Cassandra from the get go, what is a common
 (and quick in terms of development) way or practice to change data, change
 schemas, as the product evolves?
 - is it even smart to start with Cassandra? would only startups whose
 core business is big data start with it from the get go?
 - how would you do map/reduce with Cassandra? how agile is that? (for
 example, can you run map/reduce _very_ frequently?)

 Thanks!

 --
 Dotan, @jondot 


>>


Re: Causes of a High Memtable Live Ratio

2011-11-20 Thread Jahangir Mohammed
Memtable live ratio is fixed between 0 and 64. It's reset to 64 once live
ratio reports a higher value. It's between 0 and 64 for health of the node.
On Nov 19, 2011 6:10 PM, "Caleb Rackliffe"  wrote:

> Hi All,
>
> From what I've read in the source, a Memtable's "live ratio" is the ratio
> of Memtable usage to the current write throughput.  If this is too high, I
> imagine the system could be in a possibly unsafe state, as the comment in
> Memtable.java indicates.
>
> Today, while bulk loading some data, I got the following message:
>
> WARN [pool-1-thread-1] 2011-11-18 21:08:57,331 Memtable.java (line 172)
> setting live ratio to maximum of 64 instead of 78.87903667214012
>
> Should I be worried?  If so, does anybody have any suggestions for how to
> address it?
>
> Thanks :)
>
> *
> Caleb Rackliffe | Software Developer
> M 949.981.0159 | ca...@steelhouse.com
> **
> *
>
<>

Re: data agility

2011-11-20 Thread Aaron Turner
Sounds like you need to figure out what your product is going to do
and what technology will best fit those requirements.  I know you're
worried about being agile and all that, but scaling requires you to
use the right tool for the job. Worry about new requirements when they
rear their ugly head rather then a dozen of "what if" scenarios.

You can scale MySQL/etc and Cassandra, MongoDB, etc to 10-200M "users"
depending on what you're asking your datastore to do.  You haven't
defined that really at all other then some comments about wanting to
do some map/reduce jobs.

Really what you should be doing is figuring out what kind of data you
need to store and your needs like access patterns, availability, ACID
compliance, etc and then figure out what technology is the best fit.
There are tons of "Cassandra vs X" comparisons for every NoSQL DB in
existence.

Other then that, the map/reduce on Cassandra is more job based rather
then useful for interactive queries so if that is important then
Cassandra prolly isn't a good fit.  You did mention time series data
too, and that's a sweet spot for Cassandra and not something I
personally would put in a document based datastore like MonogoDB.

Good luck.
-Aaron

On Sun, Nov 20, 2011 at 1:24 PM, Dotan N.  wrote:
> Jahangir, thanks! however I've noted that we may very well need to scale to
> 200M users or "entities" within a short amount of time - say a year or two,
> 10M within few months.
>
> --
> Dotan, @jondot
>
>
> On Sun, Nov 20, 2011 at 11:14 PM, Jahangir Mohammed
>  wrote:
>>
>> IMHO, you should start with something very simple RDBMS and meanwhile
>> getting handle over Cassandra or other noSql technology. Start out with
>> simple, but always be aware and conscious of the next thing you will have in
>> stack. It's timetaking to work with new technology if you are in the phase
>> of prototyping something fast and geared towards a Vc demo. In most of the
>> cases, you won't need noSql for a while unless there is a very strong case.
>>
>> Thanks,
>> Jahangir
>>
>> On Nov 20, 2011 4:04 PM, "Dotan N."  wrote:
>>>
>>> Thanks David.
>>> Stephen: thanks for the tip, we can run a recommended configuration, so
>>> that wouldn't be an issue. I guess I can focus that my questions are on
>>> complexity of development.
>>> After digesting David's answer, I guess my follow up questions would be
>>> - how would you process data in a cassandra cluster, typically? via
>>> one-off coded offline jobs?
>>> - how easy is map/reduce on existing data (just looked at brisk but it
>>> may be unrelated, any case, not too much written about it)
>>> - how would you do analytics over a cassandra cluster
>>> - given the common examples of time-series, how would you recommend to
>>> aggregate (sum, avg, facet) and provide statistics over the collected data?
>>> for example if it were kinds of logs and you'd like to group all of certain
>>> fields in it, or provide a histogram over it.
>>> Thanks!
>>>
>>> --
>>> Dotan, @jondot
>>>
>>>
>>> On Sun, Nov 20, 2011 at 10:32 PM, Stephen Connolly
>>>  wrote:

 if your startup is bootstrapping then cassandra is sometimes to heavy to
 start with.

 i.e. it needs to be fed ram... you're not going to seriously run it in
 less than 1gb per node... that level of ram commitment can be too much 
 while
 bootstrapping.

 if your startup has enough cash to pay for 3-5 recommended spec (see
 wiki) nodes to be up 24/7 then cassandra is a good fit...

 a friend of mine is bootstrapping a startup and had to drop back to
 mysql while he finds his pain points and customers... he knows he will end
 up jumping back to cassandra when he gets enough customers (or a VC) but 
 for
 now the running costs are too much to pay from his own pocket... note that
 the jdbc driver and cql will make jumping back easy for him (as he still
 tests with c*... just runs at present against mysql nuts eh!)

 - Stephen

 ---
 Sent from my Android phone, so random spelling mistakes, random nonsense
 words and other nonsense are a direct result of using swype to type on the
 screen

 On 20 Nov 2011 19:07, "Dotan N."  wrote:
>
> Hi all,
> my question may be more philosophical than related technically
> to Cassandra, but please bear with me.
> Given that a young startup may not know its product full at the early
> stages, but that it definitely points to ~200M users,
> would Cassandra will be the right way to go?
> That is, the requirement is for a large data store, that can move with
> product changes and requirements swiftly.
> Given that in Cassandra one thinks hard about the queries, and then
> builds a model to suit it best, I was thinking of
> this situation as problematic.
> So here are some questions:
> - would it be wiser to start with a more agile data store (such as
> mongodb) and then progress onto Cassandra, whe

Re: What sort of load do the tombstones create on the cluster?

2011-11-20 Thread Jahangir Mohammed
Mostly, they are I/O and CPU intensive during major compaction. If ganglia
doesn't have anything suspicious there, then what is performance loss ?
Read or write?
On Nov 17, 2011 1:01 PM, "Maxim Potekhin"  wrote:

> In view of my unpleasant discovery last week that deletions in Cassandra
> lead to a very real
> and serious performance loss, I'm working on a strategy of moving forward.
>
> If the tombstones do cause such problem, where should I be looking for
> performance bottlenecks?
> Is it disk, CPU or something else? Thing is, I don't see anything
> outstanding in my Ganglia plots.
>
> TIA,
>
> Maxim
>
>


Re: data agility

2011-11-20 Thread Dotan N.
Thanks Aaron, I kept this use-case free as to focus on the higher level
description, it might have been a not a good idea.
But generally I think I got a better intuition from the various answers,
thanks!


--
Dotan, @jondot 



On Sun, Nov 20, 2011 at 11:52 PM, Aaron Turner  wrote:

> Sounds like you need to figure out what your product is going to do
> and what technology will best fit those requirements.  I know you're
> worried about being agile and all that, but scaling requires you to
> use the right tool for the job. Worry about new requirements when they
> rear their ugly head rather then a dozen of "what if" scenarios.
>
> You can scale MySQL/etc and Cassandra, MongoDB, etc to 10-200M "users"
> depending on what you're asking your datastore to do.  You haven't
> defined that really at all other then some comments about wanting to
> do some map/reduce jobs.
>
> Really what you should be doing is figuring out what kind of data you
> need to store and your needs like access patterns, availability, ACID
> compliance, etc and then figure out what technology is the best fit.
> There are tons of "Cassandra vs X" comparisons for every NoSQL DB in
> existence.
>
> Other then that, the map/reduce on Cassandra is more job based rather
> then useful for interactive queries so if that is important then
> Cassandra prolly isn't a good fit.  You did mention time series data
> too, and that's a sweet spot for Cassandra and not something I
> personally would put in a document based datastore like MonogoDB.
>
> Good luck.
> -Aaron
>
> On Sun, Nov 20, 2011 at 1:24 PM, Dotan N.  wrote:
> > Jahangir, thanks! however I've noted that we may very well need
> to scale to
> > 200M users or "entities" within a short amount of time - say a year or
> two,
> > 10M within few months.
> >
> > --
> > Dotan, @jondot
> >
> >
> > On Sun, Nov 20, 2011 at 11:14 PM, Jahangir Mohammed
> >  wrote:
> >>
> >> IMHO, you should start with something very simple RDBMS and meanwhile
> >> getting handle over Cassandra or other noSql technology. Start out with
> >> simple, but always be aware and conscious of the next thing you will
> have in
> >> stack. It's timetaking to work with new technology if you are in the
> phase
> >> of prototyping something fast and geared towards a Vc demo. In most of
> the
> >> cases, you won't need noSql for a while unless there is a very strong
> case.
> >>
> >> Thanks,
> >> Jahangir
> >>
> >> On Nov 20, 2011 4:04 PM, "Dotan N."  wrote:
> >>>
> >>> Thanks David.
> >>> Stephen: thanks for the tip, we can run a recommended configuration, so
> >>> that wouldn't be an issue. I guess I can focus that my questions are on
> >>> complexity of development.
> >>> After digesting David's answer, I guess my follow up questions would be
> >>> - how would you process data in a cassandra cluster, typically? via
> >>> one-off coded offline jobs?
> >>> - how easy is map/reduce on existing data (just looked at brisk but it
> >>> may be unrelated, any case, not too much written about it)
> >>> - how would you do analytics over a cassandra cluster
> >>> - given the common examples of time-series, how would you recommend to
> >>> aggregate (sum, avg, facet) and provide statistics over the collected
> data?
> >>> for example if it were kinds of logs and you'd like to group all of
> certain
> >>> fields in it, or provide a histogram over it.
> >>> Thanks!
> >>>
> >>> --
> >>> Dotan, @jondot
> >>>
> >>>
> >>> On Sun, Nov 20, 2011 at 10:32 PM, Stephen Connolly
> >>>  wrote:
> 
>  if your startup is bootstrapping then cassandra is sometimes to heavy
> to
>  start with.
> 
>  i.e. it needs to be fed ram... you're not going to seriously run it in
>  less than 1gb per node... that level of ram commitment can be too
> much while
>  bootstrapping.
> 
>  if your startup has enough cash to pay for 3-5 recommended spec (see
>  wiki) nodes to be up 24/7 then cassandra is a good fit...
> 
>  a friend of mine is bootstrapping a startup and had to drop back to
>  mysql while he finds his pain points and customers... he knows he
> will end
>  up jumping back to cassandra when he gets enough customers (or a VC)
> but for
>  now the running costs are too much to pay from his own pocket... note
> that
>  the jdbc driver and cql will make jumping back easy for him (as he
> still
>  tests with c*... just runs at present against mysql nuts eh!)
> 
>  - Stephen
> 
>  ---
>  Sent from my Android phone, so random spelling mistakes, random
> nonsense
>  words and other nonsense are a direct result of using swype to type
> on the
>  screen
> 
>  On 20 Nov 2011 19:07, "Dotan N."  wrote:
> >
> > Hi all,
> > my question may be more philosophical than related technically
> > to Cassandra, but please bear with me.
> > Given that a young startup may not know its product full at the early
> > stages, but that it de

Re: data agility

2011-11-20 Thread Milind Parikh
For 99% of current applications requiing a persistent datastore, Oracle,
PgSQL and MySQL variants will suffice.

For the 1% of the applications, consider C* if

 (a) you have given up on distributed transactions ("ACID"LY; but
NOT "BASE"ICLY)
 (b) wondering about this new fangled horizonatly scalability
buzzword and wonder why disks cannot spin faster and faster
 (c) need/want to design optimized query paths for your data with a
and b

Rewording a, b and c
  a.1 Cassandra provides best-in-class low latency asynchronous
replication with battle-hardened mechanisms to manage "eventual
consistenency" in an inherently disordered ("entroprophized") world... ACID
versus BASE transactions
  b.1 Cassandra's write path is completely optimized. It will write
as fast as the disk will allow it; but the most important feature is that
if you need to write faster than an individual server will allow, add more
servers. The locality of data principle, the ineorable faster computations
and anti-entropy services enables you to cloud-scale.
   c.1 Writing is easy; but then you actually need to find the
data. And do it at scale--speed wise.  The columnar nature of Cassandra,
designs of the internals in Cassandra and support at the API level
(composite indexes) make it possible to have fast quering capabilities.

Milind


On Sun, Nov 20, 2011 at 2:19 PM, Dotan N.  wrote:

> Thanks Aaron, I kept this use-case free as to focus on the higher level
> description, it might have been a not a good idea.
> But generally I think I got a better intuition from the various answers,
> thanks!
>
>
> --
> Dotan, @jondot 
>
>
>
> On Sun, Nov 20, 2011 at 11:52 PM, Aaron Turner wrote:
>
>> Sounds like you need to figure out what your product is going to do
>> and what technology will best fit those requirements.  I know you're
>> worried about being agile and all that, but scaling requires you to
>> use the right tool for the job. Worry about new requirements when they
>> rear their ugly head rather then a dozen of "what if" scenarios.
>>
>> You can scale MySQL/etc and Cassandra, MongoDB, etc to 10-200M "users"
>> depending on what you're asking your datastore to do.  You haven't
>> defined that really at all other then some comments about wanting to
>> do some map/reduce jobs.
>>
>> Really what you should be doing is figuring out what kind of data you
>> need to store and your needs like access patterns, availability, ACID
>> compliance, etc and then figure out what technology is the best fit.
>> There are tons of "Cassandra vs X" comparisons for every NoSQL DB in
>> existence.
>>
>> Other then that, the map/reduce on Cassandra is more job based rather
>> then useful for interactive queries so if that is important then
>> Cassandra prolly isn't a good fit.  You did mention time series data
>> too, and that's a sweet spot for Cassandra and not something I
>> personally would put in a document based datastore like MonogoDB.
>>
>> Good luck.
>> -Aaron
>>
>> On Sun, Nov 20, 2011 at 1:24 PM, Dotan N.  wrote:
>> > Jahangir, thanks! however I've noted that we may very well need
>> to scale to
>> > 200M users or "entities" within a short amount of time - say a year or
>> two,
>> > 10M within few months.
>> >
>> > --
>> > Dotan, @jondot
>> >
>> >
>> > On Sun, Nov 20, 2011 at 11:14 PM, Jahangir Mohammed
>> >  wrote:
>> >>
>> >> IMHO, you should start with something very simple RDBMS and meanwhile
>> >> getting handle over Cassandra or other noSql technology. Start out with
>> >> simple, but always be aware and conscious of the next thing you will
>> have in
>> >> stack. It's timetaking to work with new technology if you are in the
>> phase
>> >> of prototyping something fast and geared towards a Vc demo. In most of
>> the
>> >> cases, you won't need noSql for a while unless there is a very strong
>> case.
>> >>
>> >> Thanks,
>> >> Jahangir
>> >>
>> >> On Nov 20, 2011 4:04 PM, "Dotan N."  wrote:
>> >>>
>> >>> Thanks David.
>> >>> Stephen: thanks for the tip, we can run a recommended configuration,
>> so
>> >>> that wouldn't be an issue. I guess I can focus that my questions are
>> on
>> >>> complexity of development.
>> >>> After digesting David's answer, I guess my follow up questions would
>> be
>> >>> - how would you process data in a cassandra cluster, typically? via
>> >>> one-off coded offline jobs?
>> >>> - how easy is map/reduce on existing data (just looked at brisk but it
>> >>> may be unrelated, any case, not too much written about it)
>> >>> - how would you do analytics over a cassandra cluster
>> >>> - given the common examples of time-series, how would you recommend to
>> >>> aggregate (sum, avg, facet) and provide statistics over the collected
>> data?
>> >>> for example if it were kinds of logs and you'd like to group all of
>> certain
>> >>> fields in it, or provide a histogram over it.
>> >>> Thanks!
>> >>>
>> >>> --
>> >>> Dotan, @jondot
>> >>>
>> >>>
>> >>> On Su

Re: Network traffic patterns

2011-11-20 Thread Philippe
I'm using BOP.
Le 20 nov. 2011 13:09, "Boris Yen"  a écrit :

> I am just curious about which partitioner you are using?
>
> On Thu, Nov 17, 2011 at 4:30 PM, Philippe  wrote:
>
>> Hi Todd
>> Yes all equal hardware. Nearly no CPU usage and no memory issues.
>> Repairs are running in tens of minutes so i don't understand why
>> replication would be backed up.
>>
>> Any other ideas?
>> Le 17 nov. 2011 02:33, "Todd Burruss"  a écrit :
>>
>> Are all of your machines equal hardware?  Since those machines are
>>> sending data somewhere, maybe they are behind in replicating and are
>>> continuously catching up?
>>>
>>> Use a tool like tcpdump to find out where the data is going
>>>
>>> From: Philippe 
>>> Reply-To: "user@cassandra.apache.org" 
>>> Date: Tue, 15 Nov 2011 13:22:38 -0800
>>> To: user 
>>> Subject: Re: Network traffic patterns
>>>
>>> Sorry about the previous message, I've enabled keyboard shortcuts on
>>> gmail...*sigh*...
>>>
>>> Hello,
>>> I'm trying to understand the network usage I am seeing in my cluster,
>>> can anyone shed some light?
>>> It's an RF=3, 12-node, cassandra 0.8.6 cluster. repair is performed on
>>> each node once a week, with a rolling schedule.
>>> The nodes are p13,p14,p15...p24 and are consecutive in that order on the
>>> ring. Each node is only a cassandra database. I am hitting the cluster from
>>> another server (p4).
>>>
>>> p4 is doing this with 20 threads in parallel
>>>
>>>1. read a lot of data (some columns for hundreds to tens of
>>>thousands of keys, split into 512-key multigets)
>>>2. process the data
>>>3. write back a byte array to cassandra (average size is 400 bytes)
>>>4. go back to 1
>>>
>>> According to my munin graphs, network usage is about as follows. I am
>>> not surprised at the bias towards p13-p15 as p4 is getting & storing data
>>> mainly for keys located on one of those nodes.
>>>
>>>- p4 : 1.5Mb/s in and out
>>>- p13-p15 : 15Mb/s in and 80Mb/s out
>>>- p16-p24 : 45Mb/s in and 5Mb/s out
>>>
>>> What I don't understand is why p4 is only seeing 1.5Mb/s while I see
>>> 80Mb/s on p13 & p15.
>>>
>>> The way I understand this:
>>>
>>>- p4 makes a multiget to the cluster, electing to use any node in
>>>the cluster (IN traffic for describe the query)
>>>- coordinator node replays the query on all 3 replicas (so 3 servers
>>>each get the IN traffic, mostly p13-p15)
>>>- each server replies to coordinator
>>>- coordinator chooses matching values and sends back data to p4
>>>
>>> So if p13-p15 are outputting 80Mb/s why am I not seeing 80Mb/s coming
>>> into p4 which is on the receiving end ?
>>>
>>> Thanks
>>>
>>> 2011/11/15 Philippe 
>>>
 Hello,
 I'm trying to understand the network usage I am seeing in my cluster,
 can anyone shed some light?
 It's an RF=3, 12-node, cassandra 0.8.6 cluster. The nodes are
 p13,p14,p15...p24 and are consecutive in that order on the ring.
 Each node is only a cassandra database. I am hitting the cluster from
 another server (p4).

 The pattern on p4 is the pattern is to

1. read a lot of data (some columns for hundreds to tens of
thousands of keys, split into 512-key multigets)
2. process the data
3. write back a byte array to cassandra (average size is 400 bytes)


 p4 reads as

>>>
>>>
>


[no subject]

2011-11-20 Thread quinteros8...@gmail.com


--- Sent with mail@metro - the new generation of mobile messaging

Re: Data Model Design for Login Servie

2011-11-20 Thread Maciej Miklas
I will follow exactly this solution - thanks :)

On Fri, Nov 18, 2011 at 9:53 PM, David Jeske  wrote:

> On Thu, Nov 17, 2011 at 1:08 PM, Maciej Miklas 
> wrote:
>
>> A) Skinny rows
>>  - row key contains login name - this is the main search criteria
>>  - login data is replicated - each possible login is stored as single row
>> which contains all user data - 10 logins for
>
> single customer create 10 rows, where each row has different key and the
>> same content
>>
>
> To me this seems reasonable. Remember, because of your replication of the
> datavalues you will want a quick way to find all the logins for a given ID,
> so you will also want to store a separate dataset like:
>
> 1122 {
>  alfred.tes...@xyz.de =1(where the login is a column key)
>  alf...@aad.de =1
> }
>
> When you do an update, you'll need to fetch the entire row for the
> user-id, and then update all copies of the data. THis can create problems,
> if the data is out of sync (which it will be at certain times because of
> eventual consistency, and might be if something bad happens).
>
> ...the other option, of course, is to make a login-name indirection. You
> would have only one copy of the user-data stored by ID, and then you would
> store a separate mapping from login-name-to-ID. Of course this would
> require two roundtrips to get the user information from login-id, which is
> something I know you said you didn't want to do.
>
>
>