Benefit of LOCAL_SERIAL consistency

2016-12-07 Thread Hiroyuki Yamada
Hi,

I have been using lightweight transactions for several months now and
wondering what is the benefit of having LOCAL_SERIAL serial consistency level.

With SERIAL, it achieves global linearlizability,
but with LOCAL_SERIAL, it only achieves DC-local linearlizability,
which is missing point of linearlizability, I think.

So, for example,
once when SERIAL is used,
we can't use LOCAL_SERIAL to achieve local linearlizability
since data in local DC might not be updated yet to meet quorum.
And vice versa,
once when LOCAL_SERIAL is used,
we can't use SERIAL to achieve global linearlizability
since data is not globally updated yet to meet quorum .

So, it would be great if we can use LOCAL_SERIAL if possible and
use SERIAL only if local DC is down or unavailable,
but based on the example above, I think it is not possible, is it ?
So, I am not sure about what is the good use case for LOCAL_SERIAL.

The only case that I can think of is having a cluster in one DC for
online transactions and
having another cluster in another DC for analytics purpose.
In this case, I think there is no big point of using SERIAL since data
for analytics sometimes doesn't have to be very correct/fresh and
data can be asynchronously replicated to analytics node. (so using
LOCAL_SERIAL for one DC makes sense.)

Could anyone give me some thoughts about it ?

Thanks,
Hiro


Re: Benefit of LOCAL_SERIAL consistency

2016-12-07 Thread DuyHai Doan
The reason you don't want to use SERIAL in multi-DC clusters is the
prohibitive cost of lightweight transaction (in term of latency),
especially if your data centers are separated by continents. A ping from
London to New York takes 52ms just by speed of light in optic cable. Since
LightWeight Transaction involves 4 network round-trips, it means at least
200ms just for raw network transfer, not even taking into account the cost
of processing the operation

You're right to raise a warning about mixing LOCAL_SERIAL with SERIAL.
LOCAL_SERIAL guarantees you linearizability inside a DC, SERIAL guarantees
you linearizability across multiple DC.

If I have 3 DCs with RF = 3 each (total 9 replicas) and I did an INSERT IF
NOT EXISTS with LOCAL_SERIAL in DC1, then it's possible that a subsequent
INSERT IF NOT EXISTS on the same record succeeds when using SERIAL because
SERIAL on 9 replicas = at least 5 replicas. Those 5 replicas which respond
can come from DC2 and DC3 and thus did not apply yet the previous INSERT...

On Wed, Dec 7, 2016 at 2:14 PM, Hiroyuki Yamada  wrote:

> Hi,
>
> I have been using lightweight transactions for several months now and
> wondering what is the benefit of having LOCAL_SERIAL serial consistency
> level.
>
> With SERIAL, it achieves global linearlizability,
> but with LOCAL_SERIAL, it only achieves DC-local linearlizability,
> which is missing point of linearlizability, I think.
>
> So, for example,
> once when SERIAL is used,
> we can't use LOCAL_SERIAL to achieve local linearlizability
> since data in local DC might not be updated yet to meet quorum.
> And vice versa,
> once when LOCAL_SERIAL is used,
> we can't use SERIAL to achieve global linearlizability
> since data is not globally updated yet to meet quorum .
>
> So, it would be great if we can use LOCAL_SERIAL if possible and
> use SERIAL only if local DC is down or unavailable,
> but based on the example above, I think it is not possible, is it ?
> So, I am not sure about what is the good use case for LOCAL_SERIAL.
>
> The only case that I can think of is having a cluster in one DC for
> online transactions and
> having another cluster in another DC for analytics purpose.
> In this case, I think there is no big point of using SERIAL since data
> for analytics sometimes doesn't have to be very correct/fresh and
> data can be asynchronously replicated to analytics node. (so using
> LOCAL_SERIAL for one DC makes sense.)
>
> Could anyone give me some thoughts about it ?
>
> Thanks,
> Hiro
>


Re: Benefit of LOCAL_SERIAL consistency

2016-12-07 Thread Edward Capriolo
On Wed, Dec 7, 2016 at 8:25 AM, DuyHai Doan  wrote:

> The reason you don't want to use SERIAL in multi-DC clusters is the
> prohibitive cost of lightweight transaction (in term of latency),
> especially if your data centers are separated by continents. A ping from
> London to New York takes 52ms just by speed of light in optic cable. Since
> LightWeight Transaction involves 4 network round-trips, it means at least
> 200ms just for raw network transfer, not even taking into account the cost
> of processing the operation
>
> You're right to raise a warning about mixing LOCAL_SERIAL with SERIAL.
> LOCAL_SERIAL guarantees you linearizability inside a DC, SERIAL guarantees
> you linearizability across multiple DC.
>
> If I have 3 DCs with RF = 3 each (total 9 replicas) and I did an INSERT IF
> NOT EXISTS with LOCAL_SERIAL in DC1, then it's possible that a subsequent
> INSERT IF NOT EXISTS on the same record succeeds when using SERIAL because
> SERIAL on 9 replicas = at least 5 replicas. Those 5 replicas which respond
> can come from DC2 and DC3 and thus did not apply yet the previous INSERT...
>
> On Wed, Dec 7, 2016 at 2:14 PM, Hiroyuki Yamada 
> wrote:
>
>> Hi,
>>
>> I have been using lightweight transactions for several months now and
>> wondering what is the benefit of having LOCAL_SERIAL serial consistency
>> level.
>>
>> With SERIAL, it achieves global linearlizability,
>> but with LOCAL_SERIAL, it only achieves DC-local linearlizability,
>> which is missing point of linearlizability, I think.
>>
>> So, for example,
>> once when SERIAL is used,
>> we can't use LOCAL_SERIAL to achieve local linearlizability
>> since data in local DC might not be updated yet to meet quorum.
>> And vice versa,
>> once when LOCAL_SERIAL is used,
>> we can't use SERIAL to achieve global linearlizability
>> since data is not globally updated yet to meet quorum .
>>
>> So, it would be great if we can use LOCAL_SERIAL if possible and
>> use SERIAL only if local DC is down or unavailable,
>> but based on the example above, I think it is not possible, is it ?
>> So, I am not sure about what is the good use case for LOCAL_SERIAL.
>>
>> The only case that I can think of is having a cluster in one DC for
>> online transactions and
>> having another cluster in another DC for analytics purpose.
>> In this case, I think there is no big point of using SERIAL since data
>> for analytics sometimes doesn't have to be very correct/fresh and
>> data can be asynchronously replicated to analytics node. (so using
>> LOCAL_SERIAL for one DC makes sense.)
>>
>> Could anyone give me some thoughts about it ?
>>
>> Thanks,
>> Hiro
>>
>
>
You're right to raise a warning about mixing LOCAL_SERIAL with SERIAL.
LOCAL_SERIAL guarantees you linearizability inside a DC, SERIAL guarantees
you linearizability across multiple DC.

I am not sure what of the state of this is anymore but I was under the
impression the linearizability of lwt was in question. I never head it
specifically addressed.

https://issues.apache.org/jira/browse/CASSANDRA-6106

Its hard to follow 6106 because most of the tasks are closed 'fix later'
 or closed 'not a problem' .


Re: node decommission throttled

2016-12-07 Thread Eric Evans
On Tue, Dec 6, 2016 at 9:54 AM, Aleksandr Ivanov  wrote:
> I'm trying to decommission one C* node from 6 nodes cluster and see that
> outbound network traffic on this node doesn't go over ~30Mb/s.
> Looks like it is throttled somewhere in C*

Do you use compression?  Try taking a thread dump and see what the
utilization of the sending threads are.


-- 
Eric Evans
john.eric.ev...@gmail.com


Re: node decommission throttled

2016-12-07 Thread Benjamin Roth
Maybe your System cannot Stream faster. Is your cpu or hd/ssd fully
utilized?

Am 07.12.2016 16:07 schrieb "Eric Evans" :

> On Tue, Dec 6, 2016 at 9:54 AM, Aleksandr Ivanov  wrote:
> > I'm trying to decommission one C* node from 6 nodes cluster and see that
> > outbound network traffic on this node doesn't go over ~30Mb/s.
> > Looks like it is throttled somewhere in C*
>
> Do you use compression?  Try taking a thread dump and see what the
> utilization of the sending threads are.
>
>
> --
> Eric Evans
> john.eric.ev...@gmail.com
>


Re: Batch size warnings

2016-12-07 Thread Voytek Jarnot
Should've mentioned - running 3.9.  Also - please do not recommend MVs: I
tried, they're broken, we punted.

On Wed, Dec 7, 2016 at 10:06 AM, Voytek Jarnot 
wrote:

> The low default value for batch_size_warn_threshold_in_kb is making me
> wonder if I'm perhaps approaching the problem of atomicity in a non-ideal
> fashion.
>
> With one data set duplicated/denormalized into 5 tables to support
> queries, we use batches to ensure inserts make it to all or 0 tables.  This
> works fine, but I've had to bump the warn threshold and fail threshold
> substantially (8x higher for the warn threshold).  This - in turn - makes
> me wonder, with a default setting so low, if I'm not solving this problem
> in the canonical/standard way.
>
> Mostly just looking for confirmation that we're not unintentionally doing
> something weird...
>


Re: Batch size warnings

2016-12-07 Thread Benjamin Roth
Could you please be more specific?

Am 07.12.2016 17:10 schrieb "Voytek Jarnot" :

> Should've mentioned - running 3.9.  Also - please do not recommend MVs: I
> tried, they're broken, we punted.
>
> On Wed, Dec 7, 2016 at 10:06 AM, Voytek Jarnot 
> wrote:
>
>> The low default value for batch_size_warn_threshold_in_kb is making me
>> wonder if I'm perhaps approaching the problem of atomicity in a non-ideal
>> fashion.
>>
>> With one data set duplicated/denormalized into 5 tables to support
>> queries, we use batches to ensure inserts make it to all or 0 tables.  This
>> works fine, but I've had to bump the warn threshold and fail threshold
>> substantially (8x higher for the warn threshold).  This - in turn - makes
>> me wonder, with a default setting so low, if I'm not solving this problem
>> in the canonical/standard way.
>>
>> Mostly just looking for confirmation that we're not unintentionally doing
>> something weird...
>>
>
>


Batch size warnings

2016-12-07 Thread Voytek Jarnot
The low default value for batch_size_warn_threshold_in_kb is making me
wonder if I'm perhaps approaching the problem of atomicity in a non-ideal
fashion.

With one data set duplicated/denormalized into 5 tables to support queries,
we use batches to ensure inserts make it to all or 0 tables.  This works
fine, but I've had to bump the warn threshold and fail threshold
substantially (8x higher for the warn threshold).  This - in turn - makes
me wonder, with a default setting so low, if I'm not solving this problem
in the canonical/standard way.

Mostly just looking for confirmation that we're not unintentionally doing
something weird...


Re: Batch size warnings

2016-12-07 Thread Voytek Jarnot
Sure, about which part?

default batch size warning is 5kb
I've increased it to 30kb, and will need to increase to 40kb (8x default
setting) to avoid WARN log messages about batch sizes.  I do realize it's
just a WARNing, but may as well avoid those if I can configure it out.
That said, having to increase it so substantially (and we're only dealing
with 5 tables) is making me wonder if I'm not taking the correct approach
in terms of using batches to guarantee atomicity.

On Wed, Dec 7, 2016 at 10:13 AM, Benjamin Roth 
wrote:

> Could you please be more specific?
>
> Am 07.12.2016 17:10 schrieb "Voytek Jarnot" :
>
>> Should've mentioned - running 3.9.  Also - please do not recommend MVs: I
>> tried, they're broken, we punted.
>>
>> On Wed, Dec 7, 2016 at 10:06 AM, Voytek Jarnot 
>> wrote:
>>
>>> The low default value for batch_size_warn_threshold_in_kb is making me
>>> wonder if I'm perhaps approaching the problem of atomicity in a non-ideal
>>> fashion.
>>>
>>> With one data set duplicated/denormalized into 5 tables to support
>>> queries, we use batches to ensure inserts make it to all or 0 tables.  This
>>> works fine, but I've had to bump the warn threshold and fail threshold
>>> substantially (8x higher for the warn threshold).  This - in turn - makes
>>> me wonder, with a default setting so low, if I'm not solving this problem
>>> in the canonical/standard way.
>>>
>>> Mostly just looking for confirmation that we're not unintentionally
>>> doing something weird...
>>>
>>
>>


Re: Batch size warnings

2016-12-07 Thread Benjamin Roth
I meant the mv thing

Am 07.12.2016 17:27 schrieb "Voytek Jarnot" :

> Sure, about which part?
>
> default batch size warning is 5kb
> I've increased it to 30kb, and will need to increase to 40kb (8x default
> setting) to avoid WARN log messages about batch sizes.  I do realize it's
> just a WARNing, but may as well avoid those if I can configure it out.
> That said, having to increase it so substantially (and we're only dealing
> with 5 tables) is making me wonder if I'm not taking the correct approach
> in terms of using batches to guarantee atomicity.
>
> On Wed, Dec 7, 2016 at 10:13 AM, Benjamin Roth 
> wrote:
>
>> Could you please be more specific?
>>
>> Am 07.12.2016 17:10 schrieb "Voytek Jarnot" :
>>
>>> Should've mentioned - running 3.9.  Also - please do not recommend MVs:
>>> I tried, they're broken, we punted.
>>>
>>> On Wed, Dec 7, 2016 at 10:06 AM, Voytek Jarnot 
>>> wrote:
>>>
 The low default value for batch_size_warn_threshold_in_kb is making me
 wonder if I'm perhaps approaching the problem of atomicity in a non-ideal
 fashion.

 With one data set duplicated/denormalized into 5 tables to support
 queries, we use batches to ensure inserts make it to all or 0 tables.  This
 works fine, but I've had to bump the warn threshold and fail threshold
 substantially (8x higher for the warn threshold).  This - in turn - makes
 me wonder, with a default setting so low, if I'm not solving this problem
 in the canonical/standard way.

 Mostly just looking for confirmation that we're not unintentionally
 doing something weird...

>>>
>>>
>


Re: Batch size warnings

2016-12-07 Thread Voytek Jarnot
Been about a month since I have up on it, but it was very much related to
the stuff you're dealing with ... Basically Cassandra just stepping on its
own er, tripping over its own feet streaming MVs.

On Dec 7, 2016 10:45 AM, "Benjamin Roth"  wrote:

> I meant the mv thing
>
> Am 07.12.2016 17:27 schrieb "Voytek Jarnot" :
>
>> Sure, about which part?
>>
>> default batch size warning is 5kb
>> I've increased it to 30kb, and will need to increase to 40kb (8x default
>> setting) to avoid WARN log messages about batch sizes.  I do realize it's
>> just a WARNing, but may as well avoid those if I can configure it out.
>> That said, having to increase it so substantially (and we're only dealing
>> with 5 tables) is making me wonder if I'm not taking the correct approach
>> in terms of using batches to guarantee atomicity.
>>
>> On Wed, Dec 7, 2016 at 10:13 AM, Benjamin Roth 
>> wrote:
>>
>>> Could you please be more specific?
>>>
>>> Am 07.12.2016 17:10 schrieb "Voytek Jarnot" :
>>>
 Should've mentioned - running 3.9.  Also - please do not recommend MVs:
 I tried, they're broken, we punted.

 On Wed, Dec 7, 2016 at 10:06 AM, Voytek Jarnot >>> > wrote:

> The low default value for batch_size_warn_threshold_in_kb is making
> me wonder if I'm perhaps approaching the problem of atomicity in a
> non-ideal fashion.
>
> With one data set duplicated/denormalized into 5 tables to support
> queries, we use batches to ensure inserts make it to all or 0 tables.  
> This
> works fine, but I've had to bump the warn threshold and fail threshold
> substantially (8x higher for the warn threshold).  This - in turn - makes
> me wonder, with a default setting so low, if I'm not solving this problem
> in the canonical/standard way.
>
> Mostly just looking for confirmation that we're not unintentionally
> doing something weird...
>


>>


Re: Batch size warnings

2016-12-07 Thread Benjamin Roth
Ok thanks. Im investingating a Lot. There will be some improvements coming
but cannot promise if it will solve All existing problems. We will see and
keep working on it.

Am 07.12.2016 17:58 schrieb "Voytek Jarnot" :

> Been about a month since I have up on it, but it was very much related to
> the stuff you're dealing with ... Basically Cassandra just stepping on its
> own er, tripping over its own feet streaming MVs.
>
> On Dec 7, 2016 10:45 AM, "Benjamin Roth"  wrote:
>
>> I meant the mv thing
>>
>> Am 07.12.2016 17:27 schrieb "Voytek Jarnot" :
>>
>>> Sure, about which part?
>>>
>>> default batch size warning is 5kb
>>> I've increased it to 30kb, and will need to increase to 40kb (8x default
>>> setting) to avoid WARN log messages about batch sizes.  I do realize it's
>>> just a WARNing, but may as well avoid those if I can configure it out.
>>> That said, having to increase it so substantially (and we're only dealing
>>> with 5 tables) is making me wonder if I'm not taking the correct approach
>>> in terms of using batches to guarantee atomicity.
>>>
>>> On Wed, Dec 7, 2016 at 10:13 AM, Benjamin Roth 
>>> wrote:
>>>
 Could you please be more specific?

 Am 07.12.2016 17:10 schrieb "Voytek Jarnot" :

> Should've mentioned - running 3.9.  Also - please do not recommend
> MVs: I tried, they're broken, we punted.
>
> On Wed, Dec 7, 2016 at 10:06 AM, Voytek Jarnot <
> voytek.jar...@gmail.com> wrote:
>
>> The low default value for batch_size_warn_threshold_in_kb is making
>> me wonder if I'm perhaps approaching the problem of atomicity in a
>> non-ideal fashion.
>>
>> With one data set duplicated/denormalized into 5 tables to support
>> queries, we use batches to ensure inserts make it to all or 0 tables.  
>> This
>> works fine, but I've had to bump the warn threshold and fail threshold
>> substantially (8x higher for the warn threshold).  This - in turn - makes
>> me wonder, with a default setting so low, if I'm not solving this problem
>> in the canonical/standard way.
>>
>> Mostly just looking for confirmation that we're not unintentionally
>> doing something weird...
>>
>
>
>>>


Re: Batch size warnings

2016-12-07 Thread Edward Capriolo
I have been circling around a thought process over batches. Now that
Cassandra has aggregating functions, it might be possible write a type of
record that has an END_OF_BATCH type marker and the data can be suppressed
from view until it was all there.

IE you write something like a checksum record that an intelligent client
can use to tell if the rest of the batch is complete.

On Wed, Dec 7, 2016 at 11:58 AM, Voytek Jarnot 
wrote:

> Been about a month since I have up on it, but it was very much related to
> the stuff you're dealing with ... Basically Cassandra just stepping on its
> own er, tripping over its own feet streaming MVs.
>
> On Dec 7, 2016 10:45 AM, "Benjamin Roth"  wrote:
>
>> I meant the mv thing
>>
>> Am 07.12.2016 17:27 schrieb "Voytek Jarnot" :
>>
>>> Sure, about which part?
>>>
>>> default batch size warning is 5kb
>>> I've increased it to 30kb, and will need to increase to 40kb (8x default
>>> setting) to avoid WARN log messages about batch sizes.  I do realize it's
>>> just a WARNing, but may as well avoid those if I can configure it out.
>>> That said, having to increase it so substantially (and we're only dealing
>>> with 5 tables) is making me wonder if I'm not taking the correct approach
>>> in terms of using batches to guarantee atomicity.
>>>
>>> On Wed, Dec 7, 2016 at 10:13 AM, Benjamin Roth 
>>> wrote:
>>>
 Could you please be more specific?

 Am 07.12.2016 17:10 schrieb "Voytek Jarnot" :

> Should've mentioned - running 3.9.  Also - please do not recommend
> MVs: I tried, they're broken, we punted.
>
> On Wed, Dec 7, 2016 at 10:06 AM, Voytek Jarnot <
> voytek.jar...@gmail.com> wrote:
>
>> The low default value for batch_size_warn_threshold_in_kb is making
>> me wonder if I'm perhaps approaching the problem of atomicity in a
>> non-ideal fashion.
>>
>> With one data set duplicated/denormalized into 5 tables to support
>> queries, we use batches to ensure inserts make it to all or 0 tables.  
>> This
>> works fine, but I've had to bump the warn threshold and fail threshold
>> substantially (8x higher for the warn threshold).  This - in turn - makes
>> me wonder, with a default setting so low, if I'm not solving this problem
>> in the canonical/standard way.
>>
>> Mostly just looking for confirmation that we're not unintentionally
>> doing something weird...
>>
>
>
>>>


Re: Batch size warnings

2016-12-07 Thread Jonathan Haddad
@Ed, what you just said reminded me a lot of RAMP transactions.  I did a
blog post on it here: http://rustyrazorblade.com/2015/11/ramp-made-easy/

I've been considering doing a follow up on how to do a Cassandra data model
enabling RAMP transactions, but that takes time, and I have almost zero of
that.

On Wed, Dec 7, 2016 at 9:16 AM Edward Capriolo 
wrote:

> I have been circling around a thought process over batches. Now that
> Cassandra has aggregating functions, it might be possible write a type of
> record that has an END_OF_BATCH type marker and the data can be suppressed
> from view until it was all there.
>
> IE you write something like a checksum record that an intelligent client
> can use to tell if the rest of the batch is complete.
>
> On Wed, Dec 7, 2016 at 11:58 AM, Voytek Jarnot 
> wrote:
>
> Been about a month since I have up on it, but it was very much related to
> the stuff you're dealing with ... Basically Cassandra just stepping on its
> own er, tripping over its own feet streaming MVs.
>
> On Dec 7, 2016 10:45 AM, "Benjamin Roth"  wrote:
>
> I meant the mv thing
>
> Am 07.12.2016 17:27 schrieb "Voytek Jarnot" :
>
> Sure, about which part?
>
> default batch size warning is 5kb
> I've increased it to 30kb, and will need to increase to 40kb (8x default
> setting) to avoid WARN log messages about batch sizes.  I do realize it's
> just a WARNing, but may as well avoid those if I can configure it out.
> That said, having to increase it so substantially (and we're only dealing
> with 5 tables) is making me wonder if I'm not taking the correct approach
> in terms of using batches to guarantee atomicity.
>
> On Wed, Dec 7, 2016 at 10:13 AM, Benjamin Roth 
> wrote:
>
> Could you please be more specific?
>
> Am 07.12.2016 17:10 schrieb "Voytek Jarnot" :
>
> Should've mentioned - running 3.9.  Also - please do not recommend MVs: I
> tried, they're broken, we punted.
>
> On Wed, Dec 7, 2016 at 10:06 AM, Voytek Jarnot 
> wrote:
>
> The low default value for batch_size_warn_threshold_in_kb is making me
> wonder if I'm perhaps approaching the problem of atomicity in a non-ideal
> fashion.
>
> With one data set duplicated/denormalized into 5 tables to support
> queries, we use batches to ensure inserts make it to all or 0 tables.  This
> works fine, but I've had to bump the warn threshold and fail threshold
> substantially (8x higher for the warn threshold).  This - in turn - makes
> me wonder, with a default setting so low, if I'm not solving this problem
> in the canonical/standard way.
>
> Mostly just looking for confirmation that we're not unintentionally doing
> something weird...
>
>
>
>
>


Re: Batch size warnings

2016-12-07 Thread Cody Yancey
Hi Voytek,
I think the way you are using it is definitely the canonical way.
Unfortunately, as you learned, there are some gotchas. We tried
substantially increasing the batch size and it worked for a while, until we
reached new scale, and we increased it again, and so forth. It works, but
soon you start getting write timeouts, lots of them. And the thing about
multi-partition batch statements is that they offer atomicity, but not
isolation. This means your database can temporarily be in an inconsistent
state while writes are propagating to the various machines.

For our use case, we could deal with temporary inconsistency, as long as it
was for a strictly bounded period of time, on the order of a few seconds.
Unfortunately, as with all things eventually consistent, it degrades to
"totally inconsistent" when your database is under heavy load and the
time-bounds expand beyond what the application can handle. When a batch
write times out, it often still succeeds (eventually) but your tables can
be inconsistent for

minutes, even while nodetool status shows all nodes up and normal.

But there is another way, that requires us to take a page from our RDBMS
ancestors' book: multi-phase commit.

Similar to logged batch writes, multi-phase commit patterns typically
entail some write amplification cost for the benefit of stronger
consistency guarantees across isolatable units (in Cassandra's case,
*partitions*). However, multi-phase commit offers stronger guarantees that
batch writes, and ALL of the additional write load is completely
distributed as per your load-balancing policy, where as batch writes all go
through one coordinator node, then get written in their entirety to the
batch log on two or three nodes, and then get dispersed in a distributed
fashion from there.

A typical two-phase commit pattern looks like this:

The Write Path

   1. The client code chooses a random UUID.
   2. The client writes the UUID into the IncompleteTransactions table,
   which only has one column, the transactionUUID.
   3. The client makes all of the inserts involved in the transaction, IN
   PARALLEL, with the transactionUUID duplicated in every inserted row.
   4. The client deletes the UUID from IncompleteTransactions table.
   5. The client makes parallel updates to all of the rows it inserted, IN
   PARALLEL, setting the transactionUUID to null.

The Read Path

   1. The client reads some rows from a partition. If this particular
   client request can handle extraneous rows, you are done. If not, read on to
   step #2.
   2. The client gathers the set of unique transactionUUIDs. In the main
   case, they've all been deleted by step #5 in the Write Path. If not, go to
   #3.
   3. For remaining transactionUUIDs (which should be a very small number),
   query the IncompleteTransactions table.
   4. The client code culls rows where the transactionUUID existed in the
   IncompleteTransactions table.

This is just an example, one that is reasonably performant for ledger-style
non-updated inserts. For transactions involving updates to possibly
existing data, more effort is required, generally the client needs to be
smart enough to merge updates based on a timestamp, with a periodic batch
job that cleans out obsolete inserts. If it feels like reinventing the
wheel, that's because it is. But it just might be the quickest path to what
you need.

Thanks,
Cody

On Wed, Dec 7, 2016 at 10:15 AM, Edward Capriolo 
wrote:

> I have been circling around a thought process over batches. Now that
> Cassandra has aggregating functions, it might be possible write a type of
> record that has an END_OF_BATCH type marker and the data can be suppressed
> from view until it was all there.
>
> IE you write something like a checksum record that an intelligent client
> can use to tell if the rest of the batch is complete.
>
> On Wed, Dec 7, 2016 at 11:58 AM, Voytek Jarnot 
> wrote:
>
>> Been about a month since I have up on it, but it was very much related to
>> the stuff you're dealing with ... Basically Cassandra just stepping on its
>> own er, tripping over its own feet streaming MVs.
>>
>> On Dec 7, 2016 10:45 AM, "Benjamin Roth"  wrote:
>>
>>> I meant the mv thing
>>>
>>> Am 07.12.2016 17:27 schrieb "Voytek Jarnot" :
>>>
 Sure, about which part?

 default batch size warning is 5kb
 I've increased it to 30kb, and will need to increase to 40kb (8x
 default setting) to avoid WARN log messages about batch sizes.  I do
 realize it's just a WARNing, but may as well avoid those if I can configure
 it out.  That said, having to increase it so substantially (and we're only
 dealing with 5 tables) is making me wonder if I'm not taking the correct
 approach in terms of using batches to guarantee atomicity.

 On Wed, Dec 7, 2016 at 10:13 AM, Benjamin Roth >>> > wrote:

> Could you please be more specific?
>
> Am 07.12.2016 17:10 schrieb "Voytek Jarnot" :
>
>> Should've mentioned - run

Re: Batch size warnings

2016-12-07 Thread Voytek Jarnot
Appreciate the long writeup Cody.

Yeah, we're good with temporary inconsistency (thankfully) as well.  I'm
going to try to ride the batch train and hope it doesn't derail - our load
is fairly static (or, more precisely, increase in load is fairly slow and
can be projected).

Enjoyed your two-phase commit text.  Presumably one would also have some
cleanup implementation that culls any failed updates (write.5) which could
be identified in read.3 / read.4?  Still a disconnect possible between
write.3 and write.4, but there's always something...

We're insert-only (well, with some deletes via TTL, but anyway), so that's
somewhat tempting, but I'd rather not prematurely optimize.  Unless, of
course, anyone's got experience such that "batches over XXkb are definitely
going to be a problem".

Appreciate everyone's time.
--Voytek Jarnot

On Wed, Dec 7, 2016 at 11:31 AM, Cody Yancey  wrote:

> Hi Voytek,
> I think the way you are using it is definitely the canonical way.
> Unfortunately, as you learned, there are some gotchas. We tried
> substantially increasing the batch size and it worked for a while, until we
> reached new scale, and we increased it again, and so forth. It works, but
> soon you start getting write timeouts, lots of them. And the thing about
> multi-partition batch statements is that they offer atomicity, but not
> isolation. This means your database can temporarily be in an inconsistent
> state while writes are propagating to the various machines.
>
> For our use case, we could deal with temporary inconsistency, as long as
> it was for a strictly bounded period of time, on the order of a few
> seconds. Unfortunately, as with all things eventually consistent, it
> degrades to "totally inconsistent" when your database is under heavy load
> and the time-bounds expand beyond what the application can handle. When a
> batch write times out, it often still succeeds (eventually) but your tables
> can be inconsistent for
>
> minutes, even while nodetool status shows all nodes up and normal.
>
> But there is another way, that requires us to take a page from our RDBMS
> ancestors' book: multi-phase commit.
>
> Similar to logged batch writes, multi-phase commit patterns typically
> entail some write amplification cost for the benefit of stronger
> consistency guarantees across isolatable units (in Cassandra's case,
> *partitions*). However, multi-phase commit offers stronger guarantees
> that batch writes, and ALL of the additional write load is completely
> distributed as per your load-balancing policy, where as batch writes all go
> through one coordinator node, then get written in their entirety to the
> batch log on two or three nodes, and then get dispersed in a distributed
> fashion from there.
>
> A typical two-phase commit pattern looks like this:
>
> The Write Path
>
>1. The client code chooses a random UUID.
>2. The client writes the UUID into the IncompleteTransactions table,
>which only has one column, the transactionUUID.
>3. The client makes all of the inserts involved in the transaction, IN
>PARALLEL, with the transactionUUID duplicated in every inserted row.
>4. The client deletes the UUID from IncompleteTransactions table.
>5. The client makes parallel updates to all of the rows it inserted,
>IN PARALLEL, setting the transactionUUID to null.
>
> The Read Path
>
>1. The client reads some rows from a partition. If this particular
>client request can handle extraneous rows, you are done. If not, read on to
>step #2.
>2. The client gathers the set of unique transactionUUIDs. In the main
>case, they've all been deleted by step #5 in the Write Path. If not, go to
>#3.
>3. For remaining transactionUUIDs (which should be a very small
>number), query the IncompleteTransactions table.
>4. The client code culls rows where the transactionUUID existed in the
>IncompleteTransactions table.
>
> This is just an example, one that is reasonably performant for
> ledger-style non-updated inserts. For transactions involving updates to
> possibly existing data, more effort is required, generally the client needs
> to be smart enough to merge updates based on a timestamp, with a periodic
> batch job that cleans out obsolete inserts. If it feels like reinventing
> the wheel, that's because it is. But it just might be the quickest path to
> what you need.
>
> Thanks,
> Cody
>
> On Wed, Dec 7, 2016 at 10:15 AM, Edward Capriolo 
> wrote:
>
>> I have been circling around a thought process over batches. Now that
>> Cassandra has aggregating functions, it might be possible write a type of
>> record that has an END_OF_BATCH type marker and the data can be suppressed
>> from view until it was all there.
>>
>> IE you write something like a checksum record that an intelligent client
>> can use to tell if the rest of the batch is complete.
>>
>> On Wed, Dec 7, 2016 at 11:58 AM, Voytek Jarnot 
>> wrote:
>>
>>> Been about a month since I have up on it, bu

Re: Batch size warnings

2016-12-07 Thread Cody Yancey
There is a disconnect between write.3 and write.4, but it can only affect
performance, not consistency. The presence or absence of a row's txnUUID in
the IncompleteTransactions table is the ultimate source of truth, and rows
whose txnUUID are not null will be checked against that truth in the read
path.

And yes, it is a good point, failures with this model will accumulate and
degrade performance if you never clear out old failed transactions. The
tables we have that use this generally use TTLs so we don't really care as
long as irrecoverable transaction failures are very rare.

Thanks,
Cody

On Wed, Dec 7, 2016 at 1:56 PM, Voytek Jarnot 
wrote:

> Appreciate the long writeup Cody.
>
> Yeah, we're good with temporary inconsistency (thankfully) as well.  I'm
> going to try to ride the batch train and hope it doesn't derail - our load
> is fairly static (or, more precisely, increase in load is fairly slow and
> can be projected).
>
> Enjoyed your two-phase commit text.  Presumably one would also have some
> cleanup implementation that culls any failed updates (write.5) which could
> be identified in read.3 / read.4?  Still a disconnect possible between
> write.3 and write.4, but there's always something...
>
> We're insert-only (well, with some deletes via TTL, but anyway), so that's
> somewhat tempting, but I'd rather not prematurely optimize.  Unless, of
> course, anyone's got experience such that "batches over XXkb are definitely
> going to be a problem".
>
> Appreciate everyone's time.
> --Voytek Jarnot
>
> On Wed, Dec 7, 2016 at 11:31 AM, Cody Yancey  wrote:
>
>> Hi Voytek,
>> I think the way you are using it is definitely the canonical way.
>> Unfortunately, as you learned, there are some gotchas. We tried
>> substantially increasing the batch size and it worked for a while, until we
>> reached new scale, and we increased it again, and so forth. It works, but
>> soon you start getting write timeouts, lots of them. And the thing about
>> multi-partition batch statements is that they offer atomicity, but not
>> isolation. This means your database can temporarily be in an inconsistent
>> state while writes are propagating to the various machines.
>>
>> For our use case, we could deal with temporary inconsistency, as long as
>> it was for a strictly bounded period of time, on the order of a few
>> seconds. Unfortunately, as with all things eventually consistent, it
>> degrades to "totally inconsistent" when your database is under heavy load
>> and the time-bounds expand beyond what the application can handle. When a
>> batch write times out, it often still succeeds (eventually) but your tables
>> can be inconsistent for
>>
>> minutes, even while nodetool status shows all nodes up and normal.
>>
>> But there is another way, that requires us to take a page from our RDBMS
>> ancestors' book: multi-phase commit.
>>
>> Similar to logged batch writes, multi-phase commit patterns typically
>> entail some write amplification cost for the benefit of stronger
>> consistency guarantees across isolatable units (in Cassandra's case,
>> *partitions*). However, multi-phase commit offers stronger guarantees
>> that batch writes, and ALL of the additional write load is completely
>> distributed as per your load-balancing policy, where as batch writes all go
>> through one coordinator node, then get written in their entirety to the
>> batch log on two or three nodes, and then get dispersed in a distributed
>> fashion from there.
>>
>> A typical two-phase commit pattern looks like this:
>>
>> The Write Path
>>
>>1. The client code chooses a random UUID.
>>2. The client writes the UUID into the IncompleteTransactions table,
>>which only has one column, the transactionUUID.
>>3. The client makes all of the inserts involved in the transaction,
>>IN PARALLEL, with the transactionUUID duplicated in every inserted row.
>>4. The client deletes the UUID from IncompleteTransactions table.
>>5. The client makes parallel updates to all of the rows it inserted,
>>IN PARALLEL, setting the transactionUUID to null.
>>
>> The Read Path
>>
>>1. The client reads some rows from a partition. If this particular
>>client request can handle extraneous rows, you are done. If not, read on 
>> to
>>step #2.
>>2. The client gathers the set of unique transactionUUIDs. In the main
>>case, they've all been deleted by step #5 in the Write Path. If not, go to
>>#3.
>>3. For remaining transactionUUIDs (which should be a very small
>>number), query the IncompleteTransactions table.
>>4. The client code culls rows where the transactionUUID existed in
>>the IncompleteTransactions table.
>>
>> This is just an example, one that is reasonably performant for
>> ledger-style non-updated inserts. For transactions involving updates to
>> possibly existing data, more effort is required, generally the client needs
>> to be smart enough to merge updates based on a timestamp, with a periodic
>> batch

Huge files in level 1 and level 0 of LeveledCompactionStrategy

2016-12-07 Thread Sotirios Delimanolis
I have a couple of SSTables that are humongous 
-rw-r--r-- 1 user group 138933736915 Dec  1 03:41 
lb-29677471-big-Data.db-rw-r--r-- 1 user group  78444316655 Dec  1 03:58 
lb-29677495-big-Data.db-rw-r--r-- 1 user group 212429252597 Dec  1 08:20 
lb-29678145-big-Data.db
sstablemetadata reports that these are all in SSTable Level 0. This table is 
running with
compaction = {'sstable_size_in_mb': '200', 'tombstone_threshold': '0.25', 
'tombstone_compaction_interval': '300', 'class': 
'org.apache.cassandra.db.compaction.LeveledCompactionStrategy'}
How could this happen?

Re: Huge files in level 1 and level 0 of LeveledCompactionStrategy

2016-12-07 Thread Harikrishnan Pillai
This can happen as part of node bootstrap,repair or rebuild node.


From: Sotirios Delimanolis 
Sent: Wednesday, December 7, 2016 4:35:45 PM
To: User
Subject: Huge files in level 1 and level 0 of LeveledCompactionStrategy

I have a couple of SSTables that are humongous

-rw-r--r-- 1 user group 138933736915 Dec  1 03:41 lb-29677471-big-Data.db
-rw-r--r-- 1 user group  78444316655 Dec  1 03:58 lb-29677495-big-Data.db
-rw-r--r-- 1 user group 212429252597 Dec  1 08:20 lb-29678145-big-Data.db

sstablemetadata reports that these are all in SSTable Level 0. This table is 
running with

compaction = {'sstable_size_in_mb': '200', 'tombstone_threshold': '0.25', 
'tombstone_compaction_interval': '300', 'class': 
'org.apache.cassandra.db.compaction.LeveledCompactionStrategy'}

How could this happen?


Re: Huge files in level 1 and level 0 of LeveledCompactionStrategy

2016-12-07 Thread Sotirios Delimanolis
We haven't done any of those recently, on any nodes in this cluster. Would a 
major compaction through 'nodetool compact' cause this? (I think I may have 
done one of those.) 

On Wednesday, December 7, 2016 4:40 PM, Harikrishnan Pillai 
 wrote:
 

 #yiv9834340812 #yiv9834340812 -- P 
{margin-top:0;margin-bottom:0;}#yiv9834340812 This can happen as part of node 
bootstrap,repair or rebuild node.
From: Sotirios Delimanolis 
Sent: Wednesday, December 7, 2016 4:35:45 PM
To: User
Subject: Huge files in level 1 and level 0 of LeveledCompactionStrategy I have 
a couple of SSTables that are humongous 
-rw-r--r-- 1 user group 138933736915 Dec  1 03:41 
lb-29677471-big-Data.db-rw-r--r-- 1 user group  78444316655 Dec  1 03:58 
lb-29677495-big-Data.db-rw-r--r-- 1 user group 212429252597 Dec  1 08:20 
lb-29678145-big-Data.db
sstablemetadata reports that these are all in SSTable Level 0. This table is 
running with
compaction = {'sstable_size_in_mb': '200', 'tombstone_threshold': '0.25', 
'tombstone_compaction_interval': '300', 'class': 
'org.apache.cassandra.db.compaction.LeveledCompactionStrategy'}
How could this happen?

   

CQL datatype for long?

2016-12-07 Thread Check Peck
What is the CQL data type I should use for long? I have to create a column
with long data type. Cassandra version is 2.0.10.

CREATE TABLE storage (
  key text,
  clientid int,
  deviceid long, // this is wrong I guess as I don't see long in CQL?
  PRIMARY KEY (topic, partition)
);

I need to have "deviceid" as long data type. Bcoz I am getting deviceid as
long and that's how I want to store it.


Re: CQL datatype for long?

2016-12-07 Thread Varun Barala
 use `bigint` for long.


Regards,
Varun Barala

On Thu, Dec 8, 2016 at 10:32 AM, Check Peck  wrote:

> What is the CQL data type I should use for long? I have to create a column
> with long data type. Cassandra version is 2.0.10.
>
> CREATE TABLE storage (
>   key text,
>   clientid int,
>   deviceid long, // this is wrong I guess as I don't see long in CQL?
>   PRIMARY KEY (topic, partition)
> );
>
> I need to have "deviceid" as long data type. Bcoz I am getting deviceid as
> long and that's how I want to store it.
>


Re: CQL datatype for long?

2016-12-07 Thread Check Peck
And then from datastax java driver, I can use. Am I right?

To Read:
row.getLong();

To write
boundStatement.setLong()


On Wed, Dec 7, 2016 at 6:50 PM, Varun Barala 
wrote:

>  use `bigint` for long.
>
>
> Regards,
> Varun Barala
>
> On Thu, Dec 8, 2016 at 10:32 AM, Check Peck 
> wrote:
>
>> What is the CQL data type I should use for long? I have to create a
>> column with long data type. Cassandra version is 2.0.10.
>>
>> CREATE TABLE storage (
>>   key text,
>>   clientid int,
>>   deviceid long, // this is wrong I guess as I don't see long in CQL?
>>   PRIMARY KEY (topic, partition)
>> );
>>
>> I need to have "deviceid" as long data type. Bcoz I am getting deviceid
>> as long and that's how I want to store it.
>>
>
>


Re: Benefit of LOCAL_SERIAL consistency

2016-12-07 Thread Hiroyuki Yamada
Hi DuyHai,

Thank you for the comments.
Yes, that's exactly what I mean.
(Your comment is very helpful to support my opinion.)

As you said, SERIAL with multi-DCs incurs latency increase,
but it's a trade-off between latency and high availability bacause one
DC can be down from a disaster.
I don't think there is any way to achieve global linearlizability
without latency increase, right ?

> Edward
Thank you for the ticket.
I'll read it through.

Thanks,
Hiro

On Thu, Dec 8, 2016 at 12:01 AM, Edward Capriolo  wrote:
>
>
> On Wed, Dec 7, 2016 at 8:25 AM, DuyHai Doan  wrote:
>>
>> The reason you don't want to use SERIAL in multi-DC clusters is the
>> prohibitive cost of lightweight transaction (in term of latency), especially
>> if your data centers are separated by continents. A ping from London to New
>> York takes 52ms just by speed of light in optic cable. Since LightWeight
>> Transaction involves 4 network round-trips, it means at least 200ms just for
>> raw network transfer, not even taking into account the cost of processing
>> the operation
>>
>> You're right to raise a warning about mixing LOCAL_SERIAL with SERIAL.
>> LOCAL_SERIAL guarantees you linearizability inside a DC, SERIAL guarantees
>> you linearizability across multiple DC.
>>
>> If I have 3 DCs with RF = 3 each (total 9 replicas) and I did an INSERT IF
>> NOT EXISTS with LOCAL_SERIAL in DC1, then it's possible that a subsequent
>> INSERT IF NOT EXISTS on the same record succeeds when using SERIAL because
>> SERIAL on 9 replicas = at least 5 replicas. Those 5 replicas which respond
>> can come from DC2 and DC3 and thus did not apply yet the previous INSERT...
>>
>> On Wed, Dec 7, 2016 at 2:14 PM, Hiroyuki Yamada 
>> wrote:
>>>
>>> Hi,
>>>
>>> I have been using lightweight transactions for several months now and
>>> wondering what is the benefit of having LOCAL_SERIAL serial consistency
>>> level.
>>>
>>> With SERIAL, it achieves global linearlizability,
>>> but with LOCAL_SERIAL, it only achieves DC-local linearlizability,
>>> which is missing point of linearlizability, I think.
>>>
>>> So, for example,
>>> once when SERIAL is used,
>>> we can't use LOCAL_SERIAL to achieve local linearlizability
>>> since data in local DC might not be updated yet to meet quorum.
>>> And vice versa,
>>> once when LOCAL_SERIAL is used,
>>> we can't use SERIAL to achieve global linearlizability
>>> since data is not globally updated yet to meet quorum .
>>>
>>> So, it would be great if we can use LOCAL_SERIAL if possible and
>>> use SERIAL only if local DC is down or unavailable,
>>> but based on the example above, I think it is not possible, is it ?
>>> So, I am not sure about what is the good use case for LOCAL_SERIAL.
>>>
>>> The only case that I can think of is having a cluster in one DC for
>>> online transactions and
>>> having another cluster in another DC for analytics purpose.
>>> In this case, I think there is no big point of using SERIAL since data
>>> for analytics sometimes doesn't have to be very correct/fresh and
>>> data can be asynchronously replicated to analytics node. (so using
>>> LOCAL_SERIAL for one DC makes sense.)
>>>
>>> Could anyone give me some thoughts about it ?
>>>
>>> Thanks,
>>> Hiro
>>
>>
>
> You're right to raise a warning about mixing LOCAL_SERIAL with SERIAL.
> LOCAL_SERIAL guarantees you linearizability inside a DC, SERIAL guarantees
> you linearizability across multiple DC.
>
> I am not sure what of the state of this is anymore but I was under the
> impression the linearizability of lwt was in question. I never head it
> specifically addressed.
>
> https://issues.apache.org/jira/browse/CASSANDRA-6106
>
> Its hard to follow 6106 because most of the tasks are closed 'fix later'  or
> closed 'not a problem' .