Re: Using CDC Feature to Stream C* to Kafka (Design Proposal)

2018-09-12 Thread Jay Zhuang
We have the similar use case:  Streamific, the Ingestion Service for Hadoop
Big Data at Uber Engineering . We had
this data ingestion pipeline built on MySQL/schemaless
 before using Cassandra. For
Cassandra, we used to do double write to Cassandra/Kafka and moving to CDC
(as dual write has its own issues). Here is one of the use cases we
opensourced: Introducing AthenaX, Uber Engineering’s Open Source Streaming
Analytics Platform . For most of our use
cases, we cannot put kafka before Cassandra to get consistency requirement.
We're having the same challenges
 for
CDC, and here is what we currently do for the dedup and full update (not
perfect, we're still working on improving it):

Deduplication: currently we de-dup the data in the kafka consumer instead
of the producer which means there're 3 (RF number) copies of data in Kafka.
We're working on dedup with the cache as mentioned before (also in the PPT
),
but we also want to make sure the downstream consumer is able to handle
duplicated data, as the cache won't cover 100% de-dup the data (also in our
case, cache layer has lower SLA).

Full row update: MySQL provides the full row in binlog. Cassandra commitlog
only has the updated fields, but the downstream consumer has all the
historical data and it could be merged there: Hudi: Uber Engineering’s
Incremental Processing Framework on Hadoop ,
it's also opensourced here .

Just FYI. ElasticSearch is also another consumer of the kafka topic: Databook:
Turning Big Data into Knowledge with Metadata at Uber
. And we opensourced the data auditing
system for the pipeline: Introducing Chaperone: How Uber Engineering Audits
Kafka End-to-End 
We're also exploring Cache invalidation with CDC, currently, the update lag
(10 seconds) is the blocker issue for that.

On Wed, Sep 12, 2018 at 2:18 AM DuyHai Doan  wrote:

> The biggest problem of having CDC working correctly in C* is the
> deduplication issue.
>
> Having a process to read incoming mutation from commitlog is not that
> hard, having to dedup them through N replicas is much harder
>
> The idea is : why don't we generate the CDC event directly at the
> coordinator side ? Indeed, the coordinator is the single source of true for
> each mutation request. As soon as the coordinator receives 1
> acknowledgement from any replica, the mutation can be considered "durable"
> and safely sent downstream to the CDC processor. This approach would
> requires to change the write path on the coordinator side and may have
> impact on performance (if writing to CDC downstream is blocking or too slow)
>
> My 2 cents
>
> On Wed, Sep 12, 2018 at 5:56 AM, Joy Gao  wrote:
>
>> Re Rahul:  "Although DSE advanced replication does one way, those are use
>> cases with limited value to me because ultimately it’s still a master slave
>> design."
>> Completely agree. I'm not familiar with Calvin protocol, but that sounds
>> interesting (reading time...).
>>
>> On Tue, Sep 11, 2018 at 8:38 PM Joy Gao  wrote:
>>
>>> Thank you all for the feedback so far.
>>>
>>> The immediate use case for us is setting up a real-time streaming data
>>> pipeline from C* to our Data Warehouse (BigQuery), where other teams can
>>> access the data for reporting/analytics/ad-hoc query. We already do
>>> this with MySQL
>>> ,
>>> where we stream the MySQL Binlog via Debezium 's
>>> MySQL Connector to Kafka, and then use a BigQuery Sink Connector to stream
>>> data to BigQuery.
>>>
>>> Re Jon's comment about why not write to Kafka first? In some cases that
>>> may be ideal; but one potential concern we have with writing to Kafka first
>>> is not having "read-after-write" consistency. The data could be written to
>>> Kafka, but not yet consumed by C*. If the web service issues a (quorum)
>>> read immediately after the (quorum) write, the data that is being returned
>>> could still be outdated if the consumer did not catch up. Having web
>>> service interacts with C* directly solves this problem for us (we could add
>>> a cache before writing to Kafka, but that adds additional operational
>>> complexity to the architecture; alternatively, we could write to Kafka and
>>> C* transactionally, but distributed transaction is slow).
>>>
>>> Having the ability to stream its data to other systems could make C*
>>> more flexible and more easily integrated into a larger data ecosystem. As
>>> Dinesh has mentioned, implementing this in the database layer means there
>>> is a standard approach to getting a change notification stream (unlike
>

Re: Difficulties after nodetool removenode

2019-07-04 Thread Jay Zhuang
Hi Morten, it might be a bug, which C* version are you using? To guarantee
consistency, it's recommended to run repair on all nodes after removeNode
(for NetworkTopologyStrategy, it could be all nodes in that specific
datacenter).

On Thu, Jul 4, 2019 at 8:30 AM Alain RODRIGUEZ  wrote:

> Hello,
>
> Just for one node, and if you have a strong consistency 'Read CL + Write
> CL > RF', you can:
>
> - force the node out with 'nodetool removenode force' if it's still around
> - run a repair (just on that node, but full repair).
> OR
> - force the node out with 'nodetool removenode force' if it's still around
> - wipe this node and replace it by itself (if you are missing a lot of
> data or are not comfortable with repairs). *If you just lost a node, this
> might not be safe.* Repair is safer if ran with the right options/tool.
> OR
> - If the node is still there you can also re-run the 'nodetool
> removenode'. Data will be streamed again (to all nodes) and compacted in
> the future eventually.
>
> C*heers,
> ---
> Alain Rodriguez - al...@thelastpickle.com
> France / Spain
>
> The Last Pickle - Apache Cassandra Consulting
> http://www.thelastpickle.com
>
> Le sam. 29 juin 2019 à 14:36, Morten A. Iversen
>  a écrit :
>
>> Hi,
>>
>>
>> We had a hardware issue with one node in a Cassandra cluster and had to
>> use the "nodetool removenode UUID" command from a different node. This
>> seems to be running fine, but one node was restarted after the "nodetool
>> removenode" command was run, and now it seems all streams going from that
>> node have stopped.
>>
>>
>> On most nodes I can see both "Receiving X files, Y bytes total. Already
>> received Z files, Q bytes total" and "Sending X files, Y bytes total.
>> Already sent Z files, Q bytes total" messages when running nodetool
>> netstats.
>>
>>
>> Nodes are starting to complete this process, but for the node that was
>> restarted after the "nodetool removenode" command I can only see the
>> "receiving" messages, and on the other nodes the progress from that node
>> seems to have stopped. Is there some way to restart the process on only the
>> node that was restarted?
>>
>>
>> Regards
>>
>> Morten Iversen
>>
>>


Re: Pluggable throttling of read and write queries

2017-02-22 Thread Jay Zhuang
Here is the Scheduler interface: 
https://github.com/apache/cassandra/blob/cassandra-3.11/conf/cassandra.yaml#L978


Seems like it could be used for this case.

It is removed in 4.x with thrift, not sure why: 
https://github.com/apache/cassandra/commit/4881d9c308ccd6b5ca70925bf6ebedb70e7705fc


Thanks,
Jay

On 2/22/17 3:39 PM, Eric Stevens wrote:

We’ve actually had several customers where we’ve done the opposite -

split large clusters apart to separate uses cases

We do something similar but for a single application.  We're
functionally sharding data to different clusters from a single
application.  We can have different server classes for different types
of workloads, we can grow and size clusters accordingly, and we also do
things like time sharding so that we can let at-rest data go to cheaper
storage options.

I agree with the general sentiment here that (at least as it stands
today) a monolithic cluster for many applications does not compete to
per-application clusters unless cost is no issue.  At our scale, the
terabytes of C* data we take in per day means that even very small cost
savings really add up at scale.  And even where cost is no issue, the
additional isolation and workload tailoring is still highly valuable.

On Wed, Feb 22, 2017 at 12:01 PM Edward Capriolo mailto:edlinuxg...@gmail.com>> wrote:



On Wed, Feb 22, 2017 at 1:20 PM, Abhishek Verma mailto:ve...@uber.com>> wrote:

We have lots of dedicated Cassandra clusters for large use
cases, but we have a long tail of (~100) of internal customers
who want to store < 200GB of data with < 5k qps and non-critical
data. It does not make sense to create a 3 node dedicated
cluster for each of these small use cases. So we have a shared
cluster into which we onboard these users.

But once in a while, one of the customers will run a ingest job
from HDFS which will pound the shared cluster and break our SLA
for the cluster for all the other customers. Currently, I don't
see anyway to signal back pressure to the ingestion jobs or
throttle their requests. Another example is one customer doing a
large number of range queries which has the same effect.

A simple way to avoid this is to throttle the read or write
requests based on some quota limits for each keyspace or user.

Please see replies inlined:

On Mon, Feb 20, 2017 at 11:46 PM, vincent gromakowski
mailto:vincent.gromakow...@gmail.com>> wrote:

Aren't you using mesos Cassandra framework to manage your
multiple clusters ? (Seen a presentation in cass summit)

Yes we are
using https://github.com/mesosphere/dcos-cassandra-service and
contribute heavily to it. I am aware of the presentation
(https://www.youtube.com/watch?v=4Ap-1VT2ChU) at the Cassandra
summit as I was the one who gave it :)
This has helped us automate the creation and management of these
clusters.

What's wrong with your current mesos approach ?

Hardware efficiency: Spinning up dedicated clusters for each use
case wastes a lot of hardware resources. One of the approaches
we have taken is spinning up multiple Cassandra nodes belonging
to different clusters on the same physical machine. However, we
still have overhead of managing these separate multi-tenant
clusters.

I am also thinking it's better to split a large cluster into
smallers except if you also manage client layer that query
cass and you can put some backpressure or rate limit in it.

We have an internal storage API layer that some of the clients
use, but there are many customers who use the vanilla DataStax
Java or Python driver. Implementing throttling in each of those
clients does not seem like a viable approach.

Le 21 févr. 2017 2:46 AM, "Edward Capriolo"
mailto:edlinuxg...@gmail.com>> a écrit :

Older versions had a request scheduler api.

I am not aware of the history behind it. Can you please point me
to the JIRA tickets and/or why it was removed?

On Monday, February 20, 2017, Ben Slater
 wrote:

We’ve actually had several customers where we’ve
done the opposite - split large clusters apart to
separate uses cases. We found that this allowed us
to better align hardware with use case requirements
(for example using AWS c3.2xlarge for very hot data
at low latency, m4.xlarge for more general purpose
data) we can also tune JVM settings, etc to meet
those uses cases.

There have been several instances where we have moved customers
out of the shared cluster to their own

Ask for suggestions to de-duplicate data for Cassandra CDC

2017-06-20 Thread Jay Zhuang
Hi,

For Cassandra CDC feature:
http://cassandra.apache.org/doc/latest/operating/cdc.html

The CDC data is duplicated RF number of times. Let's say replication
factor is 3 in one DC, the same data will be sent out 3 times.
One solution is adding another DC with RF=1, which will be only used for
CDC. Then it won't have duplicated data. But that's very costly to have
a DC only for that job. And if any node goes down in that DC, there will
be update lag.

Our pipeline is pushing the data to kafka and then ingesting the data to
Hive. When ingesting to Hive, we could have a 20 minutes of data in
memory and de-dup. But kafka is going to store 3X data, which is also
costly.

Does anyone have the similar problem? What's your solution? Any feedback
is welcomed.

Thanks,
Jay

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



commitlog_total_space_in_mb tuning

2017-07-05 Thread Jay Zhuang
Hi,

commitlog_total_space_in_mb is increased from 1G to 8G in
CASSANDRA-7031. Sometimes we saw the number of dropped mutations spikes.
Not sure if it's a sign that we should increase the
commitlog_total_space_in_mb?

For bean:
org.apache.cassandra.metrics:name=WaitingOnSegmentAllocation,type=CommitLog
Mean is 48684 microseconds
Max is 1386179 microseconds

I think it's relatively high, compare to our other clusters. Does anyone
tune that parameter? Any suggestion on that?

Thanks,
Jay

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Re: commitlog_total_space_in_mb tuning

2017-07-05 Thread Jay Zhuang
Thanks Jeff for the quick response.

We're running with 3.0.13 which doesn't have commitlog_segment_recycling
option. So it should be disabled.

I think the CL flush is because commitlog full. The commitlog size is
closing to 8G:
#mbean =
org.apache.cassandra.metrics:name=TotalCommitLogSize,type=CommitLog:
Value = 8489271296;

We're running with 128G memory and 30G heap size. Maybe it's good idea
to increase the commitlog_total_space. On the other hand, even with  8G
commitlog_total_space, replaying CL after restart takes more than 5 minutes.

In our case, the actual problem is it's causing lots of read repair
timeouts as the repair mutations are dropped. Which causes Cassandra JVM
hang or sometimes crash.

/Jay

On 7/5/17 2:45 PM, Jeff Jirsa wrote:
> 
> 
> On 2017-07-05 14:32 (-0700), Jay Zhuang  wrote: 
>> Hi,
>>
>> commitlog_total_space_in_mb is increased from 1G to 8G in
>> CASSANDRA-7031. Sometimes we saw the number of dropped mutations spikes.
>> Not sure if it's a sign that we should increase the
>> commitlog_total_space_in_mb?
> 
> 8G seems pretty typical. The real litmus test is in Benedict's comment:
> 
>> We can find that we force memtable flushes as a result of log utilisation 
>> instead of memtable occupancy quite often
> 
> Are your memtable flushes because the memtable is full, or because the 
> commitlog is full?
> 
> Also, what version are you running, and do you have segment recycling 
> enabled? (hopefully not: https://issues.apache.org/jira/browse/CASSANDRA-8771 
> )
> 
> 
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
> 

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Re: Stress test

2017-07-27 Thread Jay Zhuang
The user and password should be in -mode section, for example:
./cassandra-stress user profile=table.yaml ops\(insert=1\) -mode native
cql3 user=** password=**

http://docs.datastax.com/en/cassandra/3.0/cassandra/tools/toolsCStress.html

/Jay

On 7/27/17 2:46 PM, Greg Lloyd wrote:
> I am trying to use the cassandra stress tool with the user
> profile=table.yaml arguments specified and do authentication at the same
> time. If I use the user profile I get an error Invalid parameter
> user=* if I specify a user and password.
> 
> Is it not possible to specify a yaml and use authentication?

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Re: Do not use Cassandra 3.11.0+ or Cassandra 3.0.12+

2017-08-28 Thread Jay Zhuang
We're using 3.0.12+ for a few months and haven't seen the issue like
that. Do we know what could trigger the problem? Or is 3.0.x really
impacted?

Thanks,
Jay

On 8/28/17 6:02 AM, Hannu Kröger wrote:
> Hello,
> 
> Current latest Cassandra version (3.11.0, possibly also 3.0.12+) has a race
> condition that causes Cassandra to create broken sstables (stats file in
> sstables to be precise).
> 
> Bug described here:
> https://issues.apache.org/jira/browse/CASSANDRA-13752
> 
> This change might be causing it (but not sure):
> https://issues.apache.org/jira/browse/CASSANDRA-13038
> 
> Other related issues:
> https://issues.apache.org/jira/browse/CASSANDRA-13718
> https://issues.apache.org/jira/browse/CASSANDRA-13756
> 
> I would not recommend using 3.11.0 nor upgrading to 3.0.12 or higher before
> this is fixed.
> 
> Cheers,
> Hannu
> 

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Is it recommended to enable debug log in production

2018-01-16 Thread Jay Zhuang
Hi,

Do you guys enable debug log in production? Is it recommended?

By default, the cassandra log level is set to debug:
https://github.com/apache/cassandra/blob/trunk/conf/logback.xml#L100

We’re using 3.0.x, which generates lots of Gossip messages:
FailureDetector.java:456 - Ignoring interval time of 2001193771 for /IP

Probably we should back port 
https://github.com/apache/cassandra/commit/9ac01baef5c8f689e96307da9b29314bc0672462
Other than that, do you guys see any other issue?

Thanks,
Jay
-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Re: CDC usability and future development

2018-02-01 Thread Jay Zhuang
We did a POC to improve CDC feature as an interface (
https://github.com/ngcc/ngcc2017/blob/master/CassandraDataIngestion.pdf),
so the user doesn't have to read the commit log directly. We deployed the
change to a test cluster and doing more tests for production traffics, will
send out the design proposal, POC and the test result soon.

We have the same problem to get the full row value for CDC downstream
pipeline. We used to do a readback, right now our CDC downstream stores all
the data (in Hive), so no need to read back. For Cassandra CDC feature, I
don't think it should provide the full row value, as it supposed to be
Change Data Capture. But it's still a problem for the range delete, as it
cannot read back deleted data. So we're purposing an option to expand the
range delete in CDC if the user really wants it.


On Wed, Jan 31, 2018 at 7:32 AM Josh McKenzie  wrote:

> CDC provides only the mutation as opposed to the full column value, which
>> tends to be of limited use for us. Applications might want to know the full
>> column value, without having to issue a read back. We also see value in
>> being able to publish the full column value both before and after the
>> update. This is especially true when deleting a column since this stream
>> may be joined with others, or consumers may require other fields to
>> properly process the delete.
>
>
> Philosophically, my first pass at the feature prioritized minimizing
> impact to node performance first and usability second, punting a lot of the
> de-duplication and RbW implications of having full column values, or
> materializing stuff off-heap for consumption from a user and flagging as
> persisted to disk etc, for future work on the feature. I don't personally
> have any time to devote to moving the feature forward now but as Jeff
> indicates, Jay and Simon are both active in the space and taking up the
> torch.
>
>
> On Tue, Jan 30, 2018 at 8:35 PM, Jeff Jirsa  wrote:
>
>> Here's a deck of some proposed additions, discussed at one of the NGCC
>> sessions last fall:
>>
>> https://github.com/ngcc/ngcc2017/blob/master/CassandraDataIngestion.pdf
>>
>>
>>
>> On Tue, Jan 30, 2018 at 5:10 PM, Andrew Prudhomme  wrote:
>>
>> > Hi all,
>> >
>> > We are currently designing a system that allows our Cassandra clusters
>> to
>> > produce a stream of data updates. Naturally, we have been evaluating if
>> CDC
>> > can aid in this endeavor. We have found several challenges in using CDC
>> for
>> > this purpose.
>> >
>> > CDC provides only the mutation as opposed to the full column value,
>> which
>> > tends to be of limited use for us. Applications might want to know the
>> full
>> > column value, without having to issue a read back. We also see value in
>> > being able to publish the full column value both before and after the
>> > update. This is especially true when deleting a column since this stream
>> > may be joined with others, or consumers may require other fields to
>> > properly process the delete.
>> >
>> > Additionally, there is some difficulty with processing CDC itself such
>> as:
>> > - Updates not being immediately available (addressed by CASSANDRA-12148)
>> > - Each node providing an independent streams of updates that must be
>> > unified and deduplicated
>> >
>> > Our question is, what is the vision for CDC development? The current
>> > implementation could work for some use cases, but is a ways from a
>> general
>> > streaming solution. I understand that the nature of Cassandra makes this
>> > quite complicated, but are there any thoughts or desires on the future
>> > direction of CDC?
>> >
>> > Thanks
>> >
>> >
>>
>
>


Re: LEAK DETECTED while minor compaction

2018-02-24 Thread Jay Zhuang
Maybe it's because of https://issues.apache.org/jira/browse/CASSANDRA-12014
We had the similar issue in 3.0.14, when the SSTable has more than about 51 
million keys: https://issues.apache.org/jira/browse/CASSANDRA-13785

If you upgrade to 2.2.11+, you should be able to see the real compaction 
exception: https://issues.apache.org/jira/browse/CASSANDRA-13833

/Jay

> On Feb 20, 2018, at 11:01 PM, Дарья Меленцова  wrote:
> 
> Hello.
> 
> Could you help me with LEAK DETECTED error while minor compaction process?
> 
> There is a table with a lot of small record 6.6*10^9 (mapping
> (eventId, boxId) -> cellId)).
> Minor compaction starts and then fails on 99% done with an error:
> 
> Stacktrace
> ERROR [Reference-Reaper:1] 2018-02-05 10:06:17,032 Ref.java:207 - LEAK
> DETECTED: a reference
> (org.apache.cassandra.utils.concurrent.Ref$State@1ca1bf87) to class
> org.apache.cassandra.io.util.MmappedSegmentedFile$Cleanup@308695651:/storage1/cassandra_events/data/EventsKeyspace/PerBoxEventSeriesEvents-41847c3049a211e6af50b9221207cca8/tmplink-lb-102593-big-Index.db
> was not released before the reference was garbage collected
> ERROR [Reference-Reaper:1] 2018-02-05 10:06:17,033 Ref.java:207 - LEAK
> DETECTED: a reference
> (org.apache.cassandra.utils.concurrent.Ref$State@1659d4f7) to class
> org.apache.cassandra.utils.concurrent.WrappedSharedCloseable$1@1398495320:[Memory@[0..dc),
> Memory@[0..898)] was not released before the reference was garbage
> collected
> ERROR [Reference-Reaper:1] 2018-02-05 10:06:17,033 Ref.java:207 - LEAK
> DETECTED: a reference
> (org.apache.cassandra.utils.concurrent.Ref$State@42978833) to class
> org.apache.cassandra.utils.concurrent.WrappedSharedCloseable$1@1648504648:[[OffHeapBitSet]]
> was not released before the reference was garbage collected
> ERROR [Reference-Reaper:1] 2018-02-05 10:06:17,033 Ref.java:207 - LEAK
> DETECTED: a reference
> (org.apache.cassandra.utils.concurrent.Ref$State@3a64a19b) to class
> org.apache.cassandra.io.sstable.format.SSTableReader$DescriptorTypeTidy@863282967:/storage1/cassandra_events/data/EventsKeyspace/PerBoxEventSeriesEvents-41847c3049a211e6af50b9221207cca8/tmplink-lb-102593-big
> was not released before the reference was garbage collected
> ERROR [Reference-Reaper:1] 2018-02-05 10:06:17,033 Ref.java:207 - LEAK
> DETECTED: a reference
> (org.apache.cassandra.utils.concurrent.Ref$State@4ddc775a) to class
> org.apache.cassandra.io.util.CompressedPoolingSegmentedFile$Cleanup@1041709510:/storage1/cassandra_events/data/EventsKeyspace/PerBoxEventSeriesEvents-41847c3049a211e6af50b9221207cca8/tmplink-lb-102593-big-Data.db
> was not released before the reference was garbage collected
> 
> I have tried increase max heap size (8GB -> 16GB), but got the same error.
> How can I resolve the issue?
> 
> 
> Cassandra parameters and the problem table
> Cassandra v 2.2.9
> MAX_HEAP_SIZE="16G"
> java version "1.8.0_121"
> 
> compaction = {'min_threshold': '4', 'enabled': 'True', 'class':
> 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy',
> 'max_threshold': '32'}
> compression = {'sstable_compression':
> 'org.apache.cassandra.io.compress.SnappyCompressor'}
> 
> nodetoole tablestats
> Read Count: 1454605
>Read Latency: 2.0174777647540054 ms.
>Write Count: 12034909
>Write Latency: 0.044917336558174224 ms.
>Pending Flushes: 0
>Table: PerBoxEventSeriesEventIds
>SSTable count: 20
>Space used (live): 885969527458
>Space used (total): 885981801994
>Space used by snapshots (total): 0
>Off heap memory used (total): 19706226232
>SSTable Compression Ratio: 0.5722091068132875
>Number of keys (estimate): 6614724684
>Memtable cell count: 375796
>Memtable data size: 31073510
>Memtable off heap memory used: 0
>Memtable switch count: 30
>Local read count: 1454605
>Local read latency: NaN ms
>Local write count: 12034909
>Local write latency: NaN ms
>Pending flushes: 0
>Bloom filter false positives: 0
>Bloom filter false ratio: 0.0
>Bloom filter space used: -4075791744
>Bloom filter off heap memory used: 17399044576
>Index summary off heap memory used: 2091833184
>Compression metadata off heap memory used: 215348472
>Compacted partition minimum bytes: 104
>Compacted partition maximum bytes: 149
>Compacted partition mean bytes: 149
>Average live cells per slice (last five minutes): NaN
>Maximum live cells per slice (last five minutes): 0
>Average tombstones per slice (last five minutes): NaN
>Maximum tombstones per slice (last five minutes): 0
> 
> Thank You
> Dary

Re: Question about how sorted runs are picked for compaction with STCS

2021-11-23 Thread Jay Zhuang
Hi Mark,

Non-adjacent files can be merged. For cassandra, it anyway needs to query
all sorted runs as it's timestamps are not strictly ordered.

Thanks,
Jay

On Tue, Nov 23, 2021 at 9:42 AM MARK CALLAGHAN  wrote:

> I am trying to understand how sorted runs are picked for compaction with
> STCS and the docs here don't have enough detail:
>
> https://cassandra.apache.org/doc/4.0/cassandra/operating/compaction/stcs.html
>
> These blog post have more detail but I still have a question:
> *
> http://distributeddatastore.blogspot.com/2021/06/cassandra-compaction.html
> *
> https://shrikantbang.wordpress.com/2014/04/22/size-tiered-compaction-strategy-in-apache-cassandra
>
> And the blog posts match what I see in the STCS source, especially
> getBuckets:
>
> https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/compaction/SizeTieredCompactionStrategy.java#L251
>
> My question is whether non-adjacent sorted runs can be merged. To be
> precise assume there are 3 sorted runs, described by 5 attributes (min-key,
> max-key, min-ts, max-ts, size) where min-ts and max-ts are the min and max
> commit timestamps for mutations in that sorted run, min-key and max-key are
> the min and max key for mutations in that sorted run, size is the size of
> the sorted run. The sorted runs are:
> * s1 - min-key=A, max-key=D, min-ts=1, max-ts=99, size=100
> * s2 - min-key=A, max-key=D, min-ts=100, max-ts=199, size=10
> * s3 - min-key=A, max-key=D, min-ts=200, max-ts=299, size=100
>
> By "adjacent" I mean adjacent when sorted runs are ordered by commit
> timestamps. From the example above s1 & s2 are adjacent, s2 & s3 are
> adjacent but s1 and s3 are not adjacent.
>
> From what I saw in the STCS source code, s1 & s3 can end up in the same
> bucket without s2 because s2 is much smaller. If s1 & s3 were merged then
> the result might be:
> * s1+s3: min-key=A, max-key=D, min-ts=1, max-ts=299, size likely >= 100
> * s2 - min-key=A, max-key=D, min-ts=100, max-ts=199, size=10
>
> A side-effect from merging non-adjacent sorted runs can be that on a point
> query, something must be fetched from all sorted runs to determine the
> value to return. This differs from what happens with RocksDB where the heap
> code can read from the iterators in (commit timestamp) order and stop once
> it gets a key (with a value or a tombstone).
>
> --
> Mark Callaghan
> mdcal...@gmail.com
>