Re: A Roadmap to Cassandra Analytics 1.0

Jon Haddad Tue, 22 Apr 2025 14:01:45 -0700

Thanks Doug!!

Supporting vnodes is huge and I'm excited to see it being worked on.  My
experience has been most environments are using them, and single token
clusters are outliers.  Usually run by folks who have been part of the
project for a long time, and often committers.  I'm looking forward to
being able to recommend our analytics library to the teams I work with who
fall outside this category.


I think the thing that'll be a priority (for me at least) will be ensuring
the bulk writer works well with UCS.  Since UCS can (and imo, should)
replace both STCS and LCS in _all_ use cases, proper UCS support seems like
it would be a fairly high priority after the items you've listed.  I didn't
see it listed under C* 5.0 support, maybe I missed it.

Since UCS calculates the level rather than explicitly tracking it in
metadata [1], I *think* it should work if you respect the
target_sstable_size parameter.  I'm guessing that's already there for
writing to LCS, so hopefully it's easy enough to get into 1.0.

Once vnode support is in I'll be happy to fire up a cluster with
easy-cass-lab and give it a test run, and add whatever support makes it
easy to test Spark jobs.

Jon

[1] See
org.apache.cassandra.db.compaction.UnifiedCompactionStrategy#chooseCompactionPick,
org.apache.cassandra.db.compaction.UnifiedCompactionStrategy#formLevels and
org.apache.cassandra.db.compaction.ShardManager#density, and
org.apache.cassandra.db.compaction.ShardManager#rangeSpanned(org.apache.cassandra.io.sstable.format.SSTableReader)


On Tue, Apr 22, 2025 at 10:53 AM Doug Rohrer <[email protected]> wrote:

> Hello folks,
>
> As many of you on the ASF Slack may have noticed, I’ve been creating a
> bunch of new tickets for the Cassandra Analytics project related to a 1.0
> release. Since it was initially contributed, there have been many
> enhancements and fixes to the library, but there are still some gaps that
> need to be addressed. We’re putting together a plan to close those gaps,
> and would love to enlist more folks from the community in making the
> analytics library more useful. The gaps we see today include:
>
>    - vnode support (and optimizations to the exiting code if necessary to
>    make it work more efficiently with clusters using vnodes) (
>    CASSANALYTICS-10
>    <https://issues.apache.org/jira/browse/CASSANALYTICS-10>)
>    - Cassandra 5.0 support (this is an epic with lots of subtasks, some
>    of which are already being worked on by a variety of folks) (
>    CASSANALYTICS-23
>    <https://issues.apache.org/jira/browse/CASSANALYTICS-23>)
>    - Documentation, including both docs on cassandra.apache.org and
>    updated/improved developer docs in the repository itself (
>    CASSANALYTICS-6 <https://issues.apache.org/jira/browse/CASSANALYTICS-6>
>    )
>    - Build scripts for release (CASSANALYTICS-22
>    <https://issues.apache.org/jira/browse/CASSANALYTICS-22>)
>    - Miscellaneous bug fixes of known issues/improvements
>       - Analytics writer should support all valid partition/clustering
>       key types (CASSANALYTICS-35
>       <https://issues.apache.org/jira/browse/CASSANALYTICS-35>)
>       - CassandraDataLayer uses configuration list of IPs instead of the
>       full ring/datacenter (CASSANALYTICS-20
>       <https://issues.apache.org/jira/browse/CASSANALYTICS-20>)
>       - Bulk Reader should dynamically calculate number of cores to use
>       to better utilize resources for smaller tables (CASSANALYTICS-36
>       <https://issues.apache.org/jira/browse/CASSANALYTICS-36>)
>
>
> Beyond 1.0, there’s a lot of improvements and enhancements on the roadmap
> to date:
>
>    - Cassandra 6.0 Support (CASSANALYTICS-37
>    <https://issues.apache.org/jira/browse/CASSANALYTICS-37>)
>    - Spark 4.0 support (CASSANALYTICS-34
>    <https://issues.apache.org/jira/browse/CASSANALYTICS-34>)
>    - JDK Support Matrix (CASSANALYTICS-38
>    <https://issues.apache.org/jira/browse/CASSANALYTICS-38>)
>    - Improved Compaction/Repair load for bulk writes (CASSANALYTICS-39
>    <https://issues.apache.org/jira/browse/CASSANALYTICS-39>)
>    - Bandwidth reduction (especially cross-dc writes) (CASSANALYTICS-40
>    <https://issues.apache.org/jira/browse/CASSANALYTICS-40>)
>    - Consolidation of SBW-on-S3 and DIRECT mode code (CASSANALYTICS-41
>    <https://issues.apache.org/jira/browse/CASSANALYTICS-41>)
>    - Bulk reads via S3 (CASSANALYTICS-42
>    <https://issues.apache.org/jira/browse/CASSANALYTICS-42>)
>
>
> We’re also looking for input on what others think should be in the 1.0
> release, or the long-term roadmap. If you’ve got ideas, don’t hesitate to
> respond to this thread. I’ll also be checking the existing JIRAs and making
> sure they are incorporated into the plan, which I believe most are already.
>
> I want to thank the folks who have, so far, contributed most of the code
> for the Analytics library, and those in the community who have already
> started to use and improve it. We’re looking forward to getting more
> community members involved. If any of these items sounds interesting,
> please feel free to reach out to folks on Slack or reply on the dev list.
>
> Thanks,
>
> Doug Rohrer
>

Re: A Roadmap to Cassandra Analytics 1.0

Reply via email to