Hello folks, As many of you on the ASF Slack may have noticed, I’ve been creating a bunch of new tickets for the Cassandra Analytics project related to a 1.0 release. Since it was initially contributed, there have been many enhancements and fixes to the library, but there are still some gaps that need to be addressed. We’re putting together a plan to close those gaps, and would love to enlist more folks from the community in making the analytics library more useful. The gaps we see today include: vnode support (and optimizations to the exiting code if necessary to make it work more efficiently with clusters using vnodes) (CASSANALYTICS-10 <https://issues.apache.org/jira/browse/CASSANALYTICS-10>) Cassandra 5.0 support (this is an epic with lots of subtasks, some of which are already being worked on by a variety of folks) (CASSANALYTICS-23 <https://issues.apache.org/jira/browse/CASSANALYTICS-23>) Documentation, including both docs on cassandra.apache.org <http://cassandra.apache.org/> and updated/improved developer docs in the repository itself (CASSANALYTICS-6 <https://issues.apache.org/jira/browse/CASSANALYTICS-6>) Build scripts for release (CASSANALYTICS-22 <https://issues.apache.org/jira/browse/CASSANALYTICS-22>) Miscellaneous bug fixes of known issues/improvements Analytics writer should support all valid partition/clustering key types (CASSANALYTICS-35 <https://issues.apache.org/jira/browse/CASSANALYTICS-35>) CassandraDataLayer uses configuration list of IPs instead of the full ring/datacenter (CASSANALYTICS-20 <https://issues.apache.org/jira/browse/CASSANALYTICS-20>) Bulk Reader should dynamically calculate number of cores to use to better utilize resources for smaller tables (CASSANALYTICS-36 <https://issues.apache.org/jira/browse/CASSANALYTICS-36>)
Beyond 1.0, there’s a lot of improvements and enhancements on the roadmap to date: Cassandra 6.0 Support (CASSANALYTICS-37 <https://issues.apache.org/jira/browse/CASSANALYTICS-37>) Spark 4.0 support (CASSANALYTICS-34 <https://issues.apache.org/jira/browse/CASSANALYTICS-34>) JDK Support Matrix (CASSANALYTICS-38 <https://issues.apache.org/jira/browse/CASSANALYTICS-38>) Improved Compaction/Repair load for bulk writes (CASSANALYTICS-39 <https://issues.apache.org/jira/browse/CASSANALYTICS-39>) Bandwidth reduction (especially cross-dc writes) (CASSANALYTICS-40 <https://issues.apache.org/jira/browse/CASSANALYTICS-40>) Consolidation of SBW-on-S3 and DIRECT mode code (CASSANALYTICS-41 <https://issues.apache.org/jira/browse/CASSANALYTICS-41>) Bulk reads via S3 (CASSANALYTICS-42 <https://issues.apache.org/jira/browse/CASSANALYTICS-42>) We’re also looking for input on what others think should be in the 1.0 release, or the long-term roadmap. If you’ve got ideas, don’t hesitate to respond to this thread. I’ll also be checking the existing JIRAs and making sure they are incorporated into the plan, which I believe most are already. I want to thank the folks who have, so far, contributed most of the code for the Analytics library, and those in the community who have already started to use and improve it. We’re looking forward to getting more community members involved. If any of these items sounds interesting, please feel free to reach out to folks on Slack or reply on the dev list. Thanks, Doug Rohrer