Hello All, I have a Cassandra cluster with a table similar to the following:
``` CREATE TABLE table ( pk1 text, pk2 int, time timestamp, ... probability list<double>, PRIMARY KEY ((pk1, pk2), time) ) WITH CLUSTERING ORDER BY (time DESC) ``` Python processes write to this table using the DataStax python Cassandra driver package. I am occasionally seeing rows written to the table where the "probability" column list is the same list, duplicated and concatenated. e.g. probability --------------- [3.0951e-43, 1.695e-37, 2.7641e-32, 2.8028e-27, 1.9887e-22, 1.0165e-17, 3.7058e-13, 9.2127e-09, 0.000141, 0.999859, 3.0951e-43, 1.695e-37, 2.7641e-32, 2.8028e-27, 1.9887e-22, 1.0165e-17, 3.7058e-13, 9.2127e-09, 0.000141, 0.999859] The code that writes to Cassandra uses "INSERT" statements and validates that "probability" lists must always approximately sum to 1.0, so it does not seem possible that the python code that writes to Cassandra has a bug which is generating this data. The code may occasionally write to the same row multiple times. It appears that there may be a bug in either Cassandra or the python driver package which results in this list column being written to and appended to with the same data. Similar invalid data was also generated by a PySpark data migration script (using the DataStax spark Cassandra connector) that copied this list data to a new table. Here are the versions of libraries we are using: Cassandra version 3.6 Spark version 1.6.0-hadoop2.6 Python Cassandra driver 3.7.1 (https://github.com/datastax/python-driver) Any help/insight into this problem would be greatly appreciated. Regards, Nathan