Hello All,

I have a Cassandra cluster with a table similar to the following:

```
CREATE TABLE table (
    pk1 text,
    pk2 int,
    time timestamp,
    ...
    probability list<double>,
    PRIMARY KEY ((pk1, pk2), time)
) WITH CLUSTERING ORDER BY (time DESC)
```

Python processes write to this table using the DataStax python Cassandra
driver package. I am occasionally seeing rows written to the table where
the "probability" column list is the same list, duplicated and concatenated.

e.g.

probability
---------------
[3.0951e-43, 1.695e-37, 2.7641e-32, 2.8028e-27, 1.9887e-22, 1.0165e-17,
3.7058e-13, 9.2127e-09, 0.000141, 0.999859,
 3.0951e-43, 1.695e-37, 2.7641e-32, 2.8028e-27, 1.9887e-22, 1.0165e-17,
3.7058e-13, 9.2127e-09, 0.000141, 0.999859]

The code that writes to Cassandra uses "INSERT" statements and validates
that "probability" lists must always approximately sum to 1.0, so it does
not seem possible that the python code that writes to Cassandra has a bug
which is generating this data. The code may occasionally write to the same
row multiple times.

It appears that there may be a bug in either Cassandra or the python driver
package which results in this list column being written to and appended to
with the same data.

Similar invalid data was also generated by a PySpark data migration script
(using the DataStax spark Cassandra connector) that copied this list data
to a new table.

Here are the versions of libraries we are using:

Cassandra version 3.6
Spark version 1.6.0-hadoop2.6
Python Cassandra driver 3.7.1
(https://github.com/datastax/python-driver)

Any help/insight into this problem would be greatly appreciated.

Regards,

Nathan

Reply via email to