What is your query to fetch rows. Can you share P1,pk2,time for the sample rows you pasted?
On 17-Aug-2017 2:20 AM, "Nathan McLean" <nmcl...@kinsolresearch.com> wrote: > Hello All, > > I have a Cassandra cluster with a table similar to the following: > > ``` > CREATE TABLE table ( > pk1 text, > pk2 int, > time timestamp, > ... > probability list<double>, > PRIMARY KEY ((pk1, pk2), time) > ) WITH CLUSTERING ORDER BY (time DESC) > ``` > > Python processes write to this table using the DataStax python Cassandra > driver package. I am occasionally seeing rows written to the table where > the "probability" column list is the same list, duplicated and concatenated. > > e.g. > > probability > --------------- > [3.0951e-43, 1.695e-37, 2.7641e-32, 2.8028e-27, 1.9887e-22, 1.0165e-17, > 3.7058e-13, 9.2127e-09, 0.000141, 0.999859, > 3.0951e-43, 1.695e-37, 2.7641e-32, 2.8028e-27, 1.9887e-22, 1.0165e-17, > 3.7058e-13, 9.2127e-09, 0.000141, 0.999859] > > The code that writes to Cassandra uses "INSERT" statements and validates > that "probability" lists must always approximately sum to 1.0, so it does > not seem possible that the python code that writes to Cassandra has a bug > which is generating this data. The code may occasionally write to the same > row multiple times. > > It appears that there may be a bug in either Cassandra or the python > driver package which results in this list column being written to and > appended to with the same data. > > Similar invalid data was also generated by a PySpark data migration script > (using the DataStax spark Cassandra connector) that copied this list data > to a new table. > > Here are the versions of libraries we are using: > > Cassandra version 3.6 > Spark version 1.6.0-hadoop2.6 > Python Cassandra driver 3.7.1 > (https://github.com/datastax/python-driver) > > Any help/insight into this problem would be greatly appreciated. > > Regards, > > Nathan >