@Sagar, A query to get the data looks like this (primary key values included in the query).
SELECT * FROM table WHERE pk1='2269202-onstreet_high' AND pk2=2017 AND time='2017-07-18 03:15:00+0000'; (in actual practice, the queries in our code would use query a range of time values). @Cristophe I actually haven't been able to reproduce this problem while testing. Rows like the example I gave just seem to show up very occasionally in our production data. On Wed, Aug 16, 2017 at 9:11 PM, Sagar Jambhulkar < [email protected]> wrote: > What is your query to fetch rows. Can you share P1,pk2,time for the sample > rows you pasted? > > On 17-Aug-2017 2:20 AM, "Nathan McLean" <[email protected]> > wrote: > >> Hello All, >> >> I have a Cassandra cluster with a table similar to the following: >> >> ``` >> CREATE TABLE table ( >> pk1 text, >> pk2 int, >> time timestamp, >> ... >> probability list<double>, >> PRIMARY KEY ((pk1, pk2), time) >> ) WITH CLUSTERING ORDER BY (time DESC) >> ``` >> >> Python processes write to this table using the DataStax python Cassandra >> driver package. I am occasionally seeing rows written to the table where >> the "probability" column list is the same list, duplicated and concatenated. >> >> e.g. >> >> probability >> --------------- >> [3.0951e-43, 1.695e-37, 2.7641e-32, 2.8028e-27, 1.9887e-22, 1.0165e-17, >> 3.7058e-13, 9.2127e-09, 0.000141, 0.999859, >> 3.0951e-43, 1.695e-37, 2.7641e-32, 2.8028e-27, 1.9887e-22, 1.0165e-17, >> 3.7058e-13, 9.2127e-09, 0.000141, 0.999859] >> >> The code that writes to Cassandra uses "INSERT" statements and validates >> that "probability" lists must always approximately sum to 1.0, so it does >> not seem possible that the python code that writes to Cassandra has a bug >> which is generating this data. The code may occasionally write to the same >> row multiple times. >> >> It appears that there may be a bug in either Cassandra or the python >> driver package which results in this list column being written to and >> appended to with the same data. >> >> Similar invalid data was also generated by a PySpark data migration >> script (using the DataStax spark Cassandra connector) that copied this list >> data to a new table. >> >> Here are the versions of libraries we are using: >> >> Cassandra version 3.6 >> Spark version 1.6.0-hadoop2.6 >> Python Cassandra driver 3.7.1 >> (https://github.com/datastax/python-driver) >> >> Any help/insight into this problem would be greatly appreciated. >> >> Regards, >> >> Nathan >> >
