For the example provided by you , are you saying you are getting two rows for same pk1,pk2,time? It may be a problem with your inserts when you are inserting multiple distinct rows or to validate all nodes are in sync try fetching using CONSISTENCY ALL in cql.
On 18-Aug-2017 9:37 PM, "Nathan McLean" <nmcl...@kinsolresearch.com> wrote: > @Sagar, > > A query to get the data looks like this (primary key values included in > the query). > > SELECT * FROM table WHERE pk1='2269202-onstreet_high' AND pk2=2017 AND > time='2017-07-18 03:15:00+0000'; > > (in actual practice, the queries in our code would use query a range of > time values). > > @Cristophe > > I actually haven't been able to reproduce this problem while testing. Rows > like the example I gave just seem to show up very occasionally in our > production data. > > On Wed, Aug 16, 2017 at 9:11 PM, Sagar Jambhulkar < > sagar.jambhul...@gmail.com> wrote: > >> What is your query to fetch rows. Can you share P1,pk2,time for the >> sample rows you pasted? >> >> On 17-Aug-2017 2:20 AM, "Nathan McLean" <nmcl...@kinsolresearch.com> >> wrote: >> >>> Hello All, >>> >>> I have a Cassandra cluster with a table similar to the following: >>> >>> ``` >>> CREATE TABLE table ( >>> pk1 text, >>> pk2 int, >>> time timestamp, >>> ... >>> probability list<double>, >>> PRIMARY KEY ((pk1, pk2), time) >>> ) WITH CLUSTERING ORDER BY (time DESC) >>> ``` >>> >>> Python processes write to this table using the DataStax python Cassandra >>> driver package. I am occasionally seeing rows written to the table where >>> the "probability" column list is the same list, duplicated and concatenated. >>> >>> e.g. >>> >>> probability >>> --------------- >>> [3.0951e-43, 1.695e-37, 2.7641e-32, 2.8028e-27, 1.9887e-22, 1.0165e-17, >>> 3.7058e-13, 9.2127e-09, 0.000141, 0.999859, >>> 3.0951e-43, 1.695e-37, 2.7641e-32, 2.8028e-27, 1.9887e-22, 1.0165e-17, >>> 3.7058e-13, 9.2127e-09, 0.000141, 0.999859] >>> >>> The code that writes to Cassandra uses "INSERT" statements and validates >>> that "probability" lists must always approximately sum to 1.0, so it does >>> not seem possible that the python code that writes to Cassandra has a bug >>> which is generating this data. The code may occasionally write to the same >>> row multiple times. >>> >>> It appears that there may be a bug in either Cassandra or the python >>> driver package which results in this list column being written to and >>> appended to with the same data. >>> >>> Similar invalid data was also generated by a PySpark data migration >>> script (using the DataStax spark Cassandra connector) that copied this list >>> data to a new table. >>> >>> Here are the versions of libraries we are using: >>> >>> Cassandra version 3.6 >>> Spark version 1.6.0-hadoop2.6 >>> Python Cassandra driver 3.7.1 >>> (https://github.com/datastax/python-driver) >>> >>> Any help/insight into this problem would be greatly appreciated. >>> >>> Regards, >>> >>> Nathan >>> >> >