Re: Cassandra Writes Duplicated/Concatenated List Data

Sagar Jambhulkar Wed, 16 Aug 2017 21:11:27 -0700

What is your query to fetch rows. Can you share P1,pk2,time for the sample
rows you pasted?


On 17-Aug-2017 2:20 AM, "Nathan McLean" <nmcl...@kinsolresearch.com> wrote:

> Hello All,
>
> I have a Cassandra cluster with a table similar to the following:
>
> ```
> CREATE TABLE table (
>     pk1 text,
>     pk2 int,
>     time timestamp,
>     ...
>     probability list<double>,
>     PRIMARY KEY ((pk1, pk2), time)
> ) WITH CLUSTERING ORDER BY (time DESC)
> ```
>
> Python processes write to this table using the DataStax python Cassandra
> driver package. I am occasionally seeing rows written to the table where
> the "probability" column list is the same list, duplicated and concatenated.
>
> e.g.
>
> probability
> ---------------
> [3.0951e-43, 1.695e-37, 2.7641e-32, 2.8028e-27, 1.9887e-22, 1.0165e-17,
> 3.7058e-13, 9.2127e-09, 0.000141, 0.999859,
>  3.0951e-43, 1.695e-37, 2.7641e-32, 2.8028e-27, 1.9887e-22, 1.0165e-17,
> 3.7058e-13, 9.2127e-09, 0.000141, 0.999859]
>
> The code that writes to Cassandra uses "INSERT" statements and validates
> that "probability" lists must always approximately sum to 1.0, so it does
> not seem possible that the python code that writes to Cassandra has a bug
> which is generating this data. The code may occasionally write to the same
> row multiple times.
>
> It appears that there may be a bug in either Cassandra or the python
> driver package which results in this list column being written to and
> appended to with the same data.
>
> Similar invalid data was also generated by a PySpark data migration script
> (using the DataStax spark Cassandra connector) that copied this list data
> to a new table.
>
> Here are the versions of libraries we are using:
>
> Cassandra version 3.6
> Spark version 1.6.0-hadoop2.6
> Python Cassandra driver 3.7.1
> (https://github.com/datastax/python-driver)
>
> Any help/insight into this problem would be greatly appreciated.
>
> Regards,
>
> Nathan
>

Re: Cassandra Writes Duplicated/Concatenated List Data

Reply via email to