COPY FROM ordering

Bogdan Gherca Tue, 11 Feb 2020 04:18:18 -0800

Hey Cassandra folks,

I'm trying to change the schema of an existing table by creating a new one
and migrating the data.


The initial table schema looks like this:






*CREATE TABLE IF NOT EXISTS initial_table (    user_id
text,    message_id              timeuuid,    interaction_state
text,    interaction_timestamp   timestamp,    PRIMARY KEY ((user_id),
message_id, interaction_state, interaction_timestamp));*

We're trying to remove interaction timestamp from the PK - same schema but
with *PRIMARY KEY ((user_id), message_id, interaction_state)*

When importing the .csv dump obtained from the *initial_table, *the
timestamp column seems to be written in a weird way. Multiple rows from the
old schema need to be merged to a single entry of the new schema. For most
cases, it seems the last entry entry gets copied over to the new table
while for others a random one gets copied. Check out the below csv sample
and the copy from result.

*initial_table_dump.csv*

*123,ed6c69a0-0add-11b2-8080-808080808080,DISMISSED,2020-01-03
17:50:59+0000123,ed6c69a0-0add-11b2-8080-808080808080,DISMISSED,2020-01-10
00:05:41+0000*

*copy new_table(user_id, message_id, interaction_state,
interaction_timestamp) from '~/initial_table_dump.csv';*

*Result:*



*user_id | message_id                           | interaction_state |
interaction_timestamp---------+--------------------------------------+-------------------+--------------------------
123
| ed6c69a0-0add-11b2-8080-808080808080 |         DISMISSED | 2020-01-03
17:50:59+0000*

Notice the first row from the csv gets written into the new table in this
case - here there are only two rows, but for multiple ones it seems a
random one would be copied over, not the first/last one necessarily. When
updating the interaction_timestamp column value as below, it seems to copy
the latest entry to the new table.

*initial_table_dump_2.csv*

*123,ed6c69a0-0add-11b2-8080-808080808080,DISMISSED,2020-01-03
17:50:59+0000123,ed6c69a0-0add-11b2-8080-808080808080,DISMISSED,2020-01-05
00:05:41+0000*

*- perform same copy from operation - *

*Result:*


*user_id | message_id                           | interaction_state |
interaction_timestamp---------+--------------------------------------+-------------------+--------------------------
123
| ed6c69a0-0add-11b2-8080-808080808080 |         DISMISSED | **2020-01-05
00:05:41+0000*

Could someone help me understand why this might happen? Does the 'copy
from' follow the order from the csv when doing the import or there are no
order guarantees?

I'm using the below cqlsh and Cassandra versions:
*[cqlsh 5.0.1 | Cassandra 2.2.15 | CQL spec 3.3.1 | Native protocol v4]*

Thanks,
Bogdan

COPY FROM ordering

Reply via email to