Hi folks, I'm looking at a table that has a primary key defined as "publisher_id text". I've noticed some of the entries have what appears to me to be a UTF-8 BOM marker and some do not.
https://docs.datastax.com/en/archived/cql/3.3/cql/cql_reference/cql_data_types_c.html says text is a UTF-8 encoded string. If I look at the first 3 bytes of one of these columns: $ dd if=~/tmp/sample.data of=/dev/stdout bs=1 count=3 2>/dev/null | hexdump 0000000 bbef 00bf 0000003 When I swap the byte order: $ dd if=~/tmp/sample.data of=/dev/stdout bs=1 count=3 conv=swab 2>/dev/null | hexdump 0000000 efbb 00bf 0000003 And I think this matches the UTF-8 BOM. However, not all the rows have this prefix, and I'm wondering if this is a client issue (client being inconsistent about how it's dealing with strings) or if Cassandra is doing something special on its own. The rest of the column falls within the US-ASCII codepoint compatible range of UTF-8, e.g., something as simple as 'abc' but in some cases it's got this marker in front of it. Cassandra is treating '<BOM>abc' as a distinct value from 'abc' , which certainly makes sense, for the sake of efficiency I assume it'd just be looking at the byte-for-byte values w/o layering meaning on top of it. But that means I'll need to clean the data up to be consistent, and I need to figure out how to prevent it from being reintroduced in the future. Jim --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org For additional commands, e-mail: user-h...@cassandra.apache.org