Jens-G opened a new pull request, #4813:
URL: https://github.com/apache/cassandra/pull/4813

   ## Summary
   
   `COPY TO` followed by `COPY FROM` corrupts text column values that contain 
backslashes: each round-trip doubles the backslash count. Reported in 
[CASSANDRA-21131](https://issues.apache.org/jira/browse/CASSANDRA-21131).
   
   **Before (one round-trip):**
   - Stored: `V\S` → exported CSV: `V\\\\S` → re-imported: `V\\S` ❌
   - Stored: `\"Marianne"\` → re-imported: `\\"Marianne"\\` ❌
   
   `list<text>`, `set<text>`, `map<text,text>`, tuples and UDTs with text 
fields are affected in the same way.
   
   ## Root Cause
   
   `format_value_text` in `formatting.py` doubles backslashes unconditionally:
   
   ```python
   escapedval = val.replace('\\', '\\\\')
   ```
   
   This is intentional for **terminal display** (SELECT output shows `V\\S` so 
the backslash is visible). However, `ExportProcess.format_value` in 
`copyutil.py` calls the same function when writing CSV. The `csv.writer` 
(configured with `escapechar='\\'`) then escapes backslashes a **second time**, 
quadrupling them in the CSV file. On `COPY FROM` the `csv.reader` unescapes 
once, leaving doubled backslashes in Cassandra.
   
   ## Fix
   
   Add an `escape_backslash` parameter (default `True`, preserving existing 
terminal display behaviour) to `format_value_text`, `format_simple_collection`, 
and all collection formatters. Pass `escape_backslash=False` from 
`ExportProcess.format_value` so the `csv.writer` handles all backslash escaping 
exclusively.
   
   Changed functions:
   - `format_value_text` — new parameter
   - `format_simple_collection` — new parameter, propagated to element 
`format_value` calls
   - `format_value_list`, `format_value_set`, `format_value_tuple` — new 
parameter, forwarded to `format_simple_collection`
   - `format_value_map` — new parameter, propagated through `subformat`
   - `format_value_utype` — new parameter, propagated through 
`format_field_value`
   - `ExportProcess.format_value` in `copyutil.py` — passes 
`escape_backslash=False`
   
   ## Test Plan
   
   Two standalone Python test scripts (no running Cassandra cluster required) 
are attached to the JIRA ticket and verify the bug and fix:
   
   - `test_cassandra_21131.py` — 10 test cases for plain `text` columns: **5/10 
pass before fix → 10/10 after**
   - `test_cassandra_21131_collections.py` — 12 test cases for 
`list/set/map<text>`: **3/12 before → 12/12 after**
   
   Integration testing against a live cluster with the exact scenario from the 
bug report (`COPY TO` → `TRUNCATE` → `COPY FROM` → `SELECT`) is needed before 
merge.
   
   ## Notes
   
   - A separate but related bug (`UNICODE_CONTROLCHARS_RE` converting control 
chars like `\n` to repr-notation `\\n` during CSV export) was discovered and 
will be tracked in a separate ticket.
   - The `Generated-by:` commit token is included per [ASF generative tooling 
policy](https://www.apache.org/legal/generative-tooling.html). The fix was 
developed with AI assistance (Claude Sonnet 4.6 / Anthropic) under human review 
and direction. All code has been verified manually.
   
   🤖 Generated with [Claude Code](https://claude.com/claude-code)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to