Jon Haddad created CASSANDRA-21197:
--------------------------------------

             Summary: import not importing resulting in data loss with 
analytics jobs
                 Key: CASSANDRA-21197
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-21197
             Project: Apache Cassandra
          Issue Type: Bug
          Components: Analytics Library
            Reporter: Jon Haddad


When evaluating the analytics bulk writer I found jobs were reported as 
successful, but the data wasn't being correctly imported.  I'm testing using C* 
5.0.6, sidecar trunk (as of yesterday), and the latest analytics code as of... 
recent.  

I've traced the network and filesystem calls and have found this is the series 
of events:

1. Spark job runs
2. data lands on disk from sidecar
3. import is called, C* says nothing to import
4. sidecar then deletes the data files

resulting in all my data getting deleted off disk, without import happening.  I 
have tested this dozens of times a day for almost a week and it's happened 100% 
of the time.

I haven't yet determined why Cassandra doesn't import anything, but given the 
nature of the issue I'm hoping more eyes on this will help.  It's possible 
there's something specific about my setup that's causing this issue - I know 
there are quite a few tests around sidecar, so I'm surprised it's happening. 

That said, if C* isn't correctly importing data, it should have a way of 
telling sidecar that so sidecar doesn't delete the results of a bulk write job.

{*}Note{*}: the names of the files might not match up here, I've done this over 
several days now with about a dozen clusters and 100 spark jobs.

[Spark job 
runs|[https://github.com/rustyrazorblade/easy-db-lab/blob/main/bin/submit-direct-bulk-writer]].
  The data files are written to disk, then renamed.  I've captured that several 
ways, the easiest way to see it is here for the rename, captured with sysdig:


{noformat}
sudo sysdig "evt.category=file and (proc.pid=24272 or proc.pid=30444)" | grep 
'cassandra/import'{noformat}
Here's the relevant output, where the vertx process (sidecar) performs the 
rename to the expected data file name:
{noformat}
2198732 01:17:36.437748828 1 vert.x-internal (30642) < rename res=0 
oldpath=/mnt/db1/cassandra/import/0-0-1d50c5e6-8fbe-44c7-98ec-a06132e78c1f-e9293be0-1767-11f1-887e-0ff1d5cec701/bulk_test/data/oa-1-big-Index.db16346060661306473655.tmp
 
newpath=/mnt/db1/cassandra/import/0-0-1d50c5e6-8fbe-44c7-98ec-a06132e78c1f-e9293be0-1767-11f1-887e-0ff1d5cec701/bulk_test/data/oa-1-big-Index.db

2199993 01:17:36.450173069 6 vert.x-internal (30635) < rename res=0 
oldpath=/mnt/db1/cassandra/import/0-0-1d50c5e6-8fbe-44c7-98ec-a06132e78c1f-e9293be0-1767-11f1-887e-0ff1d5cec701/bulk_test/data/oa-1-big-Filter.db4989982398684709072.tmp
 
newpath=/mnt/db1/cassandra/import/0-0-1d50c5e6-8fbe-44c7-98ec-a06132e78c1f-e9293be0-1767-11f1-887e-0ff1d5cec701/bulk_test/data/oa-1-big-Filter.db{noformat}
 

 

Process 30528 (cassandra) import is called.  I captured the filesystem event 
where it receives 10 entries:
{noformat}
sudo strace -p 30528 -e trace=getdents64 -y 2>&1 | grep import

getdents64(402</mnt/db1/cassandra/import/0-0-28c91aa3-fcae-4c97-bf5a-e520f070e1f9-a0a1bdd0-176b-11f1-bc8d-55a3317257c0/bulk_test/data>,
 0x7176a803a0c0 /* 10 entries */, 32768) = 392{noformat}

but the log entry says nothing is imported:




 
{noformat}
INFO [RMI TCP Connection(92)-127.0.0.1] 2026-03-04 01:44:12,773 
SSTableImporter.java:80 - [af506331-6517-4461-a10f-3846baaf30c6] Loading new 
SSTables for bulk_test/data: 
Options{srcPaths='[/mnt/db1/cassandra/import/0-0-28c91aa3-fcae-4c97-bf5a-e520f070e1f9-a0a1bdd0-176b-11f1-bc8d-55a3317257c0/bulk_test/data]',
 resetLevel=true, clearRepaired=true, verifySSTables=true, verifyTokens=true, 
invalidateCaches=true, extendedVerify=false, copyData= false, 
failOnMissingIndex= false, validateIndexChecksum= true} INFO [RMI TCP 
Connection(92)-127.0.0.1] 2026-03-04 01:44:12,781 SSTableImporter.java:214 - 
[af506331-6517-4461-a10f-3846baaf30c6] No new SSTables were found for 
bulk_test/data{noformat}

sidecar then comes around and unlinks the files, resulting in data loss:



 
{noformat}
2248856 01:17:37.778334683 1 vert.x-internal (30642) < unlink res=0 
path=/mnt/db1/cassandra/import/0-0-1d50c5e6-8fbe-44c7-98ec-a06132e78c1f-e9293be0-1767-11f1-887e-0ff1d5cec701/bulk_test/data/oa-1-big-CompressionInfo.db

2248866 01:17:37.778345865 1 vert.x-internal (30642) < newfstatat res=0 
dirfd=-100(AT_FDCWD) 
path=/mnt/db1/cassandra/import/0-0-1d50c5e6-8fbe-44c7-98ec-a06132e78c1f-e9293be0-1767-11f1-887e-0ff1d5cec701/bulk_test/data/oa-1-big-Statistics.db
 flags=256(AT_SYMLINK_NOFOLLOW) 

2248868 01:17:37.778352848 1 vert.x-internal (30642) < newfstatat res=0 
dirfd=-100(AT_FDCWD) 
path=/mnt/db1/cassandra/import/0-0-1d50c5e6-8fbe-44c7-98ec-a06132e78c1f-e9293be0-1767-11f1-887e-0ff1d5cec701/bulk_test/data/oa-1-big-Statistics.db
 flags=256(AT_SYMLINK_NOFOLLOW) 

2248875 01:17:37.778370298 1 vert.x-internal (30642) < unlink res=0 
path=/mnt/db1/cassandra/import/0-0-1d50c5e6-8fbe-44c7-98ec-a06132e78c1f-e9293be0-1767-11f1-887e-0ff1d5cec701/bulk_test/data/oa-1-big-Statistics.db{noformat}
 


I haven't yet determined why Cassandra doesn't import the data.  It sees the 
files in the listing, but there's no additional debug available to identify why 
it doesn't consider them valid.




 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to