[
https://issues.apache.org/jira/browse/CASSANDRA-21197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jon Haddad updated CASSANDRA-21197:
-----------------------------------
Description:
When evaluating the analytics bulk writer I found jobs were reported as
successful, but the data wasn't being correctly imported. I'm testing using C*
5.0.6, sidecar trunk (as of yesterday), and the latest analytics code as of...
recent. I've verified this is an issue with both single token and 4 token
clusters, using all C* defaults from the tarball release otherwise except these:
{noformat}
---
cluster_name: "test"
num_tokens: 1
seed_provider:
class_name: "org.apache.cassandra.locator.SimpleSeedProvider"
parameters:
seeds: "10.14.1.95"
hints_directory: "/mnt/db1/cassandra/hints"
data_file_directories:
- "/mnt/db1/cassandra/data"
commitlog_directory: "/mnt/db1/cassandra/commitlog"
concurrent_reads: 64
concurrent_writes: 64
trickle_fsync: true
endpoint_snitch: "Ec2Snitch"{noformat}
I've traced the network and filesystem calls and have found this is the series
of events:
1. Spark job runs
2. data lands on disk from sidecar
3. import is called, C* says nothing to import
4. sidecar then deletes the data files
resulting in all my data getting deleted off disk, without import happening. I
have tested this dozens of times a day for almost a week and it's happened 100%
of the time.
I haven't yet determined why Cassandra doesn't import anything, but given the
nature of the issue I'm hoping more eyes on this will help. It's possible
there's something specific about my setup that's causing this issue - I know
there are quite a few tests around sidecar, so I'm surprised it's happening.
That said, if C* isn't correctly importing data, it should have a way of
telling sidecar that so sidecar doesn't delete the results of a bulk write job.
{*}Note{*}: the names of the files might not match up here, I've done this over
several days now with about a dozen clusters and 100 spark jobs.
[Spark job
runs|[https://github.com/rustyrazorblade/easy-db-lab/blob/main/bin/submit-direct-bulk-writer]].
The data files are written to disk, then renamed. I've captured that several
ways, the easiest way to see it is here for the rename, captured with sysdig:
{noformat}
sudo sysdig "evt.category=file and (proc.pid=24272 or proc.pid=30444)" | grep
'cassandra/import'{noformat}
Here's the relevant output, where the vertx process (sidecar) performs the
rename to the expected data file name:
{noformat}
2198732 01:17:36.437748828 1 vert.x-internal (30642) < rename res=0
oldpath=/mnt/db1/cassandra/import/0-0-1d50c5e6-8fbe-44c7-98ec-a06132e78c1f-e9293be0-1767-11f1-887e-0ff1d5cec701/bulk_test/data/oa-1-big-Index.db16346060661306473655.tmp
newpath=/mnt/db1/cassandra/import/0-0-1d50c5e6-8fbe-44c7-98ec-a06132e78c1f-e9293be0-1767-11f1-887e-0ff1d5cec701/bulk_test/data/oa-1-big-Index.db
2199993 01:17:36.450173069 6 vert.x-internal (30635) < rename res=0
oldpath=/mnt/db1/cassandra/import/0-0-1d50c5e6-8fbe-44c7-98ec-a06132e78c1f-e9293be0-1767-11f1-887e-0ff1d5cec701/bulk_test/data/oa-1-big-Filter.db4989982398684709072.tmp
newpath=/mnt/db1/cassandra/import/0-0-1d50c5e6-8fbe-44c7-98ec-a06132e78c1f-e9293be0-1767-11f1-887e-0ff1d5cec701/bulk_test/data/oa-1-big-Filter.db{noformat}
Process 30528 (cassandra) import is called. I captured the filesystem event
where it receives 10 entries:
{noformat}
sudo strace -p 30528 -e trace=getdents64 -y 2>&1 | grep import
getdents64(402</mnt/db1/cassandra/import/0-0-28c91aa3-fcae-4c97-bf5a-e520f070e1f9-a0a1bdd0-176b-11f1-bc8d-55a3317257c0/bulk_test/data>,
0x7176a803a0c0 /* 10 entries */, 32768) = 392{noformat}
but the log entry says nothing is imported:
{noformat}
INFO [RMI TCP Connection(92)-127.0.0.1] 2026-03-04 01:44:12,773
SSTableImporter.java:80 - [af506331-6517-4461-a10f-3846baaf30c6] Loading new
SSTables for bulk_test/data:
Options{srcPaths='[/mnt/db1/cassandra/import/0-0-28c91aa3-fcae-4c97-bf5a-e520f070e1f9-a0a1bdd0-176b-11f1-bc8d-55a3317257c0/bulk_test/data]',
resetLevel=true, clearRepaired=true, verifySSTables=true, verifyTokens=true,
invalidateCaches=true, extendedVerify=false, copyData= false,
failOnMissingIndex= false, validateIndexChecksum= true}
INFO [RMI TCP Connection(92)-127.0.0.1] 2026-03-04 01:44:12,781
SSTableImporter.java:214 - [af506331-6517-4461-a10f-3846baaf30c6] No new
SSTables were found for bulk_test/data{noformat}
sidecar then comes around and unlinks the files, resulting in data loss:
{noformat}
2248856 01:17:37.778334683 1 vert.x-internal (30642) < unlink res=0
path=/mnt/db1/cassandra/import/0-0-1d50c5e6-8fbe-44c7-98ec-a06132e78c1f-e9293be0-1767-11f1-887e-0ff1d5cec701/bulk_test/data/oa-1-big-CompressionInfo.db
2248866 01:17:37.778345865 1 vert.x-internal (30642) < newfstatat res=0
dirfd=-100(AT_FDCWD)
path=/mnt/db1/cassandra/import/0-0-1d50c5e6-8fbe-44c7-98ec-a06132e78c1f-e9293be0-1767-11f1-887e-0ff1d5cec701/bulk_test/data/oa-1-big-Statistics.db
flags=256(AT_SYMLINK_NOFOLLOW)
2248868 01:17:37.778352848 1 vert.x-internal (30642) < newfstatat res=0
dirfd=-100(AT_FDCWD)
path=/mnt/db1/cassandra/import/0-0-1d50c5e6-8fbe-44c7-98ec-a06132e78c1f-e9293be0-1767-11f1-887e-0ff1d5cec701/bulk_test/data/oa-1-big-Statistics.db
flags=256(AT_SYMLINK_NOFOLLOW)
2248875 01:17:37.778370298 1 vert.x-internal (30642) < unlink res=0
path=/mnt/db1/cassandra/import/0-0-1d50c5e6-8fbe-44c7-98ec-a06132e78c1f-e9293be0-1767-11f1-887e-0ff1d5cec701/bulk_test/data/oa-1-big-Statistics.db{noformat}
I haven't yet determined why Cassandra doesn't import the data. It sees the
files in the listing, but there's no additional debug available to identify why
it doesn't consider them valid.
was:
When evaluating the analytics bulk writer I found jobs were reported as
successful, but the data wasn't being correctly imported. I'm testing using C*
5.0.6, sidecar trunk (as of yesterday), and the latest analytics code as of...
recent. I've verified this is an issue with both single token and 4 token
clusters, using all C* defaults otherwise except these:
{noformat}
---
cluster_name: "test"
num_tokens: 1
seed_provider:
class_name: "org.apache.cassandra.locator.SimpleSeedProvider"
parameters:
seeds: "10.14.1.95"
hints_directory: "/mnt/db1/cassandra/hints"
data_file_directories:
- "/mnt/db1/cassandra/data"
commitlog_directory: "/mnt/db1/cassandra/commitlog"
concurrent_reads: 64
concurrent_writes: 64
trickle_fsync: true
endpoint_snitch: "Ec2Snitch"{noformat}
I've traced the network and filesystem calls and have found this is the series
of events:
1. Spark job runs
2. data lands on disk from sidecar
3. import is called, C* says nothing to import
4. sidecar then deletes the data files
resulting in all my data getting deleted off disk, without import happening. I
have tested this dozens of times a day for almost a week and it's happened 100%
of the time.
I haven't yet determined why Cassandra doesn't import anything, but given the
nature of the issue I'm hoping more eyes on this will help. It's possible
there's something specific about my setup that's causing this issue - I know
there are quite a few tests around sidecar, so I'm surprised it's happening.
That said, if C* isn't correctly importing data, it should have a way of
telling sidecar that so sidecar doesn't delete the results of a bulk write job.
{*}Note{*}: the names of the files might not match up here, I've done this over
several days now with about a dozen clusters and 100 spark jobs.
[Spark job
runs|[https://github.com/rustyrazorblade/easy-db-lab/blob/main/bin/submit-direct-bulk-writer]].
The data files are written to disk, then renamed. I've captured that several
ways, the easiest way to see it is here for the rename, captured with sysdig:
{noformat}
sudo sysdig "evt.category=file and (proc.pid=24272 or proc.pid=30444)" | grep
'cassandra/import'{noformat}
Here's the relevant output, where the vertx process (sidecar) performs the
rename to the expected data file name:
{noformat}
2198732 01:17:36.437748828 1 vert.x-internal (30642) < rename res=0
oldpath=/mnt/db1/cassandra/import/0-0-1d50c5e6-8fbe-44c7-98ec-a06132e78c1f-e9293be0-1767-11f1-887e-0ff1d5cec701/bulk_test/data/oa-1-big-Index.db16346060661306473655.tmp
newpath=/mnt/db1/cassandra/import/0-0-1d50c5e6-8fbe-44c7-98ec-a06132e78c1f-e9293be0-1767-11f1-887e-0ff1d5cec701/bulk_test/data/oa-1-big-Index.db
2199993 01:17:36.450173069 6 vert.x-internal (30635) < rename res=0
oldpath=/mnt/db1/cassandra/import/0-0-1d50c5e6-8fbe-44c7-98ec-a06132e78c1f-e9293be0-1767-11f1-887e-0ff1d5cec701/bulk_test/data/oa-1-big-Filter.db4989982398684709072.tmp
newpath=/mnt/db1/cassandra/import/0-0-1d50c5e6-8fbe-44c7-98ec-a06132e78c1f-e9293be0-1767-11f1-887e-0ff1d5cec701/bulk_test/data/oa-1-big-Filter.db{noformat}
Process 30528 (cassandra) import is called. I captured the filesystem event
where it receives 10 entries:
{noformat}
sudo strace -p 30528 -e trace=getdents64 -y 2>&1 | grep import
getdents64(402</mnt/db1/cassandra/import/0-0-28c91aa3-fcae-4c97-bf5a-e520f070e1f9-a0a1bdd0-176b-11f1-bc8d-55a3317257c0/bulk_test/data>,
0x7176a803a0c0 /* 10 entries */, 32768) = 392{noformat}
but the log entry says nothing is imported:
{noformat}
INFO [RMI TCP Connection(92)-127.0.0.1] 2026-03-04 01:44:12,773
SSTableImporter.java:80 - [af506331-6517-4461-a10f-3846baaf30c6] Loading new
SSTables for bulk_test/data:
Options{srcPaths='[/mnt/db1/cassandra/import/0-0-28c91aa3-fcae-4c97-bf5a-e520f070e1f9-a0a1bdd0-176b-11f1-bc8d-55a3317257c0/bulk_test/data]',
resetLevel=true, clearRepaired=true, verifySSTables=true, verifyTokens=true,
invalidateCaches=true, extendedVerify=false, copyData= false,
failOnMissingIndex= false, validateIndexChecksum= true}
INFO [RMI TCP Connection(92)-127.0.0.1] 2026-03-04 01:44:12,781
SSTableImporter.java:214 - [af506331-6517-4461-a10f-3846baaf30c6] No new
SSTables were found for bulk_test/data{noformat}
sidecar then comes around and unlinks the files, resulting in data loss:
{noformat}
2248856 01:17:37.778334683 1 vert.x-internal (30642) < unlink res=0
path=/mnt/db1/cassandra/import/0-0-1d50c5e6-8fbe-44c7-98ec-a06132e78c1f-e9293be0-1767-11f1-887e-0ff1d5cec701/bulk_test/data/oa-1-big-CompressionInfo.db
2248866 01:17:37.778345865 1 vert.x-internal (30642) < newfstatat res=0
dirfd=-100(AT_FDCWD)
path=/mnt/db1/cassandra/import/0-0-1d50c5e6-8fbe-44c7-98ec-a06132e78c1f-e9293be0-1767-11f1-887e-0ff1d5cec701/bulk_test/data/oa-1-big-Statistics.db
flags=256(AT_SYMLINK_NOFOLLOW)
2248868 01:17:37.778352848 1 vert.x-internal (30642) < newfstatat res=0
dirfd=-100(AT_FDCWD)
path=/mnt/db1/cassandra/import/0-0-1d50c5e6-8fbe-44c7-98ec-a06132e78c1f-e9293be0-1767-11f1-887e-0ff1d5cec701/bulk_test/data/oa-1-big-Statistics.db
flags=256(AT_SYMLINK_NOFOLLOW)
2248875 01:17:37.778370298 1 vert.x-internal (30642) < unlink res=0
path=/mnt/db1/cassandra/import/0-0-1d50c5e6-8fbe-44c7-98ec-a06132e78c1f-e9293be0-1767-11f1-887e-0ff1d5cec701/bulk_test/data/oa-1-big-Statistics.db{noformat}
I haven't yet determined why Cassandra doesn't import the data. It sees the
files in the listing, but there's no additional debug available to identify why
it doesn't consider them valid.
> import not importing resulting in data loss with analytics jobs
> ---------------------------------------------------------------
>
> Key: CASSANDRA-21197
> URL: https://issues.apache.org/jira/browse/CASSANDRA-21197
> Project: Apache Cassandra
> Issue Type: Bug
> Components: Analytics Library, Sidecar
> Reporter: Jon Haddad
> Priority: Normal
>
> When evaluating the analytics bulk writer I found jobs were reported as
> successful, but the data wasn't being correctly imported. I'm testing using
> C* 5.0.6, sidecar trunk (as of yesterday), and the latest analytics code as
> of... recent. I've verified this is an issue with both single token and 4
> token clusters, using all C* defaults from the tarball release otherwise
> except these:
> {noformat}
> ---
> cluster_name: "test"
> num_tokens: 1
> seed_provider:
> class_name: "org.apache.cassandra.locator.SimpleSeedProvider"
> parameters:
> seeds: "10.14.1.95"
> hints_directory: "/mnt/db1/cassandra/hints"
> data_file_directories:
> - "/mnt/db1/cassandra/data"
> commitlog_directory: "/mnt/db1/cassandra/commitlog"
> concurrent_reads: 64
> concurrent_writes: 64
> trickle_fsync: true
> endpoint_snitch: "Ec2Snitch"{noformat}
> I've traced the network and filesystem calls and have found this is the
> series of events:
> 1. Spark job runs
> 2. data lands on disk from sidecar
> 3. import is called, C* says nothing to import
> 4. sidecar then deletes the data files
> resulting in all my data getting deleted off disk, without import happening.
> I have tested this dozens of times a day for almost a week and it's happened
> 100% of the time.
> I haven't yet determined why Cassandra doesn't import anything, but given the
> nature of the issue I'm hoping more eyes on this will help. It's possible
> there's something specific about my setup that's causing this issue - I know
> there are quite a few tests around sidecar, so I'm surprised it's happening.
> That said, if C* isn't correctly importing data, it should have a way of
> telling sidecar that so sidecar doesn't delete the results of a bulk write
> job.
> {*}Note{*}: the names of the files might not match up here, I've done this
> over several days now with about a dozen clusters and 100 spark jobs.
> [Spark job
> runs|[https://github.com/rustyrazorblade/easy-db-lab/blob/main/bin/submit-direct-bulk-writer]].
> The data files are written to disk, then renamed. I've captured that
> several ways, the easiest way to see it is here for the rename, captured with
> sysdig:
> {noformat}
> sudo sysdig "evt.category=file and (proc.pid=24272 or proc.pid=30444)" | grep
> 'cassandra/import'{noformat}
> Here's the relevant output, where the vertx process (sidecar) performs the
> rename to the expected data file name:
> {noformat}
> 2198732 01:17:36.437748828 1 vert.x-internal (30642) < rename res=0
> oldpath=/mnt/db1/cassandra/import/0-0-1d50c5e6-8fbe-44c7-98ec-a06132e78c1f-e9293be0-1767-11f1-887e-0ff1d5cec701/bulk_test/data/oa-1-big-Index.db16346060661306473655.tmp
>
> newpath=/mnt/db1/cassandra/import/0-0-1d50c5e6-8fbe-44c7-98ec-a06132e78c1f-e9293be0-1767-11f1-887e-0ff1d5cec701/bulk_test/data/oa-1-big-Index.db
> 2199993 01:17:36.450173069 6 vert.x-internal (30635) < rename res=0
> oldpath=/mnt/db1/cassandra/import/0-0-1d50c5e6-8fbe-44c7-98ec-a06132e78c1f-e9293be0-1767-11f1-887e-0ff1d5cec701/bulk_test/data/oa-1-big-Filter.db4989982398684709072.tmp
>
> newpath=/mnt/db1/cassandra/import/0-0-1d50c5e6-8fbe-44c7-98ec-a06132e78c1f-e9293be0-1767-11f1-887e-0ff1d5cec701/bulk_test/data/oa-1-big-Filter.db{noformat}
>
>
> Process 30528 (cassandra) import is called. I captured the filesystem event
> where it receives 10 entries:
> {noformat}
> sudo strace -p 30528 -e trace=getdents64 -y 2>&1 | grep import
> getdents64(402</mnt/db1/cassandra/import/0-0-28c91aa3-fcae-4c97-bf5a-e520f070e1f9-a0a1bdd0-176b-11f1-bc8d-55a3317257c0/bulk_test/data>,
> 0x7176a803a0c0 /* 10 entries */, 32768) = 392{noformat}
> but the log entry says nothing is imported:
>
> {noformat}
> INFO [RMI TCP Connection(92)-127.0.0.1] 2026-03-04 01:44:12,773
> SSTableImporter.java:80 - [af506331-6517-4461-a10f-3846baaf30c6] Loading new
> SSTables for bulk_test/data:
> Options{srcPaths='[/mnt/db1/cassandra/import/0-0-28c91aa3-fcae-4c97-bf5a-e520f070e1f9-a0a1bdd0-176b-11f1-bc8d-55a3317257c0/bulk_test/data]',
> resetLevel=true, clearRepaired=true, verifySSTables=true, verifyTokens=true,
> invalidateCaches=true, extendedVerify=false, copyData= false,
> failOnMissingIndex= false, validateIndexChecksum= true}
> INFO [RMI TCP Connection(92)-127.0.0.1] 2026-03-04 01:44:12,781
> SSTableImporter.java:214 - [af506331-6517-4461-a10f-3846baaf30c6] No new
> SSTables were found for bulk_test/data{noformat}
> sidecar then comes around and unlinks the files, resulting in data loss:
>
> {noformat}
> 2248856 01:17:37.778334683 1 vert.x-internal (30642) < unlink res=0
> path=/mnt/db1/cassandra/import/0-0-1d50c5e6-8fbe-44c7-98ec-a06132e78c1f-e9293be0-1767-11f1-887e-0ff1d5cec701/bulk_test/data/oa-1-big-CompressionInfo.db
> 2248866 01:17:37.778345865 1 vert.x-internal (30642) < newfstatat res=0
> dirfd=-100(AT_FDCWD)
> path=/mnt/db1/cassandra/import/0-0-1d50c5e6-8fbe-44c7-98ec-a06132e78c1f-e9293be0-1767-11f1-887e-0ff1d5cec701/bulk_test/data/oa-1-big-Statistics.db
> flags=256(AT_SYMLINK_NOFOLLOW)
> 2248868 01:17:37.778352848 1 vert.x-internal (30642) < newfstatat res=0
> dirfd=-100(AT_FDCWD)
> path=/mnt/db1/cassandra/import/0-0-1d50c5e6-8fbe-44c7-98ec-a06132e78c1f-e9293be0-1767-11f1-887e-0ff1d5cec701/bulk_test/data/oa-1-big-Statistics.db
> flags=256(AT_SYMLINK_NOFOLLOW)
> 2248875 01:17:37.778370298 1 vert.x-internal (30642) < unlink res=0
> path=/mnt/db1/cassandra/import/0-0-1d50c5e6-8fbe-44c7-98ec-a06132e78c1f-e9293be0-1767-11f1-887e-0ff1d5cec701/bulk_test/data/oa-1-big-Statistics.db{noformat}
>
> I haven't yet determined why Cassandra doesn't import the data. It sees the
> files in the listing, but there's no additional debug available to identify
> why it doesn't consider them valid.
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]