Paul Kelly created NIFI-8676:
--------------------------------
Summary: ListS3 and ListGCSObject sometimes miss objects in very
active buckets
Key: NIFI-8676
URL: https://issues.apache.org/jira/browse/NIFI-8676
Project: Apache NiFi
Issue Type: Bug
Affects Versions: 1.13.2
Reporter: Paul Kelly
Attachments: flow.xml.gz
ListS3 and ListGCSBucket occasionally miss some objects in very active buckets
and never list them. Through testing, it appears that exclusively using an
object's last modified date for state tracking is unreliable when a large dump
of objects of various sizes is uploaded simultaneously. For some reason, newer
but smaller files are sometimes listed before older but larger files, which
messes up the timestamp tracking state of the ListS3 and ListGCSBucket
processors.
We have flows that operate as ListS3 -> FetchS3Object -> DeleteS3Object ->
(downstream processing) and ListGCSBucket -> FetchGCSObject -> DeleteGCSObject
-> (downstream processing). We often notice files remain in the bucket until we
manually clear the state of the relevant List processor and restart it.
Examining the provenance logs shows that the objects that remained were never
listed, which is confirmed by logs within the downstream processing showing the
objects never made it there.
Attached is a sample flow.xml.gz file which replicates this problem by
simulating extreme conditions for both GCS and S3. Two GenerateFlowFile
processors run with a schedule of 0.01 seconds. One of them generates flow
files of size 1B and the other generates flow files of size 1GB. These feed
into a PutS3Object or PutGCSObject processor which is set to use 10 concurrent
threads, thus allowing 10 files to be uploaded simultaneously. The queue that
is connected to the Put processors does not limit the number or size of flow
files in order to preventing backpressure from causing the number of small and
large sample flow files being uploaded simultaneously to become unbalanced.
Another flow within the attached sample flow.xml.gz file uses
ListS3/ListGCSBucket -> DeleteS3Object/DeleteGCSObject to mimic the receiving
end where objects are missed. The List processors are set to a run schedule of
0 seconds to cause listing to occur as frequently as possible. After starting
both the sending and receiving flows, you should see within a few seconds to a
minute that the counts of flow files put into GCS or S3 are higher than the
count of flow files output by the List processors. Additionally, if you stop
the Put flow but let the receiving flow with its Delete processor continue to
run, objects will remain in the bucket even after all queues are flushed.
Examining provenance logs will confirm that those objects were never listed.
Stopping the List processor, clearing its state, and restarting it will cause
these remaining objects to be listed and then deleted by the Delete processor.
We do not run into this problem with ListAzureBlobStorage since we can set it
to track entities and not just track timestamps. ListS3 and ListGCSBucket do
not allow tracking by entities and are hard-coded to only track timestamps.
It'd be great if they could track by entities or if the timestamp issue could
be resolved.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)