[
https://issues.apache.org/jira/browse/NIFI-8676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17418245#comment-17418245
]
Tamas Palfy commented on NIFI-8676:
-----------------------------------
There's a kind of fix/workaround for this already:
https://issues.apache.org/jira/browse/NIFI-4876
A listing doesn't necessarily shows an object that was added very recently.
Ordering is not guaranteed either.
I.e. eventual. consistency.
The aforementioned change introduces the option to skip very recent objects so
we can make sure we don't skip their "younger siblings".
ListGCSBucket doesn't have this option at the moment but would be easy to add.
As long as the the whole listing is done by he underlying library I think using
that property is preferable than adding the whole complexity a tracking entity
strategy would bring.
> ListS3 and ListGCSObject sometimes miss objects in very active buckets
> ----------------------------------------------------------------------
>
> Key: NIFI-8676
> URL: https://issues.apache.org/jira/browse/NIFI-8676
> Project: Apache NiFi
> Issue Type: Bug
> Affects Versions: 1.13.2
> Reporter: Paul Kelly
> Priority: Major
> Labels: gcs, s3
> Attachments: flow.xml.gz
>
>
> ListS3 and ListGCSBucket occasionally miss some objects in very active
> buckets and never list them. Through testing, it appears that exclusively
> using an object's last modified date for state tracking is unreliable when a
> large dump of objects of various sizes is uploaded simultaneously. For some
> reason, newer but smaller files are sometimes listed before older but larger
> files, which messes up the timestamp tracking state of the ListS3 and
> ListGCSBucket processors.
> We have flows that operate as ListS3 -> FetchS3Object -> DeleteS3Object ->
> (downstream processing) and ListGCSBucket -> FetchGCSObject ->
> DeleteGCSObject -> (downstream processing). We often notice files remain in
> the bucket until we manually clear the state of the relevant List processor
> and restart it. Examining the provenance logs shows that the objects that
> remained were never listed, which is confirmed by logs within the downstream
> processing showing the objects never made it there.
> Attached is a sample flow.xml.gz file which replicates this problem by
> simulating extreme conditions for both GCS and S3. Two GenerateFlowFile
> processors run with a schedule of 0.01 seconds. One of them generates flow
> files of size 1B and the other generates flow files of size 1GB. These feed
> into a PutS3Object or PutGCSObject processor which is set to use 10
> concurrent threads, thus allowing 10 files to be uploaded simultaneously. The
> queue that is connected to the Put processors does not limit the number or
> size of flow files in order to preventing backpressure from causing the
> number of small and large sample flow files being uploaded simultaneously to
> become unbalanced.
> Another flow within the attached sample flow.xml.gz file uses
> ListS3/ListGCSBucket -> DeleteS3Object/DeleteGCSObject to mimic the receiving
> end where objects are missed. The List processors are set to a run schedule
> of 0 seconds to cause listing to occur as frequently as possible. After
> starting both the sending and receiving flows, you should see within a few
> seconds to a minute that the counts of flow files put into GCS or S3 are
> higher than the count of flow files output by the List processors.
> Additionally, if you stop the Put flow but let the receiving flow with its
> Delete processor continue to run, objects will remain in the bucket even
> after all queues are flushed. Examining provenance logs will confirm that
> those objects were never listed. Stopping the List processor, clearing its
> state, and restarting it will cause these remaining objects to be listed and
> then deleted by the Delete processor.
> We do not run into this problem with ListAzureBlobStorage since we can set it
> to track entities and not just track timestamps. ListS3 and ListGCSBucket do
> not allow tracking by entities and are hard-coded to only track timestamps.
> It'd be great if they could track by entities or if the timestamp issue could
> be resolved.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)