[ 
https://issues.apache.org/jira/browse/NIFI-9056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17420897#comment-17420897
 ] 

Mark Payne commented on NIFI-9056:
----------------------------------

[~aheys] the content repositories are entirely independent. Node A knows 
nothing about the content repo of Node B. It's entirely okay for them to each 
be different sizes / utilization.

Unfortunately, I have been unable to reproduce this issue. I understand not 
being able to transfer files from a private network. But it might be necessary 
to try to recreate the issue in a different network where you can share the 
details and hopefully provide a template/versioned group that can be 
demonstrated to replicate the issue.

The way in which the cleanup works is a bit complex, unfortunately, but it goes 
like this:
 * A Content Claim is a portion of a Resource Claim.
 * A ResourceClaim is a facade/wrapper over a file in the Content Repository.
 * Each ResourceClaim also has a ClaimantCount - a count of how many FlowFiles 
reference the Resource Claim. This allows us to avoid having to write each 
individual FlowFile to a separate file on disk, which would be a big 
performance issue.
 * When we write to a FlowFile, we create a new Content Claim. We increment the 
Claimant Count of the Resource Claim that this new Content Claim belongs to. We 
also decrement the Claimant Count for the ResourceClaim that the FlowFile used 
to point to.
 * When we drop a FlowFile from the system (auto-terminated, sent via 
site-to-site or load balanced connection), we also decrement the Content Claim.
 * When we checkpoint the FlowFile Repository, the FlowFile Repository is the 
one responsible for actually decrementing the Claimant Counts (this avoids 
encountering a situation in which we could decrement a claimant count to 0, 
have the content repo clean up the file, and then roll back the ProcessSession 
and need to reference the now-deleted file again).
 * When the FlowFile Repository decrements the claimant count, if the new 
claimant count 0, it gets added to a queue of "Destructable Claims" that the 
Content Repository is then free to destroy (or archive, depending on your 
configuration).

So, having said that, when we've seen similar situations, what happens is that 
for some reason or another, the Claimant Count was not decremented somewhere 
along the way when it should have been, or the Claimant Count was incremented 
somewhere along the way when it shouldn't have been. As a result, on restart, 
when we re-calculate the Claimant Count for each Content Claim, we detect that 
the file is no longer in use and we destroy (or archive) it.

[~slyouts] you are right that in NIFI-6849, we did make some updates to the 
FlowFile Repository. Those updates were very much around how we restore 
FlowFiles, and I do not believe they are related to this concern.

If I were to guess, I would guess that you have some custom processor that is 
making some series of calls to ProcessSession that is resulting in the 
ProcessSession not properly keeping track of Claimant Counts. And that would 
explain why we're not seeing the issue elsewhere. But it's hard to say.

One thing that would be worth checking out. When you run the diagnostics 
[~aheys], you can also provide a {{--verbose}} argument. If you do that, the 
output will contain the Claimant Count for every Resource Claim in your Content 
Repository. If you see that there are claims > 0  to Resource Claims, and that 
those files are cleaned up after restart, that's an indicator that the Claimant 
Count is somehow off. It will then come down to understanding how the Claimant 
Count was off. And provenance can help there. You can add 
"contentClaimIdentifier" to the list of indexed fields in nifi.properties (the 
nifi.provenance.repository.indexed.fields property).  Note that it's called 
contentClaimIdentifier here and not resourceClaimIdentifier due to some 
renaming/refactoring that happened long ago, but it's the same thing. After 
doing that, and after restarting, any provenance events that gets added will 
now allow you to search on the contentClaimIdentifier. So now, if you find a 
Resource Claim that had a Claimant Count > 0, and it was cleaned up on restart, 
you can find out exactly what was written to that Resource Claim and hopefully 
identify where things could have gone awry.

> Content Repository Filling Up
> -----------------------------
>
>                 Key: NIFI-9056
>                 URL: https://issues.apache.org/jira/browse/NIFI-9056
>             Project: Apache NiFi
>          Issue Type: Bug
>          Components: Core Framework
>    Affects Versions: 1.13.2
>            Reporter: Andrew Heys
>            Priority: Major
>
> We have a clustered nifi setup that has recently been upgraded to 1.13.2 from 
> 1.11.4. After upgrading, one of the issues we have run into is that the 
> Content Repository will fill up to the 
> nifi.content.repository.archive.backpressure.percentage mark and lock the 
> processing & canvas. The only solution is to restart nifi at this point. We 
> have the following properties set:
> nifi.content.repository.archive.backpressure.percentage=95%
> nifi.content.repository.archive.max.usage.percentage=25%
> nifi.content.repository.archive.max.retention.period=2 hours
> The max usage property seems to be completed ignored. Monitoring the nifi 
> cluster disk % for content repository shows that it slowly fills up over time 
> and never decreasing. If we pause the input to entire nifi flow and let all 
> the processing clear out with 0 flowfiles remaining on the canvas for 15+ 
> minutes, the content repository disk usage does not decrease. Currently, our 
> only solution is to restart nifi on a daily cron schedule. After restarting 
> the nifi, it will clear out the 80+ GB of the content repository and usage 
> falls down to 0%. 
>  
> There seems to be an issue removing the older content claims in 1.13.2.
> Thanks!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to