Many many thanks 🙏 Joe, that makes my flow a lot simpler. 

Thanks 
Jens

> Den 13. okt. 2021 kl. 16.50 skrev Joe Witt <[email protected]>:
> 
> Jens
> 
> If you use MergContent [1] you can create streams of flowfile bundles
> (attributes/content serialized together) in groups of 1 or more.  Then
> on the other end you can use UnpackContent [2]
> 
> Thanks
> Joe
> 
> [1] 
> http://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.14.0/org.apache.nifi.processors.standard.MergeContent/index.html
> [2] 
> http://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.14.0/org.apache.nifi.processors.standard.UnpackContent/index.html
> 
>> On Tue, Oct 12, 2021 at 11:07 PM Jens M. Kofoed <[email protected]> 
>> wrote:
>> 
>> Dear Joe
>> 
>> Regarding you point 5. This is almost also what I'm doing. But last night
>> at my phone I "just wrote" we created a hash file. What I'm actually doing
>> is converting the flowfile to json.
>> Are there a way where NIFI can export the complete flowfile (attributes and
>> content) into 1 file, which we can import again on the other side? Right
>> now I do it in 2 steps
>> Below is a short description of my flow for transferring data between
>> systems where we can't use S2S.
>> At low side:
>> get data ->
>>  CryptographicHashContent ->
>>    UpdateAttribute: original.filename = ${filename},
>> rootHash=${content_SHA-256} ->
>>      UpdateAttribute: filename=${UUID()} ->
>>        PutSFTP ->
>>          AttributesToJSON: Destination=flowfile-content ->
>>            UpdateAttribute: filename=${filename:append('.flowfile')} ->
>>              PutSFTP
>> 
>> At high side:
>> ListSFTP: File filter Regex = .*\.flowfile ->
>>  FetchSFTP ->
>>    ExecuteScript: (converting json data into attributes) ->
>>      UpdateAttribute: filename = ${filename:substringBefore('.flowfile')}
>> ->
>>        FetchSFTP ->
>>          CryptographicHashContent ->
>>            RouteOnAttribute: Hash_OK =
>> ${rootHash:equals(${content_SHA-256})} ->
>>              Hash_OK -> following production flow
>>              Unmatched -> Error flow
>> 
>> Kind regards
>> Jens
>> 
>>> Den tir. 12. okt. 2021 kl. 21.36 skrev Joe Witt <[email protected]>:
>>> 
>>> Jens
>>> 
>>> For such a setup the very specific details matter and here there are a
>>> lot of details.  It isn't easy to sort through this for me so I'll
>>> keep it high level based on my experience in very similar
>>> situations/setups:
>>> 
>>> 1. I'd generally trust SFTP to be awesome and damn near failure proof
>>> in itself.  I'd focus on other things.
>>> 2. I'd generally trust that data packet corruption in terms of network
>>> transfer is bulletproof and not think that is a problem especially
>>> since SFTP and various protocols employed here offer certain
>>> guarantees themselves (including nifi).
>>> 3. I'd be suspect of one way transfer/guard devices creating issues.
>>> I'd remove that and try to reproduce the problem.
>>> 4. In linux a cp/mv is not atomic as I understand if data is spanning
>>> across file systems so you could have partially written data scenarios
>>> here potentially.
>>> 5. I'd be careful to avoid multiple file scenarios such as original
>>> content and the sha256.  Instead if the low side is a NiFi and the
>>> high side is a NiFi I'd have lowside nifi write out flowfiles and pass
>>> those over the guard device.  Why?  Because this gives you your
>>> original content AND the flowfile attributes (where I'd have the
>>> sha256).  On the high side nifi i'd unpack that flow file and ensure
>>> the content matches the stated sha256.
>>> 
>>> Joe
>>> 
>>> On Tue, Oct 12, 2021 at 12:25 PM Jens M. Kofoed <[email protected]>
>>> wrote:
>>>> 
>>>> Hi Joe
>>>> 
>>>> I know what you are thinking but that’s not the case.
>>>> Check my very short description of my test flow.
>>>> In my loop the PutSFTP process is using default settings which means
>>> it’s uploading files as .filename and rename it when done. The next process
>>> is the FetchSFTP which will load the file as filename. If PutSFTP is not
>>> finished uploading the file it will have the wrong filename and the flow
>>> file will not go from the PutSFTP -> FetchSFTP and therefore the FetchSFTP
>>> can’t fetch the file. So in my test flow it is not the case.
>>>> 
>>>> In our production flow, after nifi gets its data it calculates the
>>> sha256.  uploads the data to a sftp server as .filename and rename it when
>>> done. Default settings for PutSFTP. Next it create a new file with the
>>> value of the hash and save it as filename.sha256.
>>>> At that sftp server a bash script is looking for NOT hidden files every
>>> 2 seconds with a ls command. If there are files the bash script does a cp
>>> filename /archive/filename and sends the data to server 3 via a data diode.
>>> At the other side another nifi server reads the filename.sha256, reads in
>>> the hash value and reads in the original data. Calculate a new sha256 and
>>> compare the two hashes.
>>>> Yesterday there was a corruption again and we checked the file at the
>>> first sftp server where the first nifi saved it after creating the first
>>> hash. Running a sha256sum at the /archive/filename produced a different
>>> hash than nifi. So after the PutSFTP and a Linux cp command the file was
>>> corrupted.
>>>> It have been less than 1 file pr. 1.000.000 files where we have seen
>>> theses issues. But we see them.
>>>> Now we try to investigate that course the issue. Therefore I created the
>>> small test flow and already after nearly 9000 iteration in the loop the
>>> file has been corrupted just being uploaded and downloaded again.
>>>> 
>>>> Are we facing a network issue where a data packed is corrupted?
>>>> Are there a very rare cases where the sftp implementation is doing
>>> something wrong?
>>>> We don’t know yet but we are running some more tests and at different
>>> systems to narrow it down
>>>> 
>>>> Kind regards
>>>> Jens M. Kofoed
>>>> 
>>>>> Den 12. okt. 2021 kl. 19.39 skrev Joe Witt <[email protected]>:
>>>>> 
>>>>> Hello
>>>>> 
>>>>> How does nifi grab the data from the file system?  It sounds like it is
>>>>> doing partial reads due to a competing consumer (data still being
>>> written)
>>>>> scenario.
>>>>> 
>>>>> Thanks
>>>>> 
>>>>> On Mon, Oct 11, 2021 at 10:36 PM Jens M. Kofoed <
>>> [email protected]>
>>>>> wrote:
>>>>> 
>>>>>> Dear Developers
>>>>>> 
>>>>>> We have a situation where we see corrupted file after using PutSFTP
>>> and
>>>>>> FetchSFTP in NIFI 1.13.2 with openjdk version "1.8.0_292", OpenJDK
>>> Runtime
>>>>>> Environment (build 1.8.0_292-8u292-b10-0ubuntu1~20.04-b10), OpenJDK
>>> 64-Bit
>>>>>> Server VM (build 25.292-b10, mixed mode) running on a Ubuntu Server
>>> 20.04
>>>>>> 
>>>>>> We have a flow between 2 separated systems where we use a PUTSFTP to
>>> export
>>>>>> data from one NIFI instance to a datadiode and use FetchSFTP to grep
>>> data
>>>>>> on the other end. To be sure data is not corrupted we calculate a
>>> SHA256 on
>>>>>> each side, and transfer the flowfile metadata in a seperate file. In
>>> rare
>>>>>> cases have see that the SHA256 doesn't match on both sides and are
>>>>>> investigation where the errors happens. We see 2 errors. Manually
>>>>>> calculation a SHA256 on both side of the diodes the file is OK and we
>>> have
>>>>>> found that the errors at  happens between NIFI and the SFTP servers.
>>> And it
>>>>>> can happens at both sides.
>>>>>> So for testing I created this little flow:
>>>>>> GeneratingFlowFile (size 100MB) (Run once) ->
>>>>>> CryptographicHashContent (SHA256) ->
>>>>>> UpdateAttribute ( hash.root = ${content_SHA-256} , iteration=1) ->
>>>>>> PutSFTP ->
>>>>>> FetchSFTP ->
>>>>>> CryptographicHashContent (SHA256) ->
>>>>>> routeOnAttribute (compare root.hash vs.content_SHA-256)
>>>>>>   If unmatch ->
>>>>>>       Going to a disabled process for placeholding the corrupted
>>> file in
>>>>>> a file queue
>>>>>>   If match ->
>>>>>>       UpdateAttribute ( iteration= ${iteration:plus(1)} ) -> looping
>>> back
>>>>>> to PutSFTP
>>>>>> 
>>>>>> After 8992 iteration the file is corrupted. To test if the errors are
>>> in
>>>>>> the calculation of the SHA256 I have a copy of the flow without the
>>>>>> PUT/FETCH SFTP processors which haven't got any errors yet.
>>>>>> 
>>>>>> It is very rare that we see these errors, millions of files are going
>>>>>> through without any issues but some time it happens which is not good.
>>>>>> 
>>>>>> Can any one please help? Maybe trying to setup the same test and see
>>> if you
>>>>>> also have a corrupted file after some days.
>>>>>> 
>>>>>> Kind regards
>>>>>> Jens M. Kofoed
>>>>>> 
>>> 

Reply via email to