This is a very interesting area of integration investigation Marc, thank
you for sharing your work!

I looked into this a little after conversations with folks in security
applications, and I wonder if you investigated approaches to tracking and
reporting/handling packet loss and error rates in this?
The interest was in reasoning about loss rates, and the completeness of
received data - something with a simple merge>put>diode>get>unpack would
not manage I think.

I was looking at Longhair <https://github.com/catid/longhair>, and similar
reed-solomon approaches, as a method of breaking down arbitrary files and
transmitting for reconstitution over diodes that may have lossy behavior in
Field scenarios.
I also looked a little into transmitting manifests for downstream
reconciliation, but this unravelled to be more complex an operation than
would suit a pure NiFi implementation, so I started on the path of
Kafka/Flink as a streaming-reconciliation service but quickly realised i
was creating a monster without commercial interest :)
Both approaches are easier for fewer larger files than millions of tiny
messages in terms of practicality, and if you had very reliable diode
transmission the overhead of ecc/reconciliation may not be worthwhile.
Other implementations I had seen (like ZeroMQ radio/dish or blindFTP)
seemed to talk about provable delivery as a potential requirement, but I
only found the more simplistic 'my network is reliable and any packet loss
is negligible anyway' approaches. I suspect the implementations of these
more robust approaches are reserved for commercial offerings...

Anyway, I appreciate that you may not be able to share more details on
this, but you reminded me of enjoying the investigation when I looked at it
so I thought I'd say thanks for that.

On Tue, Aug 3, 2021 at 2:55 AM Phil H <[email protected]> wrote:

> Adam, that's true, although if your data size is larger than network
> MTU there can be some disconnect there.
>
> Connection per flow file is pretty slow for sustained high traffic
> flows though (can't recall the establishment times off the top of my
> head, but they are non-trivial).
>
> On Tue, Aug 3, 2021 at 8:39 AM Adam Taft <[email protected]> wrote:
> >
> > Just spitballing a little here. If you set the configuration of the
> PutTCP
> > processor property "Connection per Flowfile" to 'true' and you leave the
> > "Outgoing Message Delimiter" as blank (none), then I don't think you have
> > the delimiter problem that you both are describing. I could be wrong
> though?
> >
> > I would consider it a bug if you couldn't send a "raw"
> connection-oriented
> > object over PutTCP.  With that processor, the goal would be to: a) open a
> > socket, b) dump whatever binary you have prepared over it, c) close the
> > socket to signal completion of transfer. If PutTCP doesn't work this way
> > (byte-for-byte), it should probably be flagged as a bug (its original
> > intention was exactly this use case).
> >
> > That being said, I still think custom FlowFile serialization might be
> > something that is outside of the concern of the transport. I personally
> > think serializing/deserializing is a different concern from transport.
> > Arguably, sometimes the semantics of the transport protocol requires you
> to
> > prepare the message itself in a protocol accommodating way (HTTP being an
> > obvious example of this, or packet ordering in Marc's UDP example). But a
> > new JSON flowfile serialization seems like it could be a separate
> > processor, not commingled into an existing one.
> >
> > MergeContent / UnpackContent work in tandem and have a "FlowFile Stream
> v3"
> > format that can serialize/deserialize multiple flowfiles together into a
> > single byte stream. This allows transport over any protocol, including
> > file-based, socket-based, etc.
> >
> > Marc: Your mention of performance is, of course, appropriate for the
> scale
> > that you're talking about (Gbps). Maybe there's some performance
> > improvements that could be garnered from your work applicable to the
> > "standard" processors I mentioned. And I definitely didn't mean to imply
> > you were doing "anything wrong". Just legitimately curious as to your
> > thought process and design approach.
> >
> > OK, I'll step off a little, because I might be probing too hard here.
> But I
> > was legitimately curious about the intention of the proposed processor as
> > it relates to the mentioned Diode device.
> >
> > Thanks,
> >
> > Adam
> >
> >
> > On Mon, Aug 2, 2021 at 4:15 PM Phil H <[email protected]> wrote:
> >
> > > Hi Marc,
> > >
> > > Thanks for the additional info.  Just so you know you’re not the only
> > > one, I’ve also had to re-implement a ListenTCP alternative to get
> > > around the byte delimeter issue for binary and multiline text data.
> > >
> > > Phil
> > >
> > >
> > > On Tue, Aug 3, 2021 at 6:59 AM Marc <[email protected]> wrote:
> > > >
> > > > Hi Adam,
> > > >
> > > > more or less it is a ‚merge', puttcp, listentcp and unpack. I hope
> that
> > > I am not wrong but the nifi ListenTCP processor uses a delimiter (\n as
> > > default?). If you are transferring binary data the processor splits the
> > > flow into ‚pieces'. And the attributes are not transferred to the
> > > destination.
> > > >
> > > > But your idea describes what the processor is doing.
> > > >
> > > > 1. It converts the attributes to a json string
> > > > 2. It transfers the json string and the payload (there is a header
> that
> > > tells the destination how long the json header and how long the
> payload is)
> > > > 3. The Listener gets the flow and decodes the header (to get the
> size of
> > > the json header and the payload)
> > > > 4. It writes the payload to a flow
> > > > 5. It converts the json string and sets the attributes to the flow
> > > >
> > > > If you do not want to transfer attributes you can configure a
> different
> > > decoder. In this case you can just ‚nectat‘ a binary file to nifi.
> > > >
> > > > The UDP version is far more complex. There must be a counter to tell
> the
> > > destination what part of the flow file was received (even in a diode
> > > environment packets are not received in the right order!). And you
> must be
> > > fast, very fast. It is a multithreaded architecture because one thread
> > > cannot receive, decode, and write a gigabit per second. I used the
> > > disruptor library. Receive a packet in one thread, decode it in another
> > > thread. A third thread gets the packet and write the content in the
> right
> > > order to a flow.
> > > >
> > > > I am still learning (and I am not a professional software
> developer). If
> > > I did something wrong or oversaw something please tell me.
> > > >
> > > > Marc
> > > >
> > > > > Am 02.08.2021 um 22:01 schrieb Adam Taft <[email protected]>:
> > > > >
> > > > > Marc,
> > > > >
> > > > > How would this differ from a more generic use of the existing
> > > processors,
> > > > > PutTCP/ListentTCP and PutUDP/ListenUDP?  I'm not sure what value is
> > > being
> > > > > added above these existing processors, but I'm sure I'm missing
> > > something.
> > > > >
> > > > > There's already an ability to serialize flowfiles via
> MergeContent. And
> > > > > there's the deserialize side in UnpackContent. So a dataflow that
> looks
> > > > > like the following would seem a reasonable approach to the problem:
> > > > >
> > > > > MergeContent -> PutTCP -> {diode} -> ListentTCP -> UnpackContent
> > > > >
> > > > > I'm actually very interested in this topic, having a project that
> has
> > > a use
> > > > > case for a "diode". So I'm legitimately asking here, not trying to
> > > derail
> > > > > your work.
> > > > >
> > > > > Thanks in advance,
> > > > >
> > > > > Adam
> > > > >
> > > > > On Sun, Aug 1, 2021 at 12:26 PM Marc <[email protected]> wrote:
> > > > >
> > > > >> Greetings,
> > > > >>
> > > > >> there are companies and organizations that strictly separate their
> > > > >> networks for security reasons. Such companies often use diodes to
> > > achieve
> > > > >> this. But of course they still have to exchange data between the
> > > networks
> > > > >> (eg. transfer data from ‚low‘ to ‚high‘). There are at least two
> > > kinds of
> > > > >> diodes. Some hardware-based ones only use one fiber optic to send
> > > data (UDP
> > > > >> based). Others use TCP, but prevent sending in the reverse
> direction.
> > > > >>
> > > > >> Nifi is an amazing tool that allows data to be transferred
> between two
> > > > >> separate networks in a very flexible but also secure way. I have
> > > > >> implemented two processors. The first one ‚merges‘ the attributes
> and
> > > the
> > > > >> content of a flowfile and sends it to the destination. The second
> one
> > > > >> listens on a TCP port, splits attributes and content and creates
> a new
> > > > >> flowfile containing all attributes of the origin flow. You can
> send
> > > the
> > > > >> flow without attributes as well. In this case you can easily
> netcat a
> > > > >> binary file to Nifi.
> > > > >>
> > > > >> These two processors are useful if you do NOT have a bidirectional
> > > > >> communication between two NiFi instances and therefore the
> site-2-site
> > > > >> mechanism or http(s) cannot be used.
> > > > >>
> > > > >> We have been using these processors for a longer period of time
> > > (exactly
> > > > >> the version for 1.13.2) and would like to share these processors
> with
> > > > >> others. So the question to you all is: Is someone interested in
> these
> > > > >> processors or is this use case too special?
> > > > >>
> > > > >> The current source code can be found on GitHub. (
> > > > >> https://github.com/nerdfunk-net/diode/ <
> > > > >> https://github.com/nerdfunk-net/diode/>)
> > > > >>
> > > > >> I have also implemented a UDP based version of the processor. Due
> to
> > > the
> > > > >> nature of UDP, this is more complex and these processors are now
> being
> > > > >> tested.
> > > > >>
> > > > >> Best regards
> > > > >> Marc
> > > >
> > >
>

Reply via email to