Re: [DISCUSS] Streaming or "lazy" mode for `CompressContent`

Edward Armes Tue, 30 Jul 2019 10:16:18 -0700

Joe,

My concern is that the record reading and writing as it stands isn't as
clear as it could be, and this could make it worse. I personally did find
it a little difficult understanding how some record processing processors
worked.


That aside however, I think that if a "flow level"/Processes Group settings
on compression was added, it would potentially work as a general solution.
What I'm thinking here is that as content leaves a processor it's checked
to see if it is already compressed and if it isn't compress it on the way
to the content repo and if it is. It's leave it alone. On the reverse, once
content is read from the content repo it is again intercepted and
de-compressed as it's loaded into processor, there would potentially need
to be a flag added for a processors to indicate to the core that it
shouldn't need to de-compress the input.

As for handling the compression algorithms maybe extending the
plugin-discovery functionality used for repo implementations could be used
to sliently detect compression formats and algorithms?

I think it work for record and non-record data be it text or binary.

Edward


On Tue, Jul 30, 2019 at 5:42 PM Joe Witt <[email protected]> wrote:

> Edward,
>
> I like your point/comment regarding separation of concerns/cohesion.  I
> think we could/should consider automatically decompressing data on the fly
> for processors in general in the event we know a given set of data to be
> compressed but being accessed for plaintext purposes.  For general block
> compression types this is probably fair game and could be quite compelling
> particularly to avoid the extra read/write/content repo hits involved.
>
> That said, I think for the case of record readers/writers I'm not sure we
> can avoid having a specific solution.  Some compression types can be
> concatted together and some cannot.  Some record types would be
> tolerant/still valid and some would not.
>
> Thanks
> Joe
>
> On Tue, Jul 30, 2019 at 12:34 PM Edward Armes <[email protected]>
> wrote:
>
> > So while I agree with in principle and it is a good idea on paper.
> >
> >  My concern is that this starts to add a bolt-on bloat problem. The Nifi
> > processors as they stand in general do follow the Unix Philosophy (Do one
> > thing, and do it well). My concern is while it could just be a case with
> > just adding a wrapper is that it then becomes an ask to just add the
> > wrapper to other processors to add similar functionalty or other. This
> does
> > start to cause a technical debt problem and also start to potentially a
> > detrimental experience to the user. Some of this I have mentioned in the
> > previous thread about the re-structuring the Nifi core.
> >
> > The reason why I suggest doing it either at the repo level or as the
> > InputStream is handed over to the processor from the core is that it adds
> > it as a global piece of functionality, which every processor that
> processes
> > data that compress well could benefit from. Now ideally it would be nice
> to
> > see it as a "per-flow" setting but I suspect that would be adding more
> > complexity, than is actually needed.
> >
> > I have seen an issue where over the time the content repo took up quite a
> > chunk of disk, for a multi-tenanted cluster that performed lots of small
> > changes on lots of FlowFiles, now while the hosts were under resourced,
> > being able to have compressed the content and trading it off for speed of
> > data through the flow might have helped that situation quite a bit.
> >
> > Edward
> >
> > On Tue, Jul 30, 2019 at 4:21 PM Joe Witt <[email protected]> wrote:
> >
> > > Malthe
> > >
> > > I do see value in having the Record readers/writers understand and
> handle
> > > compression directly as it will avoid the extra disk hit of decompress,
> > > read, compress cycles using existing processes and further there are
> > cases
> > > where the compression is record specific and not just holistic block
> > > encryption.
> > >
> > > I think Koji offered a great description of how to start thinking about
> > > this.
> > >
> > > Thanks
> > >
> > > On Tue, Jul 30, 2019 at 10:47 AM Malthe <[email protected]> wrote:
> > >
> > > > In reference to NIFI-6496 [1], I'd like to open a discussion on
> adding
> > > > compression support to flow files such that a processor such as
> > > > `CompressContent` might function in a streaming or "lazy" mode.
> > > >
> > > > Context, more details and initial feedback can be found in the ticket
> > > > referenced below as well as in a related SO entry [2].
> > > >
> > > > [1] https://issues.apache.org/jira/browse/NIFI-6496
> > > > [2]
> > > >
> > >
> >
> https://stackoverflow.com/questions/57005564/using-convertrecord-on-compressed-input
> > > >
> > >
> >
>

Re: [DISCUSS] Streaming or "lazy" mode for `CompressContent`

Reply via email to