Hi Martin,

So sorry about the late response to this email. I was on leave for all of
August and haven't had a chance to get back to this.

I don't think any existing operators are suitable to me that I know of. I'm
using the ContinuousFileReaderOperator to read large files as it was
designed for, but the trigger to consume a file is a Kafka message
containing the path and formatting instructions. We have some operators
that eventually create an extended TimestampedFileInputSplit that we send
to the Reader.

I'm beginning to believe my best bet is to build a custom alternative to
the ContinuousFileReaderOperator using the MailboxExecutor. There is one
limitation though, the MailboxExecutor interface doesn't expose a way to
explicitly yield to the Default Action of the task. I think I would need to
also copy casting MailboxExecutor to MailboxExecutorImpl to gain access to
isIdle().

Darin

On Thu, Jul 18, 2024 at 10:40 AM Martijn Visser <martijnvis...@apache.org>
wrote:

> Hi Darin,
>
> Your comment "I think our use case is rather unique and I'm not sure who
> else would benefit" makes me think then why we should actually support it
> in Flink, given that you're mentioning the use case is rather unique?
> Especially since the SourceFunction has many flaws, and has been long
> overdue to be cleaned up in favor of the new Source API.
>
> The logical use case I see to read from something in the middle of the
> stream is when you want to perform lookups, which is where I think we
> should have different solutions in place already. Isn't that sufficient for
> your use case?
>
> Best regards,
>
> Martijn
>
>
> On Wed, Jul 17, 2024 at 5:21 PM Darin Amos <darin.a...@instacart.com
> .invalid>
> wrote:
>
> > I wanted to bump this request from last year. As I understand it the new
> > FileSource is not suitable for my existing use case.
> >
> > The TLDR is that we have extensively customized/extended the file splits
> > and input formats so a single ContinuousFileReaderOperator can handle
> > non-homogenous files and acts as a source in the middle of our stream.
> The
> > alternative is we will re-implement this operator on our own, which seems
> > like a reasonable alternative so long as access to the MailboxExecutor
> > won't also be deprecated.
> >
> > If needed, I can create this as a ticket.
> >
> > Cheers
> >
> > Darin
> >
> > On Tue, Nov 21, 2023 at 1:30 PM Darin Amos <darin.a...@instacart.com>
> > wrote:
> >
> > > Hi All!
> > >
> > > I posted on the community slack channel and was referred to this
> mailing
> > > list. I think it would be helpful if the ContinuousFileReaderOperator
> was
> > > made a public class and not removed in Flink 2.0 (or to have an
> > equivalent
> > > created). I have a use case for it where FileSource isn't sufficient,
> at
> > > least not to my knowledge.
> > >
> > > I think our use case is rather unique and I'm not sure who else would
> > > benefit. Essentially this operator acts as a source in the middle of
> our
> > > stream. Our application processes non-homogenous files which are
> > generally,
> > > but not limited to, CSV files. In our case each CSV file has varying
> > > headers (both values and number of header), delimiters and quote
> > > characters.
> > >
> > > Our application will receive a Kafka message with sufficient metadata
> to
> > > parse a file (path, delimiter, quote char - configured by supplier) and
> > > uses an Async operator to pre-download the headers. Afterwards we are
> > able
> > > to generate custom file splits (which contain the parsing instructions
> > and
> > > headers) paired with custom format class to create name-value-pair
> > records
> > > with the ContinuousFileReaderOperator.
> > >
> > > I'm more than happy to share more details about our customization if
> > > required.
> > >
> > > Thanks!
> > >
> > > Darin Amos
> > >
> > >
> > >
> >
>

Reply via email to