Hi Seth,

first of all, thanks for sharing your use case and your plan to contribute
to Flink.

I guess processing files as soon as they are completed is a very common use
case.
Publishing the paths of completed files as a data stream sounds like a nice
idea.
IMO, it would be great to have this feature for the BucketingSink.

Would you mind to open a JIRA for this feature?

Thank you,
Fabian



2017-01-24 18:06 GMT+01:00 Seth Wiesman <swies...@mediamath.com>:

> Hi all,
>
>
>
> I have been working with Flink for a while at work now and in that time I
> have developed several extensions that I would like to contribute back. I
> wanted to reach out with what has been the most significant modification
> for and see if it is something that the community would be interested in.
>
>
>
>
>
> RichFsSinkFunction
>
>
>
> Currently my pipeline uses the BucketingSink to write out files to S3
> which are then consumed by other processes. Outputting data to a file
> system is fundamentally different than outputting to something such as
> Kafka because data can only be consumed once at the file level. The
> BucketingSink moves files through three phases, inprogress, pending, and
> complete, and if you are interested in maintaining exactly once guarantees
> then you only want your external services to consume files once it reaches
> a complete state. One option is to write a _SUCCESS file to a bucket once
> all files in that bucket are done but that can be difficult to coordinate
> or may take a prohibitively long amount of time. In the case of the
> BasePathBucketer this will never happen. For my use case it is important to
> for external services to be able to consume files as soon as they become
> available. To solve this, we modified the bucketing sink to not be the end
> of the pipeline but instead forward the final path of files on once they
> reach their final state to a final operator.
>
>
>
> I do not want to fundamentally change the concept of what a sink is, it
> should remain the end of the pipeline, instead this is simply to allow a
> custom ‘onClose’ step. To do this, paths can only be forwarded on to an
> operator of parallelism one whose only operation is to add a sink. From
> this other services can be notified that completed files exist.
>
>
>
> To provide a motivating example, after writing files to S3 I need to load
> them into a redshift cluster. To do this I batch completed files for 1
> minute of processing time and then write out a manifest file ( a list of
> completed files to load) and run a copy command.
> http://docs.aws.amazon.com/redshift/latest/dg/loading-
> data-files-using-manifest.html  .
>
>
>
> I have below an example gist of what this looks like to use.
>
>
>
> Example: https://gist.github.com/sjwiesman/fc99c64f44a93cfc9c7aa62c070a93
> 58
>
>
>
> <https://www.mediamath.com/mailto>
>
> *Seth Wiesman* *|* *Data Engineer*
>
> 4 World Trade Center, 45th Floor, New York, NY 10007
> <https://www.mediamath.com/mailto>
>
>
>

Reply via email to