[Proposal] RichFsSinkFunction

Seth Wiesman Tue, 24 Jan 2017 09:22:43 -0800

Hi all,

I have been working with Flink for a while at work now and in that time I have 
developed several extensions that I would like to contribute back. I wanted to 
reach out with what has been the most significant modification for and see if 
it is something that the community would be interested in.



RichFsSinkFunction

Currently my pipeline uses the BucketingSink to write out files to S3 which are 
then consumed by other processes. Outputting data to a file system is 
fundamentally different than outputting to something such as Kafka because data 
can only be consumed once at the file level. The BucketingSink moves files 
through three phases, inprogress, pending, and complete, and if you are 
interested in maintaining exactly once guarantees then you only want your 
external services to consume files once it reaches a complete state. One option 
is to write a _SUCCESS file to a bucket once all files in that bucket are done 
but that can be difficult to coordinate or may take a prohibitively long amount 
of time. In the case of the BasePathBucketer this will never happen. For my use 
case it is important to for external services to be able to consume files as 
soon as they become available. To solve this, we modified the bucketing sink to 
not be the end of the pipeline but instead forward the final path of files on 
once they reach their final state to a final operator.

I do not want to fundamentally change the concept of what a sink is, it should 
remain the end of the pipeline, instead this is simply to allow a custom 
‘onClose’ step. To do this, paths can only be forwarded on to an operator of 
parallelism one whose only operation is to add a sink. From this other services 
can be notified that completed files exist.

To provide a motivating example, after writing files to S3 I need to load them 
into a redshift cluster. To do this I batch completed files for 1 minute of 
processing time and then write out a manifest file ( a list of completed files 
to load) and run a copy command. 
http://docs.aws.amazon.com/redshift/latest/dg/loading-data-files-using-manifest.html
  .

I have below an example gist of what this looks like to use.

Example: https://gist.github.com/sjwiesman/fc99c64f44a93cfc9c7aa62c070a9358

[cid:image001.png@01D2763A.4A98E300]<https://www.mediamath.com/mailto>

Seth Wiesman | Data Engineer 
4 World Trade Center, 45th Floor, New York, NY 
10007<https://www.mediamath.com/mailto>

[Proposal] RichFsSinkFunction

Reply via email to