Hi Eranga and Kien, Flink supports asynchronous IO since version 1.2, see [1] for details.
You basically pack your URL download into the asynchronous part and collect the resulting string for further processing in your pipeline. Nico [1] https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/stream/ asyncio.html On Monday, 14 August 2017 17:50:47 CEST Kien Truong wrote: > Hi, > > While this task is quite trivial to do with Flink Dataset API, using > readTextFile to read the input and > > a flatMap function to perform the downloading, it might not be a good idea. > > The download process is I/O bound, and will block the synchronous > flatMap function, > > so the throughput will not be very good. > > > Until Flink supports asynchronous functions, I suggest you looks elsewhere. > > An example with master-workers architecture using Akka can be found here > > https://github.com/typesafehub/activator-akka-distributed-workers > > > Regards, > > Kien > > On 8/14/2017 10:09 AM, Eranga Heshan wrote: > > Hi all, > > > > I am fairly new to Flink. I have this project where I have a list of > > URLs (In one node) which need to be crawled distributedly. Then for > > each URL, I need the serialized crawled result to be written to a > > single text file. > > > > I want to know if there are similar projects which I can look into or > > an idea on how to implement this. > > > > Thanks & Regards, > > > > > > > > > > Eranga Heshan > > /Undergraduate/ > > Computer Science & Engineering > > University of Moratuwa > > Mobile: +94 71 138 2686 <tel:%2B94%2071%20552%202087> > > Email: eranga....@gmail.com <mailto:eranga....@gmail.com> > > <https://www.facebook.com/erangaheshan> > > <https://twitter.com/erangaheshan> > > <https://www.linkedin.com/in/erangaheshan>
signature.asc
Description: This is a digitally signed message part.