Re: Distribute crawling of a URL list using Flink

2017-08-26 Thread Eranga Heshan
Thank you Aljoscha :-) I actually need it for a Kafka stream, so I use DataStream API anyway. Regards, Eranga Heshan *Undergraduate* Computer Science & Engineering University of Moratuwa Mobile: +94 71 138 2686 <%2B94%2071%20552%202087> Email: eranga@gmail.com

Re: Distribute crawling of a URL list using Flink

2017-08-25 Thread Aljoscha Krettek
Hi, It is not available for the Batch API, you would have to use the DataStream API. Best, Aljoscha > On 15. Aug 2017, at 01:16, Kien Truong wrote: > > Hi, > > Admittedly, I have not suggested this because I thought it was not available > for batch API. > > Regards, > Kien > On Aug 15,

Re: Distribute crawling of a URL list using Flink

2017-08-14 Thread Kien Truong
Hi, Admittedly, I have not suggested this because I thought it was not available for batch API. Regards, Kien On Aug 15, 2017, 00:06, at 00:06, Nico Kruber wrote: >Hi Eranga and Kien, >Flink supports asynchronous IO since version 1.2, see [1] for details. > >You basically pack your URL downlo

Re: Distribute crawling of a URL list using Flink

2017-08-14 Thread Eranga Heshan
Thanks for your quick replies, Nico and Kien. Since I am using Flink-1.3.0, I will try Nico's idea. I might bug you again for my future problems. 😊 Regards, Eranga Heshan *Undergraduate* Computer Science & Engineering University of Moratuwa Mobile: +94 71 138 2686 <%2B94%2071%20552%202087> Ema

Re: Distribute crawling of a URL list using Flink

2017-08-14 Thread Nico Kruber
Hi Eranga and Kien, Flink supports asynchronous IO since version 1.2, see [1] for details. You basically pack your URL download into the asynchronous part and collect the resulting string for further processing in your pipeline. Nico [1] https://ci.apache.org/projects/flink/flink-docs-releas

Re: Distribute crawling of a URL list using Flink

2017-08-14 Thread Kien Truong
Hi, While this task is quite trivial to do with Flink Dataset API, using readTextFile to read the input and a flatMap function to perform the downloading, it might not be a good idea. The download process is I/O bound, and will block the synchronous flatMap function, so the throughput will

Distribute crawling of a URL list using Flink

2017-08-13 Thread Eranga Heshan
Hi all, I am fairly new to Flink. I have this project where I have a list of URLs (In one node) which need to be crawled distributedly. Then for each URL, I need the serialized crawled result to be written to a single text file. I want to know if there are similar projects which I can look into o