Re: Flink s3 streaming performance

2020-06-06 Thread venkata sateesh` kolluru
Thanks Arvid! Will try to increase the property you recommended and will post the update. On Sat, Jun 6, 2020, 7:33 AM Arvid Heise wrote: > Hi Venkata, > > you can find them on the Hadoop AWS page (we are just using it as a > library) [1]. > > [1] > https://hadoop.apache.org/docs/current/hadoop

Re: Flink s3 streaming performance

2020-06-06 Thread Arvid Heise
Hi Venkata, you can find them on the Hadoop AWS page (we are just using it as a library) [1]. [1] https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html#General_S3A_Client_configuration On Sat, Jun 6, 2020 at 1:26 AM venkata sateesh` kolluru < vkollur...@gmail.com> wrote:

Re: Flink s3 streaming performance

2020-06-05 Thread venkata sateesh` kolluru
Hi Kostas and Arvid, Thanks for your suggestions. The small files were already created and I am trying to roll few into a big file while sinking. But due to the custom bucket assigner, it is hard getting more files with in the same prefix in specified checkinpointing time. For example: /prefix1/

Re: Flink s3 streaming performance

2020-06-05 Thread Kostas Kloudas
Hi all, @Venkata, Do you have many small files being created as Arvid suggested? If yes, then I tend to agree that S3 is probably not the best sink. Although I did not get that from your description. In addition, instead of PrintStream you can have a look at the code of the SimpleStringEncoder in

Re: Flink s3 streaming performance

2020-06-05 Thread Arvid Heise
Hi Venkata, are the many small files intended or is it rather an issue of our commit on checkpointing? If so then FLINK-11499 [1] should help you. Design is close to done, unfortunately implementation will not make it into 1.11. In any case, I'd look at the parameter fs.s3a.connection.maximum, as

Re: Flink s3 streaming performance

2020-06-01 Thread Jörn Franke
I think S3 is a wrong storage backend for this volumes of small messages. Try to use a NoSQL database or write multiple messages into one file in S3 (1 or 10) If you still want to go with your scenario then try a network optimized instance and use s3a in Flink and configure s3 entropy.

Re: Flink s3 streaming performance

2020-05-31 Thread venkata sateesh` kolluru
Hi David, The avg size of each file is around 30KB and I have checkpoint interval of 5 minutes. Some files are even 1 kb, because of checkpoint some files are merged into 1 big file around 300MB. With 120 million files and 4Tb, if the rate of transfer is 300 per minute, it is taking weeks to writ

Re: Flink s3 streaming performance

2020-05-31 Thread David Magalhães
Hi Venkata. 300 requests per minute look like a 200ms per request, which should be a normal response time to send a file if there isn't any speed limitation (how big are the files?). Have you changed the parallelization to be higher than 1? I also recommend to limit the source parallelization, be

Flink s3 streaming performance

2020-05-30 Thread venkata sateesh` kolluru
Hello, I have posted the same in stackoverflow but didnt get any response. So posting it here for help. https://stackoverflow.com/questions/62068787/flink-s3-write-performance-optimization?noredirect=1#comment109814428_62068787 Details: I am working on a flink application on kubernetes(eks) whi