Re: Integrate Flink with S3 on EMR cluster

2017-03-10 Thread Robert Metzger
Hi Vinay, using the HADOOP_CLASSPATH variable on the client machine is the recommended way to solve this problem. I'll update the documentation accordingly. On Wed, Mar 8, 2017 at 10:26 AM, vinay patil wrote: > Hi , > > @Shannon - I am not facing any issue while writing to S3, was getting > N

Re: Integrate Flink with S3 on EMR cluster

2017-03-08 Thread vinay patil
Hi , @Shannon - I am not facing any issue while writing to S3, was getting NoClassDef errors when reading the file from S3. ''Hadoop File System" - I mean I am using FileSystem class of Hadoop to read the file from S3. @Stephan - I tried with 1.1.4 , was getting the same issue. The easiest way

Re: Integrate Flink with S3 on EMR cluster

2017-03-07 Thread Stephan Ewen
@vinay patil - Can you see if the same problem occurs if you use Flink 1.1 - to see if this is a regression in Flink 1.2? On Tue, Mar 7, 2017 at 6:43 PM, Shannon Carey wrote: > Generally, using S3 filesystem in EMR with Flink has worked pretty well > for me in Flink < 1.2 (unless you run out o

Re: Integrate Flink with S3 on EMR cluster

2017-03-07 Thread Shannon Carey
Generally, using S3 filesystem in EMR with Flink has worked pretty well for me in Flink < 1.2 (unless you run out of connections in your HTTP pool). When you say, "using Hadoop File System class", what do you mean? In my experience, it's sufficient to just use the "s3://" filesystem protocol and

Re: Integrate Flink with S3 on EMR cluster

2017-03-07 Thread vinay patil
Hi Guys, Has anyone got this error before ? If yes, have you found any other solution apart from copying the jar files to flink lib folder Regards, Vinay Patil On Mon, Mar 6, 2017 at 8:21 PM, vinay patil [via Apache Flink User Mailing List archive.] wrote: > Hi Guys, > > I am getting the same

Re: Integrate Flink with S3 on EMR cluster

2017-03-06 Thread vinay patil
Hi Guys, I am getting the same exception: EMRFileSystem not Found I am trying to read encrypted S3 file using Hadoop File System class. (using Flink 1.2.0) When I copy all the libs from /usr/share/aws/emrfs/lib and /usr/lib/hadoop to Flink lib folder , it works. However I see that all these lib

Re: Integrate Flink with S3 on EMR cluster

2016-04-10 Thread Stephan Ewen
You can always explicitly request a broadcast join, via "joinWithTiny", "joinWithHuge", or by supplying a JoinHint. Greetings, Stephan On Sat, Apr 9, 2016 at 1:56 AM, Timur Fayruzov wrote: > Thank you Robert. One of my test cases is broadcast join, so I need to > make statistics work. The only

Re: Integrate Flink with S3 on EMR cluster

2016-04-08 Thread Timur Fayruzov
Thank you Robert. One of my test cases is broadcast join, so I need to make statistics work. The only workaround I have found so far is to copy the contents of /usr/share/aws/emr/emrfs/lib/, /usr/share/aws/aws-java-sdk/ and /usr/share/aws/emr/s3-dist-cp/lib/s3-dist-cp-2.2.0.jar to flink/lib. Puttin

Re: Integrate Flink with S3 on EMR cluster

2016-04-08 Thread Robert Metzger
Hi Timur, the Flink optimizer runs on the client, so the exception is thrown from the JVM running the ./bin/flink client. Since the statistics sampling is an optional step, its surrounded by a try / catch block that just logs the error message. More answers inline below On Thu, Apr 7, 2016 at 1

Re: Integrate Flink with S3 on EMR cluster

2016-04-07 Thread Timur Fayruzov
The exception does not show up in the console when I run the job, it only shows in the logs. I thought it means that it happens either on AM or TM (I assume what I see in stdout is client log). Is my thinking right? On Thu, Apr 7, 2016 at 12:29 PM, Ufuk Celebi wrote: > Hey Timur, > > Just had a

Re: Integrate Flink with S3 on EMR cluster

2016-04-07 Thread Ufuk Celebi
Hey Timur, Just had a chat with Robert about this. I agree that the error message is confusing, but it is fine it this case. The file system classes are not on the class path of the client process, which is submitting the job. It fails to sample the input file sizes, but this is just an optimizati

Re: Integrate Flink with S3 on EMR cluster

2016-04-07 Thread Timur Fayruzov
There's one more filesystem integration failure that I have found. My job on a toy dataset succeeds, but Flink log contains the following message: 2016-04-07 18:10:01,339 ERROR org.apache.flink.api.common.io.DelimitedInputFormat - Unexpected problen while getting the file statistics for f

Re: Integrate Flink with S3 on EMR cluster

2016-04-06 Thread Ufuk Celebi
Yes, for sure. I added some documentation for AWS here: https://ci.apache.org/projects/flink/flink-docs-release-1.0/setup/aws.html Would be nice to update that page with your pull request. :-) – Ufuk On Wed, Apr 6, 2016 at 4:58 AM, Chiwan Park wrote: > Hi Timur, > > Great! Bootstrap action fo

Re: Integrate Flink with S3 on EMR cluster

2016-04-05 Thread Chiwan Park
Hi Timur, Great! Bootstrap action for Flink is good for AWS users. I think the bootstrap action scripts would be placed in `flink-contrib` directory. If you want, one of people in PMC of Flink will be assign FLINK-1337 to you. Regards, Chiwan Park > On Apr 6, 2016, at 3:36 AM, Timur Fayruzov

Re: Integrate Flink with S3 on EMR cluster

2016-04-05 Thread Timur Fayruzov
Yes, Hadoop version was the culprit. It turns out that EMRFS requires at least 2.4.0 (judging from the exception in the initial post, I was not able to find the official requirements). Rebuilding Flink with Hadoop 2.7.1 and with Scala 2.11 worked like a charm and I was able to run WordCount using

Re: Integrate Flink with S3 on EMR cluster

2016-04-05 Thread Ufuk Celebi
Hey Timur, if you are using EMR with IAM roles, Flink should work out of the box. You don't need to change the Hadoop config and the IAM role takes care of setting up all credentials at runtime. You don't need to hardcode any keys in your application that way and this is the recommended way to go

Re: Integrate Flink with S3 on EMR cluster

2016-04-05 Thread Timur Fayruzov
Hello Ufuk, I'm using EMR 4.4.0. Thanks, Timur On Tue, Apr 5, 2016 at 2:18 AM, Ufuk Celebi wrote: > Hey Timur, > > which EMR version are you using? > > – Ufuk > > On Tue, Apr 5, 2016 at 1:43 AM, Timur Fayruzov > wrote: > > Thanks for the answer, Ken. > > > > My understanding is that file syst

Re: Integrate Flink with S3 on EMR cluster

2016-04-05 Thread Ufuk Celebi
Hey Timur, which EMR version are you using? – Ufuk On Tue, Apr 5, 2016 at 1:43 AM, Timur Fayruzov wrote: > Thanks for the answer, Ken. > > My understanding is that file system selection is driven by the following > sections in core-site.xml: > > fs.s3.impl > > org.apache.hadoop.fs.s3na

Re: Integrate Flink with S3 on EMR cluster

2016-04-04 Thread Timur Fayruzov
Thanks for the answer, Ken. My understanding is that file system selection is driven by the following sections in core-site.xml: fs.s3.impl org.apache.hadoop.fs.s3native.NativeS3FileSystem fs.s3n.impl com.amazon.ws.emr.hadoop.fs.EmrFileSystem If I run the program using configurat

Re: Integrate Flink with S3 on EMR cluster

2016-04-04 Thread Ken Krugler
Normally in Hadoop jobs you’d want to use s3n:// as the protocol, not s3. Though EMR has some support for magically treating the s3 protocol as s3n (or maybe s3a now, with Hadoop 2.6 or later) What happens if you use s3n:/// for the --input parameter? — Ken > On Apr 4, 2016, at 2:51pm, Timur