Re: Processing S3 data with Apache Flink

2015-10-05 Thread Robert Metzger
Hi Kostia, thank you for writing to the Flink mailing list. I actually started to try out our S3 File system support after I saw your question on StackOverflow [1]. I found that our S3 connector is very broken. I had to resolve two more issues with it, before I was able to get the same exception y

Processing S3 data with Apache Flink

2015-10-05 Thread Kostiantyn Kudriavtsev
Hi guys, I,m trying to get work Apache Flink 0.9.1 on EMR, basically to read data from S3. I tried the following path for data s3://mybucket.s3.amazonaws.com/folder, but it throws me the following exception: java.io.IOException: Cannot establish connection to Amazon S3: com.amazonaws.services

Re: Error trying to access JM through proxy

2015-10-05 Thread Robert Metzger
I filed a bug for this issue in our bug tracker https://issues.apache.org/jira/browse/FLINK-2821 (even though we can not do much about it, we should track the resolution of the issue). On Mon, Oct 5, 2015 at 5:34 AM, Stephan Ewen wrote: > I think this is yet another problem caused by Akka's over

Re: Running Flink on an Amazon Elastic MapReduce cluster

2015-10-05 Thread Maximilian Michels
Hi Hanen, It appears that the environment variables are not set. Thus, Flink cannot pick up the Hadoop configuration. Could you please paste the output of "echo $HADOOP_HOME" and "echo $HADOOP_CONF_DIR" here? In any case, your problem looks similar to the one discussed here: http://stackoverflow.

Running Flink on an Amazon Elastic MapReduce cluster

2015-10-05 Thread Hanen Borchani
Hi all, I tried to start a Yarn session on an Amazon EMR cluster with Hadoop 2.6.0 following the instructions provided in this link and using Flink 0.9.1 for Hadoop 2.6.0 https://ci.apache.org/projects/flink/flink-docs-release-0.9/setup/yarn_setup.html Running the following command line: ./bin/ya

Re: For each element in a dataset, do something with another dataset

2015-10-05 Thread Fabian Hueske
Hi Pieter, a FlatMapFunction can only return values when the map() method is called. However, in your use case, you would like to return values *after* the function was called the last time. This is not possible with a FlatMapFunction, because you cannot identify the last map() call. The MapPartit

Re: Config files content read

2015-10-05 Thread Fabian Hueske
Hi Flavio, I don't think this is a feature that needs to go into Flink core. To me it looks like this be implemented as a utility method by anybody who needs it without major effort. Best, Fabian 2015-10-02 15:27 GMT+02:00 Flavio Pompermaier : > Hi to all, > > in many of my jobs I have to read

Re: Reading from multiple input files with fewer task slots

2015-10-05 Thread Stephan Ewen
Okay, nice to hear it works out! On Mon, Oct 5, 2015 at 1:50 PM, Pieter Hameete wrote: > Hi Stephen, > > it was not the SimpleInputProjection, because that is a stateless object. > The boolean endReached was not reset upon opening a new file however, so > for each consecutive file no records wer

Re: Reading from multiple input files with fewer task slots

2015-10-05 Thread Pieter Hameete
Hi Stephen, it was not the SimpleInputProjection, because that is a stateless object. The boolean endReached was not reset upon opening a new file however, so for each consecutive file no records were parsed. Thanks alot for your help! - Pieter 2015-10-05 12:50 GMT+02:00 Stephan Ewen : > If yo

Re: Reading from multiple input files with fewer task slots

2015-10-05 Thread Stephan Ewen
If you have more files than task slots, then some tasks will get multiple files. That means that open() and close() are called multiple times on the input format. Make sure that your input format tolerates that and does not get confused with lingering state (maybe create a new SimpleInputProjectio

Re: Reading from multiple input files with fewer task slots

2015-10-05 Thread Pieter Hameete
Hi Stephen, it concerns the DataSet API. The program im running can be found at https://github.com/PHameete/dawn-flink/blob/development/src/main/scala/wis/dawnflink/performance/xmark/XMarkQuery11.scala The Custom Input Format at https://github.com/PHameete/dawn-flink/blob/development/src/main/sca

Re: Reading from multiple input files with fewer task slots

2015-10-05 Thread Stephan Ewen
I assume this concerns the streaming API? Can you share your program and/or the custom input format code? On Mon, Oct 5, 2015 at 12:33 PM, Pieter Hameete wrote: > Hello Flinkers! > > I run into some strange behavior when reading from a folder of input files. > > When the number of input files i

Re: Error trying to access JM through proxy

2015-10-05 Thread Stephan Ewen
I think this is yet another problem caused by Akka's overly strict message routing. An actor system bound to a certain URL can only receive messages that are sent to that exact URL. All other messages are dropped. This has many problems: - Proxy routing (as described here, send to the proxy UR

Reading from multiple input files with fewer task slots

2015-10-05 Thread Pieter Hameete
Hello Flinkers! I run into some strange behavior when reading from a folder of input files. When the number of input files in the folder exceeds the number of task slots I noticed that the size of my datasets varies with each run. It seems as if the transformations don't wait for all input files

Re: JM/TM startup time

2015-10-05 Thread Robert Schmidtke
Thanks, the off-heap solution is indeed faster. 15s instead of 45s for the amounts of memory I allocate. On Fri, Oct 2, 2015 at 6:09 PM, Stephan Ewen wrote: > Yeah, registration is fast, JVM heatup is what takes time. > > You can try two things: > > - Use the off-heap memory variant and see if

Re: For each element in a dataset, do something with another dataset

2015-10-05 Thread Pieter Hameete
Hi Fabian, I have a question regarding the first approach. Is there a benefit gained from choosing a RichMapPartitionFunction over a RichMapFunction in this case? I assume that each broadcasted dataset is sent only once to each task manager? If I would broadcast dataset B, then I could for each e

Re: Destroy StreamExecutionEnv

2015-10-05 Thread Stephan Ewen
Matthias' solution should work in most cases. In cases where you do not control the source (or the source can never be finite, like the Kafka source), we often use a trick in the tests, which is throwing a special type of exception (a SuccessException). You can catch this exception on env.execute

Re: Destroy StreamExecutionEnv

2015-10-05 Thread Matthias J. Sax
Hi, you just need to terminate your source (ie, return from run() method if you implement your own source function). This will finish the complete program. For already available sources, just make sure you read finite input. Hope this helps. -Matthias On 10/05/2015 12:15 AM, jay vyas wrote: > H

Re: Flink program compiled with Janino fails

2015-10-05 Thread Till Rohrmann
I’m not a Janino expert but it might be related to the fact that Janino not fully supports generic types (see http://unkrig.de/w/Janino under limitations). Maybe it works of you use the untyped MapFunction type. Cheers, Till ​ On Sat, Oct 3, 2015 at 8:04 PM, Giacomo Licari wrote: > Hi guys, > I