error message

2015-09-30 Thread Lydia Ickler
Hi, what jar am I missing ? The error is: Exception in thread "main" java.lang.NoSuchMethodError: org.apache.flink.api.scala.ExecutionEnvironment.readCsvFile$default$4()Z

Re: All but one TMs connect when JM has more than 16G of memory

2015-09-30 Thread Robert Schmidtke
Hi Robert, thanks for your reply. It got me digging into my setup and I discovered that one TM was scheduled next to the JM. When specifying -yn 7 the documentation suggests that this is the number of TMs (of which I wanted 7), and I thought an additional container would be used for the JM (my YAR

Re: All but one TMs connect when JM has more than 16G of memory

2015-09-30 Thread Robert Metzger
Hi Robert, the problem here is that YARN's scheduler (there are different schedulers in YARN: FIFO, CapacityScheduler, ...) is not giving Flink's ApplicationMaster/JobManager all the containers it is requesting. By increasing the size of the AM/JM container, there is probably no memory left to fit

Re: All but one TMs connect when JM has more than 16G of memory

2015-09-30 Thread Robert Schmidtke
I should say I'm running the current Flink master branch. On Wed, Sep 30, 2015 at 5:02 PM, Robert Schmidtke wrote: > It's me again. This is a strange issue, I hope I managed to find the right > keywords. I got 8 machines, 1 for the JM, the other 7 are TMs with 64G of > memory each. > > When runn

All but one TMs connect when JM has more than 16G of memory

2015-09-30 Thread Robert Schmidtke
It's me again. This is a strange issue, I hope I managed to find the right keywords. I got 8 machines, 1 for the JM, the other 7 are TMs with 64G of memory each. When running my job like so: $FLINK_HOME/bin/flink run -m yarn-cluster -yjm 16384 -ytm 40960 -yn 7 . The job completes without any

Re: OutOfMemoryError in netty local transport

2015-09-30 Thread Robert Schmidtke
Hi Max, thanks for your quick reply. I found the relevant code and commented it out for testing, seems to be working. Happily waiting for the fix. Thanks again. Robert On Wed, Sep 30, 2015 at 1:42 PM, Maximilian Michels wrote: > Hi Robert, > > This is a regression on the current master due to

Re: OutOfMemoryError in netty local transport

2015-09-30 Thread Maximilian Michels
Hi Robert, This is a regression on the current master due to changes in the way Flink calculates the memory and sets the maximum direct memory size. We introduced these changes when we merged support for off-heap memory. This is not a problem in the way Flink deals with managed memory, just -XX:Ma

Fwd: OutOfMemoryError in netty local transport

2015-09-30 Thread Robert Schmidtke
Hi everyone, I'm constantly running into OutOfMemoryErrors and for the life of me I cannot figure out what's wrong. Let me describe my setup. I'm running the current master branch of Flink on YARN (Hadoop 2.7.0). My job is an unfinished implementation of TPC-H Q2 ( https://github.com/robert-schmid

Re: For each element in a dataset, do something with another dataset

2015-09-30 Thread Pieter Hameete
Hi Gabor, Fabian, thank you for your suggestions. I am intending to scale up so that I'm sure that both A and B won't fit in memory. I'll see if I can come up with a nice way to partition the datasets but if that will take too much time I'll just have to accept that it wont work on large datasets.

Re: For each element in a dataset, do something with another dataset

2015-09-30 Thread Gábor Gévay
Hello, Alternatively, if dataset B fits in memory, but dataset A doesn't, then you can do it with broadcasting B to a RichMapPartitionFunction on A: In the open method of mapPartition, you sort B. Then, for each element of A, you do a binary search in B, and look at the index found by the binary s

Re: For each element in a dataset, do something with another dataset

2015-09-30 Thread Fabian Hueske
The idea is to partition both datasets by range. Assume your dataset A is [1,2,3,4,5,6] you create two partitions: p1: [1,2,3] and p2: [4,5,6]. Each partition is given to a different instance of a MapPartition operator (this is a bit tricky, because you cannot use broadcastSet. You could load the c

Re: data flow example on cluster

2015-09-30 Thread Maximilian Michels
Hi Lydia, Till already pointed you to the documentation. If you want to run the WordCount example, you can do so by executing the following command: ./bin/flink run -c com.dataartisans.flink.dataflow.examples.WordCount /path/to/dataflow.jar --input /path/to/input --output /path/to/output If you

Re: For each element in a dataset, do something with another dataset

2015-09-30 Thread Pieter Hameete
Hi Fabian, thanks for your tips! Do you have some pointers for getting started with the 'tricky range partitioning'? I am quite keen to get this working with large datasets ;-) Cheers, Pieter 2015-09-30 10:24 GMT+02:00 Fabian Hueske : > Hi Pieter, > > cross is indeed too expensive for this ta

Re: For each element in a dataset, do something with another dataset

2015-09-30 Thread Fabian Hueske
Hi Pieter, cross is indeed too expensive for this task. If dataset A fits into memory, you can do the following: Use a RichMapPartitionFunction to process dataset B and add dataset A as a broadcastSet. In the open method of mapPartition, you can load the broadcasted set and sort it by a.propertyX

Re: data flow example on cluster

2015-09-30 Thread Till Rohrmann
It's described here: https://ci.apache.org/projects/flink/flink-docs-release-0.9/quickstart/setup_quickstart.html#run-example Cheers, Till On Wed, Sep 30, 2015 at 8:24 AM, Lydia Ickler wrote: > Hi all, > > I want to run the data-flow Wordcount example on a Flink Cluster. > The local execution w