PartitionedByHash input annotation?

2015-05-18 Thread Alexander Alexandrov
In the dawn of Flink when Flink Operators were still called PACTs (short for Parallelization Contracts) the system used to support the so called "output contracts" via custom annotations that can be attached to the UDF (the ForwardedFields annotation is a descendant of that concept). Amonst others

Re: problem on yarn cluster

2015-05-18 Thread Robert Metzger
You are not getting the issue in your local environment, because there everything is running in one JVM and the needed class is available there. In the distributed case, we have a special usercode classloader which can load classes from the user's jar. On Mon, May 18, 2015 at 9:16 PM, Michele Bert

Re: problem on yarn cluster

2015-05-18 Thread Michele Bertoni
Got it! thank you Best, Michele Il giorno 18/mag/2015, alle ore 19:29, Robert Metzger mailto:rmetz...@apache.org>> ha scritto: Hi, ./yarn-session.sh -n 3 -s 1 -jm 1024 -tm 4096 will start a YARN session with 4 containers (3 workers, 1 master). Once the session is running, you can submit a

Re: problem on yarn cluster

2015-05-18 Thread Michele Bertoni
Hi! thanks for answer! for 1 you are right! I am using collect method on custom object, tomorrow I will try moving back to snapshot. why is it working correctly in my local environment (that is also a milestone1)? on the other side i am not collecting large results: i am running on a test datas

Re: inconsistency in count and print

2015-05-18 Thread Stephan Ewen
Hi Michele! I cannot tell you what the problem is at a first glance, but here are some pointers that may help you find the problem: Input split creation determinism - The number of input splits is not really deterministic. It depends on the parallelism of the source (this tells the system how m

Re: problem on yarn cluster

2015-05-18 Thread Robert Metzger
Hi, ./yarn-session.sh -n 3 -s 1 -jm 1024 -tm 4096 will start a YARN session with 4 containers (3 workers, 1 master). Once the session is running, you can submit as many jobs as you want to this session, using ./bin/flink run ./path/to/jar The YARN session will create a hidden file in conf/ whic

Re: problem on yarn cluster

2015-05-18 Thread Stephan Ewen
Hi Michele! It looks like there are quite a few things going wrong in your case. Let me see what I can deduce from the output you are showing me: 1) You seem to run into a bug that exists in the 0.9-milestone-1 and has been fixed in the master: As far as I can tell, you call "collect()" on a da

Re: Informing the runtime about data already repartitioned using "output contracts"

2015-05-18 Thread Fabian Hueske
Hi Mustafa, I'm afraid, this is not possible. Although you can annotate DataSources with partitioning information, this is not enough to avoid repartitioning for a CoGroup. The reason for that is that CoGroup requires co-partitioning of both inputs, i.e., both inputs must be equally partitioned (s

problem on yarn cluster

2015-05-18 Thread Michele Bertoni
Hi, I have a problem running my app on a Yarn cluster I developed it in my computer and everything is working fine then we setup the environment on Amazon EMR reading data from HDFS not S3 we run it with these command ./yarn-session.sh -n 3 -s 1 -jm 1024 -tm 4096 ./flink run -m yarn-cluste

Informing the runtime about data already repartitioned using "output contracts"

2015-05-18 Thread Mustafa Elbehery
Hi, I am writing a flink job, in which I have three datasets. I have partitionedByHash the first two before coGrouping them. My plan is to spill the result of coGrouping to disk, and then re-read it again before coGrouping with the third dataset. My question is, is there anyway to inform flink

Re: Spark and Flink

2015-05-18 Thread Robert Metzger
Hi, I would really recommend you to put your Flink and Spark dependencies into different maven modules. Having them both in the same project will be very hard, if not impossible. Both projects depend on similar projects with slightly different versions. I would suggest a maven module structure lik

k-means core function for temporal geo data

2015-05-18 Thread Pa Rö
hallo, i want cluster geo data (lat,long,timestamp) with k-means. now i search for a good core function, i can not find good paper or other sources for that. to time i multiplicate the time and the space distance: public static double dis(GeoData input1, GeoData input2) { double timeDis = Math

Fwd: Spark and Flink

2015-05-18 Thread Pa Rö
hi, if i add your dependency i get over 100 errors, now i change the version number: com.fasterxml.jackson.module jackson-module-scala_2.10 2.4.4 com.google.guava guava now the pom is fin