How groupBy work

2015-10-29 Thread Jinfeng Li
Hi, I find wordcount on Flink is slow and 75% of the time is spent on groupBy operator. The dataset is 90G, with only 1000 distinct words. Could you tell me how the groupBy is implemented? Best Regards, Jeffrey

Re: How best to deal with wide, structured tuples?

2015-10-29 Thread Stephan Ewen
Hi Johann! You can try and use the Table API, it has logical tuples that you program with, rather than tuple classes. Have a look here: https://ci.apache.org/projects/flink/flink-docs-master/libs/table.html Stephan On Thu, Oct 29, 2015 at 6:53 AM, Fabian Hueske wrote: > Hi Johann, > > I see

Re: How best to deal with wide, structured tuples?

2015-10-29 Thread Fabian Hueske
Hi Johann, I see three options for your use case. 1) Generate Pojo code at planning time, i.e., when the program is composed. This does not work when the program is already running. The benefit is that you can use key expressions, have typed fields, and type specific serializers and comparators.

Re: Flink on EC"

2015-10-29 Thread KOSTIANTYN Kudriavtsev
Hi Thomas, Try to switch to Emr amo 3.5 and register hadoop's s3 FileSystem instead of the one packed with flink *Sent from my ZenFone On Oct 29, 2015 4:36 AM, "Thomas Götzinger" wrote: > Hello Flink Team, > > We at IESE Fraunhofer are evaluating Flink for a project and I'm a bit > frustrated i

Re: Flink Dashboard and wrong job parallelism info

2015-10-29 Thread Flavio Pompermaier
Ok, thanks a lot for the info guys! On Thu, Oct 29, 2015 at 11:30 AM, Maximilian Michels wrote: > Here's the jira issue for the cancel button: > https://issues.apache.org/jira/browse/FLINK-2939 > > On Thu, Oct 29, 2015 at 11:28 AM, Aljoscha Krettek > wrote: > >> Hi >> yes, a lot of people have

Re: Flink Dashboard and wrong job parallelism info

2015-10-29 Thread Maximilian Michels
Here's the jira issue for the cancel button: https://issues.apache.org/jira/browse/FLINK-2939 On Thu, Oct 29, 2015 at 11:28 AM, Aljoscha Krettek wrote: > Hi > yes, a lot of people have complained about the missing cancel button > already. :D (myself included) > > The number of retained jobs can

Re: Flink Dashboard and wrong job parallelism info

2015-10-29 Thread Aljoscha Krettek
Hi yes, a lot of people have complained about the missing cancel button already. :D (myself included) The number of retained jobs can be configured in conf/flink-conf.yaml by setting the configuration key “jobmanager.web.history” to a different number. Cheers, Aljoscha > On 29 Oct 2015, at 11

Re: Flink Dashboard and wrong job parallelism info

2015-10-29 Thread Flavio Pompermaier
Yes, I was referring exactly to that :) Thanks for the clarification Aljoscha. Is it planned to improve the dashboard with some button to manage jobs (cancel for example could be useful when running tests..)? And where do I set the number of completed jobs to show in history? On Thu, Oct 29, 2015

Re: Flink Dashboard and wrong job parallelism info

2015-10-29 Thread Aljoscha Krettek
Hi, are you referring to the “Job statistics/Accumulators” tab? This tab does not display actual information but is a placeholder page that we forgot to remove. It will be removed before the 0.10 release, there is currently a pull request open to remove it. Cheers, Aljoscha > On 29 Oct 2015, at

Flink Dashboard and wrong job parallelism info

2015-10-29 Thread Flavio Pompermaier
Hi to all, I'm using Flink 0.10-SNAPSHOT and on my cluster I've tested the new Dashboard (some days ago). In the job info the parallelism was wrong (I see 2 but it's 36). Does it happen only to me..? Best, Flavio

Re: Flink on EC"

2015-10-29 Thread Fabian Hueske
Hi Thomas, until recently, Flink provided an own implementation of a S3FileSystem which wasn't fully tested and buggy. We removed that implementation and are using now (in 0.10-SNAPSHOT) Hadoop's S3 implementation by default. If you want to continue using 0.9.1 you can configure Flink to use Hado

Flink on EC"

2015-10-29 Thread Thomas Götzinger
Hello Flink Team, We at IESE Fraunhofer are evaluating Flink for a project and I'm a bit frustrated in the moment. I've wrote a few testcases with the flink API and want to deploy them to an Flink EC2 Cluster. I setup the cluster using the karamel receipt which was adressed in the following video