Re: data sink stops method

2015-10-08 Thread Florian Heyl
Hey Stephan and Pieter, That was the same what I thought, so I simply changed the code like this: original.writeAsCsv(outputPath, "\n", " ", WriteMode.OVERWRITE) env.execute() transformPred.writeAsCsv(outputPath2, "\n", " ", WriteMode.OVERWRITE) env.execute() But he still not execute the two co

Re: Debug OutOfMemory

2015-10-08 Thread KOSTIANTYN Kudriavtsev
Yes it's, I'm checking number of columns per line to filter out mailformed *Sent from my ZenFone On Oct 8, 2015 1:19 PM, "Stephan Ewen" wrote: > There is probably a different CSV input format implementation which drops > invalid lines (too long lines). > > Is that actually desired behavior, simp

Re: Debug OutOfMemory

2015-10-08 Thread Stephan Ewen
There is probably a different CSV input format implementation which drops invalid lines (too long lines). Is that actually desired behavior, simply dropping malformatted input? On Thu, Oct 8, 2015 at 7:12 PM, KOSTIANTYN Kudriavtsev < kudryavtsev.konstan...@gmail.com> wrote: > Hm, you was write >

Re: Debug OutOfMemory

2015-10-08 Thread KOSTIANTYN Kudriavtsev
Hm, you was write I checked all files, one by one and found an issue with a line in one of them... It's really unexpected for me as far as I run spark job on the same dataset and "wrong" rows were filtered out without issues. Thanks for help! Thank you, Konstantin Kudryavtsev On Thu, Oct 8, 201

Re: Debug OutOfMemory

2015-10-08 Thread Stephan Ewen
Ah, that makes sense! The problem is not in the core runtime, it is in the delimited input format. It probably looks for the line split character and never finds it, so that it starts buffering a super large line (gigabytes) which leads to the OOM exception. Can you check whether the line split c

Re: Debug OutOfMemory

2015-10-08 Thread KOSTIANTYN Kudriavtsev
What's confuse me, I'm running Flink on yarn with the following command: ./yarn-session.sh -n 4 -jm 2096 -tm 5000 so I expect to have TaskManager with almost 5GB ram available, but taskmanager manel I found that each task manager has the following conf: Flink Managed Memory: 2460 mb CPU cores: 4

Re: Debug OutOfMemory

2015-10-08 Thread KOSTIANTYN Kudriavtsev
10/08/2015 16:25:48 CHAIN DataSource (at com.epam.AirSetJobExample$.main(AirSetJobExample.scala:31) (org.apache.flink.api.java.io.TextInputFormat)) -> Filter (Filter at com.epam.AirSetJobExample$.main(AirSetJobExample.scala:31)) -> FlatMap (FlatMap at count(DataSet.scala:523))(1/1) switched to

Re: Debug OutOfMemory

2015-10-08 Thread Stephan Ewen
Can you paste the exception stack trace? On Thu, Oct 8, 2015 at 6:15 PM, KOSTIANTYN Kudriavtsev < kudryavtsev.konstan...@gmail.com> wrote: > It's DataSet program that performs simple filtering, crossjoin and > aggregation. > > I'm using Hadoop S3 FileSystem (not Emr) as far as Flink's s3 connecto

Re: Debug OutOfMemory

2015-10-08 Thread KOSTIANTYN Kudriavtsev
It's DataSet program that performs simple filtering, crossjoin and aggregation. I'm using Hadoop S3 FileSystem (not Emr) as far as Flink's s3 connector doesn't work at all. Currently I have 3 taskmanagers each 5k MB, but I tried different configurations and all leads to the same exception *Sent

Re: Debug OutOfMemory

2015-10-08 Thread Fabian Hueske
Hi Konstantin, Flink uses managed memory only for its internal processing (sorting, hash tables, etc.). If you allocate too much memory in your user code, it can still fail with an OOME. This can also happen for large broadcast sets. Can you check how much much memory the JVM allocated and how muc

Re: Debug OutOfMemory

2015-10-08 Thread Stephan Ewen
Can you give us a bit more background? What exactly is your program doing? - Are you running a DataSet program, or a DataStream program? - Is it one simple source that reads from S3, or are there multiple sources? - What operations do you apply on the CSV file? - Are you using Flink's S3

Debug OutOfMemory

2015-10-08 Thread KOSTIANTYN Kudriavtsev
Hi guys, I'm running FLink on EMR with 2 m3.xlarge (each 16 GB RAM) and trying to process 3.8 GB CSV data from S3. I'm surprised the fact that Flink failed with OutOfMemory: Java Heap space I tried to find the reason: 1) to identify TaskManager with a command ps aux | grep TaskManager 2) then bui

Re: data sink stops method

2015-10-08 Thread Stephan Ewen
Yes, sinks in Flink are lazy and do not trigger execution automatically. We made this choice to allow multiple concurrent sinks (spitting the streams and writing to many outputs concurrently). That requires explicit execution triggers (env.execute()). The exceptions are, as mentioned, the "eager"

Re: data sink stops method

2015-10-08 Thread Pieter Hameete
Hi Florian, I believe that when you call *JoinPredictionAndOriginal.collect* the environment will execute your program up until that point. The Csv writes are after this point, so in order to execute these steps I think you would have to call *.execute()* after the Csv writes to trigger the execut

data sink stops method

2015-10-08 Thread Florian Heyl
Hi, I need some help to figure out why one method of mine in a pipeline stops the execution on the hdfs. I am working with the 10.0-SNAPSHOT and the code is the following (see below). The method stops on the hdfs by calling the collect method (JoinPredictionAndOriginal.collect) creating a data s

Re: Extracting weights from linear regression model

2015-10-08 Thread Theodore Vasiloudis
Hello Trevor, I assume you using the MultipleLinearRegression class in a manner similar to our examples, i.e.: // Create multiple linear regression learnerval mlr = MultipleLinearRegression().setIterations(10).setStepsize(0.5).setConvergenceThreshold(0.001) // Obtain training and testing data set

Re: Flink batch runs OK but Yarn container fails in batch mode with -m yarn-cluster

2015-10-08 Thread Maximilian Michels
Hi Arnaud, I've looked into the problem but I couldn't reproduce it using Flink 0.9.0, Flink 0.9.1 and the current master snapshot (f332fa5). I always ended up with the final state SUCCEEDED. Which version of Flink were you using? Best regards, Max On Thu, Sep 3, 2015 at 10:48 AM, Robert Metzge

reduce error

2015-10-08 Thread Michele Bertoni
Hi everybody, I am facing a very strange problem Sometimes when I run my program one line of the result is totally wrong and if I repeat the execution with same input it can change The algorithm takes two dataset and execute something like a left outer join where if there is a match - it incre