Insufficient number of network buffers

2016-05-02 Thread Tarandeep Singh
Hi, I have written ETL jobs in Flink (DataSet API). When I execute them in IDE, they run and finish fine. When I try to run them on my cluster, I get "Insufficient number of network buffers" error. I have 5 machines in my cluster with 4 cores each. TaskManager is given 3GB each. I increased the n

TimeWindow overload?

2016-05-02 Thread Elias Levy
Looking over the code, I see that Flink creates a TimeWindow object each time the WindowAssigner is created. I have not yet tested this, but I am wondering if this can become problematic if you have a very long sliding window with a small slide, such as a 24 hour window with a 1 minute slide. It s

Re: How to perform this join operation?

2016-05-02 Thread Elias Levy
Thanks for the suggestion. I ended up implementing it a different way. What is needed is a mechanism to give each stream a different window assigner, and then let Flink perform the join normally given the assigned windows. Specifically, for my use case what I need is a sliding window for one str

Re: Scala compilation error

2016-05-02 Thread Srikanth
Sorry for the previous incomplete email. Didn't realize I hit send! I was facing a weird compilation error in Scala when I did val joinedStream = stream1.connect(stream2) .transform("funName", outTypeInfo, joinOperator) It turned out to be due to a difference in API signature between Scala and Ja

Scala compilation error

2016-05-02 Thread Srikanth
Hello, I'm fac val stream = env.addSource(new FlinkKafkaConsumer09[String]("test-topic", new SimpleStringSchema(), properties)) val bidderStream: KeyedStream[BidderRawLogs, Int] = stream.flatMap(b => BidderRawLogs(b)).keyBy(b => b.strategyId) val metaStrategy: KeyedStream[(Int, String), Int] = e

Re: Measuring latency in a DataStream

2016-05-02 Thread Igor Berman
1. why are you doing join instead of something like System.currentTimeInMillis()? at the end you have tuple of your data with timestamp anyways...so why just not to wrap you data in tuple2 with additional info of creation ts? 2. are you sure that consumer/producer machines' clocks are in sync? you

Re: Multiple windows with large number of partitions

2016-05-02 Thread Christopher Santiago
Hi Aljoscha, Yes, there is still a high partition/window count since I have to keyby the userid so that I get unique users. I believe what I see happening is that the second window with the timeWindowAll is not getting all the results or the results from the previous window are changing when the

Re: S3 Checkpoint Storage

2016-05-02 Thread Fabian Hueske
Hi John, S3 keys are configured via Hadoop's configuration files. Check out the documentation for AWS setups [1]. Cheers, Fabian [1] https://ci.apache.org/projects/flink/flink-docs-release-1.0/setup/aws.html 2016-05-02 20:22 GMT+02:00 John Sherwood : > Hello all, > > I'm attempting to set up a

Re: Problem in creating quickstart project using archetype (Scala)

2016-05-02 Thread nsengupta
Fantastic! Many thanks for clarifying, Aljoscha! I blindly followed what that page said. Instead, I should have tried with the stable version. - Nirmalya -- View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Problem-in-creating-quickstart-projec

Re: Count of Grouped DataSet

2016-05-02 Thread nsengupta
Hello Fabian, Thanks for taking time to provide your recommendation, This is how I have implemented: case class Something(f1: Int,f2: Int,f3: Int,f4: String ) // My application's data structure *val k = envDefault .fromElements(Something(1,1,2,"A"),Something(1,1,2,"B"),Something(2,1,3,

S3 Checkpoint Storage

2016-05-02 Thread John Sherwood
Hello all, I'm attempting to set up a taskmanager cluster using S3 as the highly-available store. It looks like the main thing is just setting the ` state.backend.fs.checkpointdir` to the appropriate s3:// URI, but as someone rather new to accessing S3 from Java, how should I provide Flink with th

RE: Flink on Azure HDInsight

2016-05-02 Thread Brig Lamoreaux
Thanks Stephan, Turns out Azure Table is slightly different than Azure HDInsight. Both use Azure Storage however, HDInsight allows HDFS over Azure Storage. I’d be curious if anyone has tried to use Flink on top of Azure HDInsight. Thanks, Brig From: ewenstep...@gmail.com [mailto:ewenstep...@gm

Measuring latency in a DataStream

2016-05-02 Thread Robert Schmidtke
Hi everyone, I have implemented a way to measure latency in a DataStream (I hope): I'm consuming a Kafka topic and I'm union'ing the resulting stream with a custom source that emits a (machine-local) timestamp every 1000ms (using currentTimeMillis). On the consuming end I'm distinguishing between

Re: Regarding Broadcast of datasets in streaming context

2016-05-02 Thread Biplob Biswas
Hi Gyula, Could you explain a bit why i wouldn't want the centroids to be collected after every point? I mean, once I get a streamed point via map1 function .. i would want to compare the distance of the point with a centroid which arrives via map2 function and i keep on comparing for every cent

Re: Regarding Broadcast of datasets in streaming context

2016-05-02 Thread Gyula Fóra
Hey, I think you got the good idea :) So your coflatmap will get all the centroids that you have sent to the stream in the closeWith call. This means that whenever you collect a new set of centroids they will be iterated back. This means you don't always want to send the centroids out on the coll

Re: Understanding Sliding Windows

2016-05-02 Thread Aljoscha Krettek
Hi, there is no way to skip the first 5 minutes since Flink doesn't know where your "time" begins. Elements are just put into window "buckets" that are emitted at the appropriate time. Cheers, Aljoscha On Wed, 27 Apr 2016 at 07:01 Piyush Shrivastava wrote: > Hello Dominik, > > Thanks for the in

Re: Flink Iterations Ordering

2016-05-02 Thread Aljoscha Krettek
Hi, as I understand it the order of elements will not be preserved across iteration supersets. But maybe some-one else knows more. Cheers, Aljoscha On Thu, 28 Apr 2016 at 00:23 David Kim wrote: > Hello all, > > I read the documentation at [1] on iterations and had a question on > whether an ass

Re: join performance

2016-05-02 Thread Aljoscha Krettek
Hi Henry, yes, with early firings you would have the problem of duplicate emission. I'm afraid I don't have a solution for that right now. For the "another question" I think you are right that this would be session windowing. Please have a look at this blog post that I wrote recently: http://data-

first() function in DataStream

2016-05-02 Thread subash basnet
Hello all, In DataSet *first(n)* function can be called to get 'n' no. of elements in the DataSet, how could similar operations be done in DataStream to get 'n' no. of elements from the current DataStream. Best Regards, Subash Basnet

Re: Checking actual config values used by TaskManager

2016-05-02 Thread Maximilian Michels
Hi Ken, When you're running Yarn, the Flink configuration is created once and shared among all nodes (JobManager and TaskManagers). Please have a look at the JobManager tab on the web interface. It shows you the configuration. Cheers, Max On Fri, Apr 29, 2016 at 3:18 PM, Ken Krugler wrote: > Hi

Re: Regarding Broadcast of datasets in streaming context

2016-05-02 Thread Biplob Biswas
Hi Gyula, I understand more now how this thing might work and its fascinating. Although I still have one question with the coflatmap function. First, let me explain what I understand and whether its correct or not: 1. The connected iterative stream ensures that the coflatmap function receive the

Re: Perform a groupBy on an already groupedDataset

2016-05-02 Thread Punit Naik
It solved my problem! On Mon, May 2, 2016 at 3:45 PM, Fabian Hueske wrote: > Grouping a grouped dataset is not supported. > You can group on multiple keys: dataSet.groupBy(1,2). > > Can you describe your use case if that does not solve the problem? > > > > 2016-05-02 10:34 GMT+02:00 Punit Naik :

Re: Unable to write stream as csv

2016-05-02 Thread Aljoscha Krettek
I think there is a problem with the interaction of legacy OutputFormats and streaming programs. Flush is not called, the CsvOutputFormat only writes in flush(), therefore we don't see any results. On Mon, 2 May 2016 at 11:59 Fabian Hueske wrote: > Have you checked the log files as well? > > 2016

Re: Perform a groupBy on an already groupedDataset

2016-05-02 Thread Fabian Hueske
Grouping a grouped dataset is not supported. You can group on multiple keys: dataSet.groupBy(1,2). Can you describe your use case if that does not solve the problem? 2016-05-02 10:34 GMT+02:00 Punit Naik : > Hello > > I wanted to perform a groupBy on an already grouped dataset. How do I do > t

Re: Any way for Flink Elasticsearch connector reflecting IP change of Elasticsearch cluster?

2016-05-02 Thread Fabian Hueske
Yes, it looks like the connector only creates the connection once when it starts and fails if the host is no longer reachable. It should be possible to catch that failure and try to re-open the connection. I opened a JIRA for this issue (FLINK-3857). Would you like to implement the improvement? 2

Re: Unable to write stream as csv

2016-05-02 Thread Fabian Hueske
Have you checked the log files as well? 2016-05-01 14:07 GMT+02:00 subash basnet : > Hello there, > > If anyone could help me know why the below *result* DataStream get's > written as text, but not as csv?. As it's in a tuple format I guess it > should be the same for both text and csv. It shows

Re: Count of Grouped DataSet

2016-05-02 Thread Fabian Hueske
Hi Nirmalya, the solution with List.size() won't use a combiner and won't be efficient for large data sets with large groups. I would recommend to add a 1 and use GroupedDataSet.sum(). 2016-05-01 12:48 GMT+02:00 nsengupta : > Hello all, > > This is how I have moved ahead with the implementation

Re: EMR vCores and slot allocation

2016-05-02 Thread Fabian Hueske
The slot configuration should depend on the complexity of jobs. Since each slot runs a "slice" of a program, one slot might potentially execute many concurrent tasks. For complex jobs you should allocate more than one core for each slot. 2016-05-02 10:12 GMT+02:00 Robert Metzger : > Hi Ken, > s

Re: TypeVariable problems

2016-05-02 Thread Martin Neumann
Hi Aljosha Thanks for your answer! I tried using returns but it does not work since the only place where I could call it is within the function that has all the generic types so there is no useful type hint to give. I could make the user hand over the class definition for the type as well but that

Re: TypeVariable problems

2016-05-02 Thread Aljoscha Krettek
Hi, for user functions that have generics, such as you have, you have to manually specify the types somehow. This can either be done using InputTypeConfigurable/OutputTypeConfigurable or maybe using stream.returns(). Cheers, Aljoscha On Fri, 29 Apr 2016 at 12:25 Martin Neumann wrote: > Hej, > >

Perform a groupBy on an already groupedDataset

2016-05-02 Thread Punit Naik
Hello I wanted to perform a groupBy on an already grouped dataset. How do I do this? -- Thank You Regards Punit Naik

Re: Multiple windows with large number of partitions

2016-05-02 Thread Aljoscha Krettek
Hi, what do you mean by "still experiencing the same issues"? Is the key count still very hight, i.e. 500k windows? For the watermark generation, specifying a lag of 2 days is very conservative. If the watermark is this conservative I guess there will never arrive elements that are behind the wate

Re: Problem with writeAsText

2016-05-02 Thread Punit Naik
Please ignore this question as I forgot to do a env.execute On Mon, May 2, 2016 at 11:45 AM, Punit Naik wrote: > I have a Dataset which contains only strings. But when I execute a > writeAsText and supply a folder inside the string, it finishes with the > following output but does not write any

Re: Problem in creating quickstart project using archetype (Scala)

2016-05-02 Thread Aljoscha Krettek
Hi, I'm sorry for the inconvenience, for the -SNAPSHOT release versions one must also append the address of the repository to the command, like this: $ mvn archetype:generate \ -DarchetypeGroupId=org.apache.flink \ -DarchetypeArtifactId=flink-quickstart-scala \ -DarchetypeVersion=1.1-SNAP

Any way for Flink Elasticsearch connector reflecting IP change of Elasticsearch cluster?

2016-05-02 Thread Sendoh
Hi, When using Elasticsearch connector, Is there a way to reflect IP change of Elasticsearch cluster? We use DNS of Elasticsearch in data sink, e.g. elasticsearch-dev.foo.de. However, when we replace the old Elasticsearch cluster with a new one, the Elasticsearch connector cannot write into the ne

Re: EMR vCores and slot allocation

2016-05-02 Thread Robert Metzger
Hi Ken, sorry for the late response. The number of CPU cores we show in the web interface is based on what the JVM tells us from "Runtime.getRuntime().availableProcessors();". I'm not sure how tthe processor count behaves on Amazon VMs. Given that each of your servers has 8 vCores, I would set the