Re: Hello, a question about Dashborad in Flink

2016-01-28 Thread Philip Lee
Thanks, Is there any way to measure shuffle data (read and write) on Flink or Dashboard? I did not find the network usage metric in it. Best, Phil On Mon, Jan 25, 2016 at 5:06 PM, Fabian Hueske wrote: > You can start a job and then periodically request and store information > about the runnin

RE: Window stream using timestamp key for time

2016-01-28 Thread Emmanuel
Nice, you guys rock! From: fhue...@gmail.com Date: Thu, 28 Jan 2016 23:34:58 +0100 Subject: Re: Window stream using timestamp key for time To: user@flink.apache.org Hi Emmanuel, the feature you are looking for is called event time processing in Flink. These blog posts should help you to become f

Re: Window stream using timestamp key for time

2016-01-28 Thread Fabian Hueske
Hi Emmanuel, the feature you are looking for is called event time processing in Flink. These blog posts should help you to become familiar with the concepts: 1) Event-Time concepts: http://data-artisans.com/how-apache-flink-enables-new-streaming-applications-part-1/ 2) Windows in Flink: http://fl

Window stream using timestamp key for time

2016-01-28 Thread Emmanuel
Hello, I have used Flink to stream data and do analytics on the stream, using time windows... Now, this is assuming the data is effectively coming in real time. However I have a use case where the data is 'batched' upstream, and comes in bursts, but has a timestamp.It obviously messes up the win

How to register aggregation convergence criterion to bulk iteration in scala API?

2016-01-28 Thread Fridtjof Sander
Hi, I want to register a custom aggregation convergence criterion to a bulk iteration and I want to use the scala API. It appears to me that this is not possible at the moment, right? The AggregatorRegistry is exposed by IterativeDataSet.java, which is hidden by DataSet.scala: def iterate

Re: Mixing Batch & Streaming

2016-01-28 Thread Nick Dimiduk
If the dataset is too large for a file, you can put it behind a service and have your stream operators query the service for enrichment. You can even support updates to that dataset in a style very similar to the "lambda architecture" discussed elsewhere. On Thursday, January 28, 2016, Fabian Hues

Writing Parquet files with Flink

2016-01-28 Thread Flavio Pompermaier
Hi to all, I was reading about optimal Parquet file size and HDFS block size. The ideal situation for Parquet is when its block size (and thus the maximum size of each row group) is equal to the HDFS block size. The default behaviour of Flink is that the output file's size depends on the output pa

Re: about blob.storage.dir and .buffer files

2016-01-28 Thread Stephan Ewen
Hi Gwenhael! Let's look into this and fix anything we find. Can you briefly tell us: - How much data is in the blob-store directory, versus in the buffer files? - How many buffer files do you have and how large are they in average? Greetings, Stephan On Thu, Jan 28, 2016 at 10:18 AM, Ti

Re: Streaming left outer join

2016-01-28 Thread Aljoscha Krettek
Hi, I think hacking the StreamJoinOperator could work for you. In Flink a StreamOperator is essentially a more low-level version of a FlatMap. It receives elements by the processElement() method and can emit elements using the Output object output which is a souped up Collector that also allows

Re: Streaming left outer join

2016-01-28 Thread Alexander Gryzlov
Hello Stephan, Yes, I've seen this one before, but AFAIU this is a different use-case: they need an inner join with 2 different windows, whereas I'm ok with a single window, but need an outer join with different semantics... Their StreamJoinOperator, however looks roughly fitting, so I'll probably

Re: Flume source support

2016-01-28 Thread Alexandr Dzhagriev
Hi Robert, Thanks for the quick reply. I'll try. Best regards, Alex. On Thu, Jan 28, 2016 at 11:09 AM, Robert Metzger wrote: > Hi Alex, > > I'm sorry that you found that class commented. I think so far nobody was > interested in using the Flume source.It was commented out during a > refactorin

Re: Flume source support

2016-01-28 Thread Robert Metzger
Hi Alex, I'm sorry that you found that class commented. I think so far nobody was interested in using the Flume source.It was commented out during a refactoring of the stream sources: https://github.com/apache/flink/pull/659#discussion-diff-29854032. I just looked a bit through the (commented) cod

Re: Kafka+Flink

2016-01-28 Thread Robert Metzger
Hi Vinaya, this blog post explains how to connect Flink and Kafka: http://data-artisans.com/kafka-flink-a-practical-how-to/ On Thu, Jan 28, 2016 at 1:28 AM, Vinaya M S wrote: > Hi, > > I have a 3 node kafka cluster. In server.properties file of each of them > I'm setting > advertisedhost.name:

Re: cluster execution

2016-01-28 Thread Till Rohrmann
Hi Lydia, what do you mean with master? Usually when you submit a program to the cluster and don’t specify the parallelism in your program, then it will be executed with the parallelism.default value as parallelism. You can specify the value in your cluster configuration flink-config.yaml file. Al

Re: about blob.storage.dir and .buffer files

2016-01-28 Thread Till Rohrmann
Hi Gwenhael, in theory the blob storage files can be any binary data. At the moment, this is however only used to distribute the user code jars. The jars are kept around as long as the job is running. Every library-cache-manager.cleanup.interval interval the files are checked and those which are n

Re: Mixing Batch & Streaming

2016-01-28 Thread Fabian Hueske
Hi, this is currently not support yet. However, this feature is on our roadmap and has been requested for a few times. So I hope somebody will pick it up soon. If the static data set is small enough, you can read the full data set (e.g., as a file) in the open method of FlatMapFunction, build a h

cluster execution

2016-01-28 Thread Lydia Ickler
Hi all, I am doing some operations on a DataSet> … (see code below) When I run my program on a cluster with 3 machines I can see within the web client that only my master is executing the program. Do I have to specify somewhere that all machines have to participate? Usually the cluster execute