RE: about blob.storage.dir and .buffer files

2016-01-29 Thread Gwenhael Pasquiers
Hi ! Here are the answers : - How much data is in the blob-store directory, versus in the buffer files? : around 20MB per application versus 20gb ( one specific windowing app may be). - How many buffer files do you have and how large are they in average? : standard apps ( no buffer file

Re: Writing Parquet files with Flink

2016-01-29 Thread Fabian Hueske
Yes, make both block sizes the same and you're good. I think you can neglect the overhead, unless we are not talking about 1000's of small files (smaller than block size). 2016-01-29 12:06 GMT+01:00 Flavio Pompermaier : > So there's no need to worry about the number of parquet files size from > t

Re: Writing Parquet files with Flink

2016-01-29 Thread Flavio Pompermaier
So there's no need to worry about the number of parquet files size from the Flink point of view if I set correctly the parquet block size (equal to the HDFS block size)... It only affects the Parquet file overhead (header and footer present in each file) and the HDFS resources required to handle th

Re: Writing Parquet files with Flink

2016-01-29 Thread Fabian Hueske
The number of input splits does not depend on the number of files but on the number of HDFS blocks of all files. Reading a single file with 100 HDFS blocks and reading of 100 files with 1 block each should be divided into 100 input splits which can be read by 100 tasks concurrently (or less tasks w

Re: Writing Parquet files with Flink

2016-01-29 Thread Flavio Pompermaier
Hi Fabian, thanks for the response! >From what is my understanding (correct me if I'm wrong) once I produce some Parquet dir that I want to read later, the number of files in the dir affects the initial parallelism of the next job, i.e.: - If I have less files than available tasks I will not fully

Re: Writing Parquet files with Flink

2016-01-29 Thread Fabian Hueske
Hi Flavio, using a default FileOutputFormat, Flink writes one output file for each data sink task, i.e., as many files as the defined parallelism. The size of these files depends on the total output size and the distribution. If you write to HDFS, a file consists of one or more HDFS blocks. Parque

Re: Issue on watermark meaning

2016-01-29 Thread Till Rohrmann
Hi Lorenzo, you're right that we should stick to the same terminology between the online documentation and the code, otherwise it's confusing. In this case, though, a lower numeric timestamp is equivalent to an older event. The older an element is, the lower is its timestamp. However, there is a

Re: Flink stream data ordering/sequence

2016-01-29 Thread Fabian Hueske
Hi Sana, The feature you are looking for is called event time processing in Flink. These blog posts should help you to become familiar with the concepts: 1) Event-Time concepts: http://data-artisans.com/how-apache-flink-enables-new-streaming-applications-part-1/ 2) Windows in Flink: http://flink.

Flink stream data ordering/sequence

2016-01-29 Thread Sane Lee
Hi all, Do flink have mechanism for dealing with ordered streams? As far as I know , flink attaches timestamps to data items once they are processed. What about the timestamps are already in data items? For example, if some data item is missing from particular window (according to its ordering/seq

Issue on watermark meaning

2016-01-29 Thread Lorenzo Affetti
Hi everybody, I want to signal that I think there is a mismatch between what is the meaning of emitting a watermark between the code and the documentation: from Flink docs : A watermark with a certa

Re: How it is used internally Akka framework in Flink?

2016-01-29 Thread Stephan Ewen
Hi! This wiki page describes roughly how Flink uses Akka for coordination between JobManager and TaskManager https://cwiki.apache.org/confluence/display/FLINK/Akka+and+Actors Stephan On Fri, Jan 29, 2016 at 9:25 AM, Lorena Reis wrote: > Hi all, > > I'd like to know how it is used internally

Re: Hello, a question about Dashborad in Flink

2016-01-29 Thread Stephan Ewen
Hi! The REST monitoring interface and extended web dashboard were added in version 0.10 Greetings, Stephan On Fri, Jan 29, 2016 at 9:55 AM, Philip Lee wrote: > Great, > > you menat the difference between narrow shuffle and global shuffle? > > I use Flink version 0.9, > but it did not not work

Re: Hello, a question about Dashborad in Flink

2016-01-29 Thread Philip Lee
Great, you menat the difference between narrow shuffle and global shuffle? I use Flink version 0.9, but it did not not work to access REST interface when I use "ssh tunnel" to remote server. it is from version of probelm? Best, Phil On Fri, Jan 29, 2016 at 9:46 AM, Fabian Hueske wrote: > T

Re: Hello, a question about Dashborad in Flink

2016-01-29 Thread Fabian Hueske
The REST interface does also provide metrics about the number of records and the size of the input and output of all tasks. See: - /jobs//vertices/ - /jobs//vertices//subtasks//attempts/ in https://ci.apache.org/projects/flink/flink-docs-release-0.10/internals/monitoring_rest_api.html#details-of-a-

How it is used internally Akka framework in Flink?

2016-01-29 Thread Lorena Reis
Hi all, I'd like to know how it is used internally Akka framework in Flink. I just found this https://flink.apache.org/news/2015/06/24/announcing-apache-flink-0.9.0-release.html where it mentions: "Flinkā€™s RPC system has been replaced by the widely adopted Akka framework. Akka...