Re: Som question about Flink stream sql

2017-01-05 Thread Hongyuhong
Thanks Jark Do you have any design documents or instructions about row window? We are interested in adding supports of window in streamsql side and it’s greate if can have some reference. Thanks very much. Best Yuhong 发件人: Jark Wu [mailto:wuchong...@alibaba-inc.com] 发送时间: 2017年1月6日 11:01 收件人:

re: Does Flink cluster security works in Flink 1.1.4 release?

2017-01-05 Thread Zhangrucong
Hi Stephan: Thanks for your reply. You mean the CLI、JM、TM、WebUI have supported Kerberos authentication only in yarn cluster model in 1.1.x release? 发件人: Stephan Ewen [mailto:se...@apache.org] 发送时间: 2017年1月4日 17:23 收件人: user@flink.apache.org 主题: Re: Does Fl

Re: Som question about Flink stream sql

2017-01-05 Thread Jark Wu
Hi Yuhong, I have assigned an issue for tumble row-window, but the progress is still under design and discuss. The implementation of row window is more complex than group window. I will push forward this issue in the next days. - Jark Wu > 在 2017年1月5日,下午7:00,Hongyuhong 写道: > > Hi Fabian, >

Re: Triggering a saveppoint failed the Job

2017-01-05 Thread Yassine MARZOUGUI
Hi Stephan, Thank you for creating the JIRA issue, I attached a job reproducing the bug in the issue page and commented it. Best, Yassine 2017-01-04 12:55 GMT+01:00 Stephan Ewen : > Hi! > > Thanks for reporting this. > > I created a JIRA issue for it: https://issues.apache.org/ > jira/browse/FL

Re: Regarding ordering of events

2017-01-05 Thread Kostas Kloudas
Hi Abdul, Every window is handled by a single machine, if this is what you mean by “partition”. Kostas > On Jan 5, 2017, at 9:21 PM, Abdul Salam Shaikh > wrote: > > Thanks Fabian and Kostas, > > How can I put to use the power of flink as a distributed system ? > > In cases where we have

Re: Regarding ordering of events

2017-01-05 Thread Abdul Salam Shaikh
Thanks Fabian and Kostas, How can I put to use the power of flink as a distributed system ? In cases where we have multiple windows, is one single window handled by one partition entirely or is it spread across several partitions ? On Thu, Jan 5, 2017 at 12:25 PM, Fabian Hueske wrote: > Flink

failure-rate restart strategy not working?

2017-01-05 Thread Shannon Carey
I recently updated my cluster with the following config: restart-strategy: failure-rate restart-strategy.failure-rate.max-failures-per-interval: 3 restart-strategy.failure-rate.failure-rate-interval: 5 min restart-strategy.failure-rate.delay: 10 s I see the settings inside the JobManager web UI,

Re: Speedup of Flink Applications

2017-01-05 Thread Hanna Prinz
Hey Fabian and Timur, Thank you for your helpful answers. Especially because I'm aware that there is no simple answer to that. And also, I've just started to work with flink, so I might have not understood everything yet :) From the documentation on task scheduling, I assumed that the JobManger

Kafka KeyedStream source

2017-01-05 Thread Niels Basjes
Hi, In my scenario I have click stream data that I persist in Kafka. I use the sessionId as the key to instruct Kafka to put everything with the same sessionId into the same Kafka partition. That way I already have all events of a visitor in a single kafka partition in a fixed order. When I read

Re: Consistency guarantees on multiple sinks

2017-01-05 Thread Stephan Ewen
Extending on what Paris said: If you have an exactly-once sink (like the Rolling/Bucketing file sink or the Cassandra write-ahead sink), then all of them are correctly adjusted to preserve the exactly once semantics. That is regardless or one, two, or n sinks. On Thu, Jan 5, 2017 at 2:47 PM, Pari

Re: Sequential/ordered map

2017-01-05 Thread Fabian Hueske
Please avoid collecting the data to the client using collect(). This operation looks convenient but is only meant for super small data and would be a lot slower and less robust even if it would work for large data sets. Rather set the parallelism of the operator to 1. Fabian 2017-01-05 13:18 GMT+

Re: High virtual memory usage

2017-01-05 Thread Stephan Ewen
Happy to hear that! On Thu, Jan 5, 2017 at 1:34 PM, Paulo Cezar wrote: > Hi Stephan, thanks for your support. > > I was able to track the problem a few days ago. Unirest was the one to > blame, I was using it on some mapfuncionts to connect to external services > and for some reason it was usi

Re: Rapidly failing job eventually causes "Not enough free slots"

2017-01-05 Thread Stephan Ewen
Another thought on the container failure: in 1.1, the user code is loaded dynamically whenever a Task is started. That means that on every task restart the code is reloaded. For that to work proper, class unloading needs to happen, or the permgen will eventually overflow. It can happen that class

Re: Consistency guarantees on multiple sinks

2017-01-05 Thread Paris Carbone
Hi Nancy, Flink’s vanilla rollback recovery mechanism restarts computation from a global checkpoint thus sink duplicates (job output) can occur no matter how many sinks are declared; the whole computation in the failed execution graph will roll back. cheers Paris > On 5 Jan 2017, at 14:24,

Consistency guarantees on multiple sinks

2017-01-05 Thread Nancy Estrada
Hi, If in a Job there is more than one sink declared, what happens when a failure occurs? all the sink operations get aborted? (atomically as in a transactional environment), or the exactly-once-processing consistency guarantees are provided just when one sink is declared per job? Is it recommend

Re: Running into memory issues while running on Yarn

2017-01-05 Thread Yury Ruchin
Hi, You containers got killed by YARN for exceeding virtual memory limits. For some reason your container intensively allocate virtual memory while having free physical memory. There are some gotchas regarding such issue on CentOS, caused by OS-specific aggressive virtual memory allocation: [1],

Increasing parallelism skews/increases overall job processing time linearly

2017-01-05 Thread Chakravarthy varaga
Hi All, I have a job as attached. I have a 16 Core blade running RHEL 7. The taskmanager default number of slots is set to 1. The source is a kafka stream and each of the 2 sources(topic) have 2 partitions each. *What I notice is that when I deploy a job to run with #parallelism=2 the total pro

Re: High virtual memory usage

2017-01-05 Thread Paulo Cezar
Hi Stephan, thanks for your support. I was able to track the problem a few days ago. Unirest was the one to blame, I was using it on some mapfuncionts to connect to external services and for some reason it was using insane amounts of virtual memory. Paulo Cezar On Mon, Dec 19, 2016 at 11:30 AM S

Re: Sequential/ordered map

2017-01-05 Thread Sebastian Neef
Hi Chesnay, thanks for the input. Finding a word's first occurrence is part of the algorithm. To be exact I'm trying to implement Adler's Text authorship tracking in flink (http://www2007.org/papers/paper692.pdf, page 266). Thanks, Sebastian

Re: Rapidly failing job eventually causes "Not enough free slots"

2017-01-05 Thread Yury Ruchin
Hi, I've faced a similar issue recently. Hope sharing my findings will help. The problem can be split into 2 parts: *Source of container failures* The logs you provided indicate that YARN kills its containers for exceeding memory limits. Important point here is that memory limit = JVM heap memory

Re: Sequential/ordered map

2017-01-05 Thread Chesnay Schepler
So given an ordered list of texts, for each word find the earliest text it appears in? As Kostas said, when splitting the text into words wrap them in a Tuple2 containing the word and text index and group them by the word. As far as i can tell the next step would be a simple reduce that finds

Re: Sequential/ordered map

2017-01-05 Thread Sebastian Neef
Hi Kostas, thanks for the quick reply. > If T_1 must be processed before T_i, i>1, then you cannot parallelize the > algorithm. What would be the best way to process it anyway? DataSet.collect() -> loop over List -> env.fromCollection(...) ? Or with a parallelism of 1 and a .map(...) ? Howeve

Re: Regarding ordering of events

2017-01-05 Thread Fabian Hueske
Flink is a distributed system and does not preserve order across partitions. The number prefix (e.g., 1>, 2>, ...) tells you the parallel instance of the printing operator. You can set the parallelism to 1 to have the stream in order. Fabian 2017-01-05 12:16 GMT+01:00 Kostas Kloudas : > Hi Abdu

Re: Flink Checkpoint runs slow for low load stream

2017-01-05 Thread Chakravarthy varaga
BRILLIANT !!! Checkpoint times are consistent with 1.1.4... Thanks for your formidable support ! Best Regards CVP On Wed, Jan 4, 2017 at 5:33 PM, Fabian Hueske wrote: > Hi CVP, > > we recently release Flink 1.1.4, i.e., the next bugfix release of the > 1.1.x series with major robustness impro

Re: Sequential/ordered map

2017-01-05 Thread Kostas Kloudas
Hi Sebastian, If T_1 must be processed before T_i, i>1, then you cannot parallelize the algorithm. If this is not a restriction, then you could; 1) split the text in words and also attach the id of the text they appear in, 2) do a groupBy that will send all the same words to the same node, 3) k

Re: Debugging Python-Api fails with NoClassDefFoundError

2017-01-05 Thread Chesnay Schepler
Hello, all Flink dependencies of the Python APi are marked as *provided* in the pom.xml similar to most connectors. By removing the provided tags in the pom.xml you should be able to run the PythonPlanBinder from the IDE. This was done to exclude these dependencies in the flink-python jar; s

Re: Regarding ordering of events

2017-01-05 Thread Kostas Kloudas
Hi Abdul, Flink provides no ordering guarantees on the elements within a window. The only “order” it guarantees is that the results referring to window-1 are going to be emitted before those of window-2 (assuming that window-1 precedes window-2). Thanks, Kostas > On Jan 5, 2017, at 11:57 AM, Ab

Re: 答复: Som question about Flink stream sql

2017-01-05 Thread Fabian Hueske
There are several JIRA issues for row windows, some of them might be assigned. I haven't seen pull requests and nothing to merged yet. I would ask on the JIRA issue of the current status. Best, Fabian 2017-01-05 12:00 GMT+01:00 Hongyuhong : > Hi Fabian, > > > > Thanks for the reply. > > As you

答复: Som question about Flink stream sql

2017-01-05 Thread Hongyuhong
Hi Fabian, Thanks for the reply. As you noticed, row windows are already supported by Calcite and FLIP-11 has planned, Can you tell something about the progress of the row windows in Table API? Regards. Yuhong 发件人: Fabian Hueske [mailto:fhue...@gmail.com] 发送时间: 2017年1月5日 17:43 收件人: user@flin

Regarding ordering of events

2017-01-05 Thread Abdul Salam Shaikh
Hi, I am using a JSON file as the source for the streaming (in the ascending order of the field Umlaufsekunde)which has events as follows: {"event":[{"*Umlaufsekunde*":115}]} {"event":[{"*Umlaufsekunde*":135}]} {"event":[{"*Umlaufsekunde*":135}]} {"event":[{"*Umlaufsekunde*":145}]} {"event":[{"*U

Sequential/ordered map

2017-01-05 Thread Sebastian Neef
Hello, I'd like to implement an algorithm which doesn't really look parallelizable to me, but maybe there's a way around it: In general the algorithm looks like this: 1. Take a list of texts T_1 ... T_n 2. For every text T_i (i > 1) do 2.1: Split text into a list of words W_1 ... W_m 2.2: For ev

Re: Speedup of Flink Applications

2017-01-05 Thread Fabian Hueske
Hi Hanna, I assume you are asking about the possible speed up of batch analysis programs and not about streaming applications (please correct me if I'm wrong). Timur raised very good points about data size and skew. Given evenly distributed data (no skewed key distribution for a grouping or join

Re: Caching collected objects in .apply()

2017-01-05 Thread Fabian Hueske
Hi Matt, I think your approach should be fine. Although the second keyBy is logically a shuffle, the data will not be sent of the wire to a different machine if the parallelism of the first and second window operator are identical. It only cost one serialization / deserialization step. I would be

Re: Som question about Flink stream sql

2017-01-05 Thread Fabian Hueske
Hi Yuhong, as you noticed, FLIP-11 is about the window operations on the Table API and does not include SQL. The reason is that the Table API is completely Flink domain, i.e., we can design and implement the API. For SQL we have a dependency on Calcite. You are right, that Calcite's JIRA issue fo

Re: Caching collected objects in .apply()

2017-01-05 Thread Matt
I'm still looking for an answer to this question. Hope you can give me some insight! On Thu, Dec 22, 2016 at 6:17 PM, Matt wrote: > Just to be clear, the stream is of String elements. The first part of the > pipeline (up to the first .apply) receives those strings, and returns > objects of anoth

Running into memory issues while running on Yarn

2017-01-05 Thread Sachin Goel
Hey! I'm running locally under this configuration(copied from nodemanager logs): physical-memory=8192 virtual-memory=17204 virtual-cores=8 Before starting a flink deployment, memory usage stats show 3.7 GB used on system, indicating lots of free memory for flink containers. However, after I submi