RE: Expected duration for cascading-flink tests?

2016-03-29 Thread Ken Krugler
Hi Fabian, > From: Fabian Hueske > Sent: March 29, 2016 3:51:08pm PDT > To: dev@flink.apache.org > Subject: Re: Expected duration for cascading-flink tests? > > Hi Ken, > > no, this is definitely not expected. The tests complete in about 30 mins on > my machine. > Is it possible that you have an

[jira] [Created] (FLINK-3680) Remove or improve (not set) text in the Job Plan UI

2016-03-29 Thread Jamie Grier (JIRA)
Jamie Grier created FLINK-3680: -- Summary: Remove or improve (not set) text in the Job Plan UI Key: FLINK-3680 URL: https://issues.apache.org/jira/browse/FLINK-3680 Project: Flink Issue Type: Bug

Re: a typical ML algorithm flow

2016-03-29 Thread Dmitriy Lyubimov
BTW thank you for educating me on this. I think it's actually a wonderful capability, along with the capability of broadcasting distributed sets to map operators, it means (I hope) that fine-grained, centralized scheduling and centralized broadcasting we find in Spark analogous algorithms could be

Re: a typical ML algorithm flow

2016-03-29 Thread Dmitriy Lyubimov
Thanks. Regardless of the rationale, i wanted to confirm if the iteration is lazily evaluated-only thing and it sounds eager evaluation inside (and collection) is not possible, and the algorithms that need it, just will have to work around this. I think this answers my question -- thanks! -d On

Re: Range partitioning

2016-03-29 Thread Fabian Hueske
Hi Dawid, this is expected behavior. A partitioning will only be valid to the point that you change the parallelism. In the modified program the data will be correctly partitioned (lets say into 8 partitions if the default parallelism is 8). After the partitioning, the 8 partitions have to be red

Re: Expected duration for cascading-flink tests?

2016-03-29 Thread Fabian Hueske
Hi Ken, no, this is definitely not expected. The tests complete in about 30 mins on my machine. Is it possible that you have another Flink process running on your machine (maybe a debug thread in your IDE)? That could explain the "Address already in use" exceptions. Best, Fabian 2016-03-29 20:36

Re: A whole bag of ML issues

2016-03-29 Thread Trevor Grant
I was thinking that all IterativeSolvers would benefit from a setOptimizer method. I didn't realize you had been working on GLM. If that is the case (which I think is wise) then feel free to put a setOptimizer in GLM, I'll leave it in my NeuralNetworks, and lets just try to have some consistency i

[jira] [Created] (FLINK-3679) DeserializationSchema should handle zero or more outputs for every input

2016-03-29 Thread Jamie Grier (JIRA)
Jamie Grier created FLINK-3679: -- Summary: DeserializationSchema should handle zero or more outputs for every input Key: FLINK-3679 URL: https://issues.apache.org/jira/browse/FLINK-3679 Project: Flink

Re: A whole bag of ML issues

2016-03-29 Thread Theodore Vasiloudis
> Adding a setOptimizer to IterativeSolver. Do you mean MLR here? IterativeSolver is implemented by different solvers, I don't think adding a method like this makes sense there. In the case of MLR a better alternative that includes a bit more work is to create a Generalized Linear Model framework

Re: A whole bag of ML issues

2016-03-29 Thread Trevor Grant
OK, I'm trying to respond to you and Till in one thread so someone call me out if I missed a point but here goes: SGD Predicting Vectors : There was discussion in the past regarding this- at the time it was decided to go with only Doubles for simplicity. I feel strongly that there is cause now fo

Range partitioning

2016-03-29 Thread Dawid Wysakowicz
Hi all, recently I am working on FLINK-2946 and I am supposed to use range partitioning, but I am not sure about the behaviour. I've adjusted a little bit PartitionITCase#testRangePartitionerOnSequenceData so to set custom parallelism after partit

RE: Expected duration for cascading-flink tests?

2016-03-29 Thread Ken Krugler
An update (and a nudge)… So far it's been more than 20 hours, and the tests are still running. Most tests seem to fail with one of two different errors… 1. Address already in use cascading.flow.FlowException: [test] unhandled exception at cascading.flow.BaseFlow.complete(BaseFlow.java:9

Re: 答复: 答复: Effort to add SQL / StreamSQL to Flink

2016-03-29 Thread Vasiliki Kalavri
Great to see people excited about this :) SQL is indeed coming up next. We should have the SQL on DataSets programs (see FLINK-3640 [1]) pretty soon. -Vasia. [1]: https://issues.apache.org/jira/browse/FLINK-3640 On 29 March 2016 at 14:02, Jiangsong (Hi) wrote: > So excited!! SQL on Flink is

Re: [DISCUSS] Release 1.0.1 Bugfix release

2016-03-29 Thread Suneel Marthi
what about PR# 1829? We need it real bad for an upcoming release. On Tue, Mar 29, 2016 at 1:17 PM, Ufuk Celebi wrote: > I just merged #1830 and will create a RC after cherry picking > '[FLINK-3636] Add ThrottledIterator to WindowJoin jar' to the > release-1.0 branch. > > Thanks for kicking off t

Re: [DISCUSS] Release 1.0.1 Bugfix release

2016-03-29 Thread Ufuk Celebi
I just merged #1830 and will create a RC after cherry picking '[FLINK-3636] Add ThrottledIterator to WindowJoin jar' to the release-1.0 branch. Thanks for kicking off the discussion Aljoscha! – Ufuk On Thu, Mar 24, 2016 at 5:07 AM, Chiwan Park wrote: > +1 > > Regards, > Chiwan Park > >> On Mar

withBroadcastSet for a DataStream missing?

2016-03-29 Thread Stavros Kontopoulos
H i am new here... I am trying to implement online k-means as here https://databricks.com/blog/2015/01/28/introducing-streaming-k-means-in-spark-1-2.html with flink. I dont see anywhere a withBroadcastSet call to save intermediate results is this currently supported? Is intermediate results state

[jira] [Created] (FLINK-3678) Make Flink logs directory configurable

2016-03-29 Thread Stefano Baghino (JIRA)
Stefano Baghino created FLINK-3678: -- Summary: Make Flink logs directory configurable Key: FLINK-3678 URL: https://issues.apache.org/jira/browse/FLINK-3678 Project: Flink Issue Type: Improvem

[jira] [Created] (FLINK-3677) FileInputFormat: Allow to specify include/exclude file name patterns

2016-03-29 Thread Maximilian Michels (JIRA)
Maximilian Michels created FLINK-3677: - Summary: FileInputFormat: Allow to specify include/exclude file name patterns Key: FLINK-3677 URL: https://issues.apache.org/jira/browse/FLINK-3677 Project:

Re: RichMapPartitionFunction - problems with collect

2016-03-29 Thread Sergio Ramírez
Hi again, I've not been able to solve the problem with the instruction you gave me. I've tried with static variables (matrices) also unsuccessfully. I've also tried this simpler code: def mapPartition(it: java.lang.Iterable[LabeledVector], out: Collector[((Int, Int), Int)]): Unit = {

Re: a typical ML algorithm flow

2016-03-29 Thread Theodore Vasiloudis
@Shannon What you are talking about is available for the DataSet API through the iterateWithTermination function. See the API docs and Iterations page

[jira] [Created] (FLINK-3676) WebClient hasn't been removed from the docs

2016-03-29 Thread Maximilian Michels (JIRA)
Maximilian Michels created FLINK-3676: - Summary: WebClient hasn't been removed from the docs Key: FLINK-3676 URL: https://issues.apache.org/jira/browse/FLINK-3676 Project: Flink Issue Typ

[jira] [Created] (FLINK-3675) YARN ship folder incosistent behavior

2016-03-29 Thread Stefano Baghino (JIRA)
Stefano Baghino created FLINK-3675: -- Summary: YARN ship folder incosistent behavior Key: FLINK-3675 URL: https://issues.apache.org/jira/browse/FLINK-3675 Project: Flink Issue Type: Bug

Re: a typical ML algorithm flow

2016-03-29 Thread Shannon Quinn
Apologies for hijacking, but this thread hits right at my last message to this list (looking to implement native iterations in the PyFlink API). I'm particularly interested in custom convergence criteria, often centered around measuring some sort of squared loss and checking if it falls below

Re: Behavior of lib directory shipping on YARN

2016-03-29 Thread Stefano Baghino
Yup, I shall open an issue for both this one and my other thread (re: Kerberos). Thanks for the pointer on this issue. On Tue, Mar 29, 2016 at 12:44 PM, Maximilian Michels wrote: > Hi Stefano, > > Thanks for pointing out this bug. Your analysis is correct. The per-job > cluster does not ship the

答复: 答复: Effort to add SQL / StreamSQL to Flink

2016-03-29 Thread Jiangsong (Hi)
So excited!! SQL on Flink is ready? Are there any show case or howto use? -邮件原件- 发件人: ewenstep...@gmail.com [mailto:ewenstep...@gmail.com] 代表 Stephan Ewen 发送时间: 2016年3月29日 20:00 收件人: dev@flink.apache.org 主题: Re: 答复: Effort to add SQL / StreamSQL to Flink Cool stuff! SQL coming up

Re: 答复: Effort to add SQL / StreamSQL to Flink

2016-03-29 Thread Stephan Ewen
Cool stuff! SQL coming up next? ;-) On Tue, Mar 29, 2016 at 1:39 PM, Maximilian Michels wrote: > Yeah! I'm a little late to the party but exciting stuff! :) > > On Fri, Mar 18, 2016 at 3:15 PM, Vasiliki Kalavri < > vasilikikala...@gmail.com > > wrote: > > > Hi all, > > > > tableOnCalcite has b

Re: Apache Flink: aligning watermark among parallel tasks

2016-03-29 Thread Maximilian Michels
Hi Ozan, You probably want to look at a custom Trigger implementation. Please see the different triggers in org/apache/flink/streaming/api/windowing/triggers/. You can write your own event-based trigger. Best thing would be to extend the EventTimeTrigger with your logic. Then you can use windowed

Re: 答复: Effort to add SQL / StreamSQL to Flink

2016-03-29 Thread Maximilian Michels
Yeah! I'm a little late to the party but exciting stuff! :) On Fri, Mar 18, 2016 at 3:15 PM, Vasiliki Kalavri wrote: > Hi all, > > tableOnCalcite has been merged to master :) > > Cheers, > -Vasia. > > On 17 March 2016 at 11:11, Fabian Hueske wrote: > > > Thanks for the initiative Vasia! > > I w

Re: Behavior of lib directory shipping on YARN

2016-03-29 Thread Maximilian Michels
Hi Stefano, Thanks for pointing out this bug. Your analysis is correct. The per-job cluster does not ship the /lib directory by default. Would you like to open an issue/PR? We should let the ship_path default to the /lib directory. The mechanism with the environment variables is the same. They us

[jira] [Created] (FLINK-3674) Add an interface for EventTime aware User Function

2016-03-29 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-3674: --- Summary: Add an interface for EventTime aware User Function Key: FLINK-3674 URL: https://issues.apache.org/jira/browse/FLINK-3674 Project: Flink Issue Type: Ne

[jira] [Created] (FLINK-3673) Annotations for code generation

2016-03-29 Thread Gabor Horvath (JIRA)
Gabor Horvath created FLINK-3673: Summary: Annotations for code generation Key: FLINK-3673 URL: https://issues.apache.org/jira/browse/FLINK-3673 Project: Flink Issue Type: Sub-task

[jira] [Created] (FLINK-3672) Code generation for POJO comparators

2016-03-29 Thread Gabor Horvath (JIRA)
Gabor Horvath created FLINK-3672: Summary: Code generation for POJO comparators Key: FLINK-3672 URL: https://issues.apache.org/jira/browse/FLINK-3672 Project: Flink Issue Type: Sub-task

[jira] [Created] (FLINK-3671) Code generation for POJO serializer

2016-03-29 Thread Gabor Horvath (JIRA)
Gabor Horvath created FLINK-3671: Summary: Code generation for POJO serializer Key: FLINK-3671 URL: https://issues.apache.org/jira/browse/FLINK-3671 Project: Flink Issue Type: Sub-task

Re: a typical ML algorithm flow

2016-03-29 Thread Till Rohrmann
Hi, Chiwan’s example is perfectly fine and it should also work with general EM algorithms. Moreover, it is the recommended way how to implement iterations with Flink. The iterateWithTermination API call generates a lazily evaluated data flow with an iteration operator. This plan will only be execu

Re: A whole bag of ML issues

2016-03-29 Thread Till Rohrmann
Hi Trevor, great to hear that you have a working prototype :-) And it is also good that you shared your insights you gained when implementing it. Flink’s ML library is far from perfect and, thus, all kinds of feedback is highly valuable. In general it is always good to contribute code back if you

Re: A whole bag of ML issues

2016-03-29 Thread Theodore Vasiloudis
Hello Trevor, These are indeed a lot of issues, let's see if we can fit the discussion for all of them in one thread. I'll add some comments inline. - Expand SGD to allow for predicting vectors instead of just Doubles. We have discussed this in the past and at that point decided that it didn't

Re: Kerberos for Streaming & Kafka

2016-03-29 Thread Maximilian Michels
Hi Eron, Thank you for your feedback! Indeed, we have seen in the past, that Hadoop's Delegation Tokens are not meant to renewed over a long period. Plus, they have a number of subtle bugs in older versions that sometimes prevent renewal. What you suggest, sounds like a good approach to me. It wo

[jira] [Created] (FLINK-3670) Kerberos: Improving long-running streaming jobs

2016-03-29 Thread Maximilian Michels (JIRA)
Maximilian Michels created FLINK-3670: - Summary: Kerberos: Improving long-running streaming jobs Key: FLINK-3670 URL: https://issues.apache.org/jira/browse/FLINK-3670 Project: Flink Issue