Hi Timo, cool stuff! I agree with Stephan. A separate repository is not necessary because this feature is opaque to users (except for the activation switch) and might therefore be added to flink-core, IMO.
The handling of forwarded fields for group-wise operators in the optimizer is not fully sorted out, yet. So that might need to be adapted (see FLINK-1656, and PR #525) For the switch we could offer three options: - deactivated - activated hinting (write extracted semantic information to log) - activated optimizing (use extracted semantic info in optimizer) Regarding additional checks we could: - detect whether a Filter function modifies the record - check if a Reduce function returns a new record or the first(?) input record. 2015-03-24 13:07 GMT+01:00 Maximilian Michels (JIRA) <j...@apache.org>: > > [ > https://issues.apache.org/jira/browse/FLINK-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14377761#comment-14377761 > ] > > Maximilian Michels commented on FLINK-1319: > ------------------------------------------- > > This looks like a very promising way to automatically optimize Flink jobs. > > +1 for including it in {{flink-staging}}. > +1 for a switch in the {{ExecutionEnvironment}} to manually turn it on. > > > Add static code analysis for UDFs > > --------------------------------- > > > > Key: FLINK-1319 > > URL: https://issues.apache.org/jira/browse/FLINK-1319 > > Project: Flink > > Issue Type: New Feature > > Components: Java API, Scala API > > Reporter: Stephan Ewen > > Assignee: Timo Walther > > Priority: Minor > > > > Flink's Optimizer takes information that tells it for UDFs which fields > of the input elements are accessed, modified, or frwarded/copied. This > information frequently helps to reuse partitionings, sorts, etc. It may > speed up programs significantly, as it can frequently eliminate sorts and > shuffles, which are costly. > > Right now, users can add lightweight annotations to UDFs to provide this > information (such as adding {{@ConstandFields("0->3, 1, 2->1")}}. > > We worked with static code analysis of UDFs before, to determine this > information automatically. This is an incredible feature, as it "magically" > makes programs faster. > > For record-at-a-time operations (Map, Reduce, FlatMap, Join, Cross), > this works surprisingly well in many cases. We used the "Soot" toolkit for > the static code analysis. Unfortunately, Soot is LGPL licensed and thus we > did not include any of the code so far. > > I propose to add this functionality to Flink, in the form of a drop-in > addition, to work around the LGPL incompatibility with ALS 2.0. Users could > simply download a special "flink-code-analysis.jar" and drop it into the > "lib" folder to enable this functionality. We may even add a script to > "tools" that downloads that library automatically into the lib folder. This > should be legally fine, since we do not redistribute LGPL code and only > dynamically link it (the incompatibility with ASL 2.0 is mainly in the > patentability, if I remember correctly). > > Prior work on this has been done by [~aljoscha] and [~skunert], which > could provide a code base to start with. > > *Appendix* > > Hompage to Soot static analysis toolkit: > http://www.sable.mcgill.ca/soot/ > > Papers on static analysis and for optimization: > http://stratosphere.eu/assets/papers/EnablingOperatorReorderingSCA_12.pdf > and http://stratosphere.eu/assets/papers/openingTheBlackBoxes_12.pdf > > Quick introduction to the Optimizer: > http://stratosphere.eu/assets/papers/2014-VLDBJ_Stratosphere_Overview.pdf > (Section 6) > > Optimizer for Iterations: > http://stratosphere.eu/assets/papers/spinningFastIterativeDataFlows_12.pdf > (Sections 4.3 and 5.3) > > > > -- > This message was sent by Atlassian JIRA > (v6.3.4#6332) >