Unfortunately making code generation a separate module would introduce cyclic dependency. Code generation requires the TypeInfo which is available in flink-core and flink-core requires the generated serializers from the code generation module. Do you have a solution for this?
I think if we can come up with a solution I will implement it as a separate Scala module otherwise I will stick to Java. BR, Gábor On 18 April 2016 at 12:40, Fabian Hueske <fhue...@gmail.com> wrote: > +1 for not mixing Java and Scala in flink-core. > > Maybe it makes sense to implement the code generated serializers / > comparators as a separate module which can be plugged-in. This could be > pure Scala. > In general, I think it would be good to have some kind of "version > management" for serializers in place. With features such as safepoints that > depend on the implementation of serializers, it would be good to have a > mechanism to switch between implementations. > > Best, Fabian > > 2016-04-18 10:01 GMT+02:00 Chiwan Park <chiwanp...@apache.org>: > > > Yes, I know Janino is a pure Java project. I meant if we add Scala code > to > > flink-core, we should add Scala dependency to flink-core and it could be > > confusing. > > > > Regards, > > Chiwan Park > > > > > On Apr 18, 2016, at 2:49 PM, Márton Balassi <balassi.mar...@gmail.com> > > wrote: > > > > > > Chiwan, just to clarify Janino is a Java project. [1] > > > > > > [1] https://github.com/aunkrig/janino > > > > > > On Mon, Apr 18, 2016 at 3:40 AM, Chiwan Park <chiwanp...@apache.org> > > wrote: > > > > > >> I prefer to avoid Scala dependencies in flink-core. If flink-core > > includes > > >> Scala dependencies, Scala version suffix (_2.10 or _2.11) should be > > added. > > >> I think that users could be confused. > > >> > > >> Regards, > > >> Chiwan Park > > >> > > >>> On Apr 17, 2016, at 3:49 PM, Márton Balassi < > balassi.mar...@gmail.com> > > >> wrote: > > >>> > > >>> Hi Gábor, > > >>> > > >>> I think that adding the Janino dep to flink-core should be fine, as > it > > >> has > > >>> quite slim dependencies [1,2] which are generally orthogonal to > Flink's > > >>> main dependency line (also it is already used elsewhere). > > >>> > > >>> As for mixing Scala code that is used from the Java parts of the same > > >> maven > > >>> module I am skeptical. We have seen IDE compilation issues with > > projects > > >>> using this setup and have decided that the community-wide potential > IDE > > >>> setup pain outweighs the individual implementation convenience with > > >> Scala. > > >>> > > >>> [1] > > >>> > > >> > > > https://repo1.maven.org/maven2/org/codehaus/janino/janino-parent/2.7.8/janino-parent-2.7.8.pom > > >>> [2] > > >>> > > >> > > > https://repo1.maven.org/maven2/org/codehaus/janino/janino/2.7.8/janino-2.7.8.pom > > >>> > > >>> On Sat, Apr 16, 2016 at 5:51 PM, Gábor Horváth <xazax....@gmail.com> > > >> wrote: > > >>> > > >>>> Hi! > > >>>> > > >>>> Table API already uses code generation and the Janino compiler [1]. > Is > > >> it a > > >>>> dependency that is ok to add to flink-core? In case it is ok, I > think > > I > > >>>> will use the same in order to be consistent with the other code > > >> generation > > >>>> efforts. > > >>>> > > >>>> I started to look at the Table API code generation [2] and it uses > > Scala > > >>>> extensively. There are several Scala features that can make Java > code > > >>>> generation easier such as pattern matching and string > interpolation. I > > >> did > > >>>> not see any Scala code in flink-core yet. Is it ok to implement the > > code > > >>>> generation inside the flink-core using Scala? > > >>>> > > >>>> Regards, > > >>>> Gábor > > >>>> > > >>>> [1] http://unkrig.de/w/Janino > > >>>> [2] > > >>>> > > >>>> > > >> > > > https://github.com/apache/flink/blob/master/flink-libraries/flink-table/src/main/scala/org/apache/flink/api/table/codegen/CodeGenerator.scala > > >>>> > > >>>> On 18 March 2016 at 19:37, Gábor Horváth <xazax....@gmail.com> > wrote: > > >>>> > > >>>>> Thank you! I finalized the project. > > >>>>> > > >>>>> > > >>>>> On 18 March 2016 at 10:29, Márton Balassi < > balassi.mar...@gmail.com> > > >>>>> wrote: > > >>>>> > > >>>>>> Thanks Gábor, now I also see it on the internal GSoC interface. I > > have > > >>>>>> indicated that I wish to mentor your project, I think you can hit > > >>>> finalize > > >>>>>> on your project there. > > >>>>>> > > >>>>>> On Mon, Mar 14, 2016 at 11:16 AM, Gábor Horváth < > > xazax....@gmail.com> > > >>>>>> wrote: > > >>>>>> > > >>>>>>> Hi, > > >>>>>>> > > >>>>>>> I have updated this draft to include preliminary benchmarks, > > >> mentioned > > >>>>>> the > > >>>>>>> interaction of annotations with savepoints, extended it with a > > >>>> timeline, > > >>>>>>> and some notes about scala case classes. > > >>>>>>> > > >>>>>>> Regards, > > >>>>>>> Gábor > > >>>>>>> > > >>>>>>> On 9 March 2016 at 16:12, Gábor Horváth <xazax....@gmail.com> > > wrote: > > >>>>>>> > > >>>>>>>> Hi! > > >>>>>>>> > > >>>>>>>> As far as I can see the formatting was not correct in my > previous > > >>>>>> mail. A > > >>>>>>>> better formatted version is available here: > > >>>>>>>> > > >>>>>>> > > >>>>>> > > >>>> > > >> > > > https://docs.google.com/document/d/1VC8lCeErx9kI5lCMPiUn625PO0rxR-iKlVqtt3hkVnk > > >>>>>>>> Sorry for that. > > >>>>>>>> > > >>>>>>>> Regards, > > >>>>>>>> Gábor > > >>>>>>>> > > >>>>>>>> On 9 March 2016 at 15:51, Gábor Horváth <xazax....@gmail.com> > > >>>> wrote: > > >>>>>>>> > > >>>>>>>>> Hi,I did not want to send this proposal out before the I have > > some > > >>>>>>>>> initial benchmarks, but this issue was mentioned on the mailing > > >>>> list > > >>>>>> ( > > >>>>>>>>> > > >>>>>>> > > >>>>>> > > >>>> > > >> > > > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Tuple-performance-and-the-curious-JIT-compiler-td10666.html > > >>>>>>> ), > > >>>>>>>>> and I wanted to make this information available to be able to > > >>>>>>> incorporate > > >>>>>>>>> this into that discussion. I have written this draft with the > > help > > >>>> of > > >>>>>>> Gábor > > >>>>>>>>> Gévay and Márton Balassi and I am open to every suggestion. > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> The proposal draft: > > >>>>>>>>> Code Generation in Serializers and Comparators of Apache Flink > > >>>>>>>>> > > >>>>>>>>> I am doing my last semester of my MSc studies and I’m a former > > GSoC > > >>>>>>>>> student in the LLVM project. I plan to improve the > serialization > > >>>>>> code in > > >>>>>>>>> Flink during this summer. The current implementation of the > > >>>>>> serializers > > >>>>>>> can > > >>>>>>>>> be a performance bottleneck in some scenarios. These > performance > > >>>>>>> problems > > >>>>>>>>> were also reported on the mailing list recently [1]. I plan to > > >>>>>> implement > > >>>>>>>>> code generation into the serializers to improve the performance > > (as > > >>>>>>> Stephan > > >>>>>>>>> Ewen also suggested.) > > >>>>>>>>> > > >>>>>>>>> TODO: I plan to include some preliminary benchmarks in this > > >>>> section. > > >>>>>>>>> Performance problems with the current serializers > > >>>>>>>>> > > >>>>>>>>> 1. > > >>>>>>>>> > > >>>>>>>>> PojoSerializer uses reflection for accessing the fields, which > > >>>> is > > >>>>>>>>> slow (eg. [2]) > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> - > > >>>>>>>>> > > >>>>>>>>> This is also a serious problem for the comparators > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> 1. > > >>>>>>>>> > > >>>>>>>>> When deserializing fields of primitive types (eg. int), the > > >>>>>> reusing > > >>>>>>>>> overload of the corresponding field serializers cannot really > do > > >>>>>> any > > >>>>>>> reuse, > > >>>>>>>>> because boxed primitive types are immutable in Java. This > > >>>> results > > >>>>>> in > > >>>>>>> lots > > >>>>>>>>> of object creations. [3][7] > > >>>>>>>>> 2. > > >>>>>>>>> > > >>>>>>>>> The loop to call the field serializers makes virtual function > > >>>>>> calls, > > >>>>>>>>> that cannot be speculatively devirtualized by the JVM or > > >>>> predicted > > >>>>>>> by the > > >>>>>>>>> CPU, because different serializer subclasses are invoked for > the > > >>>>>>> different > > >>>>>>>>> fields. (And the loop cannot be unrolled, because the number > of > > >>>>>>> iterations > > >>>>>>>>> is not a compile time constant.) See also the following > > >>>> discussion > > >>>>>>> on the > > >>>>>>>>> mailing list [1]. > > >>>>>>>>> 3. > > >>>>>>>>> > > >>>>>>>>> A POJO field can have the value null, so the serializer > inserts > > >>>> 1 > > >>>>>>>>> byte null tags, which wastes space. (Also, the type extractor > > >>>>>> logic > > >>>>>>> does > > >>>>>>>>> not distinguish between primitive types and their boxed > > >>>> versions, > > >>>>>> so > > >>>>>>> even > > >>>>>>>>> an int field has a null tag.) > > >>>>>>>>> 4. > > >>>>>>>>> > > >>>>>>>>> Subclass tags also add a byte at the beginning of every POJO > > >>>>>>>>> 5. > > >>>>>>>>> > > >>>>>>>>> getLength() does not know the size in most cases [4] > > >>>>>>>>> Knowing the size of a type when serialized has numerous > > >>>>>> performance > > >>>>>>>>> benefits throughout Flink: > > >>>>>>>>> 1. > > >>>>>>>>> > > >>>>>>>>> Sorters can do in-place, when the type is small [5] > > >>>>>>>>> 2. > > >>>>>>>>> > > >>>>>>>>> Chaining hash tables do not need resizes, because they know > > >>>> how > > >>>>>>>>> many buckets to allocate upfront [6] > > >>>>>>>>> 3. > > >>>>>>>>> > > >>>>>>>>> Different hash table architectures could be used, eg. open > > >>>>>>>>> addressing with linear probing instead of some chaining > > >>>>>>>>> 4. > > >>>>>>>>> > > >>>>>>>>> It is possible to deserialize, modify, and then serialize > > >>>> back > > >>>>>> a > > >>>>>>>>> record to its original place, because it cannot happen that > > >>>> the > > >>>>>>> modified > > >>>>>>>>> version does not fit in the place allocated there for the > old > > >>>>>>> version (see > > >>>>>>>>> CompactingHashTable and ReduceHashTable for concrete > > >>>> instances > > >>>>>> of > > >>>>>>> this > > >>>>>>>>> problem) > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> Note, that 2. and 3. are problems with not just the > > PojoSerializer, > > >>>>>> but > > >>>>>>>>> also with the TupleSerializer. > > >>>>>>>>> Solution approaches > > >>>>>>>>> > > >>>>>>>>> 1. > > >>>>>>>>> > > >>>>>>>>> Run time code generation for every POJO > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> - > > >>>>>>>>> > > >>>>>>>>> 1. and 3 . would be automatically solved, if the > serializers > > >>>>>> for > > >>>>>>>>> POJOs would be generated on-the-fly (by, for example, > > >>>>>> Javassist) > > >>>>>>>>> - > > >>>>>>>>> > > >>>>>>>>> 2. also needs code generation, and also some extra effort > in > > >>>>>> the > > >>>>>>>>> type extractor to distinguish between primitive types and > > >>>> their > > >>>>>>> boxed > > >>>>>>>>> versions > > >>>>>>>>> - > > >>>>>>>>> > > >>>>>>>>> could be used for PojoComparator as well (which could > greatly > > >>>>>>>>> increase the performance of sorting) > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> 1. > > >>>>>>>>> > > >>>>>>>>> Annotations on POJOs (by the users) > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> - > > >>>>>>>>> > > >>>>>>>>> Concretely: > > >>>>>>>>> - > > >>>>>>>>> > > >>>>>>>>> annotate fields that will never be nulls -> no null tag > > >>>>>> needed > > >>>>>>>>> before every field! > > >>>>>>>>> - > > >>>>>>>>> > > >>>>>>>>> make a POJO final -> no subclass tag needed > > >>>>>>>>> - > > >>>>>>>>> > > >>>>>>>>> annotating a POJO that it will not be null -> no top > level > > >>>>>> null > > >>>>>>>>> tag needed > > >>>>>>>>> - > > >>>>>>>>> > > >>>>>>>>> These would also help with the getLength problem (6.), > > >>>> because > > >>>>>> the > > >>>>>>>>> length is often not known because currently anything can be > > >>>>>> null > > >>>>>>> or a > > >>>>>>>>> subclass can appear anywhere > > >>>>>>>>> - > > >>>>>>>>> > > >>>>>>>>> These annotations could be done without code generation, > but > > >>>>>> then > > >>>>>>>>> they would add some overhead when there are no annotations > > >>>>>>> present, so this > > >>>>>>>>> would work better together with the code generation > > >>>>>>>>> - > > >>>>>>>>> > > >>>>>>>>> Tuples would become a special case of POJOs, where nothing > > >>>> can > > >>>>>> be > > >>>>>>>>> null, and no subclass can appear, so maybe we could > eliminate > > >>>>>> the > > >>>>>>>>> TupleSerializer > > >>>>>>>>> - > > >>>>>>>>> > > >>>>>>>>> We could annotate some internal types in Flink libraries > > >>>> (Gelly > > >>>>>>>>> (Vertex, Edge), FlinkML) > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> TODO: what is the situation with Scala case classes? Run time > > code > > >>>>>>>>> generation is probably easier in Scala? (with quasiquotes) > > >>>>>>>>> > > >>>>>>>>> About me > > >>>>>>>>> > > >>>>>>>>> I am in the last year of my Computer Science MSc studies at > > Eotvos > > >>>>>>> Lorand > > >>>>>>>>> University in Budapest, and planning to start a PhD in the > > autumn. > > >>>> I > > >>>>>>> have > > >>>>>>>>> been working for almost three years at Ericsson on static > > analysis > > >>>>>> tools > > >>>>>>>>> for C++. In 2014 I participated in GSoC, working on the LLVM > > >>>> project, > > >>>>>>> and I > > >>>>>>>>> am a frequent contributor ever since. The next summer I was > > >>>>>> interning at > > >>>>>>>>> Apple. > > >>>>>>>>> > > >>>>>>>>> I learned about the Flink project not too long ago and I like > it > > so > > >>>>>> far. > > >>>>>>>>> The last few weeks I was working on some tickets to familiarize > > >>>>>> myself > > >>>>>>> with > > >>>>>>>>> the codebase: > > >>>>>>>>> > > >>>>>>>>> https://issues.apache.org/jira/browse/FLINK-3422 > > >>>>>>>>> > > >>>>>>>>> https://issues.apache.org/jira/browse/FLINK-3322 > > >>>>>>>>> > > >>>>>>>>> https://issues.apache.org/jira/browse/FLINK-3457 > > >>>>>>>>> > > >>>>>>>>> My CV is available here: > > http://xazax.web.elte.hu/files/resume.pdf > > >>>>>>>>> References > > >>>>>>>>> > > >>>>>>>>> [1] > > >>>>>>>>> > > >>>>>>> > > >>>>>> > > >>>> > > >> > > > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Tuple-performance-and-the-curious-JIT-compiler-td10666.html > > >>>>>>>>> > > >>>>>>>>> [2] > > >>>>>>>>> > > >>>>>>> > > >>>>>> > > >>>> > > >> > > > https://github.com/apache/flink/blob/master/flink-java/src/main/java/org/apache/flink/api/java/typeutils/runtime/PojoSerializer.java#L369 > > >>>>>>>>> > > >>>>>>>>> [3] > > >>>>>>>>> > > >>>>>>> > > >>>>>> > > >>>> > > >> > > > https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/api/common/typeutils/base/IntSerializer.java#L73 > > >>>>>>>>> > > >>>>>>>>> [4] > > >>>>>>>>> > > >>>>>>> > > >>>>>> > > >>>> > > >> > > > https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/api/common/typeutils/TypeSerializer.java#L98 > > >>>>>>>>> > > >>>>>>>>> [5] > > >>>>>>>>> > > >>>>>>> > > >>>>>> > > >>>> > > >> > > > https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/operators/sort/FixedLengthRecordSorter.java > > >>>>>>>>> > > >>>>>>>>> [6] > > >>>>>>>>> > > >>>>>>> > > >>>>>> > > >>>> > > >> > > > https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/operators/hash/CompactingHashTable.java#L861 > > >>>>>>>>> [7] https://issues.apache.org/jira/browse/FLINK-3277 > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> Best Regards, > > >>>>>>>>> > > >>>>>>>>> Gábor > > >>>>>>>>> > > >>>>>>>> > > >>>>>>>> > > >>>>>>> > > >>>>>> > > >>>>> > > >>>>> > > >>>> > > >> > > >> > > > > >