I prefer to avoid Scala dependencies in flink-core. If flink-core includes Scala dependencies, Scala version suffix (_2.10 or _2.11) should be added. I think that users could be confused.
Regards, Chiwan Park > On Apr 17, 2016, at 3:49 PM, Márton Balassi <balassi.mar...@gmail.com> wrote: > > Hi Gábor, > > I think that adding the Janino dep to flink-core should be fine, as it has > quite slim dependencies [1,2] which are generally orthogonal to Flink's > main dependency line (also it is already used elsewhere). > > As for mixing Scala code that is used from the Java parts of the same maven > module I am skeptical. We have seen IDE compilation issues with projects > using this setup and have decided that the community-wide potential IDE > setup pain outweighs the individual implementation convenience with Scala. > > [1] > https://repo1.maven.org/maven2/org/codehaus/janino/janino-parent/2.7.8/janino-parent-2.7.8.pom > [2] > https://repo1.maven.org/maven2/org/codehaus/janino/janino/2.7.8/janino-2.7.8.pom > > On Sat, Apr 16, 2016 at 5:51 PM, Gábor Horváth <xazax....@gmail.com> wrote: > >> Hi! >> >> Table API already uses code generation and the Janino compiler [1]. Is it a >> dependency that is ok to add to flink-core? In case it is ok, I think I >> will use the same in order to be consistent with the other code generation >> efforts. >> >> I started to look at the Table API code generation [2] and it uses Scala >> extensively. There are several Scala features that can make Java code >> generation easier such as pattern matching and string interpolation. I did >> not see any Scala code in flink-core yet. Is it ok to implement the code >> generation inside the flink-core using Scala? >> >> Regards, >> Gábor >> >> [1] http://unkrig.de/w/Janino >> [2] >> >> https://github.com/apache/flink/blob/master/flink-libraries/flink-table/src/main/scala/org/apache/flink/api/table/codegen/CodeGenerator.scala >> >> On 18 March 2016 at 19:37, Gábor Horváth <xazax....@gmail.com> wrote: >> >>> Thank you! I finalized the project. >>> >>> >>> On 18 March 2016 at 10:29, Márton Balassi <balassi.mar...@gmail.com> >>> wrote: >>> >>>> Thanks Gábor, now I also see it on the internal GSoC interface. I have >>>> indicated that I wish to mentor your project, I think you can hit >> finalize >>>> on your project there. >>>> >>>> On Mon, Mar 14, 2016 at 11:16 AM, Gábor Horváth <xazax....@gmail.com> >>>> wrote: >>>> >>>>> Hi, >>>>> >>>>> I have updated this draft to include preliminary benchmarks, mentioned >>>> the >>>>> interaction of annotations with savepoints, extended it with a >> timeline, >>>>> and some notes about scala case classes. >>>>> >>>>> Regards, >>>>> Gábor >>>>> >>>>> On 9 March 2016 at 16:12, Gábor Horváth <xazax....@gmail.com> wrote: >>>>> >>>>>> Hi! >>>>>> >>>>>> As far as I can see the formatting was not correct in my previous >>>> mail. A >>>>>> better formatted version is available here: >>>>>> >>>>> >>>> >> https://docs.google.com/document/d/1VC8lCeErx9kI5lCMPiUn625PO0rxR-iKlVqtt3hkVnk >>>>>> Sorry for that. >>>>>> >>>>>> Regards, >>>>>> Gábor >>>>>> >>>>>> On 9 March 2016 at 15:51, Gábor Horváth <xazax....@gmail.com> >> wrote: >>>>>> >>>>>>> Hi,I did not want to send this proposal out before the I have some >>>>>>> initial benchmarks, but this issue was mentioned on the mailing >> list >>>> ( >>>>>>> >>>>> >>>> >> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Tuple-performance-and-the-curious-JIT-compiler-td10666.html >>>>> ), >>>>>>> and I wanted to make this information available to be able to >>>>> incorporate >>>>>>> this into that discussion. I have written this draft with the help >> of >>>>> Gábor >>>>>>> Gévay and Márton Balassi and I am open to every suggestion. >>>>>>> >>>>>>> >>>>>>> The proposal draft: >>>>>>> Code Generation in Serializers and Comparators of Apache Flink >>>>>>> >>>>>>> I am doing my last semester of my MSc studies and I’m a former GSoC >>>>>>> student in the LLVM project. I plan to improve the serialization >>>> code in >>>>>>> Flink during this summer. The current implementation of the >>>> serializers >>>>> can >>>>>>> be a performance bottleneck in some scenarios. These performance >>>>> problems >>>>>>> were also reported on the mailing list recently [1]. I plan to >>>> implement >>>>>>> code generation into the serializers to improve the performance (as >>>>> Stephan >>>>>>> Ewen also suggested.) >>>>>>> >>>>>>> TODO: I plan to include some preliminary benchmarks in this >> section. >>>>>>> Performance problems with the current serializers >>>>>>> >>>>>>> 1. >>>>>>> >>>>>>> PojoSerializer uses reflection for accessing the fields, which >> is >>>>>>> slow (eg. [2]) >>>>>>> >>>>>>> >>>>>>> - >>>>>>> >>>>>>> This is also a serious problem for the comparators >>>>>>> >>>>>>> >>>>>>> 1. >>>>>>> >>>>>>> When deserializing fields of primitive types (eg. int), the >>>> reusing >>>>>>> overload of the corresponding field serializers cannot really do >>>> any >>>>> reuse, >>>>>>> because boxed primitive types are immutable in Java. This >> results >>>> in >>>>> lots >>>>>>> of object creations. [3][7] >>>>>>> 2. >>>>>>> >>>>>>> The loop to call the field serializers makes virtual function >>>> calls, >>>>>>> that cannot be speculatively devirtualized by the JVM or >> predicted >>>>> by the >>>>>>> CPU, because different serializer subclasses are invoked for the >>>>> different >>>>>>> fields. (And the loop cannot be unrolled, because the number of >>>>> iterations >>>>>>> is not a compile time constant.) See also the following >> discussion >>>>> on the >>>>>>> mailing list [1]. >>>>>>> 3. >>>>>>> >>>>>>> A POJO field can have the value null, so the serializer inserts >> 1 >>>>>>> byte null tags, which wastes space. (Also, the type extractor >>>> logic >>>>> does >>>>>>> not distinguish between primitive types and their boxed >> versions, >>>> so >>>>> even >>>>>>> an int field has a null tag.) >>>>>>> 4. >>>>>>> >>>>>>> Subclass tags also add a byte at the beginning of every POJO >>>>>>> 5. >>>>>>> >>>>>>> getLength() does not know the size in most cases [4] >>>>>>> Knowing the size of a type when serialized has numerous >>>> performance >>>>>>> benefits throughout Flink: >>>>>>> 1. >>>>>>> >>>>>>> Sorters can do in-place, when the type is small [5] >>>>>>> 2. >>>>>>> >>>>>>> Chaining hash tables do not need resizes, because they know >> how >>>>>>> many buckets to allocate upfront [6] >>>>>>> 3. >>>>>>> >>>>>>> Different hash table architectures could be used, eg. open >>>>>>> addressing with linear probing instead of some chaining >>>>>>> 4. >>>>>>> >>>>>>> It is possible to deserialize, modify, and then serialize >> back >>>> a >>>>>>> record to its original place, because it cannot happen that >> the >>>>> modified >>>>>>> version does not fit in the place allocated there for the old >>>>> version (see >>>>>>> CompactingHashTable and ReduceHashTable for concrete >> instances >>>> of >>>>> this >>>>>>> problem) >>>>>>> >>>>>>> >>>>>>> Note, that 2. and 3. are problems with not just the PojoSerializer, >>>> but >>>>>>> also with the TupleSerializer. >>>>>>> Solution approaches >>>>>>> >>>>>>> 1. >>>>>>> >>>>>>> Run time code generation for every POJO >>>>>>> >>>>>>> >>>>>>> - >>>>>>> >>>>>>> 1. and 3 . would be automatically solved, if the serializers >>>> for >>>>>>> POJOs would be generated on-the-fly (by, for example, >>>> Javassist) >>>>>>> - >>>>>>> >>>>>>> 2. also needs code generation, and also some extra effort in >>>> the >>>>>>> type extractor to distinguish between primitive types and >> their >>>>> boxed >>>>>>> versions >>>>>>> - >>>>>>> >>>>>>> could be used for PojoComparator as well (which could greatly >>>>>>> increase the performance of sorting) >>>>>>> >>>>>>> >>>>>>> 1. >>>>>>> >>>>>>> Annotations on POJOs (by the users) >>>>>>> >>>>>>> >>>>>>> - >>>>>>> >>>>>>> Concretely: >>>>>>> - >>>>>>> >>>>>>> annotate fields that will never be nulls -> no null tag >>>> needed >>>>>>> before every field! >>>>>>> - >>>>>>> >>>>>>> make a POJO final -> no subclass tag needed >>>>>>> - >>>>>>> >>>>>>> annotating a POJO that it will not be null -> no top level >>>> null >>>>>>> tag needed >>>>>>> - >>>>>>> >>>>>>> These would also help with the getLength problem (6.), >> because >>>> the >>>>>>> length is often not known because currently anything can be >>>> null >>>>> or a >>>>>>> subclass can appear anywhere >>>>>>> - >>>>>>> >>>>>>> These annotations could be done without code generation, but >>>> then >>>>>>> they would add some overhead when there are no annotations >>>>> present, so this >>>>>>> would work better together with the code generation >>>>>>> - >>>>>>> >>>>>>> Tuples would become a special case of POJOs, where nothing >> can >>>> be >>>>>>> null, and no subclass can appear, so maybe we could eliminate >>>> the >>>>>>> TupleSerializer >>>>>>> - >>>>>>> >>>>>>> We could annotate some internal types in Flink libraries >> (Gelly >>>>>>> (Vertex, Edge), FlinkML) >>>>>>> >>>>>>> >>>>>>> TODO: what is the situation with Scala case classes? Run time code >>>>>>> generation is probably easier in Scala? (with quasiquotes) >>>>>>> >>>>>>> About me >>>>>>> >>>>>>> I am in the last year of my Computer Science MSc studies at Eotvos >>>>> Lorand >>>>>>> University in Budapest, and planning to start a PhD in the autumn. >> I >>>>> have >>>>>>> been working for almost three years at Ericsson on static analysis >>>> tools >>>>>>> for C++. In 2014 I participated in GSoC, working on the LLVM >> project, >>>>> and I >>>>>>> am a frequent contributor ever since. The next summer I was >>>> interning at >>>>>>> Apple. >>>>>>> >>>>>>> I learned about the Flink project not too long ago and I like it so >>>> far. >>>>>>> The last few weeks I was working on some tickets to familiarize >>>> myself >>>>> with >>>>>>> the codebase: >>>>>>> >>>>>>> https://issues.apache.org/jira/browse/FLINK-3422 >>>>>>> >>>>>>> https://issues.apache.org/jira/browse/FLINK-3322 >>>>>>> >>>>>>> https://issues.apache.org/jira/browse/FLINK-3457 >>>>>>> >>>>>>> My CV is available here: http://xazax.web.elte.hu/files/resume.pdf >>>>>>> References >>>>>>> >>>>>>> [1] >>>>>>> >>>>> >>>> >> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Tuple-performance-and-the-curious-JIT-compiler-td10666.html >>>>>>> >>>>>>> [2] >>>>>>> >>>>> >>>> >> https://github.com/apache/flink/blob/master/flink-java/src/main/java/org/apache/flink/api/java/typeutils/runtime/PojoSerializer.java#L369 >>>>>>> >>>>>>> [3] >>>>>>> >>>>> >>>> >> https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/api/common/typeutils/base/IntSerializer.java#L73 >>>>>>> >>>>>>> [4] >>>>>>> >>>>> >>>> >> https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/api/common/typeutils/TypeSerializer.java#L98 >>>>>>> >>>>>>> [5] >>>>>>> >>>>> >>>> >> https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/operators/sort/FixedLengthRecordSorter.java >>>>>>> >>>>>>> [6] >>>>>>> >>>>> >>>> >> https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/operators/hash/CompactingHashTable.java#L861 >>>>>>> [7] https://issues.apache.org/jira/browse/FLINK-3277 >>>>>>> >>>>>>> >>>>>>> Best Regards, >>>>>>> >>>>>>> Gábor >>>>>>> >>>>>> >>>>>> >>>>> >>>> >>> >>> >>