Chiwan, just to clarify Janino is a Java project. [1] [1] https://github.com/aunkrig/janino
On Mon, Apr 18, 2016 at 3:40 AM, Chiwan Park <chiwanp...@apache.org> wrote: > I prefer to avoid Scala dependencies in flink-core. If flink-core includes > Scala dependencies, Scala version suffix (_2.10 or _2.11) should be added. > I think that users could be confused. > > Regards, > Chiwan Park > > > On Apr 17, 2016, at 3:49 PM, Márton Balassi <balassi.mar...@gmail.com> > wrote: > > > > Hi Gábor, > > > > I think that adding the Janino dep to flink-core should be fine, as it > has > > quite slim dependencies [1,2] which are generally orthogonal to Flink's > > main dependency line (also it is already used elsewhere). > > > > As for mixing Scala code that is used from the Java parts of the same > maven > > module I am skeptical. We have seen IDE compilation issues with projects > > using this setup and have decided that the community-wide potential IDE > > setup pain outweighs the individual implementation convenience with > Scala. > > > > [1] > > > https://repo1.maven.org/maven2/org/codehaus/janino/janino-parent/2.7.8/janino-parent-2.7.8.pom > > [2] > > > https://repo1.maven.org/maven2/org/codehaus/janino/janino/2.7.8/janino-2.7.8.pom > > > > On Sat, Apr 16, 2016 at 5:51 PM, Gábor Horváth <xazax....@gmail.com> > wrote: > > > >> Hi! > >> > >> Table API already uses code generation and the Janino compiler [1]. Is > it a > >> dependency that is ok to add to flink-core? In case it is ok, I think I > >> will use the same in order to be consistent with the other code > generation > >> efforts. > >> > >> I started to look at the Table API code generation [2] and it uses Scala > >> extensively. There are several Scala features that can make Java code > >> generation easier such as pattern matching and string interpolation. I > did > >> not see any Scala code in flink-core yet. Is it ok to implement the code > >> generation inside the flink-core using Scala? > >> > >> Regards, > >> Gábor > >> > >> [1] http://unkrig.de/w/Janino > >> [2] > >> > >> > https://github.com/apache/flink/blob/master/flink-libraries/flink-table/src/main/scala/org/apache/flink/api/table/codegen/CodeGenerator.scala > >> > >> On 18 March 2016 at 19:37, Gábor Horváth <xazax....@gmail.com> wrote: > >> > >>> Thank you! I finalized the project. > >>> > >>> > >>> On 18 March 2016 at 10:29, Márton Balassi <balassi.mar...@gmail.com> > >>> wrote: > >>> > >>>> Thanks Gábor, now I also see it on the internal GSoC interface. I have > >>>> indicated that I wish to mentor your project, I think you can hit > >> finalize > >>>> on your project there. > >>>> > >>>> On Mon, Mar 14, 2016 at 11:16 AM, Gábor Horváth <xazax....@gmail.com> > >>>> wrote: > >>>> > >>>>> Hi, > >>>>> > >>>>> I have updated this draft to include preliminary benchmarks, > mentioned > >>>> the > >>>>> interaction of annotations with savepoints, extended it with a > >> timeline, > >>>>> and some notes about scala case classes. > >>>>> > >>>>> Regards, > >>>>> Gábor > >>>>> > >>>>> On 9 March 2016 at 16:12, Gábor Horváth <xazax....@gmail.com> wrote: > >>>>> > >>>>>> Hi! > >>>>>> > >>>>>> As far as I can see the formatting was not correct in my previous > >>>> mail. A > >>>>>> better formatted version is available here: > >>>>>> > >>>>> > >>>> > >> > https://docs.google.com/document/d/1VC8lCeErx9kI5lCMPiUn625PO0rxR-iKlVqtt3hkVnk > >>>>>> Sorry for that. > >>>>>> > >>>>>> Regards, > >>>>>> Gábor > >>>>>> > >>>>>> On 9 March 2016 at 15:51, Gábor Horváth <xazax....@gmail.com> > >> wrote: > >>>>>> > >>>>>>> Hi,I did not want to send this proposal out before the I have some > >>>>>>> initial benchmarks, but this issue was mentioned on the mailing > >> list > >>>> ( > >>>>>>> > >>>>> > >>>> > >> > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Tuple-performance-and-the-curious-JIT-compiler-td10666.html > >>>>> ), > >>>>>>> and I wanted to make this information available to be able to > >>>>> incorporate > >>>>>>> this into that discussion. I have written this draft with the help > >> of > >>>>> Gábor > >>>>>>> Gévay and Márton Balassi and I am open to every suggestion. > >>>>>>> > >>>>>>> > >>>>>>> The proposal draft: > >>>>>>> Code Generation in Serializers and Comparators of Apache Flink > >>>>>>> > >>>>>>> I am doing my last semester of my MSc studies and I’m a former GSoC > >>>>>>> student in the LLVM project. I plan to improve the serialization > >>>> code in > >>>>>>> Flink during this summer. The current implementation of the > >>>> serializers > >>>>> can > >>>>>>> be a performance bottleneck in some scenarios. These performance > >>>>> problems > >>>>>>> were also reported on the mailing list recently [1]. I plan to > >>>> implement > >>>>>>> code generation into the serializers to improve the performance (as > >>>>> Stephan > >>>>>>> Ewen also suggested.) > >>>>>>> > >>>>>>> TODO: I plan to include some preliminary benchmarks in this > >> section. > >>>>>>> Performance problems with the current serializers > >>>>>>> > >>>>>>> 1. > >>>>>>> > >>>>>>> PojoSerializer uses reflection for accessing the fields, which > >> is > >>>>>>> slow (eg. [2]) > >>>>>>> > >>>>>>> > >>>>>>> - > >>>>>>> > >>>>>>> This is also a serious problem for the comparators > >>>>>>> > >>>>>>> > >>>>>>> 1. > >>>>>>> > >>>>>>> When deserializing fields of primitive types (eg. int), the > >>>> reusing > >>>>>>> overload of the corresponding field serializers cannot really do > >>>> any > >>>>> reuse, > >>>>>>> because boxed primitive types are immutable in Java. This > >> results > >>>> in > >>>>> lots > >>>>>>> of object creations. [3][7] > >>>>>>> 2. > >>>>>>> > >>>>>>> The loop to call the field serializers makes virtual function > >>>> calls, > >>>>>>> that cannot be speculatively devirtualized by the JVM or > >> predicted > >>>>> by the > >>>>>>> CPU, because different serializer subclasses are invoked for the > >>>>> different > >>>>>>> fields. (And the loop cannot be unrolled, because the number of > >>>>> iterations > >>>>>>> is not a compile time constant.) See also the following > >> discussion > >>>>> on the > >>>>>>> mailing list [1]. > >>>>>>> 3. > >>>>>>> > >>>>>>> A POJO field can have the value null, so the serializer inserts > >> 1 > >>>>>>> byte null tags, which wastes space. (Also, the type extractor > >>>> logic > >>>>> does > >>>>>>> not distinguish between primitive types and their boxed > >> versions, > >>>> so > >>>>> even > >>>>>>> an int field has a null tag.) > >>>>>>> 4. > >>>>>>> > >>>>>>> Subclass tags also add a byte at the beginning of every POJO > >>>>>>> 5. > >>>>>>> > >>>>>>> getLength() does not know the size in most cases [4] > >>>>>>> Knowing the size of a type when serialized has numerous > >>>> performance > >>>>>>> benefits throughout Flink: > >>>>>>> 1. > >>>>>>> > >>>>>>> Sorters can do in-place, when the type is small [5] > >>>>>>> 2. > >>>>>>> > >>>>>>> Chaining hash tables do not need resizes, because they know > >> how > >>>>>>> many buckets to allocate upfront [6] > >>>>>>> 3. > >>>>>>> > >>>>>>> Different hash table architectures could be used, eg. open > >>>>>>> addressing with linear probing instead of some chaining > >>>>>>> 4. > >>>>>>> > >>>>>>> It is possible to deserialize, modify, and then serialize > >> back > >>>> a > >>>>>>> record to its original place, because it cannot happen that > >> the > >>>>> modified > >>>>>>> version does not fit in the place allocated there for the old > >>>>> version (see > >>>>>>> CompactingHashTable and ReduceHashTable for concrete > >> instances > >>>> of > >>>>> this > >>>>>>> problem) > >>>>>>> > >>>>>>> > >>>>>>> Note, that 2. and 3. are problems with not just the PojoSerializer, > >>>> but > >>>>>>> also with the TupleSerializer. > >>>>>>> Solution approaches > >>>>>>> > >>>>>>> 1. > >>>>>>> > >>>>>>> Run time code generation for every POJO > >>>>>>> > >>>>>>> > >>>>>>> - > >>>>>>> > >>>>>>> 1. and 3 . would be automatically solved, if the serializers > >>>> for > >>>>>>> POJOs would be generated on-the-fly (by, for example, > >>>> Javassist) > >>>>>>> - > >>>>>>> > >>>>>>> 2. also needs code generation, and also some extra effort in > >>>> the > >>>>>>> type extractor to distinguish between primitive types and > >> their > >>>>> boxed > >>>>>>> versions > >>>>>>> - > >>>>>>> > >>>>>>> could be used for PojoComparator as well (which could greatly > >>>>>>> increase the performance of sorting) > >>>>>>> > >>>>>>> > >>>>>>> 1. > >>>>>>> > >>>>>>> Annotations on POJOs (by the users) > >>>>>>> > >>>>>>> > >>>>>>> - > >>>>>>> > >>>>>>> Concretely: > >>>>>>> - > >>>>>>> > >>>>>>> annotate fields that will never be nulls -> no null tag > >>>> needed > >>>>>>> before every field! > >>>>>>> - > >>>>>>> > >>>>>>> make a POJO final -> no subclass tag needed > >>>>>>> - > >>>>>>> > >>>>>>> annotating a POJO that it will not be null -> no top level > >>>> null > >>>>>>> tag needed > >>>>>>> - > >>>>>>> > >>>>>>> These would also help with the getLength problem (6.), > >> because > >>>> the > >>>>>>> length is often not known because currently anything can be > >>>> null > >>>>> or a > >>>>>>> subclass can appear anywhere > >>>>>>> - > >>>>>>> > >>>>>>> These annotations could be done without code generation, but > >>>> then > >>>>>>> they would add some overhead when there are no annotations > >>>>> present, so this > >>>>>>> would work better together with the code generation > >>>>>>> - > >>>>>>> > >>>>>>> Tuples would become a special case of POJOs, where nothing > >> can > >>>> be > >>>>>>> null, and no subclass can appear, so maybe we could eliminate > >>>> the > >>>>>>> TupleSerializer > >>>>>>> - > >>>>>>> > >>>>>>> We could annotate some internal types in Flink libraries > >> (Gelly > >>>>>>> (Vertex, Edge), FlinkML) > >>>>>>> > >>>>>>> > >>>>>>> TODO: what is the situation with Scala case classes? Run time code > >>>>>>> generation is probably easier in Scala? (with quasiquotes) > >>>>>>> > >>>>>>> About me > >>>>>>> > >>>>>>> I am in the last year of my Computer Science MSc studies at Eotvos > >>>>> Lorand > >>>>>>> University in Budapest, and planning to start a PhD in the autumn. > >> I > >>>>> have > >>>>>>> been working for almost three years at Ericsson on static analysis > >>>> tools > >>>>>>> for C++. In 2014 I participated in GSoC, working on the LLVM > >> project, > >>>>> and I > >>>>>>> am a frequent contributor ever since. The next summer I was > >>>> interning at > >>>>>>> Apple. > >>>>>>> > >>>>>>> I learned about the Flink project not too long ago and I like it so > >>>> far. > >>>>>>> The last few weeks I was working on some tickets to familiarize > >>>> myself > >>>>> with > >>>>>>> the codebase: > >>>>>>> > >>>>>>> https://issues.apache.org/jira/browse/FLINK-3422 > >>>>>>> > >>>>>>> https://issues.apache.org/jira/browse/FLINK-3322 > >>>>>>> > >>>>>>> https://issues.apache.org/jira/browse/FLINK-3457 > >>>>>>> > >>>>>>> My CV is available here: http://xazax.web.elte.hu/files/resume.pdf > >>>>>>> References > >>>>>>> > >>>>>>> [1] > >>>>>>> > >>>>> > >>>> > >> > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Tuple-performance-and-the-curious-JIT-compiler-td10666.html > >>>>>>> > >>>>>>> [2] > >>>>>>> > >>>>> > >>>> > >> > https://github.com/apache/flink/blob/master/flink-java/src/main/java/org/apache/flink/api/java/typeutils/runtime/PojoSerializer.java#L369 > >>>>>>> > >>>>>>> [3] > >>>>>>> > >>>>> > >>>> > >> > https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/api/common/typeutils/base/IntSerializer.java#L73 > >>>>>>> > >>>>>>> [4] > >>>>>>> > >>>>> > >>>> > >> > https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/api/common/typeutils/TypeSerializer.java#L98 > >>>>>>> > >>>>>>> [5] > >>>>>>> > >>>>> > >>>> > >> > https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/operators/sort/FixedLengthRecordSorter.java > >>>>>>> > >>>>>>> [6] > >>>>>>> > >>>>> > >>>> > >> > https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/operators/hash/CompactingHashTable.java#L861 > >>>>>>> [7] https://issues.apache.org/jira/browse/FLINK-3277 > >>>>>>> > >>>>>>> > >>>>>>> Best Regards, > >>>>>>> > >>>>>>> Gábor > >>>>>>> > >>>>>> > >>>>>> > >>>>> > >>>> > >>> > >>> > >> > >