Re: GSoC Project Proposal Draft: Code Generation in Serializers

Chiwan Park Sun, 17 Apr 2016 18:41:16 -0700

I prefer to avoid Scala dependencies in flink-core. If flink-core includes 
Scala dependencies, Scala version suffix (_2.10 or _2.11) should be added. I 
think that users could be confused.


Regards,
Chiwan Park

> On Apr 17, 2016, at 3:49 PM, Márton Balassi <[email protected]> wrote:
> 
> Hi Gábor,
> 
> I think that adding the Janino dep to flink-core should be fine, as it has
> quite slim dependencies [1,2] which are generally orthogonal to Flink's
> main dependency line (also it is already used elsewhere).
> 
> As for mixing Scala code that is used from the Java parts of the same maven
> module I am skeptical. We have seen IDE compilation issues with projects
> using this setup and have decided that the community-wide potential IDE
> setup pain outweighs the individual implementation convenience with Scala.
> 
> [1]
> https://repo1.maven.org/maven2/org/codehaus/janino/janino-parent/2.7.8/janino-parent-2.7.8.pom
> [2]
> https://repo1.maven.org/maven2/org/codehaus/janino/janino/2.7.8/janino-2.7.8.pom
> 
> On Sat, Apr 16, 2016 at 5:51 PM, Gábor Horváth <[email protected]> wrote:
> 
>> Hi!
>> 
>> Table API already uses code generation and the Janino compiler [1]. Is it a
>> dependency that is ok to add to flink-core? In case it is ok, I think I
>> will use the same in order to be consistent with the other code generation
>> efforts.
>> 
>> I started to look at the Table API code generation [2] and it uses Scala
>> extensively. There are several Scala features that can make Java code
>> generation easier such as pattern matching and string interpolation. I did
>> not see any Scala code in flink-core yet. Is it ok to implement the code
>> generation inside the flink-core using Scala?
>> 
>> Regards,
>> Gábor
>> 
>> [1] http://unkrig.de/w/Janino
>> [2]
>> 
>> https://github.com/apache/flink/blob/master/flink-libraries/flink-table/src/main/scala/org/apache/flink/api/table/codegen/CodeGenerator.scala
>> 
>> On 18 March 2016 at 19:37, Gábor Horváth <[email protected]> wrote:
>> 
>>> Thank you! I finalized the project.
>>> 
>>> 
>>> On 18 March 2016 at 10:29, Márton Balassi <[email protected]>
>>> wrote:
>>> 
>>>> Thanks Gábor, now I also see it on the internal GSoC interface. I have
>>>> indicated that I wish to mentor your project, I think you can hit
>> finalize
>>>> on your project there.
>>>> 
>>>> On Mon, Mar 14, 2016 at 11:16 AM, Gábor Horváth <[email protected]>
>>>> wrote:
>>>> 
>>>>> Hi,
>>>>> 
>>>>> I have updated this draft to include preliminary benchmarks, mentioned
>>>> the
>>>>> interaction of annotations with savepoints, extended it with a
>> timeline,
>>>>> and some notes about scala case classes.
>>>>> 
>>>>> Regards,
>>>>> Gábor
>>>>> 
>>>>> On 9 March 2016 at 16:12, Gábor Horváth <[email protected]> wrote:
>>>>> 
>>>>>> Hi!
>>>>>> 
>>>>>> As far as I can see the formatting was not correct in my previous
>>>> mail. A
>>>>>> better formatted version is available here:
>>>>>> 
>>>>> 
>>>> 
>> https://docs.google.com/document/d/1VC8lCeErx9kI5lCMPiUn625PO0rxR-iKlVqtt3hkVnk
>>>>>> Sorry for that.
>>>>>> 
>>>>>> Regards,
>>>>>> Gábor
>>>>>> 
>>>>>> On 9 March 2016 at 15:51, Gábor Horváth <[email protected]>
>> wrote:
>>>>>> 
>>>>>>> Hi,I did not want to send this proposal out before the I have some
>>>>>>> initial benchmarks, but this issue was mentioned on the mailing
>> list
>>>> (
>>>>>>> 
>>>>> 
>>>> 
>> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Tuple-performance-and-the-curious-JIT-compiler-td10666.html
>>>>> ),
>>>>>>> and I wanted to make this information available to be able to
>>>>> incorporate
>>>>>>> this into that discussion. I have written this draft with the help
>> of
>>>>> Gábor
>>>>>>> Gévay and Márton Balassi and I am open to every suggestion.
>>>>>>> 
>>>>>>> 
>>>>>>> The proposal draft:
>>>>>>> Code Generation in Serializers and Comparators of Apache Flink
>>>>>>> 
>>>>>>> I am doing my last semester of my MSc studies and I’m a former GSoC
>>>>>>> student in the LLVM project. I plan to improve the serialization
>>>> code in
>>>>>>> Flink during this summer. The current implementation of the
>>>> serializers
>>>>> can
>>>>>>> be a performance bottleneck in some scenarios. These performance
>>>>> problems
>>>>>>> were also reported on the mailing list recently [1]. I plan to
>>>> implement
>>>>>>> code generation into the serializers to improve the performance (as
>>>>> Stephan
>>>>>>> Ewen also suggested.)
>>>>>>> 
>>>>>>> TODO: I plan to include some preliminary benchmarks in this
>> section.
>>>>>>> Performance problems with the current serializers
>>>>>>> 
>>>>>>>   1.
>>>>>>> 
>>>>>>>   PojoSerializer uses reflection for accessing the fields, which
>> is
>>>>>>>   slow (eg. [2])
>>>>>>> 
>>>>>>> 
>>>>>>>   -
>>>>>>> 
>>>>>>>   This is also a serious problem for the comparators
>>>>>>> 
>>>>>>> 
>>>>>>>   1.
>>>>>>> 
>>>>>>>   When deserializing fields of primitive types (eg. int), the
>>>> reusing
>>>>>>>   overload of the corresponding field serializers cannot really do
>>>> any
>>>>> reuse,
>>>>>>>   because boxed primitive types are immutable in Java. This
>> results
>>>> in
>>>>> lots
>>>>>>>   of object creations. [3][7]
>>>>>>>   2.
>>>>>>> 
>>>>>>>   The loop to call the field serializers makes virtual function
>>>> calls,
>>>>>>>   that cannot be speculatively devirtualized by the JVM or
>> predicted
>>>>> by the
>>>>>>>   CPU, because different serializer subclasses are invoked for the
>>>>> different
>>>>>>>   fields. (And the loop cannot be unrolled, because the number of
>>>>> iterations
>>>>>>>   is not a compile time constant.) See also the following
>> discussion
>>>>> on the
>>>>>>>   mailing list [1].
>>>>>>>   3.
>>>>>>> 
>>>>>>>   A POJO field can have the value null, so the serializer inserts
>> 1
>>>>>>>   byte null tags, which wastes space. (Also, the type extractor
>>>> logic
>>>>> does
>>>>>>>   not distinguish between primitive types and their boxed
>> versions,
>>>> so
>>>>> even
>>>>>>>   an int field has a null tag.)
>>>>>>>   4.
>>>>>>> 
>>>>>>>   Subclass tags also add a byte at the beginning of every POJO
>>>>>>>   5.
>>>>>>> 
>>>>>>>   getLength() does not know the size in most cases [4]
>>>>>>>   Knowing the size of a type when serialized has numerous
>>>> performance
>>>>>>>   benefits throughout Flink:
>>>>>>>   1.
>>>>>>> 
>>>>>>>      Sorters can do in-place, when the type is small [5]
>>>>>>>      2.
>>>>>>> 
>>>>>>>      Chaining hash tables do not need resizes, because they know
>> how
>>>>>>>      many buckets to allocate upfront [6]
>>>>>>>      3.
>>>>>>> 
>>>>>>>      Different hash table architectures could be used, eg. open
>>>>>>>      addressing with linear probing instead of some chaining
>>>>>>>      4.
>>>>>>> 
>>>>>>>      It is possible to deserialize, modify, and then serialize
>> back
>>>> a
>>>>>>>      record to its original place, because it cannot happen that
>> the
>>>>> modified
>>>>>>>      version does not fit in the place allocated there for the old
>>>>> version (see
>>>>>>>      CompactingHashTable and ReduceHashTable for concrete
>> instances
>>>> of
>>>>> this
>>>>>>>      problem)
>>>>>>> 
>>>>>>> 
>>>>>>> Note, that 2. and 3. are problems with not just the PojoSerializer,
>>>> but
>>>>>>> also with the TupleSerializer.
>>>>>>> Solution approaches
>>>>>>> 
>>>>>>>   1.
>>>>>>> 
>>>>>>>   Run time code generation for every POJO
>>>>>>> 
>>>>>>> 
>>>>>>>   -
>>>>>>> 
>>>>>>>      1. and 3 . would be automatically solved, if the serializers
>>>> for
>>>>>>>      POJOs would be generated on-the-fly (by, for example,
>>>> Javassist)
>>>>>>>      -
>>>>>>> 
>>>>>>>      2. also needs code generation, and also some extra effort in
>>>> the
>>>>>>>      type extractor to distinguish between primitive types and
>> their
>>>>> boxed
>>>>>>>      versions
>>>>>>>      -
>>>>>>> 
>>>>>>>      could be used for PojoComparator as well (which could greatly
>>>>>>>      increase the performance of sorting)
>>>>>>> 
>>>>>>> 
>>>>>>>   1.
>>>>>>> 
>>>>>>>   Annotations on POJOs (by the users)
>>>>>>> 
>>>>>>> 
>>>>>>>   -
>>>>>>> 
>>>>>>>      Concretely:
>>>>>>>      -
>>>>>>> 
>>>>>>>         annotate fields that will never be nulls -> no null tag
>>>> needed
>>>>>>>         before every field!
>>>>>>>         -
>>>>>>> 
>>>>>>>         make a POJO final -> no subclass tag needed
>>>>>>>         -
>>>>>>> 
>>>>>>>         annotating a POJO that it will not be null -> no top level
>>>> null
>>>>>>>         tag needed
>>>>>>>         -
>>>>>>> 
>>>>>>>      These would also help with the getLength problem (6.),
>> because
>>>> the
>>>>>>>      length is often not known because currently anything can be
>>>> null
>>>>> or a
>>>>>>>      subclass can appear anywhere
>>>>>>>      -
>>>>>>> 
>>>>>>>      These annotations could be done without code generation, but
>>>> then
>>>>>>>      they would add some overhead when there are no annotations
>>>>> present, so this
>>>>>>>      would work better together with the code generation
>>>>>>>      -
>>>>>>> 
>>>>>>>      Tuples would become a special case of POJOs, where nothing
>> can
>>>> be
>>>>>>>      null, and no subclass can appear, so maybe we could eliminate
>>>> the
>>>>>>>      TupleSerializer
>>>>>>>      -
>>>>>>> 
>>>>>>>      We could annotate some internal types in Flink libraries
>> (Gelly
>>>>>>>      (Vertex, Edge), FlinkML)
>>>>>>> 
>>>>>>> 
>>>>>>> TODO: what is the situation with Scala case classes? Run time code
>>>>>>> generation is probably easier in Scala? (with quasiquotes)
>>>>>>> 
>>>>>>> About me
>>>>>>> 
>>>>>>> I am in the last year of my Computer Science MSc studies at Eotvos
>>>>> Lorand
>>>>>>> University in Budapest, and planning to start a PhD in the autumn.
>> I
>>>>> have
>>>>>>> been working for almost three years at Ericsson on static analysis
>>>> tools
>>>>>>> for C++. In 2014 I participated in GSoC, working on the LLVM
>> project,
>>>>> and I
>>>>>>> am a frequent contributor ever since. The next summer I was
>>>> interning at
>>>>>>> Apple.
>>>>>>> 
>>>>>>> I learned about the Flink project not too long ago and I like it so
>>>> far.
>>>>>>> The last few weeks I was working on some tickets to familiarize
>>>> myself
>>>>> with
>>>>>>> the codebase:
>>>>>>> 
>>>>>>> https://issues.apache.org/jira/browse/FLINK-3422
>>>>>>> 
>>>>>>> https://issues.apache.org/jira/browse/FLINK-3322
>>>>>>> 
>>>>>>> https://issues.apache.org/jira/browse/FLINK-3457
>>>>>>> 
>>>>>>> My CV is available here: http://xazax.web.elte.hu/files/resume.pdf
>>>>>>> References
>>>>>>> 
>>>>>>> [1]
>>>>>>> 
>>>>> 
>>>> 
>> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Tuple-performance-and-the-curious-JIT-compiler-td10666.html
>>>>>>> 
>>>>>>> [2]
>>>>>>> 
>>>>> 
>>>> 
>> https://github.com/apache/flink/blob/master/flink-java/src/main/java/org/apache/flink/api/java/typeutils/runtime/PojoSerializer.java#L369
>>>>>>> 
>>>>>>> [3]
>>>>>>> 
>>>>> 
>>>> 
>> https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/api/common/typeutils/base/IntSerializer.java#L73
>>>>>>> 
>>>>>>> [4]
>>>>>>> 
>>>>> 
>>>> 
>> https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/api/common/typeutils/TypeSerializer.java#L98
>>>>>>> 
>>>>>>> [5]
>>>>>>> 
>>>>> 
>>>> 
>> https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/operators/sort/FixedLengthRecordSorter.java
>>>>>>> 
>>>>>>> [6]
>>>>>>> 
>>>>> 
>>>> 
>> https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/operators/hash/CompactingHashTable.java#L861
>>>>>>> [7] https://issues.apache.org/jira/browse/FLINK-3277
>>>>>>> 
>>>>>>> 
>>>>>>> Best Regards,
>>>>>>> 
>>>>>>> Gábor
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>>> 
>>

Re: GSoC Project Proposal Draft: Code Generation in Serializers

Reply via email to