Re: GSoC Project Proposal Draft: Code Generation in Serializers

Chiwan Park Mon, 18 Apr 2016 01:01:34 -0700

Yes, I know Janino is a pure Java project. I meant if we add Scala code to 
flink-core, we should add Scala dependency to flink-core and it could be 
confusing.


Regards,
Chiwan Park

> On Apr 18, 2016, at 2:49 PM, Márton Balassi <[email protected]> wrote:
> 
> Chiwan, just to clarify Janino is a Java project. [1]
> 
> [1] https://github.com/aunkrig/janino
> 
> On Mon, Apr 18, 2016 at 3:40 AM, Chiwan Park <[email protected]> wrote:
> 
>> I prefer to avoid Scala dependencies in flink-core. If flink-core includes
>> Scala dependencies, Scala version suffix (_2.10 or _2.11) should be added.
>> I think that users could be confused.
>> 
>> Regards,
>> Chiwan Park
>> 
>>> On Apr 17, 2016, at 3:49 PM, Márton Balassi <[email protected]>
>> wrote:
>>> 
>>> Hi Gábor,
>>> 
>>> I think that adding the Janino dep to flink-core should be fine, as it
>> has
>>> quite slim dependencies [1,2] which are generally orthogonal to Flink's
>>> main dependency line (also it is already used elsewhere).
>>> 
>>> As for mixing Scala code that is used from the Java parts of the same
>> maven
>>> module I am skeptical. We have seen IDE compilation issues with projects
>>> using this setup and have decided that the community-wide potential IDE
>>> setup pain outweighs the individual implementation convenience with
>> Scala.
>>> 
>>> [1]
>>> 
>> https://repo1.maven.org/maven2/org/codehaus/janino/janino-parent/2.7.8/janino-parent-2.7.8.pom
>>> [2]
>>> 
>> https://repo1.maven.org/maven2/org/codehaus/janino/janino/2.7.8/janino-2.7.8.pom
>>> 
>>> On Sat, Apr 16, 2016 at 5:51 PM, Gábor Horváth <[email protected]>
>> wrote:
>>> 
>>>> Hi!
>>>> 
>>>> Table API already uses code generation and the Janino compiler [1]. Is
>> it a
>>>> dependency that is ok to add to flink-core? In case it is ok, I think I
>>>> will use the same in order to be consistent with the other code
>> generation
>>>> efforts.
>>>> 
>>>> I started to look at the Table API code generation [2] and it uses Scala
>>>> extensively. There are several Scala features that can make Java code
>>>> generation easier such as pattern matching and string interpolation. I
>> did
>>>> not see any Scala code in flink-core yet. Is it ok to implement the code
>>>> generation inside the flink-core using Scala?
>>>> 
>>>> Regards,
>>>> Gábor
>>>> 
>>>> [1] http://unkrig.de/w/Janino
>>>> [2]
>>>> 
>>>> 
>> https://github.com/apache/flink/blob/master/flink-libraries/flink-table/src/main/scala/org/apache/flink/api/table/codegen/CodeGenerator.scala
>>>> 
>>>> On 18 March 2016 at 19:37, Gábor Horváth <[email protected]> wrote:
>>>> 
>>>>> Thank you! I finalized the project.
>>>>> 
>>>>> 
>>>>> On 18 March 2016 at 10:29, Márton Balassi <[email protected]>
>>>>> wrote:
>>>>> 
>>>>>> Thanks Gábor, now I also see it on the internal GSoC interface. I have
>>>>>> indicated that I wish to mentor your project, I think you can hit
>>>> finalize
>>>>>> on your project there.
>>>>>> 
>>>>>> On Mon, Mar 14, 2016 at 11:16 AM, Gábor Horváth <[email protected]>
>>>>>> wrote:
>>>>>> 
>>>>>>> Hi,
>>>>>>> 
>>>>>>> I have updated this draft to include preliminary benchmarks,
>> mentioned
>>>>>> the
>>>>>>> interaction of annotations with savepoints, extended it with a
>>>> timeline,
>>>>>>> and some notes about scala case classes.
>>>>>>> 
>>>>>>> Regards,
>>>>>>> Gábor
>>>>>>> 
>>>>>>> On 9 March 2016 at 16:12, Gábor Horváth <[email protected]> wrote:
>>>>>>> 
>>>>>>>> Hi!
>>>>>>>> 
>>>>>>>> As far as I can see the formatting was not correct in my previous
>>>>>> mail. A
>>>>>>>> better formatted version is available here:
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>> 
>> https://docs.google.com/document/d/1VC8lCeErx9kI5lCMPiUn625PO0rxR-iKlVqtt3hkVnk
>>>>>>>> Sorry for that.
>>>>>>>> 
>>>>>>>> Regards,
>>>>>>>> Gábor
>>>>>>>> 
>>>>>>>> On 9 March 2016 at 15:51, Gábor Horváth <[email protected]>
>>>> wrote:
>>>>>>>> 
>>>>>>>>> Hi,I did not want to send this proposal out before the I have some
>>>>>>>>> initial benchmarks, but this issue was mentioned on the mailing
>>>> list
>>>>>> (
>>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>> 
>> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Tuple-performance-and-the-curious-JIT-compiler-td10666.html
>>>>>>> ),
>>>>>>>>> and I wanted to make this information available to be able to
>>>>>>> incorporate
>>>>>>>>> this into that discussion. I have written this draft with the help
>>>> of
>>>>>>> Gábor
>>>>>>>>> Gévay and Márton Balassi and I am open to every suggestion.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> The proposal draft:
>>>>>>>>> Code Generation in Serializers and Comparators of Apache Flink
>>>>>>>>> 
>>>>>>>>> I am doing my last semester of my MSc studies and I’m a former GSoC
>>>>>>>>> student in the LLVM project. I plan to improve the serialization
>>>>>> code in
>>>>>>>>> Flink during this summer. The current implementation of the
>>>>>> serializers
>>>>>>> can
>>>>>>>>> be a performance bottleneck in some scenarios. These performance
>>>>>>> problems
>>>>>>>>> were also reported on the mailing list recently [1]. I plan to
>>>>>> implement
>>>>>>>>> code generation into the serializers to improve the performance (as
>>>>>>> Stephan
>>>>>>>>> Ewen also suggested.)
>>>>>>>>> 
>>>>>>>>> TODO: I plan to include some preliminary benchmarks in this
>>>> section.
>>>>>>>>> Performance problems with the current serializers
>>>>>>>>> 
>>>>>>>>>  1.
>>>>>>>>> 
>>>>>>>>>  PojoSerializer uses reflection for accessing the fields, which
>>>> is
>>>>>>>>>  slow (eg. [2])
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>  -
>>>>>>>>> 
>>>>>>>>>  This is also a serious problem for the comparators
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>  1.
>>>>>>>>> 
>>>>>>>>>  When deserializing fields of primitive types (eg. int), the
>>>>>> reusing
>>>>>>>>>  overload of the corresponding field serializers cannot really do
>>>>>> any
>>>>>>> reuse,
>>>>>>>>>  because boxed primitive types are immutable in Java. This
>>>> results
>>>>>> in
>>>>>>> lots
>>>>>>>>>  of object creations. [3][7]
>>>>>>>>>  2.
>>>>>>>>> 
>>>>>>>>>  The loop to call the field serializers makes virtual function
>>>>>> calls,
>>>>>>>>>  that cannot be speculatively devirtualized by the JVM or
>>>> predicted
>>>>>>> by the
>>>>>>>>>  CPU, because different serializer subclasses are invoked for the
>>>>>>> different
>>>>>>>>>  fields. (And the loop cannot be unrolled, because the number of
>>>>>>> iterations
>>>>>>>>>  is not a compile time constant.) See also the following
>>>> discussion
>>>>>>> on the
>>>>>>>>>  mailing list [1].
>>>>>>>>>  3.
>>>>>>>>> 
>>>>>>>>>  A POJO field can have the value null, so the serializer inserts
>>>> 1
>>>>>>>>>  byte null tags, which wastes space. (Also, the type extractor
>>>>>> logic
>>>>>>> does
>>>>>>>>>  not distinguish between primitive types and their boxed
>>>> versions,
>>>>>> so
>>>>>>> even
>>>>>>>>>  an int field has a null tag.)
>>>>>>>>>  4.
>>>>>>>>> 
>>>>>>>>>  Subclass tags also add a byte at the beginning of every POJO
>>>>>>>>>  5.
>>>>>>>>> 
>>>>>>>>>  getLength() does not know the size in most cases [4]
>>>>>>>>>  Knowing the size of a type when serialized has numerous
>>>>>> performance
>>>>>>>>>  benefits throughout Flink:
>>>>>>>>>  1.
>>>>>>>>> 
>>>>>>>>>     Sorters can do in-place, when the type is small [5]
>>>>>>>>>     2.
>>>>>>>>> 
>>>>>>>>>     Chaining hash tables do not need resizes, because they know
>>>> how
>>>>>>>>>     many buckets to allocate upfront [6]
>>>>>>>>>     3.
>>>>>>>>> 
>>>>>>>>>     Different hash table architectures could be used, eg. open
>>>>>>>>>     addressing with linear probing instead of some chaining
>>>>>>>>>     4.
>>>>>>>>> 
>>>>>>>>>     It is possible to deserialize, modify, and then serialize
>>>> back
>>>>>> a
>>>>>>>>>     record to its original place, because it cannot happen that
>>>> the
>>>>>>> modified
>>>>>>>>>     version does not fit in the place allocated there for the old
>>>>>>> version (see
>>>>>>>>>     CompactingHashTable and ReduceHashTable for concrete
>>>> instances
>>>>>> of
>>>>>>> this
>>>>>>>>>     problem)
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Note, that 2. and 3. are problems with not just the PojoSerializer,
>>>>>> but
>>>>>>>>> also with the TupleSerializer.
>>>>>>>>> Solution approaches
>>>>>>>>> 
>>>>>>>>>  1.
>>>>>>>>> 
>>>>>>>>>  Run time code generation for every POJO
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>  -
>>>>>>>>> 
>>>>>>>>>     1. and 3 . would be automatically solved, if the serializers
>>>>>> for
>>>>>>>>>     POJOs would be generated on-the-fly (by, for example,
>>>>>> Javassist)
>>>>>>>>>     -
>>>>>>>>> 
>>>>>>>>>     2. also needs code generation, and also some extra effort in
>>>>>> the
>>>>>>>>>     type extractor to distinguish between primitive types and
>>>> their
>>>>>>> boxed
>>>>>>>>>     versions
>>>>>>>>>     -
>>>>>>>>> 
>>>>>>>>>     could be used for PojoComparator as well (which could greatly
>>>>>>>>>     increase the performance of sorting)
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>  1.
>>>>>>>>> 
>>>>>>>>>  Annotations on POJOs (by the users)
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>  -
>>>>>>>>> 
>>>>>>>>>     Concretely:
>>>>>>>>>     -
>>>>>>>>> 
>>>>>>>>>        annotate fields that will never be nulls -> no null tag
>>>>>> needed
>>>>>>>>>        before every field!
>>>>>>>>>        -
>>>>>>>>> 
>>>>>>>>>        make a POJO final -> no subclass tag needed
>>>>>>>>>        -
>>>>>>>>> 
>>>>>>>>>        annotating a POJO that it will not be null -> no top level
>>>>>> null
>>>>>>>>>        tag needed
>>>>>>>>>        -
>>>>>>>>> 
>>>>>>>>>     These would also help with the getLength problem (6.),
>>>> because
>>>>>> the
>>>>>>>>>     length is often not known because currently anything can be
>>>>>> null
>>>>>>> or a
>>>>>>>>>     subclass can appear anywhere
>>>>>>>>>     -
>>>>>>>>> 
>>>>>>>>>     These annotations could be done without code generation, but
>>>>>> then
>>>>>>>>>     they would add some overhead when there are no annotations
>>>>>>> present, so this
>>>>>>>>>     would work better together with the code generation
>>>>>>>>>     -
>>>>>>>>> 
>>>>>>>>>     Tuples would become a special case of POJOs, where nothing
>>>> can
>>>>>> be
>>>>>>>>>     null, and no subclass can appear, so maybe we could eliminate
>>>>>> the
>>>>>>>>>     TupleSerializer
>>>>>>>>>     -
>>>>>>>>> 
>>>>>>>>>     We could annotate some internal types in Flink libraries
>>>> (Gelly
>>>>>>>>>     (Vertex, Edge), FlinkML)
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> TODO: what is the situation with Scala case classes? Run time code
>>>>>>>>> generation is probably easier in Scala? (with quasiquotes)
>>>>>>>>> 
>>>>>>>>> About me
>>>>>>>>> 
>>>>>>>>> I am in the last year of my Computer Science MSc studies at Eotvos
>>>>>>> Lorand
>>>>>>>>> University in Budapest, and planning to start a PhD in the autumn.
>>>> I
>>>>>>> have
>>>>>>>>> been working for almost three years at Ericsson on static analysis
>>>>>> tools
>>>>>>>>> for C++. In 2014 I participated in GSoC, working on the LLVM
>>>> project,
>>>>>>> and I
>>>>>>>>> am a frequent contributor ever since. The next summer I was
>>>>>> interning at
>>>>>>>>> Apple.
>>>>>>>>> 
>>>>>>>>> I learned about the Flink project not too long ago and I like it so
>>>>>> far.
>>>>>>>>> The last few weeks I was working on some tickets to familiarize
>>>>>> myself
>>>>>>> with
>>>>>>>>> the codebase:
>>>>>>>>> 
>>>>>>>>> https://issues.apache.org/jira/browse/FLINK-3422
>>>>>>>>> 
>>>>>>>>> https://issues.apache.org/jira/browse/FLINK-3322
>>>>>>>>> 
>>>>>>>>> https://issues.apache.org/jira/browse/FLINK-3457
>>>>>>>>> 
>>>>>>>>> My CV is available here: http://xazax.web.elte.hu/files/resume.pdf
>>>>>>>>> References
>>>>>>>>> 
>>>>>>>>> [1]
>>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>> 
>> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Tuple-performance-and-the-curious-JIT-compiler-td10666.html
>>>>>>>>> 
>>>>>>>>> [2]
>>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>> 
>> https://github.com/apache/flink/blob/master/flink-java/src/main/java/org/apache/flink/api/java/typeutils/runtime/PojoSerializer.java#L369
>>>>>>>>> 
>>>>>>>>> [3]
>>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>> 
>> https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/api/common/typeutils/base/IntSerializer.java#L73
>>>>>>>>> 
>>>>>>>>> [4]
>>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>> 
>> https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/api/common/typeutils/TypeSerializer.java#L98
>>>>>>>>> 
>>>>>>>>> [5]
>>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>> 
>> https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/operators/sort/FixedLengthRecordSorter.java
>>>>>>>>> 
>>>>>>>>> [6]
>>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>> 
>> https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/operators/hash/CompactingHashTable.java#L861
>>>>>>>>> [7] https://issues.apache.org/jira/browse/FLINK-3277
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Best Regards,
>>>>>>>>> 
>>>>>>>>> Gábor
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>> 
>>

Re: GSoC Project Proposal Draft: Code Generation in Serializers

Reply via email to