Re: GSoC Project Proposal Draft: Code Generation in Serializers

Márton Balassi Sat, 16 Apr 2016 23:50:17 -0700

Hi Gábor,

I think that adding the Janino dep to flink-core should be fine, as it has
quite slim dependencies [1,2] which are generally orthogonal to Flink's
main dependency line (also it is already used elsewhere).


As for mixing Scala code that is used from the Java parts of the same maven
module I am skeptical. We have seen IDE compilation issues with projects
using this setup and have decided that the community-wide potential IDE
setup pain outweighs the individual implementation convenience with Scala.

[1]
https://repo1.maven.org/maven2/org/codehaus/janino/janino-parent/2.7.8/janino-parent-2.7.8.pom
[2]
https://repo1.maven.org/maven2/org/codehaus/janino/janino/2.7.8/janino-2.7.8.pom

On Sat, Apr 16, 2016 at 5:51 PM, Gábor Horváth <[email protected]> wrote:

> Hi!
>
> Table API already uses code generation and the Janino compiler [1]. Is it a
> dependency that is ok to add to flink-core? In case it is ok, I think I
> will use the same in order to be consistent with the other code generation
> efforts.
>
> I started to look at the Table API code generation [2] and it uses Scala
> extensively. There are several Scala features that can make Java code
> generation easier such as pattern matching and string interpolation. I did
> not see any Scala code in flink-core yet. Is it ok to implement the code
> generation inside the flink-core using Scala?
>
> Regards,
> Gábor
>
> [1] http://unkrig.de/w/Janino
> [2]
>
> https://github.com/apache/flink/blob/master/flink-libraries/flink-table/src/main/scala/org/apache/flink/api/table/codegen/CodeGenerator.scala
>
> On 18 March 2016 at 19:37, Gábor Horváth <[email protected]> wrote:
>
> > Thank you! I finalized the project.
> >
> >
> > On 18 March 2016 at 10:29, Márton Balassi <[email protected]>
> > wrote:
> >
> >> Thanks Gábor, now I also see it on the internal GSoC interface. I have
> >> indicated that I wish to mentor your project, I think you can hit
> finalize
> >> on your project there.
> >>
> >> On Mon, Mar 14, 2016 at 11:16 AM, Gábor Horváth <[email protected]>
> >> wrote:
> >>
> >> > Hi,
> >> >
> >> > I have updated this draft to include preliminary benchmarks, mentioned
> >> the
> >> > interaction of annotations with savepoints, extended it with a
> timeline,
> >> > and some notes about scala case classes.
> >> >
> >> > Regards,
> >> > Gábor
> >> >
> >> > On 9 March 2016 at 16:12, Gábor Horváth <[email protected]> wrote:
> >> >
> >> > > Hi!
> >> > >
> >> > > As far as I can see the formatting was not correct in my previous
> >> mail. A
> >> > > better formatted version is available here:
> >> > >
> >> >
> >>
> https://docs.google.com/document/d/1VC8lCeErx9kI5lCMPiUn625PO0rxR-iKlVqtt3hkVnk
> >> > > Sorry for that.
> >> > >
> >> > > Regards,
> >> > > Gábor
> >> > >
> >> > > On 9 March 2016 at 15:51, Gábor Horváth <[email protected]>
> wrote:
> >> > >
> >> > >> Hi,I did not want to send this proposal out before the I have some
> >> > >> initial benchmarks, but this issue was mentioned on the mailing
> list
> >> (
> >> > >>
> >> >
> >>
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Tuple-performance-and-the-curious-JIT-compiler-td10666.html
> >> > ),
> >> > >> and I wanted to make this information available to be able to
> >> > incorporate
> >> > >> this into that discussion. I have written this draft with the help
> of
> >> > Gábor
> >> > >> Gévay and Márton Balassi and I am open to every suggestion.
> >> > >>
> >> > >>
> >> > >> The proposal draft:
> >> > >> Code Generation in Serializers and Comparators of Apache Flink
> >> > >>
> >> > >> I am doing my last semester of my MSc studies and I’m a former GSoC
> >> > >> student in the LLVM project. I plan to improve the serialization
> >> code in
> >> > >> Flink during this summer. The current implementation of the
> >> serializers
> >> > can
> >> > >> be a performance bottleneck in some scenarios. These performance
> >> > problems
> >> > >> were also reported on the mailing list recently [1]. I plan to
> >> implement
> >> > >> code generation into the serializers to improve the performance (as
> >> > Stephan
> >> > >> Ewen also suggested.)
> >> > >>
> >> > >> TODO: I plan to include some preliminary benchmarks in this
> section.
> >> > >> Performance problems with the current serializers
> >> > >>
> >> > >>    1.
> >> > >>
> >> > >>    PojoSerializer uses reflection for accessing the fields, which
> is
> >> > >>    slow (eg. [2])
> >> > >>
> >> > >>
> >> > >>    -
> >> > >>
> >> > >>    This is also a serious problem for the comparators
> >> > >>
> >> > >>
> >> > >>    1.
> >> > >>
> >> > >>    When deserializing fields of primitive types (eg. int), the
> >> reusing
> >> > >>    overload of the corresponding field serializers cannot really do
> >> any
> >> > reuse,
> >> > >>    because boxed primitive types are immutable in Java. This
> results
> >> in
> >> > lots
> >> > >>    of object creations. [3][7]
> >> > >>    2.
> >> > >>
> >> > >>    The loop to call the field serializers makes virtual function
> >> calls,
> >> > >>    that cannot be speculatively devirtualized by the JVM or
> predicted
> >> > by the
> >> > >>    CPU, because different serializer subclasses are invoked for the
> >> > different
> >> > >>    fields. (And the loop cannot be unrolled, because the number of
> >> > iterations
> >> > >>    is not a compile time constant.) See also the following
> discussion
> >> > on the
> >> > >>    mailing list [1].
> >> > >>    3.
> >> > >>
> >> > >>    A POJO field can have the value null, so the serializer inserts
> 1
> >> > >>    byte null tags, which wastes space. (Also, the type extractor
> >> logic
> >> > does
> >> > >>    not distinguish between primitive types and their boxed
> versions,
> >> so
> >> > even
> >> > >>    an int field has a null tag.)
> >> > >>    4.
> >> > >>
> >> > >>    Subclass tags also add a byte at the beginning of every POJO
> >> > >>    5.
> >> > >>
> >> > >>    getLength() does not know the size in most cases [4]
> >> > >>    Knowing the size of a type when serialized has numerous
> >> performance
> >> > >>    benefits throughout Flink:
> >> > >>    1.
> >> > >>
> >> > >>       Sorters can do in-place, when the type is small [5]
> >> > >>       2.
> >> > >>
> >> > >>       Chaining hash tables do not need resizes, because they know
> how
> >> > >>       many buckets to allocate upfront [6]
> >> > >>       3.
> >> > >>
> >> > >>       Different hash table architectures could be used, eg. open
> >> > >>       addressing with linear probing instead of some chaining
> >> > >>       4.
> >> > >>
> >> > >>       It is possible to deserialize, modify, and then serialize
> back
> >> a
> >> > >>       record to its original place, because it cannot happen that
> the
> >> > modified
> >> > >>       version does not fit in the place allocated there for the old
> >> > version (see
> >> > >>       CompactingHashTable and ReduceHashTable for concrete
> instances
> >> of
> >> > this
> >> > >>       problem)
> >> > >>
> >> > >>
> >> > >> Note, that 2. and 3. are problems with not just the PojoSerializer,
> >> but
> >> > >> also with the TupleSerializer.
> >> > >> Solution approaches
> >> > >>
> >> > >>    1.
> >> > >>
> >> > >>    Run time code generation for every POJO
> >> > >>
> >> > >>
> >> > >>    -
> >> > >>
> >> > >>       1. and 3 . would be automatically solved, if the serializers
> >> for
> >> > >>       POJOs would be generated on-the-fly (by, for example,
> >> Javassist)
> >> > >>       -
> >> > >>
> >> > >>       2. also needs code generation, and also some extra effort in
> >> the
> >> > >>       type extractor to distinguish between primitive types and
> their
> >> > boxed
> >> > >>       versions
> >> > >>       -
> >> > >>
> >> > >>       could be used for PojoComparator as well (which could greatly
> >> > >>       increase the performance of sorting)
> >> > >>
> >> > >>
> >> > >>    1.
> >> > >>
> >> > >>    Annotations on POJOs (by the users)
> >> > >>
> >> > >>
> >> > >>    -
> >> > >>
> >> > >>       Concretely:
> >> > >>       -
> >> > >>
> >> > >>          annotate fields that will never be nulls -> no null tag
> >> needed
> >> > >>          before every field!
> >> > >>          -
> >> > >>
> >> > >>          make a POJO final -> no subclass tag needed
> >> > >>          -
> >> > >>
> >> > >>          annotating a POJO that it will not be null -> no top level
> >> null
> >> > >>          tag needed
> >> > >>          -
> >> > >>
> >> > >>       These would also help with the getLength problem (6.),
> because
> >> the
> >> > >>       length is often not known because currently anything can be
> >> null
> >> > or a
> >> > >>       subclass can appear anywhere
> >> > >>       -
> >> > >>
> >> > >>       These annotations could be done without code generation, but
> >> then
> >> > >>       they would add some overhead when there are no annotations
> >> > present, so this
> >> > >>       would work better together with the code generation
> >> > >>       -
> >> > >>
> >> > >>       Tuples would become a special case of POJOs, where nothing
> can
> >> be
> >> > >>       null, and no subclass can appear, so maybe we could eliminate
> >> the
> >> > >>       TupleSerializer
> >> > >>       -
> >> > >>
> >> > >>       We could annotate some internal types in Flink libraries
> (Gelly
> >> > >>       (Vertex, Edge), FlinkML)
> >> > >>
> >> > >>
> >> > >> TODO: what is the situation with Scala case classes? Run time code
> >> > >> generation is probably easier in Scala? (with quasiquotes)
> >> > >>
> >> > >> About me
> >> > >>
> >> > >> I am in the last year of my Computer Science MSc studies at Eotvos
> >> > Lorand
> >> > >> University in Budapest, and planning to start a PhD in the autumn.
> I
> >> > have
> >> > >> been working for almost three years at Ericsson on static analysis
> >> tools
> >> > >> for C++. In 2014 I participated in GSoC, working on the LLVM
> project,
> >> > and I
> >> > >> am a frequent contributor ever since. The next summer I was
> >> interning at
> >> > >> Apple.
> >> > >>
> >> > >> I learned about the Flink project not too long ago and I like it so
> >> far.
> >> > >> The last few weeks I was working on some tickets to familiarize
> >> myself
> >> > with
> >> > >> the codebase:
> >> > >>
> >> > >> https://issues.apache.org/jira/browse/FLINK-3422
> >> > >>
> >> > >> https://issues.apache.org/jira/browse/FLINK-3322
> >> > >>
> >> > >> https://issues.apache.org/jira/browse/FLINK-3457
> >> > >>
> >> > >> My CV is available here: http://xazax.web.elte.hu/files/resume.pdf
> >> > >> References
> >> > >>
> >> > >> [1]
> >> > >>
> >> >
> >>
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Tuple-performance-and-the-curious-JIT-compiler-td10666.html
> >> > >>
> >> > >> [2]
> >> > >>
> >> >
> >>
> https://github.com/apache/flink/blob/master/flink-java/src/main/java/org/apache/flink/api/java/typeutils/runtime/PojoSerializer.java#L369
> >> > >>
> >> > >> [3]
> >> > >>
> >> >
> >>
> https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/api/common/typeutils/base/IntSerializer.java#L73
> >> > >>
> >> > >> [4]
> >> > >>
> >> >
> >>
> https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/api/common/typeutils/TypeSerializer.java#L98
> >> > >>
> >> > >> [5]
> >> > >>
> >> >
> >>
> https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/operators/sort/FixedLengthRecordSorter.java
> >> > >>
> >> > >> [6]
> >> > >>
> >> >
> >>
> https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/operators/hash/CompactingHashTable.java#L861
> >> > >> [7] https://issues.apache.org/jira/browse/FLINK-3277
> >> > >>
> >> > >>
> >> > >> Best Regards,
> >> > >>
> >> > >> Gábor
> >> > >>
> >> > >
> >> > >
> >> >
> >>
> >
> >
>

Re: GSoC Project Proposal Draft: Code Generation in Serializers

Reply via email to