Re: GSoC Project Proposal Draft: Code Generation in Serializers

Gábor Horváth Wed, 20 Apr 2016 06:53:51 -0700

Hi Fabian,

I agree that it would be awesome to move this to its own module/plugin.
However in order to be able to write the code generation in Scala I would
need to rewrite the type information to use Scala as well. I think I will
not
have time to do this during the summer, so I think I will stick to Java and
this modularization can be done later.


Thanks,
Gábor

On 19 April 2016 at 11:50, Fabian Hueske <fhue...@gmail.com> wrote:

> Hi Gabor,
>
> you are right, a codegen serializer module would depend on flink-core and
> in the current design flink-core would need to know about the type infos /
> serializers / comparators.
>
> Decoupling implementations of type info, serializers, and comparators from
> flink-core and resolving the cyclic dependency would be what the plugin
> architecture would be for.
> Maybe this can be done by some mechanism to dynamically load
> TypeInformations for types with overridden serializers / comparators.
> This would require some design document and discussion in the community.
>
> Cheers, Fabian
>
>
>
>
>
> 2016-04-18 21:19 GMT+02:00 Gábor Horváth <xazax....@gmail.com>:
>
> > Unfortunately making code generation a separate module would introduce
> > cyclic dependency.
> > Code generation requires the TypeInfo which is available in flink-core
> and
> > flink-core requires
> > the generated serializers from the code generation module. Do you have a
> > solution for this?
> >
> > I think if we can come up with a solution I will implement it as a
> separate
> > Scala module
> > otherwise I will stick to Java.
> >
> > BR,
> > Gábor
> >
> > On 18 April 2016 at 12:40, Fabian Hueske <fhue...@gmail.com> wrote:
> >
> > > +1 for not mixing Java and Scala in flink-core.
> > >
> > > Maybe it makes sense to implement the code generated serializers /
> > > comparators as a separate module which can be plugged-in. This could be
> > > pure Scala.
> > > In general, I think it would be good to have some kind of "version
> > > management" for serializers in place. With features such as safepoints
> > that
> > > depend on the implementation of serializers, it would be good to have a
> > > mechanism to switch between implementations.
> > >
> > > Best, Fabian
> > >
> > > 2016-04-18 10:01 GMT+02:00 Chiwan Park <chiwanp...@apache.org>:
> > >
> > > > Yes, I know Janino is a pure Java project. I meant if we add Scala
> code
> > > to
> > > > flink-core, we should add Scala dependency to flink-core and it could
> > be
> > > > confusing.
> > > >
> > > > Regards,
> > > > Chiwan Park
> > > >
> > > > > On Apr 18, 2016, at 2:49 PM, Márton Balassi <
> > balassi.mar...@gmail.com>
> > > > wrote:
> > > > >
> > > > > Chiwan, just to clarify Janino is a Java project. [1]
> > > > >
> > > > > [1] https://github.com/aunkrig/janino
> > > > >
> > > > > On Mon, Apr 18, 2016 at 3:40 AM, Chiwan Park <
> chiwanp...@apache.org>
> > > > wrote:
> > > > >
> > > > >> I prefer to avoid Scala dependencies in flink-core. If flink-core
> > > > includes
> > > > >> Scala dependencies, Scala version suffix (_2.10 or _2.11) should
> be
> > > > added.
> > > > >> I think that users could be confused.
> > > > >>
> > > > >> Regards,
> > > > >> Chiwan Park
> > > > >>
> > > > >>> On Apr 17, 2016, at 3:49 PM, Márton Balassi <
> > > balassi.mar...@gmail.com>
> > > > >> wrote:
> > > > >>>
> > > > >>> Hi Gábor,
> > > > >>>
> > > > >>> I think that adding the Janino dep to flink-core should be fine,
> as
> > > it
> > > > >> has
> > > > >>> quite slim dependencies [1,2] which are generally orthogonal to
> > > Flink's
> > > > >>> main dependency line (also it is already used elsewhere).
> > > > >>>
> > > > >>> As for mixing Scala code that is used from the Java parts of the
> > same
> > > > >> maven
> > > > >>> module I am skeptical. We have seen IDE compilation issues with
> > > > projects
> > > > >>> using this setup and have decided that the community-wide
> potential
> > > IDE
> > > > >>> setup pain outweighs the individual implementation convenience
> with
> > > > >> Scala.
> > > > >>>
> > > > >>> [1]
> > > > >>>
> > > > >>
> > > >
> > >
> >
> https://repo1.maven.org/maven2/org/codehaus/janino/janino-parent/2.7.8/janino-parent-2.7.8.pom
> > > > >>> [2]
> > > > >>>
> > > > >>
> > > >
> > >
> >
> https://repo1.maven.org/maven2/org/codehaus/janino/janino/2.7.8/janino-2.7.8.pom
> > > > >>>
> > > > >>> On Sat, Apr 16, 2016 at 5:51 PM, Gábor Horváth <
> > xazax....@gmail.com>
> > > > >> wrote:
> > > > >>>
> > > > >>>> Hi!
> > > > >>>>
> > > > >>>> Table API already uses code generation and the Janino compiler
> > [1].
> > > Is
> > > > >> it a
> > > > >>>> dependency that is ok to add to flink-core? In case it is ok, I
> > > think
> > > > I
> > > > >>>> will use the same in order to be consistent with the other code
> > > > >> generation
> > > > >>>> efforts.
> > > > >>>>
> > > > >>>> I started to look at the Table API code generation [2] and it
> uses
> > > > Scala
> > > > >>>> extensively. There are several Scala features that can make Java
> > > code
> > > > >>>> generation easier such as pattern matching and string
> > > interpolation. I
> > > > >> did
> > > > >>>> not see any Scala code in flink-core yet. Is it ok to implement
> > the
> > > > code
> > > > >>>> generation inside the flink-core using Scala?
> > > > >>>>
> > > > >>>> Regards,
> > > > >>>> Gábor
> > > > >>>>
> > > > >>>> [1] http://unkrig.de/w/Janino
> > > > >>>> [2]
> > > > >>>>
> > > > >>>>
> > > > >>
> > > >
> > >
> >
> https://github.com/apache/flink/blob/master/flink-libraries/flink-table/src/main/scala/org/apache/flink/api/table/codegen/CodeGenerator.scala
> > > > >>>>
> > > > >>>> On 18 March 2016 at 19:37, Gábor Horváth <xazax....@gmail.com>
> > > wrote:
> > > > >>>>
> > > > >>>>> Thank you! I finalized the project.
> > > > >>>>>
> > > > >>>>>
> > > > >>>>> On 18 March 2016 at 10:29, Márton Balassi <
> > > balassi.mar...@gmail.com>
> > > > >>>>> wrote:
> > > > >>>>>
> > > > >>>>>> Thanks Gábor, now I also see it on the internal GSoC
> interface.
> > I
> > > > have
> > > > >>>>>> indicated that I wish to mentor your project, I think you can
> > hit
> > > > >>>> finalize
> > > > >>>>>> on your project there.
> > > > >>>>>>
> > > > >>>>>> On Mon, Mar 14, 2016 at 11:16 AM, Gábor Horváth <
> > > > xazax....@gmail.com>
> > > > >>>>>> wrote:
> > > > >>>>>>
> > > > >>>>>>> Hi,
> > > > >>>>>>>
> > > > >>>>>>> I have updated this draft to include preliminary benchmarks,
> > > > >> mentioned
> > > > >>>>>> the
> > > > >>>>>>> interaction of annotations with savepoints, extended it with
> a
> > > > >>>> timeline,
> > > > >>>>>>> and some notes about scala case classes.
> > > > >>>>>>>
> > > > >>>>>>> Regards,
> > > > >>>>>>> Gábor
> > > > >>>>>>>
> > > > >>>>>>> On 9 March 2016 at 16:12, Gábor Horváth <xazax....@gmail.com
> >
> > > > wrote:
> > > > >>>>>>>
> > > > >>>>>>>> Hi!
> > > > >>>>>>>>
> > > > >>>>>>>> As far as I can see the formatting was not correct in my
> > > previous
> > > > >>>>>> mail. A
> > > > >>>>>>>> better formatted version is available here:
> > > > >>>>>>>>
> > > > >>>>>>>
> > > > >>>>>>
> > > > >>>>
> > > > >>
> > > >
> > >
> >
> https://docs.google.com/document/d/1VC8lCeErx9kI5lCMPiUn625PO0rxR-iKlVqtt3hkVnk
> > > > >>>>>>>> Sorry for that.
> > > > >>>>>>>>
> > > > >>>>>>>> Regards,
> > > > >>>>>>>> Gábor
> > > > >>>>>>>>
> > > > >>>>>>>> On 9 March 2016 at 15:51, Gábor Horváth <
> xazax....@gmail.com>
> > > > >>>> wrote:
> > > > >>>>>>>>
> > > > >>>>>>>>> Hi,I did not want to send this proposal out before the I
> have
> > > > some
> > > > >>>>>>>>> initial benchmarks, but this issue was mentioned on the
> > mailing
> > > > >>>> list
> > > > >>>>>> (
> > > > >>>>>>>>>
> > > > >>>>>>>
> > > > >>>>>>
> > > > >>>>
> > > > >>
> > > >
> > >
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Tuple-performance-and-the-curious-JIT-compiler-td10666.html
> > > > >>>>>>> ),
> > > > >>>>>>>>> and I wanted to make this information available to be able
> to
> > > > >>>>>>> incorporate
> > > > >>>>>>>>> this into that discussion. I have written this draft with
> the
> > > > help
> > > > >>>> of
> > > > >>>>>>> Gábor
> > > > >>>>>>>>> Gévay and Márton Balassi and I am open to every suggestion.
> > > > >>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>>>> The proposal draft:
> > > > >>>>>>>>> Code Generation in Serializers and Comparators of Apache
> > Flink
> > > > >>>>>>>>>
> > > > >>>>>>>>> I am doing my last semester of my MSc studies and I’m a
> > former
> > > > GSoC
> > > > >>>>>>>>> student in the LLVM project. I plan to improve the
> > > serialization
> > > > >>>>>> code in
> > > > >>>>>>>>> Flink during this summer. The current implementation of the
> > > > >>>>>> serializers
> > > > >>>>>>> can
> > > > >>>>>>>>> be a performance bottleneck in some scenarios. These
> > > performance
> > > > >>>>>>> problems
> > > > >>>>>>>>> were also reported on the mailing list recently [1]. I plan
> > to
> > > > >>>>>> implement
> > > > >>>>>>>>> code generation into the serializers to improve the
> > performance
> > > > (as
> > > > >>>>>>> Stephan
> > > > >>>>>>>>> Ewen also suggested.)
> > > > >>>>>>>>>
> > > > >>>>>>>>> TODO: I plan to include some preliminary benchmarks in this
> > > > >>>> section.
> > > > >>>>>>>>> Performance problems with the current serializers
> > > > >>>>>>>>>
> > > > >>>>>>>>>  1.
> > > > >>>>>>>>>
> > > > >>>>>>>>>  PojoSerializer uses reflection for accessing the fields,
> > which
> > > > >>>> is
> > > > >>>>>>>>>  slow (eg. [2])
> > > > >>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>>>>  -
> > > > >>>>>>>>>
> > > > >>>>>>>>>  This is also a serious problem for the comparators
> > > > >>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>>>>  1.
> > > > >>>>>>>>>
> > > > >>>>>>>>>  When deserializing fields of primitive types (eg. int),
> the
> > > > >>>>>> reusing
> > > > >>>>>>>>>  overload of the corresponding field serializers cannot
> > really
> > > do
> > > > >>>>>> any
> > > > >>>>>>> reuse,
> > > > >>>>>>>>>  because boxed primitive types are immutable in Java. This
> > > > >>>> results
> > > > >>>>>> in
> > > > >>>>>>> lots
> > > > >>>>>>>>>  of object creations. [3][7]
> > > > >>>>>>>>>  2.
> > > > >>>>>>>>>
> > > > >>>>>>>>>  The loop to call the field serializers makes virtual
> > function
> > > > >>>>>> calls,
> > > > >>>>>>>>>  that cannot be speculatively devirtualized by the JVM or
> > > > >>>> predicted
> > > > >>>>>>> by the
> > > > >>>>>>>>>  CPU, because different serializer subclasses are invoked
> for
> > > the
> > > > >>>>>>> different
> > > > >>>>>>>>>  fields. (And the loop cannot be unrolled, because the
> number
> > > of
> > > > >>>>>>> iterations
> > > > >>>>>>>>>  is not a compile time constant.) See also the following
> > > > >>>> discussion
> > > > >>>>>>> on the
> > > > >>>>>>>>>  mailing list [1].
> > > > >>>>>>>>>  3.
> > > > >>>>>>>>>
> > > > >>>>>>>>>  A POJO field can have the value null, so the serializer
> > > inserts
> > > > >>>> 1
> > > > >>>>>>>>>  byte null tags, which wastes space. (Also, the type
> > extractor
> > > > >>>>>> logic
> > > > >>>>>>> does
> > > > >>>>>>>>>  not distinguish between primitive types and their boxed
> > > > >>>> versions,
> > > > >>>>>> so
> > > > >>>>>>> even
> > > > >>>>>>>>>  an int field has a null tag.)
> > > > >>>>>>>>>  4.
> > > > >>>>>>>>>
> > > > >>>>>>>>>  Subclass tags also add a byte at the beginning of every
> POJO
> > > > >>>>>>>>>  5.
> > > > >>>>>>>>>
> > > > >>>>>>>>>  getLength() does not know the size in most cases [4]
> > > > >>>>>>>>>  Knowing the size of a type when serialized has numerous
> > > > >>>>>> performance
> > > > >>>>>>>>>  benefits throughout Flink:
> > > > >>>>>>>>>  1.
> > > > >>>>>>>>>
> > > > >>>>>>>>>     Sorters can do in-place, when the type is small [5]
> > > > >>>>>>>>>     2.
> > > > >>>>>>>>>
> > > > >>>>>>>>>     Chaining hash tables do not need resizes, because they
> > know
> > > > >>>> how
> > > > >>>>>>>>>     many buckets to allocate upfront [6]
> > > > >>>>>>>>>     3.
> > > > >>>>>>>>>
> > > > >>>>>>>>>     Different hash table architectures could be used, eg.
> > open
> > > > >>>>>>>>>     addressing with linear probing instead of some chaining
> > > > >>>>>>>>>     4.
> > > > >>>>>>>>>
> > > > >>>>>>>>>     It is possible to deserialize, modify, and then
> serialize
> > > > >>>> back
> > > > >>>>>> a
> > > > >>>>>>>>>     record to its original place, because it cannot happen
> > that
> > > > >>>> the
> > > > >>>>>>> modified
> > > > >>>>>>>>>     version does not fit in the place allocated there for
> the
> > > old
> > > > >>>>>>> version (see
> > > > >>>>>>>>>     CompactingHashTable and ReduceHashTable for concrete
> > > > >>>> instances
> > > > >>>>>> of
> > > > >>>>>>> this
> > > > >>>>>>>>>     problem)
> > > > >>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>>>> Note, that 2. and 3. are problems with not just the
> > > > PojoSerializer,
> > > > >>>>>> but
> > > > >>>>>>>>> also with the TupleSerializer.
> > > > >>>>>>>>> Solution approaches
> > > > >>>>>>>>>
> > > > >>>>>>>>>  1.
> > > > >>>>>>>>>
> > > > >>>>>>>>>  Run time code generation for every POJO
> > > > >>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>>>>  -
> > > > >>>>>>>>>
> > > > >>>>>>>>>     1. and 3 . would be automatically solved, if the
> > > serializers
> > > > >>>>>> for
> > > > >>>>>>>>>     POJOs would be generated on-the-fly (by, for example,
> > > > >>>>>> Javassist)
> > > > >>>>>>>>>     -
> > > > >>>>>>>>>
> > > > >>>>>>>>>     2. also needs code generation, and also some extra
> effort
> > > in
> > > > >>>>>> the
> > > > >>>>>>>>>     type extractor to distinguish between primitive types
> and
> > > > >>>> their
> > > > >>>>>>> boxed
> > > > >>>>>>>>>     versions
> > > > >>>>>>>>>     -
> > > > >>>>>>>>>
> > > > >>>>>>>>>     could be used for PojoComparator as well (which could
> > > greatly
> > > > >>>>>>>>>     increase the performance of sorting)
> > > > >>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>>>>  1.
> > > > >>>>>>>>>
> > > > >>>>>>>>>  Annotations on POJOs (by the users)
> > > > >>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>>>>  -
> > > > >>>>>>>>>
> > > > >>>>>>>>>     Concretely:
> > > > >>>>>>>>>     -
> > > > >>>>>>>>>
> > > > >>>>>>>>>        annotate fields that will never be nulls -> no null
> > tag
> > > > >>>>>> needed
> > > > >>>>>>>>>        before every field!
> > > > >>>>>>>>>        -
> > > > >>>>>>>>>
> > > > >>>>>>>>>        make a POJO final -> no subclass tag needed
> > > > >>>>>>>>>        -
> > > > >>>>>>>>>
> > > > >>>>>>>>>        annotating a POJO that it will not be null -> no top
> > > level
> > > > >>>>>> null
> > > > >>>>>>>>>        tag needed
> > > > >>>>>>>>>        -
> > > > >>>>>>>>>
> > > > >>>>>>>>>     These would also help with the getLength problem (6.),
> > > > >>>> because
> > > > >>>>>> the
> > > > >>>>>>>>>     length is often not known because currently anything
> can
> > be
> > > > >>>>>> null
> > > > >>>>>>> or a
> > > > >>>>>>>>>     subclass can appear anywhere
> > > > >>>>>>>>>     -
> > > > >>>>>>>>>
> > > > >>>>>>>>>     These annotations could be done without code
> generation,
> > > but
> > > > >>>>>> then
> > > > >>>>>>>>>     they would add some overhead when there are no
> > annotations
> > > > >>>>>>> present, so this
> > > > >>>>>>>>>     would work better together with the code generation
> > > > >>>>>>>>>     -
> > > > >>>>>>>>>
> > > > >>>>>>>>>     Tuples would become a special case of POJOs, where
> > nothing
> > > > >>>> can
> > > > >>>>>> be
> > > > >>>>>>>>>     null, and no subclass can appear, so maybe we could
> > > eliminate
> > > > >>>>>> the
> > > > >>>>>>>>>     TupleSerializer
> > > > >>>>>>>>>     -
> > > > >>>>>>>>>
> > > > >>>>>>>>>     We could annotate some internal types in Flink
> libraries
> > > > >>>> (Gelly
> > > > >>>>>>>>>     (Vertex, Edge), FlinkML)
> > > > >>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>>>> TODO: what is the situation with Scala case classes? Run
> time
> > > > code
> > > > >>>>>>>>> generation is probably easier in Scala? (with quasiquotes)
> > > > >>>>>>>>>
> > > > >>>>>>>>> About me
> > > > >>>>>>>>>
> > > > >>>>>>>>> I am in the last year of my Computer Science MSc studies at
> > > > Eotvos
> > > > >>>>>>> Lorand
> > > > >>>>>>>>> University in Budapest, and planning to start a PhD in the
> > > > autumn.
> > > > >>>> I
> > > > >>>>>>> have
> > > > >>>>>>>>> been working for almost three years at Ericsson on static
> > > > analysis
> > > > >>>>>> tools
> > > > >>>>>>>>> for C++. In 2014 I participated in GSoC, working on the
> LLVM
> > > > >>>> project,
> > > > >>>>>>> and I
> > > > >>>>>>>>> am a frequent contributor ever since. The next summer I was
> > > > >>>>>> interning at
> > > > >>>>>>>>> Apple.
> > > > >>>>>>>>>
> > > > >>>>>>>>> I learned about the Flink project not too long ago and I
> like
> > > it
> > > > so
> > > > >>>>>> far.
> > > > >>>>>>>>> The last few weeks I was working on some tickets to
> > familiarize
> > > > >>>>>> myself
> > > > >>>>>>> with
> > > > >>>>>>>>> the codebase:
> > > > >>>>>>>>>
> > > > >>>>>>>>> https://issues.apache.org/jira/browse/FLINK-3422
> > > > >>>>>>>>>
> > > > >>>>>>>>> https://issues.apache.org/jira/browse/FLINK-3322
> > > > >>>>>>>>>
> > > > >>>>>>>>> https://issues.apache.org/jira/browse/FLINK-3457
> > > > >>>>>>>>>
> > > > >>>>>>>>> My CV is available here:
> > > > http://xazax.web.elte.hu/files/resume.pdf
> > > > >>>>>>>>> References
> > > > >>>>>>>>>
> > > > >>>>>>>>> [1]
> > > > >>>>>>>>>
> > > > >>>>>>>
> > > > >>>>>>
> > > > >>>>
> > > > >>
> > > >
> > >
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Tuple-performance-and-the-curious-JIT-compiler-td10666.html
> > > > >>>>>>>>>
> > > > >>>>>>>>> [2]
> > > > >>>>>>>>>
> > > > >>>>>>>
> > > > >>>>>>
> > > > >>>>
> > > > >>
> > > >
> > >
> >
> https://github.com/apache/flink/blob/master/flink-java/src/main/java/org/apache/flink/api/java/typeutils/runtime/PojoSerializer.java#L369
> > > > >>>>>>>>>
> > > > >>>>>>>>> [3]
> > > > >>>>>>>>>
> > > > >>>>>>>
> > > > >>>>>>
> > > > >>>>
> > > > >>
> > > >
> > >
> >
> https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/api/common/typeutils/base/IntSerializer.java#L73
> > > > >>>>>>>>>
> > > > >>>>>>>>> [4]
> > > > >>>>>>>>>
> > > > >>>>>>>
> > > > >>>>>>
> > > > >>>>
> > > > >>
> > > >
> > >
> >
> https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/api/common/typeutils/TypeSerializer.java#L98
> > > > >>>>>>>>>
> > > > >>>>>>>>> [5]
> > > > >>>>>>>>>
> > > > >>>>>>>
> > > > >>>>>>
> > > > >>>>
> > > > >>
> > > >
> > >
> >
> https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/operators/sort/FixedLengthRecordSorter.java
> > > > >>>>>>>>>
> > > > >>>>>>>>> [6]
> > > > >>>>>>>>>
> > > > >>>>>>>
> > > > >>>>>>
> > > > >>>>
> > > > >>
> > > >
> > >
> >
> https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/operators/hash/CompactingHashTable.java#L861
> > > > >>>>>>>>> [7] https://issues.apache.org/jira/browse/FLINK-3277
> > > > >>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>>>> Best Regards,
> > > > >>>>>>>>>
> > > > >>>>>>>>> Gábor
> > > > >>>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>>
> > > > >>>>>>
> > > > >>>>>
> > > > >>>>>
> > > > >>>>
> > > > >>
> > > > >>
> > > >
> > > >
> > >
> >
>

Re: GSoC Project Proposal Draft: Code Generation in Serializers

Reply via email to