Re: GSoC Project Proposal Draft: Code Generation in Serializers

2016-05-29 Thread Gábor Horváth
Hi!

I would like to give you some status updates on the Google Summer of Code
project. I started to implement the proposed features [1].

Status of code generation in general:
* I can compile the generated code using Janino compiler
* I can load the compiled classes and use them
* For some mysterious reason, during deserializing a Janino compiled
object, the readObject method is not invoked. When the same code is
compiled using another compiler it works as intended. I am investigating
this issue. In case any of you have some idea what the problem might be,
don't keep it secret :) While I am trying to solve this issue, I also
continue to work on code generation. I can still test the generated code,
registering it manually.

Status of generated POJO serializers:
* I could use the generated code on the WordCountPojo example.
* Everything is implemented except for copying stateful serializers and
serializing subclasses.
* There are several possible performance advantages of the generated
serializers:
  - The serialization/deserialization of the fields are not in a loop,
giving the JVM better chance to inline and devirtualize
  - Null checks are eliminated for primitive types
  - Subclass checks are eliminated for final classes

Status of generated POJO comparators:
* I started to implement them.

I did some preliminary benchmarks with the generated code using the
WordCountPojo example.
* In the baseline (using the default Flink serializers)
PojoSerializer.deserialize was one of the hottest methods (with over 11
percent sample rate).
* Using the generated serializers, the percentage of samples from
deserialize method went down below 3 percent.
* Very significant amount of the time is spent in comparators, so there are
some potential performance gains there as well.

What's next? I am trying to solve the problem with the readObject method,
and in the meantime I try to get the generated comparators working on the
WordCountPojo example. Once that is done, I will make a more detailed
performance case study. After that I will add support for handling
subclasses in the generated code. At that point the generated code will
have all the required features.

Note: I did not change the serialization format, so the generated code can
work with the default serializers. This is crucial for backward
compatibility with save points.

Regards,
Gábor

[1] https://github.com/Xazax-hun/flink/commits/serializer_codegen

On 23 April 2016 at 10:33, Gábor Horváth  wrote:

> Hi,
>
> The GSoC project proposal was accepted! Thank you for all your support. I
> will do my best to live up to the challenges and deliver everything that
> way planned for this summer.
>
> Best Regards,
> Gábor
>
> On 20 April 2016 at 16:18, Gábor Horváth  wrote:
>
>> On the second thought I think you are right. I had the impression that
>> there is cyclic dependency between TypeInformation and the serializers but
>> that is not the case. So there is no rewrite needed for TypeInformation in
>> order to be able to use Scala for serializers.
>>
>> According to the proposal unless someone utilize the annotations the
>> generated serializers would be compatible to the current ones. There could
>> be a configuration option whether to try to make the layout more compact
>> based on annotations.
>>
>> On 20 April 2016 at 16:03, Fabian Hueske  wrote:
>>
>>> Why would you need to rewrite the TypeInformation in Scala?
>>> I think we need a way to replace Serializer implementations anyway unless
>>> the generated serializers are compatible to the current ones.
>>>
>>> 2016-04-20 15:53 GMT+02:00 Gábor Horváth :
>>>
>>> > Hi Fabian,
>>> >
>>> > I agree that it would be awesome to move this to its own module/plugin.
>>> > However in order to be able to write the code generation in Scala I
>>> would
>>> > need to rewrite the type information to use Scala as well. I think I
>>> will
>>> > not
>>> > have time to do this during the summer, so I think I will stick to
>>> Java and
>>> > this modularization can be done later.
>>> >
>>> > Thanks,
>>> > Gábor
>>> >
>>> > On 19 April 2016 at 11:50, Fabian Hueske  wrote:
>>> >
>>> > > Hi Gabor,
>>> > >
>>> > > you are right, a codegen serializer module would depend on
>>> flink-core and
>>> > > in the current design flink-core would need to know about the type
>>> infos
>>> > /
>>> > > serializers / comparators.
>>> > >
>>> > > Decoupling implementations of type info, serializers, and comparators
>>> > from
>>> > > flink-core and resolving the cyclic dependency would be what the
>>> plugin
>>> > > architecture would be for.
>>> > > Maybe this can be done by some mechanism to dynamically load
>>> > > TypeInformations for types with overridden serializers / comparators.
>>> > > This would require some design document and discussion in the
>>> community.
>>> > >
>>> > > Cheers, Fabian
>>> > >
>>> > >
>>> > >
>>> > >
>>> > >
>>> > > 2016-04-18 21:19 GMT+02:00 Gábor Horváth :
>>> > >
>>> > > > Unfortunately making code generation a separate 

Re: Would like to add a connector to DistributedLog.

2016-05-29 Thread Yijie Shen
Hi Jia, I'm currently working on a stream connector for DL, do you mind if
we cooperate on FLINK-3987?
If you don't mind, I would do some clean up locally in the next days and
share it here.

Hi Sijie, I think we should publish DL jars to public repository in order
to be referenced by other project.
I've field https://github.com/twitter/distributedlog/issues/20 a couple
days ago.

Best,
Yijie

On Sun, May 29, 2016 at 2:55 PM, Sijie Guo  wrote:

> Ah, that's cool. Look forward to your flink connector. we are willing to
> offer any helps on DL.
>
> - Sijie
>
> On Sat, May 28, 2016 at 9:33 PM, Jia Zhai  wrote:
>
> > +Distributedlog groups to get pertential help and instructions.
> > Thanks.
> >
> > On Sun, May 29, 2016 at 11:33 AM, Jia Zhai  wrote:
> >
> >> Hi,
> >> I would like to add a connector to DistributedLog, which recently
> >> published by twitter. :)
> >> Is there any special instructions or best practise to add a new
> >> connector?  Any suggestion is very appreciated.
> >>
> >> The JIRA tickets(FLINK-3987
> >>  and FLINK-3988
> >> ) has been added for
> >> this work,
> >>
> >> All the infomation of DistributedLog, could be found at here:
> >> http://distributedlog.io.
> >>
> >> Thanks a lot.
> >> -Jia
> >>
> >
> > --
> > You received this message because you are subscribed to the Google Groups
> > "distributedlog-user" group.
> > To unsubscribe from this group and stop receiving emails from it, send an
> > email to distributedlog-user+unsubscr...@googlegroups.com.
> > To post to this group, send email to
> distributedlog-u...@googlegroups.com.
> > To view this discussion on the web visit
> >
> https://groups.google.com/d/msgid/distributedlog-user/CALsc%2BXqAUa__e3gUkCr%2BmVk7mASDeFFV-eBnFPgD47HhUrmC6g%40mail.gmail.com
> > <
> https://groups.google.com/d/msgid/distributedlog-user/CALsc%2BXqAUa__e3gUkCr%2BmVk7mASDeFFV-eBnFPgD47HhUrmC6g%40mail.gmail.com?utm_medium=email&utm_source=footer
> >
> > .
> > For more options, visit https://groups.google.com/d/optout.
> >
>


[jira] [Created] (FLINK-3990) ZooKeeperLeaderElectionTest kills the JVM

2016-05-29 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-3990:
---

 Summary: ZooKeeperLeaderElectionTest kills the JVM
 Key: FLINK-3990
 URL: https://issues.apache.org/jira/browse/FLINK-3990
 Project: Flink
  Issue Type: Bug
  Components: Tests
Affects Versions: 1.1.0
Reporter: Stephan Ewen
Priority: Critical


Frequently, the {{ZooKeeperLeaderElectionTest}} causes the JVM to exit, failing 
the build.

https://api.travis-ci.org/jobs/133759814/log.txt?deansi=true



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Side-effects of DataSet::count

2016-05-29 Thread Eron Wright
I was curious as to how the `count` method on DataSet worked, and was surprised 
to see that it executes the entire program graph.   Wouldn’t this cause 
undesirable side-effects like writing to sinks?Also strange that the graph 
is mutated with the addition of a sink (that isn’t subsequently removed).

Surveying the Flink code, there aren’t many situations where the program graph 
is implicitly executed (`collect` is another).   Nonetheless, this has deepened 
my appreciation for how dynamic the application might be.

// DataSet.java
public long count() throws Exception {
   final String id = new AbstractID().toString();

   output(new Utils.CountHelper(id)).name("count()");

   JobExecutionResult res = getExecutionEnvironment().execute();
   return res. getAccumulatorResult(id);
}
Eron

Re: Side-effects of DataSet::count

2016-05-29 Thread Márton Balassi
Hey Eron,

Yes, DataSet#collect and count methods implicitly trigger a JobGraph
execution, thus they also trigger writing to any previously defined sinks.
The idea behind this behavior is to enable interactive querying (the one
that you are used to get from a shell environment) and it is also a great
debugging tool.

Best,

Marton

On Sun, May 29, 2016 at 11:28 PM, Eron Wright  wrote:

> I was curious as to how the `count` method on DataSet worked, and was
> surprised to see that it executes the entire program graph.   Wouldn’t this
> cause undesirable side-effects like writing to sinks?Also strange that
> the graph is mutated with the addition of a sink (that isn’t subsequently
> removed).
>
> Surveying the Flink code, there aren’t many situations where the program
> graph is implicitly executed (`collect` is another).   Nonetheless, this
> has deepened my appreciation for how dynamic the application might be.
>
> // DataSet.java
> public long count() throws Exception {
>final String id = new AbstractID().toString();
>
>output(new Utils.CountHelper(id)).name("count()");
>
>JobExecutionResult res = getExecutionEnvironment().execute();
>return res. getAccumulatorResult(id);
> }
> Eron


Re: Would like to add a connector to DistributedLog.

2016-05-29 Thread Jia Zhai
Hi Gordon.
Thanks a lot for the information and the Jira ticket.

On Sun, May 29, 2016 at 1:37 PM, Tzu-Li (Gordon) Tai 
wrote:

> Hi Jia,
>
> There's currently no dedicated documentation for composing streaming
> connectors, but for examples, I would say that the Elasticsearch and Nifi
> connectors are a good place to start learning. For a more full blown
> example
> I recommend taking a look at the Kafka connector. Basically, you'll extend
> the RichParallelSourceFunction for sources (or RichSinkFunction for sinks)
> and implement the required methods that will be called during the source /
> sink tasks' life cycle.
>
> I'd be happy to help you with any questions you may come into while
> implementing the connector. Also, I'm thinking about adding documentation
> for composing a new connector (filed in
> https://issues.apache.org/jira/browse/FLINK-3989) as its a frequently
> asked
> question, so any feedback during your learning process will be quite
> helpful!
>
>
>
> --
> View this message in context:
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Would-like-to-add-a-connector-to-DistributedLog-tp11869p11872.html
> Sent from the Apache Flink Mailing List archive. mailing list archive at
> Nabble.com.
>