Very nice proposal!
On Wed, Mar 9, 2016 at 6:35 PM, Stephan Ewen <se...@apache.org> wrote: > Thanks for posting this. > > I think it is not super urgent (in the sense of weeks or few months), so > results around mid summer is probably good. > The background in LLVM is a very good base for this! > > On Wed, Mar 9, 2016 at 3:56 PM, Gábor Horváth <xazax....@gmail.com> wrote: > >> Hi, >> >> In the meantime I sent out the current version of the proposal draft [1]. >> Hopefully it will help you triage this task and contribute to the >> discussion of the problem. >> How urgent is this issue? In what time frame should there be results? >> >> Best Regards, >> Gábor >> >> [1] >> >> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/GSoC-Project-Proposal-Draft-Code-Generation-in-Serializers-td10702.html >> >> On 9 March 2016 at 14:49, Stephan Ewen <se...@apache.org> wrote: >> >> > Do we have consensus that we want to "reserve" this topic for a GSoC >> > student? >> > >> > It is becoming a feature that gains more importance. To see we can "hold >> > off" on working on that, would be good to know a bit more, like >> > - when is it decided whether this project takes place? >> > - when would results be there? >> > - can we expect the results to be usable, i.e., how good is the >> student? >> > (no offence, but so far the results in GSoC were everywhere between very >> > good and super bad) >> > >> > Greetings, >> > Stephan >> > >> > >> > On Tue, Mar 8, 2016 at 4:28 PM, Márton Balassi <balassi.mar...@gmail.com >> > >> > wrote: >> > >> > > @Fabian: That is my bad, but I think we should be still on time. Pinged >> > Uli >> > > just to make sure. Proposal from Gabor and Jira from me are coming >> soon. >> > > >> > > On Tue, Mar 8, 2016 at 11:43 AM, Fabian Hueske <fhue...@gmail.com> >> > wrote: >> > > >> > > > Hi Gabor, >> > > > >> > > > I did not find any Flink proposals for this year's GSoC in JIRA >> (should >> > > be >> > > > labeled with gsoc2016). >> > > > I am also not sure if any of the Flink committers signed up as a GSoC >> > > > mentor. >> > > > Maybe it is still time to do that but as it looks right now there are >> > no >> > > > GSoC projects offered by Flink. >> > > > >> > > > Best, Fabian >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > 2016-03-08 11:22 GMT+01:00 Gábor Horváth <xazax....@gmail.com>: >> > > > >> > > > > Hi! >> > > > > >> > > > > I am planning to do GSoC and I would like to work on the >> serializers. >> > > > More >> > > > > specifically I would like to implement code generation. I am >> planning >> > > to >> > > > > send the first draft of the proposal to the mailing list early next >> > > week. >> > > > > If everything is going well, that will include some preliminary >> > > > benchmarks >> > > > > how much performance gain can be expected from hand written >> > > serializers. >> > > > > >> > > > > Best regards, >> > > > > Gábor >> > > > > >> > > > > On 8 March 2016 at 10:47, Stephan Ewen <se...@apache.org> wrote: >> > > > > >> > > > > > Ah, very good, that makes sense! >> > > > > > >> > > > > > I would guess that this performance difference could probably be >> > seen >> > > > at >> > > > > > various points where generic serializers and comparators are used >> > > (also >> > > > > for >> > > > > > Comparable, Writable) or >> > > > > > where the TupleSerializer delegates to a sequence of other >> > > > > TypeSerializers. >> > > > > > >> > > > > > I guess creating more specialized serializers would solve some of >> > > these >> > > > > > problems, like in your IntValue vs LongValue case. >> > > > > > >> > > > > > The best way to solve that would probably be through code >> > generation >> > > in >> > > > > the >> > > > > > serializers. That has actually been my wish for quite a while. >> > > > > > If you are also into these kinds of low-level performance topics, >> > we >> > > > > could >> > > > > > start a discussion on that. >> > > > > > >> > > > > > Greetings, >> > > > > > Stephan >> > > > > > >> > > > > > >> > > > > > On Mon, Mar 7, 2016 at 11:25 PM, Greg Hogan <c...@greghogan.com> >> > > > wrote: >> > > > > > >> > > > > > > The issue is not with the Tuple hierarchy (running Gelly >> examples >> > > had >> > > > > no >> > > > > > > effect on runtime, and as you note there aren't any subclass >> > > > overrides) >> > > > > > but >> > > > > > > with CopyableValue. I had been using IntValue exclusively but >> had >> > > > > > switched >> > > > > > > to using LongValue for graph generation. >> CopyableValueComparator >> > > and >> > > > > > > CopyableValueSerializer are now working with multiple types. >> > > > > > > >> > > > > > > If I create IntValue- and LongValue-specific versions of >> > > > > > > CopyableValueComparator and CopyableValueSerializer and modify >> > > > > > > ValueTypeInfo to return these then I see the expected >> > performance. >> > > > > > > >> > > > > > > Greg >> > > > > > > >> > > > > > > On Mon, Mar 7, 2016 at 5:18 AM, Stephan Ewen <se...@apache.org >> > >> > > > wrote: >> > > > > > > >> > > > > > > > Hi Greg! >> > > > > > > > >> > > > > > > > Sounds very interesting. >> > > > > > > > >> > > > > > > > Do you have a hunch what "virtual" Tuple methods are being >> used >> > > > that >> > > > > > > become >> > > > > > > > less jit-able? In many cases, tuples use only field accesses >> > > (like >> > > > > > > > "vakle.f1") in the user functions. >> > > > > > > > >> > > > > > > > I have to dig into the serializers, to see if they could >> suffer >> > > > from >> > > > > > > that. >> > > > > > > > The "getField(pos)" method for example should always have >> many >> > > > > > overrides >> > > > > > > > (though few would be loaded at any time, because one usually >> > does >> > > > not >> > > > > > use >> > > > > > > > all Tuple classes at the same time). >> > > > > > > > >> > > > > > > > Greetings, >> > > > > > > > Stephan >> > > > > > > > >> > > > > > > > >> > > > > > > > On Fri, Mar 4, 2016 at 11:37 PM, Greg Hogan < >> > c...@greghogan.com> >> > > > > > wrote: >> > > > > > > > >> > > > > > > > > I am noticing what looks like the same drop-off in >> > performance >> > > > when >> > > > > > > > > introducing TupleN subclasses as expressed in >> "Understanding >> > > the >> > > > > JIT >> > > > > > > and >> > > > > > > > > tuning the implementation" [1]. >> > > > > > > > > >> > > > > > > > > I start my single-node cluster, run an algorithm which >> relies >> > > > > purely >> > > > > > on >> > > > > > > > > Tuples, and measure the runtime. I execute a separate jar >> > which >> > > > > > > executes >> > > > > > > > > essentially the same algorithm but using Gelly's Edge >> (which >> > > > > > subclasses >> > > > > > > > > Tuple3 but does not add any extra fields) and now both the >> > > Tuple >> > > > > and >> > > > > > > Edge >> > > > > > > > > algorithms take twice as long. >> > > > > > > > > >> > > > > > > > > Has this been previously discussed? If not I can work up a >> > > > > > > demonstration. >> > > > > > > > > >> > > > > > > > > [1] >> > > > https://flink.apache.org/news/2015/09/16/off-heap-memory.html >> > > > > > > > > >> > > > > > > > > Greg >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >>