In any case, thank you guys for the exhaustive discussion :D Aljoscha Krettek <aljos...@apache.org> ezt írta (időpont: 2015. júl. 31., P, 13:52):
> Yes, I wouldn't deal with that now, that's orthogonal to the Types issue. > > On Fri, 31 Jul 2015 at 12:09 Chesnay Schepler <c.schep...@web.de> wrote: > > > I feel like we drifted away from the original topic a bit, but alright. > > > > I don't consider it a pity we created a proprietary protocol. we know > > exactly how it works and what it is capable of. It is also made exactly > > for our use case, in contrast to general purpose libraries. If we ever > > decide that the current implementation is lacking we can always look for > > better alternatives and swap stuff out fairly easily. bonus points for > > being able to swap out only part of the system, since there is a clear > > distinction between *what*(how the data is serialized) and > > *how*(tcp/mmap) data is exchanged, something that, in my opinion, is way > > too often bundled together. > > > > on the other hand, let's assume we went from the start with one of these > > magic libraries. if you then notice it lacks something (let's say its to > > slow), and can't find a different library without these faults, you are > > so screwed. now you have to re-implement these magic libraries, with all > > their supported features, without these faults, or otherwise you break a > > a lot of user programs that built upon these. > > > > The current implementation was a safer approach imo. It has it's faults, > > and did provide me with some very nerve wrecking afternoon's, but I'd > > feel really uncomfortable relying on some library that i have no control > > over for the most performance-impacting component. > > > > On 31.07.2015 11:18, Maximilian Michels wrote: > > > py4j looks really nice and the communication works in both ways. There > is > > > also another Python to Java communication library called javabridge. I > > > think it is a pity we chose to implement a proprietary protocol for the > > > network communication of the Python API. This could have been > abstracted > > > more nicely and we have already seen that you can run into problems if > > you > > > implement that yourself. > > > > > > Serializers could be created dynamically if Python passed its > dynamically > > > determined types to Java at runtime. Then everything should work on the > > > Java side. > > > > > > On Fri, Jul 31, 2015 at 11:01 AM, Till Rohrmann <trohrm...@apache.org> > > > wrote: > > > > > >> Zeppelin uses py4j [1] to transfer data between a Python process and a > > JVM. > > >> That way they can run a Python interpreter and Java interpreter and > > easily > > >> share state between them. Spark also uses py4j as a bridge between > Java > > and > > >> Python. However, I don't know for what exactly. And I also don't know > > >> what's the performance penalty of py4j. But programming is a lot of > fun > > >> with it :-) > > >> > > >> Cheers, > > >> Till > > >> > > >> [1] https://www.py4j.org/ > > >> > > >> On Fri, Jul 31, 2015 at 10:34 AM, Stephan Ewen <se...@apache.org> > > wrote: > > >> > > >>> I think in short: Spark never worried about types. It is just > something > > >>> arbitrary. > > >>> > > >>> Flink worries about types, for memory management. > > >>> > > >>> Aljoscha's suggestion is a good one: have a PythonTypeInfo that is > > >> dynamic. > > >>> Till' also found a pretty nice way to connect Python and Java in his > > >>> Zeppelin-based demo at the meetup. > > >>> > > >>> On Fri, Jul 31, 2015 at 10:30 AM, Chesnay Schepler < > c.schep...@web.de> > > >>> wrote: > > >>> > > >>>> if its just a single array, how would you define group/sort keys? > > >>>> > > >>>> > > >>>> On 31.07.2015 07:03, Aljoscha Krettek wrote: > > >>>> > > >>>>> I think then the Python part would just serialize all the tuple > > fields > > >>> to > > >>>>> a > > >>>>> big byte array. And all the key fields to another array, so that > the > > >>> java > > >>>>> side can to comparisons on the whole "key blob". > > >>>>> > > >>>>> Maybe it's overly simplistic, but it might work. :D > > >>>>> > > >>>>> On Thu, 30 Jul 2015 at 23:35 Chesnay Schepler <c.schep...@web.de> > > >>> wrote: > > >>>>> I can see this working for basic types, but am unsure how it would > > >> work > > >>>>>> with Tuples. Wouldn't the java API still need to know the arity to > > >>> setup > > >>>>>> serializers? > > >>>>>> > > >>>>>> On 30.07.2015 23:02, Aljoscha Krettek wrote: > > >>>>>> > > >>>>>>> I believe it should be possible to create a special > PythonTypeInfo > > >>> where > > >>>>>>> the python side is responsible for serializing data to a byte > array > > >>> and > > >>>>>> to > > >>>>>> > > >>>>>>> the java side it is just a byte array and all the comparisons are > > >> also > > >>>>>>> performed on these byte arrays. I think partitioning and sort > > should > > >>>>>>> > > >>>>>> still > > >>>>>> > > >>>>>>> work, since the sorting is (in most cases) only used to group the > > >>>>>>> > > >>>>>> elements > > >>>>>> > > >>>>>>> for a groupBy(). If proper sort order would be required this > would > > >>> have > > >>>>>> to > > >>>>>> > > >>>>>>> be done on the python side. > > >>>>>>> > > >>>>>>> On Thu, 30 Jul 2015 at 22:21 Chesnay Schepler <c.schep...@web.de > > > > >>>>>>> wrote: > > >>>>>>> > > >>>>>>> To be perfectly honest i never really managed to work my way > > through > > >>>>>>>> Spark's python API, it's a whole bunch of magic to me; not even > > the > > >>>>>>>> general structure is understandable. > > >>>>>>>> > > >>>>>>>> With "pure python" do you mean doing everything in python? as in > > >> just > > >>>>>>>> having serialized data on the java side? > > >>>>>>>> > > >>>>>>>> I believe the way to do this with Flink is to add a switch that > > >>>>>>>> a) disables all type checks > > >>>>>>>> b) creates serializers dynamically at runtime. > > >>>>>>>> > > >>>>>>>> a) should be fairly straight forward, b) on the other hand.... > > >>>>>>>> > > >>>>>>>> btw., the Python API itself doesn't require the type > information, > > >> it > > >>>>>>>> already does the b part. > > >>>>>>>> > > >>>>>>>> On 30.07.2015 22:11, Gyula Fóra wrote: > > >>>>>>>> > > >>>>>>>>> That I understand, but could you please tell me how is this > done > > >>>>>>>>> differently in Spark for instance? > > >>>>>>>>> > > >>>>>>>>> What would we need to change to make this work with pure python > > >> (as > > >>> it > > >>>>>>>>> seems to be possible)? This probably have large performance > > >>>>>>>>> > > >>>>>>>> implications > > >>>>>>> though. > > >>>>>>>>> Gyula > > >>>>>>>>> > > >>>>>>>>> Chesnay Schepler <c.schep...@web.de> ezt írta (időpont: 2015. > > >> júl. > > >>>>>>>> 30., > > >>>>>>> Cs, > > >>>>>>>>> 22:04): > > >>>>>>>>> > > >>>>>>>>> because it still goes through the Java API that requires some > > kind > > >>> of > > >>>>>>>>>> type information. imagine a java api program where you omit > all > > >>>>>>>>>> > > >>>>>>>>> generic > > >>>>>>> types, it just wouldn't work as of now. > > >>>>>>>>>> On 30.07.2015 21:17, Gyula Fóra wrote: > > >>>>>>>>>> > > >>>>>>>>>>> Hey! > > >>>>>>>>>>> > > >>>>>>>>>>> Could anyone briefly tell me what exactly is the reason why > we > > >>> force > > >>>>>>>>>> the > > >>>>>>>>> users in the Python API to declare types for operators? > > >>>>>>>>>>> I don't really understand how this works in different systems > > >> but > > >>> I > > >>>>>>>>>> am > > >>>>>>> just > > >>>>>>>>>>> curious why Flink has types and why Spark doesn't for > instance. > > >>>>>>>>>>> > > >>>>>>>>>>> If you give me some pointers to read that would also be fine > :) > > >>>>>>>>>>> > > >>>>>>>>>>> Thank you, > > >>>>>>>>>>> Gyula > > >>>>>>>>>>> > > >>>>>>>>>>> > > > > >