Yes, I wouldn't deal with that now, that's orthogonal to the Types issue. On Fri, 31 Jul 2015 at 12:09 Chesnay Schepler <c.schep...@web.de> wrote:
> I feel like we drifted away from the original topic a bit, but alright. > > I don't consider it a pity we created a proprietary protocol. we know > exactly how it works and what it is capable of. It is also made exactly > for our use case, in contrast to general purpose libraries. If we ever > decide that the current implementation is lacking we can always look for > better alternatives and swap stuff out fairly easily. bonus points for > being able to swap out only part of the system, since there is a clear > distinction between *what*(how the data is serialized) and > *how*(tcp/mmap) data is exchanged, something that, in my opinion, is way > too often bundled together. > > on the other hand, let's assume we went from the start with one of these > magic libraries. if you then notice it lacks something (let's say its to > slow), and can't find a different library without these faults, you are > so screwed. now you have to re-implement these magic libraries, with all > their supported features, without these faults, or otherwise you break a > a lot of user programs that built upon these. > > The current implementation was a safer approach imo. It has it's faults, > and did provide me with some very nerve wrecking afternoon's, but I'd > feel really uncomfortable relying on some library that i have no control > over for the most performance-impacting component. > > On 31.07.2015 11:18, Maximilian Michels wrote: > > py4j looks really nice and the communication works in both ways. There is > > also another Python to Java communication library called javabridge. I > > think it is a pity we chose to implement a proprietary protocol for the > > network communication of the Python API. This could have been abstracted > > more nicely and we have already seen that you can run into problems if > you > > implement that yourself. > > > > Serializers could be created dynamically if Python passed its dynamically > > determined types to Java at runtime. Then everything should work on the > > Java side. > > > > On Fri, Jul 31, 2015 at 11:01 AM, Till Rohrmann <trohrm...@apache.org> > > wrote: > > > >> Zeppelin uses py4j [1] to transfer data between a Python process and a > JVM. > >> That way they can run a Python interpreter and Java interpreter and > easily > >> share state between them. Spark also uses py4j as a bridge between Java > and > >> Python. However, I don't know for what exactly. And I also don't know > >> what's the performance penalty of py4j. But programming is a lot of fun > >> with it :-) > >> > >> Cheers, > >> Till > >> > >> [1] https://www.py4j.org/ > >> > >> On Fri, Jul 31, 2015 at 10:34 AM, Stephan Ewen <se...@apache.org> > wrote: > >> > >>> I think in short: Spark never worried about types. It is just something > >>> arbitrary. > >>> > >>> Flink worries about types, for memory management. > >>> > >>> Aljoscha's suggestion is a good one: have a PythonTypeInfo that is > >> dynamic. > >>> Till' also found a pretty nice way to connect Python and Java in his > >>> Zeppelin-based demo at the meetup. > >>> > >>> On Fri, Jul 31, 2015 at 10:30 AM, Chesnay Schepler <c.schep...@web.de> > >>> wrote: > >>> > >>>> if its just a single array, how would you define group/sort keys? > >>>> > >>>> > >>>> On 31.07.2015 07:03, Aljoscha Krettek wrote: > >>>> > >>>>> I think then the Python part would just serialize all the tuple > fields > >>> to > >>>>> a > >>>>> big byte array. And all the key fields to another array, so that the > >>> java > >>>>> side can to comparisons on the whole "key blob". > >>>>> > >>>>> Maybe it's overly simplistic, but it might work. :D > >>>>> > >>>>> On Thu, 30 Jul 2015 at 23:35 Chesnay Schepler <c.schep...@web.de> > >>> wrote: > >>>>> I can see this working for basic types, but am unsure how it would > >> work > >>>>>> with Tuples. Wouldn't the java API still need to know the arity to > >>> setup > >>>>>> serializers? > >>>>>> > >>>>>> On 30.07.2015 23:02, Aljoscha Krettek wrote: > >>>>>> > >>>>>>> I believe it should be possible to create a special PythonTypeInfo > >>> where > >>>>>>> the python side is responsible for serializing data to a byte array > >>> and > >>>>>> to > >>>>>> > >>>>>>> the java side it is just a byte array and all the comparisons are > >> also > >>>>>>> performed on these byte arrays. I think partitioning and sort > should > >>>>>>> > >>>>>> still > >>>>>> > >>>>>>> work, since the sorting is (in most cases) only used to group the > >>>>>>> > >>>>>> elements > >>>>>> > >>>>>>> for a groupBy(). If proper sort order would be required this would > >>> have > >>>>>> to > >>>>>> > >>>>>>> be done on the python side. > >>>>>>> > >>>>>>> On Thu, 30 Jul 2015 at 22:21 Chesnay Schepler <c.schep...@web.de> > >>>>>>> wrote: > >>>>>>> > >>>>>>> To be perfectly honest i never really managed to work my way > through > >>>>>>>> Spark's python API, it's a whole bunch of magic to me; not even > the > >>>>>>>> general structure is understandable. > >>>>>>>> > >>>>>>>> With "pure python" do you mean doing everything in python? as in > >> just > >>>>>>>> having serialized data on the java side? > >>>>>>>> > >>>>>>>> I believe the way to do this with Flink is to add a switch that > >>>>>>>> a) disables all type checks > >>>>>>>> b) creates serializers dynamically at runtime. > >>>>>>>> > >>>>>>>> a) should be fairly straight forward, b) on the other hand.... > >>>>>>>> > >>>>>>>> btw., the Python API itself doesn't require the type information, > >> it > >>>>>>>> already does the b part. > >>>>>>>> > >>>>>>>> On 30.07.2015 22:11, Gyula Fóra wrote: > >>>>>>>> > >>>>>>>>> That I understand, but could you please tell me how is this done > >>>>>>>>> differently in Spark for instance? > >>>>>>>>> > >>>>>>>>> What would we need to change to make this work with pure python > >> (as > >>> it > >>>>>>>>> seems to be possible)? This probably have large performance > >>>>>>>>> > >>>>>>>> implications > >>>>>>> though. > >>>>>>>>> Gyula > >>>>>>>>> > >>>>>>>>> Chesnay Schepler <c.schep...@web.de> ezt írta (időpont: 2015. > >> júl. > >>>>>>>> 30., > >>>>>>> Cs, > >>>>>>>>> 22:04): > >>>>>>>>> > >>>>>>>>> because it still goes through the Java API that requires some > kind > >>> of > >>>>>>>>>> type information. imagine a java api program where you omit all > >>>>>>>>>> > >>>>>>>>> generic > >>>>>>> types, it just wouldn't work as of now. > >>>>>>>>>> On 30.07.2015 21:17, Gyula Fóra wrote: > >>>>>>>>>> > >>>>>>>>>>> Hey! > >>>>>>>>>>> > >>>>>>>>>>> Could anyone briefly tell me what exactly is the reason why we > >>> force > >>>>>>>>>> the > >>>>>>>>> users in the Python API to declare types for operators? > >>>>>>>>>>> I don't really understand how this works in different systems > >> but > >>> I > >>>>>>>>>> am > >>>>>>> just > >>>>>>>>>>> curious why Flink has types and why Spark doesn't for instance. > >>>>>>>>>>> > >>>>>>>>>>> If you give me some pointers to read that would also be fine :) > >>>>>>>>>>> > >>>>>>>>>>> Thank you, > >>>>>>>>>>> Gyula > >>>>>>>>>>> > >>>>>>>>>>> > >