Zeppelin uses py4j [1] to transfer data between a Python process and a JVM. That way they can run a Python interpreter and Java interpreter and easily share state between them. Spark also uses py4j as a bridge between Java and Python. However, I don't know for what exactly. And I also don't know what's the performance penalty of py4j. But programming is a lot of fun with it :-)
Cheers, Till [1] https://www.py4j.org/ On Fri, Jul 31, 2015 at 10:34 AM, Stephan Ewen <se...@apache.org> wrote: > I think in short: Spark never worried about types. It is just something > arbitrary. > > Flink worries about types, for memory management. > > Aljoscha's suggestion is a good one: have a PythonTypeInfo that is dynamic. > > Till' also found a pretty nice way to connect Python and Java in his > Zeppelin-based demo at the meetup. > > On Fri, Jul 31, 2015 at 10:30 AM, Chesnay Schepler <c.schep...@web.de> > wrote: > > > if its just a single array, how would you define group/sort keys? > > > > > > On 31.07.2015 07:03, Aljoscha Krettek wrote: > > > >> I think then the Python part would just serialize all the tuple fields > to > >> a > >> big byte array. And all the key fields to another array, so that the > java > >> side can to comparisons on the whole "key blob". > >> > >> Maybe it's overly simplistic, but it might work. :D > >> > >> On Thu, 30 Jul 2015 at 23:35 Chesnay Schepler <c.schep...@web.de> > wrote: > >> > >> I can see this working for basic types, but am unsure how it would work > >>> with Tuples. Wouldn't the java API still need to know the arity to > setup > >>> serializers? > >>> > >>> On 30.07.2015 23:02, Aljoscha Krettek wrote: > >>> > >>>> I believe it should be possible to create a special PythonTypeInfo > where > >>>> the python side is responsible for serializing data to a byte array > and > >>>> > >>> to > >>> > >>>> the java side it is just a byte array and all the comparisons are also > >>>> performed on these byte arrays. I think partitioning and sort should > >>>> > >>> still > >>> > >>>> work, since the sorting is (in most cases) only used to group the > >>>> > >>> elements > >>> > >>>> for a groupBy(). If proper sort order would be required this would > have > >>>> > >>> to > >>> > >>>> be done on the python side. > >>>> > >>>> On Thu, 30 Jul 2015 at 22:21 Chesnay Schepler <c.schep...@web.de> > >>>> wrote: > >>>> > >>>> To be perfectly honest i never really managed to work my way through > >>>>> Spark's python API, it's a whole bunch of magic to me; not even the > >>>>> general structure is understandable. > >>>>> > >>>>> With "pure python" do you mean doing everything in python? as in just > >>>>> having serialized data on the java side? > >>>>> > >>>>> I believe the way to do this with Flink is to add a switch that > >>>>> a) disables all type checks > >>>>> b) creates serializers dynamically at runtime. > >>>>> > >>>>> a) should be fairly straight forward, b) on the other hand.... > >>>>> > >>>>> btw., the Python API itself doesn't require the type information, it > >>>>> already does the b part. > >>>>> > >>>>> On 30.07.2015 22:11, Gyula Fóra wrote: > >>>>> > >>>>>> That I understand, but could you please tell me how is this done > >>>>>> differently in Spark for instance? > >>>>>> > >>>>>> What would we need to change to make this work with pure python (as > it > >>>>>> seems to be possible)? This probably have large performance > >>>>>> > >>>>> implications > >>> > >>>> though. > >>>>>> > >>>>>> Gyula > >>>>>> > >>>>>> Chesnay Schepler <c.schep...@web.de> ezt írta (időpont: 2015. júl. > >>>>>> > >>>>> 30., > >>> > >>>> Cs, > >>>>> > >>>>>> 22:04): > >>>>>> > >>>>>> because it still goes through the Java API that requires some kind > of > >>>>>>> type information. imagine a java api program where you omit all > >>>>>>> > >>>>>> generic > >>> > >>>> types, it just wouldn't work as of now. > >>>>>>> > >>>>>>> On 30.07.2015 21:17, Gyula Fóra wrote: > >>>>>>> > >>>>>>>> Hey! > >>>>>>>> > >>>>>>>> Could anyone briefly tell me what exactly is the reason why we > force > >>>>>>>> > >>>>>>> the > >>>>> > >>>>>> users in the Python API to declare types for operators? > >>>>>>>> > >>>>>>>> I don't really understand how this works in different systems but > I > >>>>>>>> > >>>>>>> am > >>> > >>>> just > >>>>>>> > >>>>>>>> curious why Flink has types and why Spark doesn't for instance. > >>>>>>>> > >>>>>>>> If you give me some pointers to read that would also be fine :) > >>>>>>>> > >>>>>>>> Thank you, > >>>>>>>> Gyula > >>>>>>>> > >>>>>>>> > >>> > > >