Re: Types in the Python API

Till Rohrmann Fri, 31 Jul 2015 02:03:08 -0700

Zeppelin uses py4j [1] to transfer data between a Python process and a JVM.
That way they can run a Python interpreter and Java interpreter and easily
share state between them. Spark also uses py4j as a bridge between Java and
Python. However, I don't know for what exactly. And I also don't know
what's the performance penalty of py4j. But programming is a lot of fun
with it :-)


Cheers,
Till

[1] https://www.py4j.org/

On Fri, Jul 31, 2015 at 10:34 AM, Stephan Ewen <se...@apache.org> wrote:

> I think in short: Spark never worried about types. It is just something
> arbitrary.
>
> Flink worries about types, for memory management.
>
> Aljoscha's suggestion is a good one: have a PythonTypeInfo that is dynamic.
>
> Till' also found a pretty nice way to connect Python and Java in his
> Zeppelin-based demo at the meetup.
>
> On Fri, Jul 31, 2015 at 10:30 AM, Chesnay Schepler <c.schep...@web.de>
> wrote:
>
> > if its just a single array, how would you define group/sort keys?
> >
> >
> > On 31.07.2015 07:03, Aljoscha Krettek wrote:
> >
> >> I think then the Python part would just serialize all the tuple fields
> to
> >> a
> >> big byte array. And all the key fields to another array, so that the
> java
> >> side can to comparisons on the whole "key blob".
> >>
> >> Maybe it's overly simplistic, but it might work. :D
> >>
> >> On Thu, 30 Jul 2015 at 23:35 Chesnay Schepler <c.schep...@web.de>
> wrote:
> >>
> >> I can see this working for basic types, but am unsure how it would work
> >>> with Tuples. Wouldn't the java API still need to know the arity to
> setup
> >>> serializers?
> >>>
> >>> On 30.07.2015 23:02, Aljoscha Krettek wrote:
> >>>
> >>>> I believe it should be possible to create a special PythonTypeInfo
> where
> >>>> the python side is responsible for serializing data to a byte array
> and
> >>>>
> >>> to
> >>>
> >>>> the java side it is just a byte array and all the comparisons are also
> >>>> performed on these byte arrays. I think partitioning and sort should
> >>>>
> >>> still
> >>>
> >>>> work, since the sorting is (in most cases) only used to group the
> >>>>
> >>> elements
> >>>
> >>>> for a groupBy(). If proper sort order would be required this would
> have
> >>>>
> >>> to
> >>>
> >>>> be done on the python side.
> >>>>
> >>>> On Thu, 30 Jul 2015 at 22:21 Chesnay Schepler <c.schep...@web.de>
> >>>> wrote:
> >>>>
> >>>> To be perfectly honest i never really managed to work my way through
> >>>>> Spark's python API, it's a whole bunch of magic to me; not even the
> >>>>> general structure is understandable.
> >>>>>
> >>>>> With "pure python" do you mean doing everything in python? as in just
> >>>>> having serialized data on the java side?
> >>>>>
> >>>>> I believe the way to do this with Flink is to add a switch that
> >>>>> a) disables all type checks
> >>>>> b) creates serializers dynamically at runtime.
> >>>>>
> >>>>> a) should be fairly straight forward, b) on the other hand....
> >>>>>
> >>>>> btw., the Python API itself doesn't require the type information, it
> >>>>> already does the b part.
> >>>>>
> >>>>> On 30.07.2015 22:11, Gyula Fóra wrote:
> >>>>>
> >>>>>> That I understand, but could you please tell me how is this done
> >>>>>> differently in Spark for instance?
> >>>>>>
> >>>>>> What would we need to change to make this work with pure python (as
> it
> >>>>>> seems to be possible)? This probably have large performance
> >>>>>>
> >>>>> implications
> >>>
> >>>> though.
> >>>>>>
> >>>>>> Gyula
> >>>>>>
> >>>>>> Chesnay Schepler <c.schep...@web.de> ezt írta (időpont: 2015. júl.
> >>>>>>
> >>>>> 30.,
> >>>
> >>>> Cs,
> >>>>>
> >>>>>> 22:04):
> >>>>>>
> >>>>>> because it still goes through the Java API that requires some kind
> of
> >>>>>>> type information. imagine a java api program where you omit all
> >>>>>>>
> >>>>>> generic
> >>>
> >>>> types, it just wouldn't work as of now.
> >>>>>>>
> >>>>>>> On 30.07.2015 21:17, Gyula Fóra wrote:
> >>>>>>>
> >>>>>>>> Hey!
> >>>>>>>>
> >>>>>>>> Could anyone briefly tell me what exactly is the reason why we
> force
> >>>>>>>>
> >>>>>>> the
> >>>>>
> >>>>>> users in the Python API to declare types for operators?
> >>>>>>>>
> >>>>>>>> I don't really understand how this works in different systems but
> I
> >>>>>>>>
> >>>>>>> am
> >>>
> >>>> just
> >>>>>>>
> >>>>>>>> curious why Flink has types and why Spark doesn't for instance.
> >>>>>>>>
> >>>>>>>> If you give me some pointers to read that would also be fine :)
> >>>>>>>>
> >>>>>>>> Thank you,
> >>>>>>>> Gyula
> >>>>>>>>
> >>>>>>>>
> >>>
> >
>

Re: Types in the Python API

Reply via email to