Re: Types in the Python API

Aljoscha Krettek Fri, 31 Jul 2015 04:53:50 -0700

Yes, I wouldn't deal with that now, that's orthogonal to the Types issue.

On Fri, 31 Jul 2015 at 12:09 Chesnay Schepler <c.schep...@web.de> wrote:


> I feel like we drifted away from the original topic a bit, but alright.
>
> I don't consider it a pity we created a proprietary protocol. we know
> exactly how it works and what it is capable of. It is also made exactly
> for our use case, in contrast to general purpose libraries. If we ever
> decide that the current implementation is lacking we can always look for
> better alternatives and swap stuff out fairly easily. bonus points for
> being able to swap out only part of the system, since there is a clear
> distinction between *what*(how the data is serialized) and
> *how*(tcp/mmap) data is exchanged, something that, in my opinion, is way
> too often bundled together.
>
> on the other hand, let's assume we went from the start with one of these
> magic libraries. if you then notice it lacks something (let's say its to
> slow), and can't find a different library without these faults, you are
> so screwed. now you have to re-implement these magic libraries, with all
> their supported features, without these faults, or otherwise you break a
> a lot of user programs that built upon these.
>
> The current implementation was a safer approach imo. It has it's faults,
> and did provide me with some very nerve wrecking afternoon's, but I'd
> feel really uncomfortable relying on some library that i have no control
> over for the most performance-impacting component.
>
> On 31.07.2015 11:18, Maximilian Michels wrote:
> > py4j looks really nice and the communication works in both ways. There is
> > also another Python to Java communication library called javabridge. I
> > think it is a pity we chose to implement a proprietary protocol for the
> > network communication of the Python API. This could have been abstracted
> > more nicely and we have already seen that you can run into problems if
> you
> > implement that yourself.
> >
> > Serializers could be created dynamically if Python passed its dynamically
> > determined types to Java at runtime. Then everything should work on the
> > Java side.
> >
> > On Fri, Jul 31, 2015 at 11:01 AM, Till Rohrmann <trohrm...@apache.org>
> > wrote:
> >
> >> Zeppelin uses py4j [1] to transfer data between a Python process and a
> JVM.
> >> That way they can run a Python interpreter and Java interpreter and
> easily
> >> share state between them. Spark also uses py4j as a bridge between Java
> and
> >> Python. However, I don't know for what exactly. And I also don't know
> >> what's the performance penalty of py4j. But programming is a lot of fun
> >> with it :-)
> >>
> >> Cheers,
> >> Till
> >>
> >> [1] https://www.py4j.org/
> >>
> >> On Fri, Jul 31, 2015 at 10:34 AM, Stephan Ewen <se...@apache.org>
> wrote:
> >>
> >>> I think in short: Spark never worried about types. It is just something
> >>> arbitrary.
> >>>
> >>> Flink worries about types, for memory management.
> >>>
> >>> Aljoscha's suggestion is a good one: have a PythonTypeInfo that is
> >> dynamic.
> >>> Till' also found a pretty nice way to connect Python and Java in his
> >>> Zeppelin-based demo at the meetup.
> >>>
> >>> On Fri, Jul 31, 2015 at 10:30 AM, Chesnay Schepler <c.schep...@web.de>
> >>> wrote:
> >>>
> >>>> if its just a single array, how would you define group/sort keys?
> >>>>
> >>>>
> >>>> On 31.07.2015 07:03, Aljoscha Krettek wrote:
> >>>>
> >>>>> I think then the Python part would just serialize all the tuple
> fields
> >>> to
> >>>>> a
> >>>>> big byte array. And all the key fields to another array, so that the
> >>> java
> >>>>> side can to comparisons on the whole "key blob".
> >>>>>
> >>>>> Maybe it's overly simplistic, but it might work. :D
> >>>>>
> >>>>> On Thu, 30 Jul 2015 at 23:35 Chesnay Schepler <c.schep...@web.de>
> >>> wrote:
> >>>>> I can see this working for basic types, but am unsure how it would
> >> work
> >>>>>> with Tuples. Wouldn't the java API still need to know the arity to
> >>> setup
> >>>>>> serializers?
> >>>>>>
> >>>>>> On 30.07.2015 23:02, Aljoscha Krettek wrote:
> >>>>>>
> >>>>>>> I believe it should be possible to create a special PythonTypeInfo
> >>> where
> >>>>>>> the python side is responsible for serializing data to a byte array
> >>> and
> >>>>>> to
> >>>>>>
> >>>>>>> the java side it is just a byte array and all the comparisons are
> >> also
> >>>>>>> performed on these byte arrays. I think partitioning and sort
> should
> >>>>>>>
> >>>>>> still
> >>>>>>
> >>>>>>> work, since the sorting is (in most cases) only used to group the
> >>>>>>>
> >>>>>> elements
> >>>>>>
> >>>>>>> for a groupBy(). If proper sort order would be required this would
> >>> have
> >>>>>> to
> >>>>>>
> >>>>>>> be done on the python side.
> >>>>>>>
> >>>>>>> On Thu, 30 Jul 2015 at 22:21 Chesnay Schepler <c.schep...@web.de>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>> To be perfectly honest i never really managed to work my way
> through
> >>>>>>>> Spark's python API, it's a whole bunch of magic to me; not even
> the
> >>>>>>>> general structure is understandable.
> >>>>>>>>
> >>>>>>>> With "pure python" do you mean doing everything in python? as in
> >> just
> >>>>>>>> having serialized data on the java side?
> >>>>>>>>
> >>>>>>>> I believe the way to do this with Flink is to add a switch that
> >>>>>>>> a) disables all type checks
> >>>>>>>> b) creates serializers dynamically at runtime.
> >>>>>>>>
> >>>>>>>> a) should be fairly straight forward, b) on the other hand....
> >>>>>>>>
> >>>>>>>> btw., the Python API itself doesn't require the type information,
> >> it
> >>>>>>>> already does the b part.
> >>>>>>>>
> >>>>>>>> On 30.07.2015 22:11, Gyula Fóra wrote:
> >>>>>>>>
> >>>>>>>>> That I understand, but could you please tell me how is this done
> >>>>>>>>> differently in Spark for instance?
> >>>>>>>>>
> >>>>>>>>> What would we need to change to make this work with pure python
> >> (as
> >>> it
> >>>>>>>>> seems to be possible)? This probably have large performance
> >>>>>>>>>
> >>>>>>>> implications
> >>>>>>> though.
> >>>>>>>>> Gyula
> >>>>>>>>>
> >>>>>>>>> Chesnay Schepler <c.schep...@web.de> ezt írta (időpont: 2015.
> >> júl.
> >>>>>>>> 30.,
> >>>>>>> Cs,
> >>>>>>>>> 22:04):
> >>>>>>>>>
> >>>>>>>>> because it still goes through the Java API that requires some
> kind
> >>> of
> >>>>>>>>>> type information. imagine a java api program where you omit all
> >>>>>>>>>>
> >>>>>>>>> generic
> >>>>>>> types, it just wouldn't work as of now.
> >>>>>>>>>> On 30.07.2015 21:17, Gyula Fóra wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Hey!
> >>>>>>>>>>>
> >>>>>>>>>>> Could anyone briefly tell me what exactly is the reason why we
> >>> force
> >>>>>>>>>> the
> >>>>>>>>> users in the Python API to declare types for operators?
> >>>>>>>>>>> I don't really understand how this works in different systems
> >> but
> >>> I
> >>>>>>>>>> am
> >>>>>>> just
> >>>>>>>>>>> curious why Flink has types and why Spark doesn't for instance.
> >>>>>>>>>>>
> >>>>>>>>>>> If you give me some pointers to read that would also be fine :)
> >>>>>>>>>>>
> >>>>>>>>>>> Thank you,
> >>>>>>>>>>> Gyula
> >>>>>>>>>>>
> >>>>>>>>>>>
>
>

Re: Types in the Python API

Reply via email to