I think in short: Spark never worried about types. It is just something arbitrary.
Flink worries about types, for memory management. Aljoscha's suggestion is a good one: have a PythonTypeInfo that is dynamic. Till' also found a pretty nice way to connect Python and Java in his Zeppelin-based demo at the meetup. On Fri, Jul 31, 2015 at 10:30 AM, Chesnay Schepler <c.schep...@web.de> wrote: > if its just a single array, how would you define group/sort keys? > > > On 31.07.2015 07:03, Aljoscha Krettek wrote: > >> I think then the Python part would just serialize all the tuple fields to >> a >> big byte array. And all the key fields to another array, so that the java >> side can to comparisons on the whole "key blob". >> >> Maybe it's overly simplistic, but it might work. :D >> >> On Thu, 30 Jul 2015 at 23:35 Chesnay Schepler <c.schep...@web.de> wrote: >> >> I can see this working for basic types, but am unsure how it would work >>> with Tuples. Wouldn't the java API still need to know the arity to setup >>> serializers? >>> >>> On 30.07.2015 23:02, Aljoscha Krettek wrote: >>> >>>> I believe it should be possible to create a special PythonTypeInfo where >>>> the python side is responsible for serializing data to a byte array and >>>> >>> to >>> >>>> the java side it is just a byte array and all the comparisons are also >>>> performed on these byte arrays. I think partitioning and sort should >>>> >>> still >>> >>>> work, since the sorting is (in most cases) only used to group the >>>> >>> elements >>> >>>> for a groupBy(). If proper sort order would be required this would have >>>> >>> to >>> >>>> be done on the python side. >>>> >>>> On Thu, 30 Jul 2015 at 22:21 Chesnay Schepler <c.schep...@web.de> >>>> wrote: >>>> >>>> To be perfectly honest i never really managed to work my way through >>>>> Spark's python API, it's a whole bunch of magic to me; not even the >>>>> general structure is understandable. >>>>> >>>>> With "pure python" do you mean doing everything in python? as in just >>>>> having serialized data on the java side? >>>>> >>>>> I believe the way to do this with Flink is to add a switch that >>>>> a) disables all type checks >>>>> b) creates serializers dynamically at runtime. >>>>> >>>>> a) should be fairly straight forward, b) on the other hand.... >>>>> >>>>> btw., the Python API itself doesn't require the type information, it >>>>> already does the b part. >>>>> >>>>> On 30.07.2015 22:11, Gyula Fóra wrote: >>>>> >>>>>> That I understand, but could you please tell me how is this done >>>>>> differently in Spark for instance? >>>>>> >>>>>> What would we need to change to make this work with pure python (as it >>>>>> seems to be possible)? This probably have large performance >>>>>> >>>>> implications >>> >>>> though. >>>>>> >>>>>> Gyula >>>>>> >>>>>> Chesnay Schepler <c.schep...@web.de> ezt írta (időpont: 2015. júl. >>>>>> >>>>> 30., >>> >>>> Cs, >>>>> >>>>>> 22:04): >>>>>> >>>>>> because it still goes through the Java API that requires some kind of >>>>>>> type information. imagine a java api program where you omit all >>>>>>> >>>>>> generic >>> >>>> types, it just wouldn't work as of now. >>>>>>> >>>>>>> On 30.07.2015 21:17, Gyula Fóra wrote: >>>>>>> >>>>>>>> Hey! >>>>>>>> >>>>>>>> Could anyone briefly tell me what exactly is the reason why we force >>>>>>>> >>>>>>> the >>>>> >>>>>> users in the Python API to declare types for operators? >>>>>>>> >>>>>>>> I don't really understand how this works in different systems but I >>>>>>>> >>>>>>> am >>> >>>> just >>>>>>> >>>>>>>> curious why Flink has types and why Spark doesn't for instance. >>>>>>>> >>>>>>>> If you give me some pointers to read that would also be fine :) >>>>>>>> >>>>>>>> Thank you, >>>>>>>> Gyula >>>>>>>> >>>>>>>> >>> >