I don't know yet. :D Maybe the sorting will have to be delegated to python. I don't think it's possible to always get a meaningful order when only sorting on the serialized bytes. It should however work for grouping.
On Fri, 31 Jul 2015 at 10:31 Chesnay Schepler <c.schep...@web.de> wrote: > if its just a single array, how would you define group/sort keys? > > On 31.07.2015 07:03, Aljoscha Krettek wrote: > > I think then the Python part would just serialize all the tuple fields > to a > > big byte array. And all the key fields to another array, so that the java > > side can to comparisons on the whole "key blob". > > > > Maybe it's overly simplistic, but it might work. :D > > > > On Thu, 30 Jul 2015 at 23:35 Chesnay Schepler <c.schep...@web.de> wrote: > > > >> I can see this working for basic types, but am unsure how it would work > >> with Tuples. Wouldn't the java API still need to know the arity to setup > >> serializers? > >> > >> On 30.07.2015 23:02, Aljoscha Krettek wrote: > >>> I believe it should be possible to create a special PythonTypeInfo > where > >>> the python side is responsible for serializing data to a byte array and > >> to > >>> the java side it is just a byte array and all the comparisons are also > >>> performed on these byte arrays. I think partitioning and sort should > >> still > >>> work, since the sorting is (in most cases) only used to group the > >> elements > >>> for a groupBy(). If proper sort order would be required this would have > >> to > >>> be done on the python side. > >>> > >>> On Thu, 30 Jul 2015 at 22:21 Chesnay Schepler <c.schep...@web.de> > wrote: > >>> > >>>> To be perfectly honest i never really managed to work my way through > >>>> Spark's python API, it's a whole bunch of magic to me; not even the > >>>> general structure is understandable. > >>>> > >>>> With "pure python" do you mean doing everything in python? as in just > >>>> having serialized data on the java side? > >>>> > >>>> I believe the way to do this with Flink is to add a switch that > >>>> a) disables all type checks > >>>> b) creates serializers dynamically at runtime. > >>>> > >>>> a) should be fairly straight forward, b) on the other hand.... > >>>> > >>>> btw., the Python API itself doesn't require the type information, it > >>>> already does the b part. > >>>> > >>>> On 30.07.2015 22:11, Gyula Fóra wrote: > >>>>> That I understand, but could you please tell me how is this done > >>>>> differently in Spark for instance? > >>>>> > >>>>> What would we need to change to make this work with pure python (as > it > >>>>> seems to be possible)? This probably have large performance > >> implications > >>>>> though. > >>>>> > >>>>> Gyula > >>>>> > >>>>> Chesnay Schepler <c.schep...@web.de> ezt írta (időpont: 2015. júl. > >> 30., > >>>> Cs, > >>>>> 22:04): > >>>>> > >>>>>> because it still goes through the Java API that requires some kind > of > >>>>>> type information. imagine a java api program where you omit all > >> generic > >>>>>> types, it just wouldn't work as of now. > >>>>>> > >>>>>> On 30.07.2015 21:17, Gyula Fóra wrote: > >>>>>>> Hey! > >>>>>>> > >>>>>>> Could anyone briefly tell me what exactly is the reason why we > force > >>>> the > >>>>>>> users in the Python API to declare types for operators? > >>>>>>> > >>>>>>> I don't really understand how this works in different systems but I > >> am > >>>>>> just > >>>>>>> curious why Flink has types and why Spark doesn't for instance. > >>>>>>> > >>>>>>> If you give me some pointers to read that would also be fine :) > >>>>>>> > >>>>>>> Thank you, > >>>>>>> Gyula > >>>>>>> > >> > >