I think in short: Spark never worried about types. It is just something
arbitrary.

Flink worries about types, for memory management.

Aljoscha's suggestion is a good one: have a PythonTypeInfo that is dynamic.

Till' also found a pretty nice way to connect Python and Java in his
Zeppelin-based demo at the meetup.

On Fri, Jul 31, 2015 at 10:30 AM, Chesnay Schepler <c.schep...@web.de>
wrote:

> if its just a single array, how would you define group/sort keys?
>
>
> On 31.07.2015 07:03, Aljoscha Krettek wrote:
>
>> I think then the Python part would just serialize all the tuple fields to
>> a
>> big byte array. And all the key fields to another array, so that the java
>> side can to comparisons on the whole "key blob".
>>
>> Maybe it's overly simplistic, but it might work. :D
>>
>> On Thu, 30 Jul 2015 at 23:35 Chesnay Schepler <c.schep...@web.de> wrote:
>>
>> I can see this working for basic types, but am unsure how it would work
>>> with Tuples. Wouldn't the java API still need to know the arity to setup
>>> serializers?
>>>
>>> On 30.07.2015 23:02, Aljoscha Krettek wrote:
>>>
>>>> I believe it should be possible to create a special PythonTypeInfo where
>>>> the python side is responsible for serializing data to a byte array and
>>>>
>>> to
>>>
>>>> the java side it is just a byte array and all the comparisons are also
>>>> performed on these byte arrays. I think partitioning and sort should
>>>>
>>> still
>>>
>>>> work, since the sorting is (in most cases) only used to group the
>>>>
>>> elements
>>>
>>>> for a groupBy(). If proper sort order would be required this would have
>>>>
>>> to
>>>
>>>> be done on the python side.
>>>>
>>>> On Thu, 30 Jul 2015 at 22:21 Chesnay Schepler <c.schep...@web.de>
>>>> wrote:
>>>>
>>>> To be perfectly honest i never really managed to work my way through
>>>>> Spark's python API, it's a whole bunch of magic to me; not even the
>>>>> general structure is understandable.
>>>>>
>>>>> With "pure python" do you mean doing everything in python? as in just
>>>>> having serialized data on the java side?
>>>>>
>>>>> I believe the way to do this with Flink is to add a switch that
>>>>> a) disables all type checks
>>>>> b) creates serializers dynamically at runtime.
>>>>>
>>>>> a) should be fairly straight forward, b) on the other hand....
>>>>>
>>>>> btw., the Python API itself doesn't require the type information, it
>>>>> already does the b part.
>>>>>
>>>>> On 30.07.2015 22:11, Gyula Fóra wrote:
>>>>>
>>>>>> That I understand, but could you please tell me how is this done
>>>>>> differently in Spark for instance?
>>>>>>
>>>>>> What would we need to change to make this work with pure python (as it
>>>>>> seems to be possible)? This probably have large performance
>>>>>>
>>>>> implications
>>>
>>>> though.
>>>>>>
>>>>>> Gyula
>>>>>>
>>>>>> Chesnay Schepler <c.schep...@web.de> ezt írta (időpont: 2015. júl.
>>>>>>
>>>>> 30.,
>>>
>>>> Cs,
>>>>>
>>>>>> 22:04):
>>>>>>
>>>>>> because it still goes through the Java API that requires some kind of
>>>>>>> type information. imagine a java api program where you omit all
>>>>>>>
>>>>>> generic
>>>
>>>> types, it just wouldn't work as of now.
>>>>>>>
>>>>>>> On 30.07.2015 21:17, Gyula Fóra wrote:
>>>>>>>
>>>>>>>> Hey!
>>>>>>>>
>>>>>>>> Could anyone briefly tell me what exactly is the reason why we force
>>>>>>>>
>>>>>>> the
>>>>>
>>>>>> users in the Python API to declare types for operators?
>>>>>>>>
>>>>>>>> I don't really understand how this works in different systems but I
>>>>>>>>
>>>>>>> am
>>>
>>>> just
>>>>>>>
>>>>>>>> curious why Flink has types and why Spark doesn't for instance.
>>>>>>>>
>>>>>>>> If you give me some pointers to read that would also be fine :)
>>>>>>>>
>>>>>>>> Thank you,
>>>>>>>> Gyula
>>>>>>>>
>>>>>>>>
>>>
>

Reply via email to