Re: [PySpark] Revisiting PySpark type annotations

Driesprong, Fokko Mon, 03 Aug 2020 04:57:03 -0700

Cool stuff! Moving it to the ASF would be a great first step.

I think you might want to check the IP Clearance template:
http://incubator.apache.org/ip-clearance/ip-clearance-template.html


This is the one being used when donating the Airflow Kubernetes operator
from Google to the ASF:
http://mail-archives.apache.org/mod_mbox/airflow-dev/201909.mbox/%3cca+aakm-ahq7wni6+nazfnrxfnfh1wy34gcvyavsq4xlcwh2...@mail.gmail.com%3e

I don't expect anything weird, but it might be a good idea to check if the
licenses are in the files:
https://github.com/zero323/pyspark-stubs/pull/458 And
check if there are any dependencies with licenses that are in conflict with
the Apache 2.0 license, but it looks good to me.

Looking forward, are we going to keep this as a separate repository? While
adding the licenses I've noticed that there is a lingering annotation:
https://github.com/zero323/pyspark-stubs/pull/459 This file has been
removed in Spark upstream because we've bumped the Python version. As
mentioned in the Pull Request earlier, I would be a big fan of putting the
annotations and the code in the same repository. I'm fine with keeping them
separate in a .pyi as well. Otherwise, it is very easy for them to run out
of sync.

Please let me know what comes out of the meeting.

Cheers, Fokko

Op ma 3 aug. 2020 om 10:59 schreef Hyukjin Kwon <gurwls...@gmail.com>:

> Okay, seems like we can create a separate repo as apache/spark? e.g.)
> https://issues.apache.org/jira/browse/INFRA-20470
> We can also think about porting the files as are.
> I will try to have a short sync with the author Maciej, and share what we
> discussed offline.
>
>
> 2020년 7월 22일 (수) 오후 10:43, Maciej Szymkiewicz <mszymkiew...@gmail.com>님이
> 작성:
>
>>
>>
>> W dniu środa, 22 lipca 2020 Driesprong, Fokko <fo...@driesprong.frl>
>> napisał(a):
>>
>>> That's probably one-time overhead so it is not a big issue.  In my
>>> opinion, a bigger one is possible complexity. Annotations tend to introduce
>>> a lot of cyclic dependencies in Spark codebase. This can be addressed, but
>>> don't look great.
>>>
>>>
>>> This is not true (anymore). With Python 3.6 you can add string
>>> annotations -> 'DenseVector', and in the future with Python 3.7 this is
>>> fixed by having postponed evaluation:
>>> https://www.python.org/dev/peps/pep-0563/
>>>
>>
>> As far as I recall linked PEP addresses backrferences not cyclic
>> dependencies, which weren't a big issue in the first place
>>
>> What I mean is a actually cyclic stuff - for example pyspark.context
>> depends on pyspark.rdd and the other way around. These dependencies are not
>> explicit at he moment.
>>
>>
>>
>>> Merging stubs into project structure from the other hand has almost no
>>> overhead.
>>>
>>>
>>> This feels awkward to me, this is like having the docstring in a
>>> separate file. In my opinion you want to have the signatures and the
>>> functions together for transparency and maintainability.
>>>
>>>
>> I guess that's the matter of preference. From maintainability perspective
>> it is actually much easier to have separate objects.
>>
>> For example there are different types of objects that are required for
>> meaningful checking, which don't really exist in real code (protocols,
>> aliases, code generated signatures fo let complex overloads) as well as
>> some monkey patched entities
>>
>> Additionally it is often easier to see inconsistencies when typing is
>> separate.
>>
>> However, I am not implying that this should be a persistent state.
>>
>> In general I see two non breaking paths here.
>>
>>  - Merge pyspark-stubs a separate subproject within main spark repo and
>> keep it in-sync there with common CI pipeline and transfer ownership of
>> pypi package to ASF
>> - Move stubs directly into python/pyspark and then apply individual stubs
>> to .modules of choice.
>>
>> Of course, the first proposal could be an initial step for the latter one.
>>
>>
>>>
>>> I think DBT is a very nice project where they use annotations very well:
>>> https://github.com/fishtown-analytics/dbt/blob/dev/marian-anderson/core/dbt/graph/queue.py
>>>
>>> Also, they left out the types in the docstring, since they are available
>>> in the annotations itself.
>>>
>>>
>>
>>> In practice, the biggest advantage is actually support for completion,
>>> not type checking (which works in simple cases).
>>>
>>>
>>> Agreed.
>>>
>>> Would you be interested in writing up the Outreachy proposal for work on
>>> this?
>>>
>>>
>>> I would be, and also happy to mentor. But, I think we first need to
>>> agree as a Spark community if we want to add the annotations to the code,
>>> and in which extend.
>>>
>>
>>
>>
>>
>>
>>> At some point (in general when things are heavy in generics, which is
>>> the case here), annotations become somewhat painful to write.
>>>
>>>
>>> That's true, but that might also be a pointer that it is time to
>>> refactor the function/code :)
>>>
>>
>> That might the case, but it is more often a matter capturing useful
>> properties combined with requirement to keep things in sync with Scala
>> counterparts.
>>
>>
>>
>>> For now, I tend to think adding type hints to the codes make it
>>> difficult to backport or revert and more difficult to discuss about typing
>>> only especially considering typing is arguably premature yet.
>>>
>>>
>>> This feels a bit weird to me, since you want to keep this in sync right?
>>> Do you provide different stubs for different versions of Python? I had to
>>> look up the literals: https://www.python.org/dev/peps/pep-0586/
>>>
>>
>> I think it is more about portability between Spark versions
>>
>>>
>>>
>>> Cheers, Fokko
>>>
>>
>>> Op wo 22 jul. 2020 om 09:40 schreef Maciej Szymkiewicz <
>>> mszymkiew...@gmail.com>:
>>>
>>>>
>>>> On 7/22/20 3:45 AM, Hyukjin Kwon wrote:
>>>> > For now, I tend to think adding type hints to the codes make it
>>>> > difficult to backport or revert and
>>>> > more difficult to discuss about typing only especially considering
>>>> > typing is arguably premature yet.
>>>>
>>>> About being premature ‒ since typing ecosystem evolves much faster than
>>>> Spark it might be preferable to keep annotations as a separate project
>>>> (preferably under AST / Spark umbrella). It allows for faster iterations
>>>> and supporting new features (for example Literals proved to be very
>>>> useful), without waiting for the next Spark release.
>>>>
>>>> --
>>>> Best regards,
>>>> Maciej Szymkiewicz
>>>>
>>>> Web: https://zero323.net
>>>> Keybase: https://keybase.io/zero323
>>>> Gigs: https://www.codementor.io/@zero323
>>>> PGP: A30CEF0C31A501EC
>>>>
>>>>
>>>>
>>
>> --
>>
>> Best regards,
>> Maciej Szymkiewicz
>>
>>
>>

Re: [PySpark] Revisiting PySpark type annotations

Reply via email to