Re: [PySpark] Revisiting PySpark type annotations

Felix Cheung Tue, 04 Aug 2020 09:45:30 -0700

What would be the reason for separate git repo?

________________________________
From: Hyukjin Kwon <[email protected]>
Sent: Monday, August 3, 2020 1:58:55 AM
To: Maciej Szymkiewicz <[email protected]>
Cc: Driesprong, Fokko <[email protected]>; Holden Karau 
<[email protected]>; Spark Dev List <[email protected]>
Subject: Re: [PySpark] Revisiting PySpark type annotations

Okay, seems like we can create a separate repo as apache/spark? e.g.) 
https://issues.apache.org/jira/browse/INFRA-20470
We can also think about porting the files as are.
I will try to have a short sync with the author Maciej, and share what we 
discussed offline.

2020년 7월 22일 (수) 오후 10:43, Maciej Szymkiewicz 
<[email protected]<mailto:[email protected]>>님이 작성:

W dniu środa, 22 lipca 2020 Driesprong, Fokko <[email protected]> napisał(a):
That's probably one-time overhead so it is not a big issue.  In my opinion, a 
bigger one is possible complexity. Annotations tend to introduce a lot of 
cyclic dependencies in Spark codebase. This can be addressed, but don't look 
great.

This is not true (anymore). With Python 3.6 you can add string annotations -> 
'DenseVector', and in the future with Python 3.7 this is fixed by having 
postponed evaluation: https://www.python.org/dev/peps/pep-0563/

As far as I recall linked PEP addresses backrferences not cyclic dependencies, 
which weren't a big issue in the first place

What I mean is a actually cyclic stuff - for example pyspark.context depends on 
pyspark.rdd and the other way around. These dependencies are not explicit at he 
moment.

Merging stubs into project structure from the other hand has almost no overhead.

This feels awkward to me, this is like having the docstring in a separate file. 
In my opinion you want to have the signatures and the functions together for 
transparency and maintainability.

I guess that's the matter of preference. From maintainability perspective it is 
actually much easier to have separate objects.

For example there are different types of objects that are required for 
meaningful checking, which don't really exist in real code (protocols, aliases, 
code generated signatures fo let complex overloads) as well as some monkey 
patched entities

Additionally it is often easier to see inconsistencies when typing is separate.

However, I am not implying that this should be a persistent state.

In general I see two non breaking paths here.

 - Merge pyspark-stubs a separate subproject within main spark repo and keep it 
in-sync there with common CI pipeline and transfer ownership of pypi package to 
ASF
- Move stubs directly into python/pyspark and then apply individual stubs to 
.modules of choice.

Of course, the first proposal could be an initial step for the latter one.

I think DBT is a very nice project where they use annotations very well: 
https://github.com/fishtown-analytics/dbt/blob/dev/marian-anderson/core/dbt/graph/queue.py

Also, they left out the types in the docstring, since they are available in the 
annotations itself.

In practice, the biggest advantage is actually support for completion, not type 
checking (which works in simple cases).

Agreed.

Would you be interested in writing up the Outreachy proposal for work on this?

I would be, and also happy to mentor. But, I think we first need to agree as a 
Spark community if we want to add the annotations to the code, and in which 
extend.

At some point (in general when things are heavy in generics, which is the case 
here), annotations become somewhat painful to write.

That's true, but that might also be a pointer that it is time to refactor the 
function/code :)

That might the case, but it is more often a matter capturing useful properties 
combined with requirement to keep things in sync with Scala counterparts.

For now, I tend to think adding type hints to the codes make it difficult to 
backport or revert and more difficult to discuss about typing only especially 
considering typing is arguably premature yet.

This feels a bit weird to me, since you want to keep this in sync right? Do you 
provide different stubs for different versions of Python? I had to look up the 
literals: https://www.python.org/dev/peps/pep-0586/

I think it is more about portability between Spark versions

Cheers, Fokko

Op wo 22 jul. 2020 om 09:40 schreef Maciej Szymkiewicz 
<[email protected]<mailto:[email protected]>>:

On 7/22/20 3:45 AM, Hyukjin Kwon wrote:
> For now, I tend to think adding type hints to the codes make it
> difficult to backport or revert and
> more difficult to discuss about typing only especially considering
> typing is arguably premature yet.

About being premature ‒ since typing ecosystem evolves much faster than
Spark it might be preferable to keep annotations as a separate project
(preferably under AST / Spark umbrella). It allows for faster iterations
and supporting new features (for example Literals proved to be very
useful), without waiting for the next Spark release.

--
Best regards,
Maciej Szymkiewicz

Web: https://zero323.net
Keybase: https://keybase.io/zero323
Gigs: https://www.codementor.io/@zero323
PGP: A30CEF0C31A501EC

--

Best regards,
Maciej Szymkiewicz

Re: [PySpark] Revisiting PySpark type annotations

Reply via email to