Re: [PySpark] Revisiting PySpark type annotations

zero323 Sat, 26 Jan 2019 06:10:25 -0800

As already pointed out by Nicholas, there is no Python 2 conflict here.
Moreover, despite the fact that I used Python 3 specific feature, Python 2
users can benefit from the annotations as well in some circumstances
(already mentioned MyPy is one option, PyCharm another, maybe some other
tools as well, if not natively then, like Jupyter, through MyPy).


Nonetheless there are many factors to consider here.

First and foremost if project has enough manpower to spare, to actually
maintain manually curated annotations. While simple annotations can be
generated automatically (static ones, can be created with stubgen, by
reflection with MonkeyType), but these are fairly limited and sometimes
truly monstrous. At this moment PySpark annotations consist of ~ 5KLOCs -
some parts are close to trivial, other are rather, and sometimes require
additional definitions. Since standards and tools evolve, this code that has
to be actively maintained. This potentially means another stream of JIRA
tickets to handle.

Additionally, if  annotations are to be used, maintainers should set clear
goals. As annotations can vary from dynamic Any -> Any, through detailed
annotations including generics (that's where most of the annotations for
PySpark are at the point), to in-depth constraints on values (simple
dependent types). Additionally one can choose between documenting factual
relationships and recommendations (in other words, rejecting some values in
the types system, that are allowed in practice). There is also a trade-off
between completeness and the cost of maintenance. Finally it should be
decided if annotations should cover only the public API (my choice), or
internals as well, and if should be mandatory for the chosen API, or
optional.

Furthermore there are some challenges when it comes to PySpark dependencies,
many of which don't have their own annotations. And there is of course a
matter of annotating Py4j interfaces.

Last but not least there is a question of testing and acceptance. Ideally
one would run type checker of choice against examples and source, and accept
annotations, if there is no conflict. In reality however, available tools
have limitations, and can reject correct code (generics are particularly
problematic here). Not to start with regressions and backward incompatible
changes. From the other hand, checking only internal consistency (primary
acceptance criterion used with annotations only project) can miss some
obvious problems. There are possible solutions, but these don't come without
a cost.

Now the question is what are possible advantages of merging annotations into
the official repository versus keeping these outside. Keeping things in sync
and tapping into existing pool of contributors are the most obvious ones.
Additionally it means bringing some benefits of annotations, even if the
final user is not aware or not interested in typing at all (see PyCharm
case).

On the other hand, if user is aware of Python typing, there is little
overhead of having a separate package. It is a lightweight dependency, with
no executable code, and it is not required on the worker nodes. There is
also more room for experimentation without strict release schedule.

Anyway.... On my side I can donate existing annotations, help with the
migration process, and provide some support during the transition period, if
decision to include annotations in the main repository is made. However I
don't have a strong opinion if such transition is required or not. 



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [PySpark] Revisiting PySpark type annotations

Reply via email to