Re: [PySpark] Revisiting PySpark type annotations

2020-08-27 Thread Hyukjin Kwon
Thanks Maciej and Fokko. 2020년 8월 28일 (금) 오전 6:09, Maciej 님이 작성: > On my side, I'll try to identify any possible problems by the end of the > week or so (at somewhat crude inspection there is nothing unexpected or > particularly hard to resolve, but sometimes problem occur when you try to > refin

Re: [PySpark] Revisiting PySpark type annotations

2020-08-27 Thread Maciej
On my side, I'll try to identify any possible problems by the end of the week or so (at somewhat crude inspection there is nothing unexpected or particularly hard to resolve, but sometimes problem occur when you try to refine things) and I'll post an update. Maybe we could take it from there? In g

Re: [PySpark] Revisiting PySpark type annotations

2020-08-27 Thread Maciej
Oh, this is probably because of how annotations are handled. In general stubs take preference over inline annotations and are considered the only source of type hints, unless packaged is marked as partially typed (https://www.python.org/dev/peps/pep-0561/#id21). In such case however is all-or-noth

Re: [PySpark] Revisiting PySpark type annotations

2020-08-27 Thread Driesprong, Fokko
Looking at it a second time, I think it is only mypy that's complaining: MacBook-Pro-van-Fokko:spark fokkodriesprong$ git diff *diff --git a/python/pyspark/accumulators.pyi b/python/pyspark/accumulators.pyi* *index f60de25704..6eafe46a46 100644* *--- a/python/pyspark/accumulators.pyi* *+++ b/p

Re: [PySpark] Revisiting PySpark type annotations

2020-08-27 Thread Maciej
Well, technically speaking annotation and actual are not the same thing. Many parts of Spark API might require heavy overloads to either capture relationships between arguments (for example in case of ML) or to capture at least rudimentary relationships between inputs and outputs (i.e. udfs). Just

Re: [PySpark] Revisiting PySpark type annotations

2020-08-27 Thread Maciej
That doesn't sound right. Would it be a problem for you to provide reproducible example? On 8/27/20 6:09 PM, Driesprong, Fokko wrote: > Today I've updated [SPARK-17333][PYSPARK] Enable mypy on the > repository  and while > doing so I've noticed that all

Re: [PySpark] Revisiting PySpark type annotations

2020-08-27 Thread Driesprong, Fokko
stubs >>>>>><https://github.com/zero323/pyspark-stubs> is still incomplete >>>>>>and does not annotate types in some other APIs (by using Any). >>>>>> Correct me >>>>>>if I am wrong, Maciej. >>>>&

Re: [PySpark] Revisiting PySpark type annotations

2020-08-27 Thread Hyukjin Kwon
;>am wrong, Maciej. >>>>> >>>>> >>>>> >>>>> >>>>> I’ll have a short sync with him and share to understand better since >>>>> he’d probably know the context best in PySpark type hints and I

Re: [PySpark] Revisiting PySpark type annotations

2020-08-20 Thread Driesprong, Fokko
> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> Indeed, though the possible advantage is that in theory, you can >

Re: [PySpark] Revisiting PySpark type annotations

2020-08-20 Thread Hyukjin Kwon
I know some >>> contexts in ASF and Apache Spark. >>> >>> >>> >>> 2020년 8월 5일 (수) 오전 6:31, Maciej Szymkiewicz 님이 >>> 작성: >>> >>>> Indeed, though the possible advantage is that in theory, you can have >>>> different release cycle than for the main r

Re: [PySpark] Revisiting PySpark type annotations

2020-08-20 Thread Driesprong, Fokko
for the main repo (I am not sure if that's >>> feasible in practice or if that was the intention). >>> >>> I guess all depends on how we envision the future of annotations >>> (including, but not limited to, how conservative we want to be in the >>> future). Which is proba

Re: [PySpark] Revisiting PySpark type annotations

2020-08-05 Thread Driesprong, Fokko
; On 8/4/20 11:06 PM, Felix Cheung wrote: >> >> So IMO maintaining outside in a separate repo is going to be harder. That >> was why I asked. >> >> >> >> ---------- >> *From:* Maciej Szymkiewicz >> >> *Sent:* Tuesday, August 4

Re: [PySpark] Revisiting PySpark type annotations

2020-08-04 Thread Hyukjin Kwon
:* Tuesday, August 4, 2020 12:59 PM > *To:* Sean Owen > *Cc:* Felix Cheung; Hyukjin Kwon; Driesprong, Fokko; Holden Karau; Spark > Dev List > *Subject:* Re: [PySpark] Revisiting PySpark type annotations > > > On 8/4/20 9:35 PM, Sean Owen wrote > > Yes, but the general

Re: [PySpark] Revisiting PySpark type annotations

2020-08-04 Thread Maciej Szymkiewicz
why I asked. > > >   > > *From:* Maciej Szymkiewicz > *Sent:* Tuesday, August 4, 2020 12:59 PM > *To:* Sean Owen > *Cc:* Felix Cheung; Hyukjin Kwon; Driesprong, Fokko; Holden Karau; > Spark Dev List > *Subject:* Re: [PySpark] Revisiting PySpark type annotat

Re: [PySpark] Revisiting PySpark type annotations

2020-08-04 Thread Felix Cheung
: [PySpark] Revisiting PySpark type annotations On 8/4/20 9:35 PM, Sean Owen wrote > Yes, but the general argument you make here is: if you tie this > project to the main project, it will _have_ to be maintained by > everyone. That's good, but also exactly I think the downside we w

Re: [PySpark] Revisiting PySpark type annotations

2020-08-04 Thread Maciej Szymkiewicz
On 8/4/20 9:35 PM, Sean Owen wrote > Yes, but the general argument you make here is: if you tie this > project to the main project, it will _have_ to be maintained by > everyone. That's good, but also exactly I think the downside we want > to avoid at this stage (I thought?) I understand for some

Re: [PySpark] Revisiting PySpark type annotations

2020-08-04 Thread Sean Owen
On Tue, Aug 4, 2020 at 2:32 PM Maciej Szymkiewicz wrote: > > First of all why ASF ownership? > > For the project of this size maintaining high quality (it is not hard to use > stubgen or monkeytype, but resulting annotations are rather simplistic) > annotations independent of the actual codebase

Re: [PySpark] Revisiting PySpark type annotations

2020-08-04 Thread Maciej Szymkiewicz
n for separate git repo? >> >> >> From: Hyukjin Kwon >> Sent: Monday, August 3, 2020 1:58:55 AM >> To: Maciej Szymkiewicz >> Cc: Driesprong, Fokko ; Holden Karau >> ; Spark Dev List >> Subject: Re: [PySpark] Revisiting PySpark type ann

Re: [PySpark] Revisiting PySpark type annotations

2020-08-04 Thread Sean Owen
t; Cc: Driesprong, Fokko ; Holden Karau > ; Spark Dev List > Subject: Re: [PySpark] Revisiting PySpark type annotations > > Okay, seems like we can create a separate repo as apache/spark? e.g.) > https://issues.apache.org/jira/browse/INFRA-20470 > We can also think about portin

Re: [PySpark] Revisiting PySpark type annotations

2020-08-04 Thread Felix Cheung
What would be the reason for separate git repo? From: Hyukjin Kwon Sent: Monday, August 3, 2020 1:58:55 AM To: Maciej Szymkiewicz Cc: Driesprong, Fokko ; Holden Karau ; Spark Dev List Subject: Re: [PySpark] Revisiting PySpark type annotations Okay, seems like

Re: [PySpark] Revisiting PySpark type annotations

2020-08-03 Thread Driesprong, Fokko
Cool stuff! Moving it to the ASF would be a great first step. I think you might want to check the IP Clearance template: http://incubator.apache.org/ip-clearance/ip-clearance-template.html This is the one being used when donating the Airflow Kubernetes operator from Google to the ASF: http://mail

Re: [PySpark] Revisiting PySpark type annotations

2020-08-03 Thread Hyukjin Kwon
Okay, seems like we can create a separate repo as apache/spark? e.g.) https://issues.apache.org/jira/browse/INFRA-20470 We can also think about porting the files as are. I will try to have a short sync with the author Maciej, and share what we discussed offline. 2020년 7월 22일 (수) 오후 10:43, Maciej

Re: [PySpark] Revisiting PySpark type annotations

2020-07-22 Thread Maciej Szymkiewicz
W dniu środa, 22 lipca 2020 Driesprong, Fokko napisał(a): > That's probably one-time overhead so it is not a big issue. In my > opinion, a bigger one is possible complexity. Annotations tend to introduce > a lot of cyclic dependencies in Spark codebase. This can be addressed, but > don't look gr

Re: [PySpark] Revisiting PySpark type annotations

2020-07-22 Thread Driesprong, Fokko
That's probably one-time overhead so it is not a big issue. In my opinion, a bigger one is possible complexity. Annotations tend to introduce a lot of cyclic dependencies in Spark codebase. This can be addressed, but don't look great. This is not true (anymore). With Python 3.6 you can add strin

Re: [PySpark] Revisiting PySpark type annotations

2020-07-22 Thread Maciej Szymkiewicz
On 7/22/20 3:45 AM, Hyukjin Kwon wrote: > For now, I tend to think adding type hints to the codes make it > difficult to backport or revert and > more difficult to discuss about typing only especially considering > typing is arguably premature yet. About being premature ‒ since typing ecosystem e

Re: [PySpark] Revisiting PySpark type annotations

2020-07-22 Thread Maciej Szymkiewicz
On 7/21/20 9:40 PM, Holden Karau wrote: > Yeah I think this could be a great project now that we're only Python > 3.5+. One potential is making this an Outreachy project to get more > folks from different backgrounds involved in Spark. I am honestly not sure if that's really the case. At the mom

Re: [PySpark] Revisiting PySpark type annotations

2020-07-22 Thread Maciej Szymkiewicz
On 7/22/20 3:45 AM, Hyukjin Kwon wrote: > > Yeah, I tend to be positive about leveraging the Python type hints in > general. > > However, just to clarify, I don’t think we should just port the type > hints into the main codes yet but maybe think about > having/porting Maciej's work, pyi files as s

Re: [PySpark] Revisiting PySpark type annotations

2020-07-21 Thread Hyukjin Kwon
Yeah, I tend to be positive about leveraging the Python type hints in general. However, just to clarify, I don’t think we should just port the type hints into the main codes yet but maybe think about having/porting Maciej's work, pyi files as stubs. For now, I tend to think adding type hints to th

Re: [PySpark] Revisiting PySpark type annotations

2020-07-21 Thread Driesprong, Fokko
Fully agree Holden, would be great to include the Outreachy project. Adding annotations is a very friendly way to get familiar with the codebase. I've also created a PR to see what's needed to get mypy in: https://github.com/apache/spark/pull/29180 From there on we can start adding annotations. C

Re: [PySpark] Revisiting PySpark type annotations

2020-07-21 Thread Holden Karau
Yeah I think this could be a great project now that we're only Python 3.5+. One potential is making this an Outreachy project to get more folks from different backgrounds involved in Spark. On Tue, Jul 21, 2020 at 12:33 PM Driesprong, Fokko wrote: > Since we've recently dropped support for Pytho

Re: [PySpark] Revisiting PySpark type annotations

2020-07-21 Thread Driesprong, Fokko
Since we've recently dropped support for Python <=3.5 , I think it would be nice to add support for type annotations. Having this in the main repository allows us to do type checking using MyPy in the CI itself.

Re: [PySpark] Revisiting PySpark type annotations

2020-07-21 Thread zero323
Given a discussion related to SPARK-32320 PR I'd like to resurrect this thread. Is there any interest in migrating annotations to the main repository? -- Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ ---

Re: [PySpark] Revisiting PySpark type annotations

2020-07-21 Thread zero323
Given a discussion related to SPARK-32320 PR I'd like to resurrect this thread. -- Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ - To unsubscribe e-mail: d

Re: [PySpark] Revisiting PySpark type annotations

2019-01-26 Thread zero323
As already pointed out by Nicholas, there is no Python 2 conflict here. Moreover, despite the fact that I used Python 3 specific feature, Python 2 users can benefit from the annotations as well in some circumstances (already mentioned MyPy is one option, PyCharm another, maybe some other tools as w

Re: [PySpark] Revisiting PySpark type annotations

2019-01-25 Thread Nicholas Chammas
I think the annotations are compatible with Python 2 since Maciej implemented them via stub files , which Python 2 simply ignores. Folks using mypy to check types will get the benefit whether they're on Python 2 or 3,

Re: [PySpark] Revisiting PySpark type annotations

2019-01-25 Thread Reynold Xin
If we can make the annotation compatible with Python 2, why don’t we add type annotation to make life easier for users of Python 3 (with type)? On Fri, Jan 25, 2019 at 7:53 AM Maciej Szymkiewicz wrote: > > Hello everyone, > > I'd like to revisit the topic of adding PySpark type annotations in 3.

[PySpark] Revisiting PySpark type annotations

2019-01-25 Thread Maciej Szymkiewicz
Hello everyone, I'd like to revisit the topic of adding PySpark type annotations in 3.0. It has been discussed before ( http://apache-spark-developers-list.1001551.n3.nabble.com/Python-friendly-API-for-Spark-3-0-td25016.html and http://apache-spark-developers-list.1001551.n3.nabble.com/PYTHON-PySp