Okay, it took me a while because I had to check the options and feasibility we discussed here.
TL;DR: I think we can just port directly pyi files as are into PySpark main repository. I would like to share only the key points here because it looks like I, Maciej and people here agree with this direction. - The stability in PySpark stubs seems pretty okay enough to port directly into the main repository. At least it covers the most of user-facing APIs. So there won't be many advantages by running it separately, (vs the overhead to make a repo and maintain separately) - There's a possibility that the type hinting way can be changed drastically but it will be manageable given that it will be handled within the same pyi files. - We'll need some tests for that. - We'll make sure there's no external user app breakage by this. There will likely be some other meta works such as adding tests and/or documentation works. So I filed an umbrella JIRA for that SPARK-32681 <https://issues.apache.org/jira/browse/SPARK-32681>. If there's no objections in this direction, I think hopefully we can start. Let me know if you guys have thoughts on this. Thanks! 2020년 8월 20일 (목) 오후 8:39, Driesprong, Fokko <fo...@driesprong.frl>님이 작성: > No worries, thanks for the update! > > Op do 20 aug. 2020 om 12:50 schreef Hyukjin Kwon <gurwls...@gmail.com> > >> Yeah, we had a short meeting. I had to check a few other things so some >> delays happened. I will share soon. >> >> 2020년 8월 20일 (목) 오후 7:14, Driesprong, Fokko <fo...@driesprong.frl>님이 작성: >> >>> Hi Maciej, Hyukjin, >>> >>> Did you find any time to discuss adding the types to the Python >>> repository? Would love to know what came out of it. >>> >>> Cheers, Fokko >>> >>> Op wo 5 aug. 2020 om 10:14 schreef Driesprong, Fokko >>> <fo...@driesprong.frl>: >>> >>>> Mostly echoing stuff that we've discussed in >>>> https://github.com/apache/spark/pull/29180, but good to have this also >>>> on the dev-list. >>>> >>>> > So IMO maintaining outside in a separate repo is going to be harder. >>>> That was why I asked. >>>> >>>> I agree with Felix, having this inside of the project would make it >>>> much easier to maintain. Having it inside of the ASF might be easier to >>>> port the pyi files to the actual Spark repository. >>>> >>>> > FWIW, NumPy took this approach. they made a separate repo, and merged >>>> it into the main repo after it became stable. >>>> >>>> As Maciej pointed out: >>>> >>>> > As of POC ‒ we have stubs, which have been maintained over three >>>> years now and cover versions between 2.3 (though these are fairly limited) >>>> to, with some lag, current master. >>>> >>>> What would be required to mark it as stable? >>>> >>>> > I guess all depends on how we envision the future of annotations >>>> (including, but not limited to, how conservative we want to be in the >>>> future). Which is probably something that should be discussed here. >>>> >>>> I'm happy to motivate people to contribute type hints, and I believe it >>>> is a very accessible way to get more people involved in the Python >>>> codebase. Using the ASF model we can ensure that we require committers/PMC >>>> to sign off on the annotations. >>>> >>>> > Indeed, though the possible advantage is that in theory, you can have >>>> different release cycle than for the main repo (I am not sure if that's >>>> feasible in practice or if that was the intention). >>>> >>>> Personally, I don't think we need a different cycle if the type >>>> hints are part of the code itself. >>>> >>>> > If my understanding is correct, pyspark-stubs is still incomplete and >>>> does not annotate types in some other APIs (by using Any). Correct me if I >>>> am wrong, Maciej. >>>> >>>> For me, it is a bit like code coverage. You want this to be high to >>>> make sure that you cover most of the APIs, but it will take some time to >>>> make it complete. >>>> >>>> For me, it feels a bit like a chicken and egg problem. Because the type >>>> hints are in a separate repository, they will always lag behind. Also, it >>>> is harder to spot where the gaps are. >>>> >>>> Cheers, Fokko >>>> >>>> >>>> >>>> Op wo 5 aug. 2020 om 05:51 schreef Hyukjin Kwon <gurwls...@gmail.com>: >>>> >>>>> Oh I think I caused some confusion here. >>>>> Just for clarification, I wasn’t saying we must port this into a >>>>> separate repo now. I was saying it can be one of the options we can >>>>> consider. >>>>> >>>>> >>>>> For a bit of more context: >>>>> This option was considered as, roughly speaking, an invalid option and >>>>> it might need an incubation process as a separate project. >>>>> After some investigations, I found that this is still a valid option >>>>> and we can take this as the part of Apache Spark but in a separate repo. >>>>> >>>>> >>>>> FWIW, NumPy took this approach. they made a separate repo >>>>> <https://github.com/numpy/numpy-stubs>, and merged it into the main >>>>> repo <https://github.com/numpy/numpy-stubs> after it became stable. >>>>> >>>>> >>>>> >>>>> My only major concerns are: >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> - the possibility to fundamentally change the approach in >>>>> pyspark-stubs <https://github.com/zero323/pyspark-stubs>. It’s not >>>>> because how it was done is wrong but because how Python type hinting >>>>> itself >>>>> evolves. >>>>> >>>>> - If my understanding is correct, pyspark-stubs >>>>> <https://github.com/zero323/pyspark-stubs> is still incomplete and >>>>> does not annotate types in some other APIs (by using Any). Correct me >>>>> if I >>>>> am wrong, Maciej. >>>>> >>>>> >>>>> >>>>> >>>>> I’ll have a short sync with him and share to understand better since >>>>> he’d probably know the context best in PySpark type hints and I know some >>>>> contexts in ASF and Apache Spark. >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> 2020년 8월 5일 (수) 오전 6:31, Maciej Szymkiewicz <mszymkiew...@gmail.com>님이 >>>>> 작성: >>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> Indeed, though the possible advantage is that in theory, you can >>>>>> >>>>>> have different release cycle than for the main repo (I am not sure >>>>>> >>>>>> if that's feasible in practice or if that was the intention). >>>>>> >>>>>> >>>>>> I guess all depends on how we envision the future of annotations >>>>>> >>>>>> (including, but not limited to, how conservative we want to be in >>>>>> >>>>>> the future). Which is probably something that should be discussed >>>>>> >>>>>> here. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On 8/4/20 11:06 PM, Felix Cheung wrote: >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> So IMO maintaining outside in a separate repo is going >>>>>> >>>>>> to be harder. That was why I asked. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> ------------------------------ >>>>>> >>>>>> >>>>>> *From:* Maciej Szymkiewicz >>>>>> >>>>>> <mszymkiew...@gmail.com> <mszymkiew...@gmail.com> >>>>>> >>>>>> >>>>>> *Sent:* Tuesday, August 4, 2020 12:59 PM >>>>>> >>>>>> >>>>>> *To:* Sean Owen >>>>>> >>>>>> >>>>>> *Cc:* Felix Cheung; Hyukjin Kwon; Driesprong, Fokko; >>>>>> >>>>>> Holden Karau; Spark Dev List >>>>>> >>>>>> >>>>>> *Subject:* Re: [PySpark] Revisiting PySpark type >>>>>> >>>>>> annotations >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On 8/4/20 9:35 PM, Sean Owen wrote >>>>>> >>>>>> >>>>>> > Yes, but the general argument you make here is: if >>>>>> >>>>>> you tie this >>>>>> >>>>>> >>>>>> > project to the main project, it will _have_ to be >>>>>> >>>>>> maintained by >>>>>> >>>>>> >>>>>> > everyone. That's good, but also exactly I think the >>>>>> >>>>>> downside we want >>>>>> >>>>>> >>>>>> > to avoid at this stage (I thought?) I understand >>>>>> >>>>>> for some >>>>>> >>>>>> >>>>>> > undertakings, it's just not feasible to start >>>>>> >>>>>> outside the main >>>>>> >>>>>> >>>>>> > project, but is there no proof of concept even >>>>>> >>>>>> possible before taking >>>>>> >>>>>> >>>>>> > this step -- which more or less implies it's going >>>>>> >>>>>> to be owned and >>>>>> >>>>>> >>>>>> > merged and have to be maintained in the main >>>>>> >>>>>> project. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> I think we have a bit different understanding here ‒ I >>>>>> >>>>>> believe we have >>>>>> >>>>>> >>>>>> reached a conclusion that maintaining annotations within >>>>>> >>>>>> the project is >>>>>> >>>>>> >>>>>> OK, we only differ when it comes to specific form it >>>>>> >>>>>> should take. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> As of POC ‒ we have stubs, which have been maintained >>>>>> >>>>>> over three years >>>>>> >>>>>> >>>>>> now and cover versions between 2.3 (though these are >>>>>> >>>>>> fairly limited) to, >>>>>> >>>>>> >>>>>> with some lag, current master. There is some evidence >>>>>> >>>>>> there are used in >>>>>> >>>>>> >>>>>> the wild >>>>>> >>>>>> >>>>>> ( >>>>>> https://github.com/zero323/pyspark-stubs/network/dependents?package_id=UGFja2FnZS02MzU1MTc4Mg%3D%3D >>>>>> ), >>>>>> >>>>>> >>>>>> there are a few contributors >>>>>> >>>>>> >>>>>> (https://github.com/zero323/pyspark-stubs/graphs/contributors) >>>>>> >>>>>> and at >>>>>> >>>>>> >>>>>> least some use cases (https://stackoverflow.com/q/40163106/). >>>>>> >>>>>> So, >>>>>> >>>>>> >>>>>> subjectively speaking, it seems we're already beyond >>>>>> >>>>>> POC. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> >>>>>> >>>>>> Best regards, >>>>>> >>>>>> >>>>>> Maciej Szymkiewicz >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> Web: https://zero323.net >>>>>> >>>>>> >>>>>> Keybase: https://keybase.io/zero323 >>>>>> >>>>>> >>>>>> Gigs: https://www.codementor.io/@zero323 >>>>>> >>>>>> >>>>>> PGP: A30CEF0C31A501EC >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> >>>>>> Best regards, >>>>>> >>>>>> Maciej Szymkiewicz >>>>>> >>>>>> >>>>>> >>>>>> Web: https://zero323.net >>>>>> >>>>>> Keybase: https://keybase.io/zero323 >>>>>> >>>>>> Gigs: https://www.codementor.io/@zero323 >>>>>> >>>>>> PGP: A30CEF0C31A501EC >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>> >>>> >>> >>> >> >>