Okay, it took me a while because I had to check the options and feasibility
we discussed here.

TL;DR: I think we can just port directly pyi files as are into PySpark main
repository.

I would like to share only the key points here because it looks like I,
Maciej and people here agree with this direction.

- The stability in PySpark stubs seems pretty okay enough to port directly
into the main repository.
    At least it covers the most of user-facing APIs. So there won't be many
advantages by running it separately, (vs the overhead to make a repo and
maintain separately)
- There's a possibility that the type hinting way can be changed
drastically but it will be manageable given that it will be handled within
the same pyi files.
- We'll need some tests for that.
- We'll make sure there's no external user app breakage by this.

There will likely be some other meta works such as adding tests and/or
documentation works. So I filed an umbrella JIRA for that SPARK-32681
<https://issues.apache.org/jira/browse/SPARK-32681>.
If there's no objections in this direction, I think hopefully we can start.
Let me know if you guys have thoughts on this.

Thanks!



2020년 8월 20일 (목) 오후 8:39, Driesprong, Fokko <fo...@driesprong.frl>님이 작성:

> No worries, thanks for the update!
>
> Op do 20 aug. 2020 om 12:50 schreef Hyukjin Kwon <gurwls...@gmail.com>
>
>> Yeah, we had a short meeting. I had to check a few other things so some
>> delays happened. I will share soon.
>>
>> 2020년 8월 20일 (목) 오후 7:14, Driesprong, Fokko <fo...@driesprong.frl>님이 작성:
>>
>>> Hi Maciej, Hyukjin,
>>>
>>> Did you find any time to discuss adding the types to the Python
>>> repository? Would love to know what came out of it.
>>>
>>> Cheers, Fokko
>>>
>>> Op wo 5 aug. 2020 om 10:14 schreef Driesprong, Fokko
>>> <fo...@driesprong.frl>:
>>>
>>>> Mostly echoing stuff that we've discussed in
>>>> https://github.com/apache/spark/pull/29180, but good to have this also
>>>> on the dev-list.
>>>>
>>>> > So IMO maintaining outside in a separate repo is going to be harder.
>>>> That was why I asked.
>>>>
>>>> I agree with Felix, having this inside of the project would make it
>>>> much easier to maintain. Having it inside of the ASF might be easier to
>>>> port the pyi files to the actual Spark repository.
>>>>
>>>> > FWIW, NumPy took this approach. they made a separate repo, and merged
>>>> it into the main repo after it became stable.
>>>>
>>>> As Maciej pointed out:
>>>>
>>>> > As of POC ‒ we have stubs, which have been maintained over three
>>>> years now and cover versions between 2.3 (though these are fairly limited)
>>>> to, with some lag, current master.
>>>>
>>>> What would be required to mark it as stable?
>>>>
>>>> > I guess all depends on how we envision the future of annotations
>>>> (including, but not limited to, how conservative we want to be in the
>>>> future). Which is probably something that should be discussed here.
>>>>
>>>> I'm happy to motivate people to contribute type hints, and I believe it
>>>> is a very accessible way to get more people involved in the Python
>>>> codebase. Using the ASF model we can ensure that we require committers/PMC
>>>> to sign off on the annotations.
>>>>
>>>> > Indeed, though the possible advantage is that in theory, you can have
>>>> different release cycle than for the main repo (I am not sure if that's
>>>> feasible in practice or if that was the intention).
>>>>
>>>> Personally, I don't think we need a different cycle if the type
>>>> hints are part of the code itself.
>>>>
>>>> > If my understanding is correct, pyspark-stubs is still incomplete and
>>>> does not annotate types in some other APIs (by using Any). Correct me if I
>>>> am wrong, Maciej.
>>>>
>>>> For me, it is a bit like code coverage. You want this to be high to
>>>> make sure that you cover most of the APIs, but it will take some time to
>>>> make it complete.
>>>>
>>>> For me, it feels a bit like a chicken and egg problem. Because the type
>>>> hints are in a separate repository, they will always lag behind. Also, it
>>>> is harder to spot where the gaps are.
>>>>
>>>> Cheers, Fokko
>>>>
>>>>
>>>>
>>>> Op wo 5 aug. 2020 om 05:51 schreef Hyukjin Kwon <gurwls...@gmail.com>:
>>>>
>>>>> Oh I think I caused some confusion here.
>>>>> Just for clarification, I wasn’t saying we must port this into a
>>>>> separate repo now. I was saying it can be one of the options we can
>>>>> consider.
>>>>>
>>>>>
>>>>> For a bit of more context:
>>>>> This option was considered as, roughly speaking, an invalid option and
>>>>> it might need an incubation process as a separate project.
>>>>> After some investigations, I found that this is still a valid option
>>>>> and we can take this as the part of Apache Spark but in a separate repo.
>>>>>
>>>>>
>>>>> FWIW, NumPy took this approach. they made a separate repo
>>>>> <https://github.com/numpy/numpy-stubs>, and merged it into the main
>>>>> repo <https://github.com/numpy/numpy-stubs> after it became stable.
>>>>>
>>>>>
>>>>>
>>>>> My only major concerns are:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>    - the possibility to fundamentally change the approach in
>>>>>    pyspark-stubs <https://github.com/zero323/pyspark-stubs>. It’s not
>>>>>    because how it was done is wrong but because how Python type hinting 
>>>>> itself
>>>>>    evolves.
>>>>>
>>>>>    - If my understanding is correct, pyspark-stubs
>>>>>    <https://github.com/zero323/pyspark-stubs> is still incomplete and
>>>>>    does not annotate types in some other APIs (by using Any). Correct me 
>>>>> if I
>>>>>    am wrong, Maciej.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> I’ll have a short sync with him and share to understand better since
>>>>> he’d probably know the context best in PySpark type hints and I know some
>>>>> contexts in ASF and Apache Spark.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> 2020년 8월 5일 (수) 오전 6:31, Maciej Szymkiewicz <mszymkiew...@gmail.com>님이
>>>>> 작성:
>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Indeed, though the possible advantage is that in theory, you can
>>>>>>
>>>>>> have different release cycle than for the main repo (I am not sure
>>>>>>
>>>>>> if that's feasible in practice or if that was the intention).
>>>>>>
>>>>>>
>>>>>> I guess all depends on how we envision the future of annotations
>>>>>>
>>>>>> (including, but not limited to, how conservative we want to be in
>>>>>>
>>>>>> the future). Which is probably something that should be discussed
>>>>>>
>>>>>> here.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 8/4/20 11:06 PM, Felix Cheung wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> So IMO maintaining outside in a separate repo is going
>>>>>>
>>>>>> to be harder. That was why I asked.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> ------------------------------
>>>>>>
>>>>>>
>>>>>> *From:* Maciej Szymkiewicz
>>>>>>
>>>>>> <mszymkiew...@gmail.com> <mszymkiew...@gmail.com>
>>>>>>
>>>>>>
>>>>>> *Sent:* Tuesday, August 4, 2020 12:59 PM
>>>>>>
>>>>>>
>>>>>> *To:* Sean Owen
>>>>>>
>>>>>>
>>>>>> *Cc:* Felix Cheung; Hyukjin Kwon; Driesprong, Fokko;
>>>>>>
>>>>>> Holden Karau; Spark Dev List
>>>>>>
>>>>>>
>>>>>> *Subject:* Re: [PySpark] Revisiting PySpark type
>>>>>>
>>>>>> annotations
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 8/4/20 9:35 PM, Sean Owen wrote
>>>>>>
>>>>>>
>>>>>> > Yes, but the general argument you make here is: if
>>>>>>
>>>>>> you tie this
>>>>>>
>>>>>>
>>>>>> > project to the main project, it will _have_ to be
>>>>>>
>>>>>> maintained by
>>>>>>
>>>>>>
>>>>>> > everyone. That's good, but also exactly I think the
>>>>>>
>>>>>> downside we want
>>>>>>
>>>>>>
>>>>>> > to avoid at this stage (I thought?) I understand
>>>>>>
>>>>>> for some
>>>>>>
>>>>>>
>>>>>> > undertakings, it's just not feasible to start
>>>>>>
>>>>>> outside the main
>>>>>>
>>>>>>
>>>>>> > project, but is there no proof of concept even
>>>>>>
>>>>>> possible before taking
>>>>>>
>>>>>>
>>>>>> > this step -- which more or less implies it's going
>>>>>>
>>>>>> to be owned and
>>>>>>
>>>>>>
>>>>>> > merged and have to be maintained in the main
>>>>>>
>>>>>> project.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> I think we have a bit different understanding here ‒ I
>>>>>>
>>>>>> believe we have
>>>>>>
>>>>>>
>>>>>> reached a conclusion that maintaining annotations within
>>>>>>
>>>>>> the project is
>>>>>>
>>>>>>
>>>>>> OK, we only differ when it comes to specific form it
>>>>>>
>>>>>> should take.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> As of POC ‒ we have stubs, which have been maintained
>>>>>>
>>>>>> over three years
>>>>>>
>>>>>>
>>>>>> now and cover versions between 2.3 (though these are
>>>>>>
>>>>>> fairly limited) to,
>>>>>>
>>>>>>
>>>>>> with some lag, current master.  There is some evidence
>>>>>>
>>>>>> there are used in
>>>>>>
>>>>>>
>>>>>> the wild
>>>>>>
>>>>>>
>>>>>> (
>>>>>> https://github.com/zero323/pyspark-stubs/network/dependents?package_id=UGFja2FnZS02MzU1MTc4Mg%3D%3D
>>>>>> ),
>>>>>>
>>>>>>
>>>>>> there are a few contributors
>>>>>>
>>>>>>
>>>>>> (https://github.com/zero323/pyspark-stubs/graphs/contributors)
>>>>>>
>>>>>> and at
>>>>>>
>>>>>>
>>>>>> least some use cases (https://stackoverflow.com/q/40163106/).
>>>>>>
>>>>>> So,
>>>>>>
>>>>>>
>>>>>> subjectively speaking, it seems we're already beyond
>>>>>>
>>>>>> POC.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>>
>>>>>> Best regards,
>>>>>>
>>>>>>
>>>>>> Maciej Szymkiewicz
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Web: https://zero323.net
>>>>>>
>>>>>>
>>>>>> Keybase: https://keybase.io/zero323
>>>>>>
>>>>>>
>>>>>> Gigs: https://www.codementor.io/@zero323
>>>>>>
>>>>>>
>>>>>> PGP: A30CEF0C31A501EC
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>> Best regards,
>>>>>>
>>>>>> Maciej Szymkiewicz
>>>>>>
>>>>>>
>>>>>>
>>>>>> Web: https://zero323.net
>>>>>>
>>>>>> Keybase: https://keybase.io/zero323
>>>>>>
>>>>>> Gigs: https://www.codementor.io/@zero323
>>>>>>
>>>>>> PGP: A30CEF0C31A501EC
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>

Reply via email to