Yeah, that's a good point, Georg. I think we will port as is first, and discuss further about that indexing system. We should probably either add non-index mode or switch it to a distributed default index type that minimizes the side effect in query plan. We still have some months left. I will very likely raise another discussion about it in a PR or dev mailing list after finishing the initial porting.
2021년 3월 17일 (수) 오후 8:33, Georg Heiler <georg.kf.hei...@gmail.com>님이 작성: > Would you plan to keep the existing indexing mechanism then? > > https://koalas.readthedocs.io/en/latest/user_guide/best_practices.html#use-distributed-or-distributed-sequence-default-index > For me, it always even when trying to use the distributed version resulted > in various window functions being chained, a different query plan than the > default query plan, and slower execution of the job due to this overhead. > > Especially when some people here are thinking about making it the > default/replacing the regular API I would strongly suggest defaulting to an > indexing mechanism that is not changing the query plan. > > Best, > Georg > > Am Mi., 17. März 2021 um 12:13 Uhr schrieb Hyukjin Kwon < > gurwls...@gmail.com>: > >> > Just out of curiosity, does Koalas pretty much implement all of the >> Pandas APIs now? If there are some that are yet to be implemented or others >> that have differences, are these documented so users won't be caught >> off-guard? >> >> It's roughly 75% done so far (in Series, DataFrame and Index). >> Yeah, and it throws an exception that says it's not implemented yet >> properly (or intentionally not implemented, e.g.) Series.__iter__ that will >> easily make users shoot their feet by, for example, for loop ... ). >> >> >> 2021년 3월 17일 (수) 오후 2:17, Bryan Cutler <cutl...@gmail.com>님이 작성: >> >>> +1 the proposal sounds good to me. Having a familiar API built-in will >>> really help new users get into using Spark that might only have Pandas >>> experience. It sounds like maintenance costs should be manageable, once the >>> hurdle with setting up tests is done. Just out of curiosity, does Koalas >>> pretty much implement all of the Pandas APIs now? If there are some that >>> are yet to be implemented or others that have differences, are these >>> documented so users won't be caught off-guard? >>> >>> On Tue, Mar 16, 2021 at 6:54 PM Andrew Melo <andrew.m...@gmail.com> >>> wrote: >>> >>>> Hi, >>>> >>>> Integrating Koalas with pyspark might help enable a richer integration >>>> between the two. Something that would be useful with a tighter >>>> integration is support for custom column array types. Currently, Spark >>>> takes dataframes, converts them to arrow buffers then transmits them >>>> over the socket to Python. On the other side, pyspark takes the arrow >>>> buffer and converts it to a Pandas dataframe. Unfortunately, the >>>> default Pandas representation of a list-type for a column causes it to >>>> turn what was contiguous value/offset arrays in Arrow into >>>> deserialized Python objects for each row. Obviously, this kills >>>> performance. >>>> >>>> A PR to extend the pyspark API to elide the pandas conversion >>>> (https://github.com/apache/spark/pull/26783) was submitted and >>>> rejected, which is unfortunate, but perhaps this proposed integration >>>> would provide the hooks via Pandas' ExtensionArray interface to allow >>>> Spark to performantly interchange jagged/ragged lists to/from python >>>> UDFs. >>>> >>>> Cheers >>>> Andrew >>>> >>>> On Tue, Mar 16, 2021 at 8:15 PM Hyukjin Kwon <gurwls...@gmail.com> >>>> wrote: >>>> > >>>> > Thank you guys for all your feedback. I will start working on SPIP >>>> with Koalas team. >>>> > I would expect the SPIP can be sent late this week or early next week. >>>> > >>>> > >>>> > I inlined and answered the questions unanswered as below: >>>> > >>>> > Is the community developing the pandas API layer for Spark interested >>>> in being part of Spark or do they prefer having their own release cycle? >>>> > >>>> > Yeah, Koalas team used to have its own release cycle to develop and >>>> move quickly. >>>> > Now it became pretty mature with reaching 1.7.0, and the team thinks >>>> that it’s now >>>> > fine to have less frequent releases, and they are happy to work >>>> together with Spark with >>>> > contributing to it. The active contributors in the Koalas community >>>> will continue to >>>> > make the contributions in Spark. >>>> > >>>> > How about test code? Does it fit into the PySpark test framework? >>>> > >>>> > Yes, this will be one of the places where it needs some efforts. >>>> Koalas currently uses pytest >>>> > with various dependency version combinations (e.g., Python version, >>>> conda vs pip) whereas >>>> > PySpark uses the plain unittests with less dependency version >>>> combinations. >>>> > >>>> > For pytest in Koalas <> unittests in PySpark: >>>> > >>>> > I am currently thinking we will have to convert the Koalas tests to >>>> use unittests to match >>>> > with PySpark for now. >>>> > It is a feasible option for PySpark to migrate to pytest too but it >>>> will need extra effort to >>>> > make it working with our own PySpark testing framework seamlessly. >>>> > Koalas team (presumably and likely I) will take a look in any event. >>>> > >>>> > For the combinations of dependency versions: >>>> > >>>> > Due to the lack of the resources in GitHub Actions, I currently >>>> plan to just add the >>>> > Koalas tests into the matrix PySpark is currently using. >>>> > >>>> > one question I have; what’s an initial goal of the proposal? >>>> > Is that to port all the pandas interfaces that Koalas has already >>>> implemented? >>>> > Or, the basic set of them? >>>> > >>>> > The goal of the proposal is to port all of Koalas project into >>>> PySpark. >>>> > For example, >>>> > >>>> > import koalas >>>> > >>>> > will be equivalent to >>>> > >>>> > # Names, etc. might change in the final proposal or during the review >>>> > from pyspark.sql import pandas >>>> > >>>> > Koalas supports pandas APIs with a separate layer to cover a bit of >>>> difference between >>>> > DataFrame structures in pandas and PySpark, e.g.) other types as >>>> column names (labels), >>>> > index (something like row number in DBMSs) and so on. So I think it >>>> would make more sense >>>> > to port the whole layer instead of a subset of the APIs. >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > 2021년 3월 17일 (수) 오전 12:32, Wenchen Fan <cloud0...@gmail.com>님이 작성: >>>> >> >>>> >> +1, it's great to have Pandas support in Spark out of the box. >>>> >> >>>> >> On Tue, Mar 16, 2021 at 10:12 PM Takeshi Yamamuro < >>>> linguin....@gmail.com> wrote: >>>> >>> >>>> >>> +1; the pandas interfaces are pretty popular and supporting them in >>>> pyspark looks promising, I think. >>>> >>> one question I have; what's an initial goal of the proposal? >>>> >>> Is that to port all the pandas interfaces that Koalas has already >>>> implemented? >>>> >>> Or, the basic set of them? >>>> >>> >>>> >>> On Tue, Mar 16, 2021 at 1:44 AM Ismaël Mejía <ieme...@gmail.com> >>>> wrote: >>>> >>>> >>>> >>>> +1 >>>> >>>> >>>> >>>> Bringing a Pandas API for pyspark to upstream Spark will only bring >>>> >>>> benefits for everyone (more eyes to use/see/fix/improve the API) as >>>> >>>> well as better alignment with core Spark improvements, the extra >>>> >>>> weight looks manageable. >>>> >>>> >>>> >>>> On Mon, Mar 15, 2021 at 4:45 PM Nicholas Chammas >>>> >>>> <nicholas.cham...@gmail.com> wrote: >>>> >>>> > >>>> >>>> > On Mon, Mar 15, 2021 at 2:12 AM Reynold Xin <r...@databricks.com> >>>> wrote: >>>> >>>> >> >>>> >>>> >> I don't think we should deprecate existing APIs. >>>> >>>> > >>>> >>>> > >>>> >>>> > +1 >>>> >>>> > >>>> >>>> > I strongly prefer Spark's immutable DataFrame API to the Pandas >>>> API. I could be wrong, but I wager most people who have worked with both >>>> Spark and Pandas feel the same way. >>>> >>>> > >>>> >>>> > For the large community of current PySpark users, or users >>>> switching to PySpark from another Spark language API, it doesn't make sense >>>> to deprecate the current API, even by convention. >>>> >>>> >>>> >>>> >>>> --------------------------------------------------------------------- >>>> >>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>>> >>>> >>>> >>> >>>> >>> >>>> >>> -- >>>> >>> --- >>>> >>> Takeshi Yamamuro >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>>> >>>>