Would you plan to keep the existing indexing mechanism then? https://koalas.readthedocs.io/en/latest/user_guide/best_practices.html#use-distributed-or-distributed-sequence-default-index For me, it always even when trying to use the distributed version resulted in various window functions being chained, a different query plan than the default query plan, and slower execution of the job due to this overhead.
Especially when some people here are thinking about making it the default/replacing the regular API I would strongly suggest defaulting to an indexing mechanism that is not changing the query plan. Best, Georg Am Mi., 17. März 2021 um 12:13 Uhr schrieb Hyukjin Kwon <gurwls...@gmail.com >: > > Just out of curiosity, does Koalas pretty much implement all of the > Pandas APIs now? If there are some that are yet to be implemented or others > that have differences, are these documented so users won't be caught > off-guard? > > It's roughly 75% done so far (in Series, DataFrame and Index). > Yeah, and it throws an exception that says it's not implemented yet > properly (or intentionally not implemented, e.g.) Series.__iter__ that will > easily make users shoot their feet by, for example, for loop ... ). > > > 2021년 3월 17일 (수) 오후 2:17, Bryan Cutler <cutl...@gmail.com>님이 작성: > >> +1 the proposal sounds good to me. Having a familiar API built-in will >> really help new users get into using Spark that might only have Pandas >> experience. It sounds like maintenance costs should be manageable, once the >> hurdle with setting up tests is done. Just out of curiosity, does Koalas >> pretty much implement all of the Pandas APIs now? If there are some that >> are yet to be implemented or others that have differences, are these >> documented so users won't be caught off-guard? >> >> On Tue, Mar 16, 2021 at 6:54 PM Andrew Melo <andrew.m...@gmail.com> >> wrote: >> >>> Hi, >>> >>> Integrating Koalas with pyspark might help enable a richer integration >>> between the two. Something that would be useful with a tighter >>> integration is support for custom column array types. Currently, Spark >>> takes dataframes, converts them to arrow buffers then transmits them >>> over the socket to Python. On the other side, pyspark takes the arrow >>> buffer and converts it to a Pandas dataframe. Unfortunately, the >>> default Pandas representation of a list-type for a column causes it to >>> turn what was contiguous value/offset arrays in Arrow into >>> deserialized Python objects for each row. Obviously, this kills >>> performance. >>> >>> A PR to extend the pyspark API to elide the pandas conversion >>> (https://github.com/apache/spark/pull/26783) was submitted and >>> rejected, which is unfortunate, but perhaps this proposed integration >>> would provide the hooks via Pandas' ExtensionArray interface to allow >>> Spark to performantly interchange jagged/ragged lists to/from python >>> UDFs. >>> >>> Cheers >>> Andrew >>> >>> On Tue, Mar 16, 2021 at 8:15 PM Hyukjin Kwon <gurwls...@gmail.com> >>> wrote: >>> > >>> > Thank you guys for all your feedback. I will start working on SPIP >>> with Koalas team. >>> > I would expect the SPIP can be sent late this week or early next week. >>> > >>> > >>> > I inlined and answered the questions unanswered as below: >>> > >>> > Is the community developing the pandas API layer for Spark interested >>> in being part of Spark or do they prefer having their own release cycle? >>> > >>> > Yeah, Koalas team used to have its own release cycle to develop and >>> move quickly. >>> > Now it became pretty mature with reaching 1.7.0, and the team thinks >>> that it’s now >>> > fine to have less frequent releases, and they are happy to work >>> together with Spark with >>> > contributing to it. The active contributors in the Koalas community >>> will continue to >>> > make the contributions in Spark. >>> > >>> > How about test code? Does it fit into the PySpark test framework? >>> > >>> > Yes, this will be one of the places where it needs some efforts. >>> Koalas currently uses pytest >>> > with various dependency version combinations (e.g., Python version, >>> conda vs pip) whereas >>> > PySpark uses the plain unittests with less dependency version >>> combinations. >>> > >>> > For pytest in Koalas <> unittests in PySpark: >>> > >>> > I am currently thinking we will have to convert the Koalas tests to >>> use unittests to match >>> > with PySpark for now. >>> > It is a feasible option for PySpark to migrate to pytest too but it >>> will need extra effort to >>> > make it working with our own PySpark testing framework seamlessly. >>> > Koalas team (presumably and likely I) will take a look in any event. >>> > >>> > For the combinations of dependency versions: >>> > >>> > Due to the lack of the resources in GitHub Actions, I currently plan >>> to just add the >>> > Koalas tests into the matrix PySpark is currently using. >>> > >>> > one question I have; what’s an initial goal of the proposal? >>> > Is that to port all the pandas interfaces that Koalas has already >>> implemented? >>> > Or, the basic set of them? >>> > >>> > The goal of the proposal is to port all of Koalas project into PySpark. >>> > For example, >>> > >>> > import koalas >>> > >>> > will be equivalent to >>> > >>> > # Names, etc. might change in the final proposal or during the review >>> > from pyspark.sql import pandas >>> > >>> > Koalas supports pandas APIs with a separate layer to cover a bit of >>> difference between >>> > DataFrame structures in pandas and PySpark, e.g.) other types as >>> column names (labels), >>> > index (something like row number in DBMSs) and so on. So I think it >>> would make more sense >>> > to port the whole layer instead of a subset of the APIs. >>> > >>> > >>> > >>> > >>> > >>> > 2021년 3월 17일 (수) 오전 12:32, Wenchen Fan <cloud0...@gmail.com>님이 작성: >>> >> >>> >> +1, it's great to have Pandas support in Spark out of the box. >>> >> >>> >> On Tue, Mar 16, 2021 at 10:12 PM Takeshi Yamamuro < >>> linguin....@gmail.com> wrote: >>> >>> >>> >>> +1; the pandas interfaces are pretty popular and supporting them in >>> pyspark looks promising, I think. >>> >>> one question I have; what's an initial goal of the proposal? >>> >>> Is that to port all the pandas interfaces that Koalas has already >>> implemented? >>> >>> Or, the basic set of them? >>> >>> >>> >>> On Tue, Mar 16, 2021 at 1:44 AM Ismaël Mejía <ieme...@gmail.com> >>> wrote: >>> >>>> >>> >>>> +1 >>> >>>> >>> >>>> Bringing a Pandas API for pyspark to upstream Spark will only bring >>> >>>> benefits for everyone (more eyes to use/see/fix/improve the API) as >>> >>>> well as better alignment with core Spark improvements, the extra >>> >>>> weight looks manageable. >>> >>>> >>> >>>> On Mon, Mar 15, 2021 at 4:45 PM Nicholas Chammas >>> >>>> <nicholas.cham...@gmail.com> wrote: >>> >>>> > >>> >>>> > On Mon, Mar 15, 2021 at 2:12 AM Reynold Xin <r...@databricks.com> >>> wrote: >>> >>>> >> >>> >>>> >> I don't think we should deprecate existing APIs. >>> >>>> > >>> >>>> > >>> >>>> > +1 >>> >>>> > >>> >>>> > I strongly prefer Spark's immutable DataFrame API to the Pandas >>> API. I could be wrong, but I wager most people who have worked with both >>> Spark and Pandas feel the same way. >>> >>>> > >>> >>>> > For the large community of current PySpark users, or users >>> switching to PySpark from another Spark language API, it doesn't make sense >>> to deprecate the current API, even by convention. >>> >>>> >>> >>>> >>> --------------------------------------------------------------------- >>> >>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>> >>>> >>> >>> >>> >>> >>> >>> -- >>> >>> --- >>> >>> Takeshi Yamamuro >>> >>> --------------------------------------------------------------------- >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>> >>>