I've created my own implicit withColumnsRenamed for such a purpose which just accepted a map of string→string and called rename multiple times.
вт, 4 мая 2021 г. в 10:22, Yikun Jiang <yikunk...@gmail.com>: > @Saurabh @Mr.Powers Thanks for the input information. > > I personal perfer to introduce the `withColumns` because it bring more > friendly development experience rather than select( * ). > > This is the PR to add `withColumns`: > https://github.com/apache/spark/pull/32431 > > Regards, > Yikun > > > Saurabh Chawla <s.saurabh...@gmail.com> 于2021年4月30日周五 下午1:13写道: > >> Hi All, >> >> I also had a scenario where at runtime, I needed to loop through a >> dataframe to use withColumn many times. >> >> For the safer side I used the reflection to access the withColumns to >> prevent any java.lang.StackOverflowError. >> >> val dataSetClass = Class.forName("org.apache.spark.sql.Dataset") >> val newConfigurationMethod = >> dataSetClass.getMethod("withColumns", classOf[Seq[String]], >> classOf[Seq[Column]]) >> newConfigurationMethod.invoke( >> baseDataFrame, columnName, columnValue).asInstanceOf[DataFrame] >> >> It would be great if we use the "withColumns" rather than using the >> reflection code like this. >> or >> make changes in the code to merge the project with existing project in >> the plan, instead of adding the new project every time we call the " >> withColumn". >> >> +1 for exposing the *withColumns* >> >> Regards >> Saurabh Chawla >> >> On Thu, Apr 22, 2021 at 1:03 PM Yikun Jiang <yikunk...@gmail.com> wrote: >> >>> Hi, all >>> >>> *Background:* >>> >>> Currently, there is a withColumns >>> <https://github.com/apache/spark/blob/b5241c97b17a1139a4ff719bfce7f68aef094d95/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2402>[1] >>> method to help users/devs add/replace multiple columns at once. >>> But this method is private and isn't exposed as a public API interface, >>> that means it cannot be used by the user directly, and also it is not >>> supported in PySpark API. >>> >>> As the dataframe user, I can only call withColumn() multiple times: >>> >>> df.withColumn("key1", col("key1")).withColumn("key2", >>> col("key2")).withColumn("key3", col("key3")) >>> >>> rather than: >>> >>> df.withColumn(["key1", "key2", "key3"], [col("key1"), col("key2"), >>> col("key3")]) >>> >>> Multiple calls bring some higher cost on developer experience and >>> performance. Especially in a PySpark related scenario, multiple calls mean >>> multiple py4j calls. >>> >>> As mentioned >>> <https://github.com/apache/spark/pull/32276#issuecomment-824461143> >>> from @Hyukjin, there were some previous discussions on SPARK-12225 >>> <https://issues.apache.org/jira/browse/SPARK-12225> [2] . >>> >>> [1] >>> https://github.com/apache/spark/blob/b5241c97b17a1139a4ff719bfce7f68aef094d95/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2402 >>> [2] https://issues.apache.org/jira/browse/SPARK-12225 >>> >>> *Potential solution:* >>> Looks like there are 2 potential solutions if we want to support it: >>> >>> 1. Introduce a *withColumns *api for Scala/Python. >>> A separate public withColumns API will be added in scala/python api. >>> >>> 2. Make withColumn can receive *single col *and also the* list of cols*. >>> I did some experimental try on PySpark on >>> https://github.com/apache/spark/pull/32276 >>> Just like Maciej said >>> <https://github.com/apache/spark/pull/32276#pullrequestreview-641280217> >>> it will bring some confusion with naming. >>> >>> >>> Thanks for your reading, feel free to reply if you have any other >>> concerns or suggestions! >>> >>> >>> Regards, >>> Yikun >>> >>