@Saurabh @Mr.Powers Thanks for the input information. I personal perfer to introduce the `withColumns` because it bring more friendly development experience rather than select( * ).
This is the PR to add `withColumns`: https://github.com/apache/spark/pull/32431 Regards, Yikun Saurabh Chawla <s.saurabh...@gmail.com> 于2021年4月30日周五 下午1:13写道: > Hi All, > > I also had a scenario where at runtime, I needed to loop through a > dataframe to use withColumn many times. > > For the safer side I used the reflection to access the withColumns to > prevent any java.lang.StackOverflowError. > > val dataSetClass = Class.forName("org.apache.spark.sql.Dataset") > val newConfigurationMethod = > dataSetClass.getMethod("withColumns", classOf[Seq[String]], > classOf[Seq[Column]]) > newConfigurationMethod.invoke( > baseDataFrame, columnName, columnValue).asInstanceOf[DataFrame] > > It would be great if we use the "withColumns" rather than using the > reflection code like this. > or > make changes in the code to merge the project with existing project in the > plan, instead of adding the new project every time we call the " > withColumn". > > +1 for exposing the *withColumns* > > Regards > Saurabh Chawla > > On Thu, Apr 22, 2021 at 1:03 PM Yikun Jiang <yikunk...@gmail.com> wrote: > >> Hi, all >> >> *Background:* >> >> Currently, there is a withColumns >> <https://github.com/apache/spark/blob/b5241c97b17a1139a4ff719bfce7f68aef094d95/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2402>[1] >> method to help users/devs add/replace multiple columns at once. >> But this method is private and isn't exposed as a public API interface, >> that means it cannot be used by the user directly, and also it is not >> supported in PySpark API. >> >> As the dataframe user, I can only call withColumn() multiple times: >> >> df.withColumn("key1", col("key1")).withColumn("key2", >> col("key2")).withColumn("key3", col("key3")) >> >> rather than: >> >> df.withColumn(["key1", "key2", "key3"], [col("key1"), col("key2"), >> col("key3")]) >> >> Multiple calls bring some higher cost on developer experience and >> performance. Especially in a PySpark related scenario, multiple calls mean >> multiple py4j calls. >> >> As mentioned >> <https://github.com/apache/spark/pull/32276#issuecomment-824461143> from >> @Hyukjin, there were some previous discussions on SPARK-12225 >> <https://issues.apache.org/jira/browse/SPARK-12225> [2] . >> >> [1] >> https://github.com/apache/spark/blob/b5241c97b17a1139a4ff719bfce7f68aef094d95/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2402 >> [2] https://issues.apache.org/jira/browse/SPARK-12225 >> >> *Potential solution:* >> Looks like there are 2 potential solutions if we want to support it: >> >> 1. Introduce a *withColumns *api for Scala/Python. >> A separate public withColumns API will be added in scala/python api. >> >> 2. Make withColumn can receive *single col *and also the* list of cols*. >> I did some experimental try on PySpark on >> https://github.com/apache/spark/pull/32276 >> Just like Maciej said >> <https://github.com/apache/spark/pull/32276#pullrequestreview-641280217> >> it will bring some confusion with naming. >> >> >> Thanks for your reading, feel free to reply if you have any other >> concerns or suggestions! >> >> >> Regards, >> Yikun >> >