Re: Trouble dropping columns from a DataFrame that has other columns with dots in their names

Joshua TAYLOR Mon, 25 Jan 2016 13:19:17 -0800

Thanks Michael, hopefully those will get some attention for a not too
distant release.  Do you think that this is related to, or separate
from, a similar issue [1] that a filed a bit earlier, regarding the
way that StringIndexer (and perhaps other ML components) handles some
of these columns?  (I've dug through a bit of the source, but it's not
entirely clear to me (I'm not a Scala hacker) how transparently (or
non-transparently) column names are passed through to underlying
DataFrame methods.)


Joshua

[1]: https://issues.apache.org/jira/browse/SPARK-12965

On Mon, Jan 25, 2016 at 4:08 PM, Michael Armbrust
<mich...@databricks.com> wrote:
> Looks like you found a bug.  I've filed them here:
>
> SPARK-12987 - Drop fails when columns contain dots
> SPARK-12988 - Can't drop columns that contain dots
>
> On Fri, Jan 22, 2016 at 3:18 PM, Joshua TAYLOR <joshuaaa...@gmail.com>
> wrote:
>>
>> (Apologies if this comes through twice;  I sent it once before I'd
>> confirmed by mailing list subscription.)
>>
>>
>> I've been having lots of trouble with DataFrames whose columns have dots
>> in their names today.  I know that in many places, backticks can be used to
>> quote column names, but the problem I'm running into now is that I can't
>> drop a column that has *no* dots in its name when there are *other* columns
>> in the table that do.  Here's some code that tries four ways of dropping the
>> column.  One throws a weird exception, one is a semi-expected no-op, and the
>> other two work.
>>
>> public class SparkExample {
>>     public static void main(String[] args) {
>>         /* Get the spark and sql contexts.  Setting spark.ui.enabled to
>> false
>>          * keeps Spark from using its built in dependency on Jersey. */
>>         SparkConf conf = new SparkConf()
>>                 .setMaster("local[*]")
>>                 .setAppName("test")
>>                 .set("spark.ui.enabled", "false");
>>         JavaSparkContext sparkContext = new JavaSparkContext(conf);
>>         SQLContext sqlContext = new SQLContext(sparkContext);
>>
>>         /* Create a schema with two columns, one of which as no dots
>> (a_b),
>>          * and the other which does (a.b). */
>>         StructType schema = new StructType(new StructField[] {
>>                 DataTypes.createStructField("a_b", DataTypes.StringType,
>> false),
>>                 DataTypes.createStructField("a.c", DataTypes.IntegerType,
>> false)
>>         });
>>
>>         /* Create an RDD of Rows, and then convert it into a DataFrame. */
>>         List<Row> rows = Arrays.asList(
>>                 RowFactory.create("t", 2),
>>                 RowFactory.create("u", 4));
>>         JavaRDD<Row> rdd = sparkContext.parallelize(rows);
>>         DataFrame df = sqlContext.createDataFrame(rdd, schema);
>>
>>         /* Four ways to attempt dropping a_b from the DataFrame.
>>          * We'll try calling each one of these and looking at
>>          * the results (or the resulting exception). */
>>         Function<DataFrame,DataFrame> x1 = d -> d.drop("a_b");          //
>> exception
>>         Function<DataFrame,DataFrame> x2 = d -> d.drop("`a_b`");        //
>> no-op
>>         Function<DataFrame,DataFrame> x3 = d -> d.drop(d.col("a_b"));   //
>> works
>>         Function<DataFrame,DataFrame> x4 = d -> d.drop(d.col("`a_b`")); //
>> works
>>
>>         int i=0;
>>         for (Function<DataFrame,DataFrame> x : Arrays.asList(x1, x2, x3,
>> x4)) {
>>             System.out.println("Case "+i++);
>>             try {
>>                 x.apply(df).show();
>>             } catch (Exception e) {
>>                 e.printStackTrace(System.out);
>>             }
>>         }
>>     }
>> }
>>
>> Here's the output.  Case 1 is a no-op, which I think I can understand,
>> because DataFrame.drop(String) doesn't do any resolution (it doesn't need
>> to), so d.drop("`a_b`") doesn't do anything because there's no column whose
>> name is literally "`a_b`".  The third and fourth cases work, because
>> DataFrame.col() does do resolution, and both "a_b" and "`a_b`" resolve
>> correctly.  But why does the first case fail?  And why with the message that
>> it does?  Why is it trying to resolve "a.c" at all in this case?
>>
>> Case 0
>> org.apache.spark.sql.AnalysisException: cannot resolve 'a.c' given input
>> columns a_b, a.c;
>>     at
>> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>>     at
>> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:60)
>>     at
>> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:57)
>>     at
>> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)
>>     at
>> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)
>>     at
>> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53)
>>     at
>> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:318)
>>     at
>> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:107)
>>     at
>> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:117)
>>     at
>> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:121)
>>     at
>> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>>     at
>> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>>     at scala.collection.immutable.List.foreach(List.scala:318)
>>     at
>> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>>     at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>>     at
>> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:121)
>>     at
>> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:125)
>>     at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>>     at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>>     at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>>     at
>> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>>     at
>> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>>     at
>> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>>     at
>> scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>>     at scala.collection.AbstractIterator.to(Iterator.scala:1157)
>>     at
>> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>>     at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
>>     at
>> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>>     at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
>>     at
>> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:125)
>>     at
>> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:57)
>>     at
>> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:50)
>>     at
>> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:105)
>>     at
>> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:50)
>>     at
>> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
>>     at
>> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34)
>>     at org.apache.spark.sql.DataFrame.<init>(DataFrame.scala:133)
>>     at
>> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$withPlan(DataFrame.scala:2165)
>>     at org.apache.spark.sql.DataFrame.select(DataFrame.scala:751)
>>     at org.apache.spark.sql.DataFrame.drop(DataFrame.scala:1286)
>>     at SparkExample.lambda$0(SparkExample.java:45)
>>     at SparkExample.main(SparkExample.java:54)
>> Case 1
>> +---+---+
>> |a_b|a.c|
>> +---+---+
>> |  t|  2|
>> |  u|  4|
>> +---+---+
>>
>> Case 2
>> +---+
>> |a.c|
>> +---+
>> |  2|
>> |  4|
>> +---+
>>
>> Case 3
>> +---+
>> |a.c|
>> +---+
>> |  2|
>> |  4|
>> +---+
>>
>>
>> Thanks in advance,
>> Joshua
>>
>> --
>> Joshua Taylor, http://www.cs.rpi.edu/~tayloj/
>
>



-- 
Joshua Taylor, http://www.cs.rpi.edu/~tayloj/

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Trouble dropping columns from a DataFrame that has other columns with dots in their names

Reply via email to