RE: Select all columns except some

Saif.A.Ellafi Fri, 17 Jul 2015 06:16:13 -0700

Hello, thank you for your time.

Seq[String] works perfectly fine. I also tried running a for loop through all 
elements to see if any access to a value was broken, but no, they are alright.


For now, I solved it properly calling this. Sadly, it takes a lot of time, but 
works:

var data_sas = 
sqlContext.read.format("com.github.saurfang.sas.spark").load("/path/to/file.s")
data_sas.cache
for (col <- clean_cols) {
    data_sas = data_sas.drop(col)
}
data_sas.unpersist

Saif


From: Yana Kadiyska [mailto:[email protected]]
Sent: Thursday, July 16, 2015 12:58 PM
To: Ellafi, Saif A.
Cc: [email protected]<mailto:[email protected]>
Subject: Re: Select all columns except some

Have you tried to examine what clean_cols contains -- I'm suspect of this part 
mkString(“, “).
Try this:
val clean_cols : Seq[String] = df.columns...

if you get a type error you need to work on clean_cols (I suspect yours is of 
type String at the moment and presents itself to Spark as a single column names 
with commas embedded).

Not sure why the .drop call hangs but in either case drop returns a new 
dataframe -- it's not a setter call....

On Thu, Jul 16, 2015 at 10:57 AM, 
<[email protected]<mailto:[email protected]>> wrote:
Hi,

In a hundred columns dataframe, I wish to either select all of them except or 
drop the ones I dont want.

I am failing in doing such simple task, tried two ways

val clean_cols = df.columns.filterNot(col_name => 
col_name.startWith(“STATE_”).mkString(“, “)
df.select(clean_cols)

But this throws exception:
org.apache.spark.sql.AnalysisException: cannot resolve 'asd_dt, 
industry_area,...’
at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
 at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:63)
 at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:52)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286)
 at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:285) at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org<http://org.apache.spark.sql.catalyst.plans.QueryPlan.org>$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionUp$1(QueryPlan.scala:108)
 at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2$$anonfun$apply$2.apply(QueryPlan.scala:123)

The other thing I tried is

df.columns.filter(col_name => col_name.startWith(“STATE_”)
for (col <- cols) df.drop(col)

But this other thing doesn’t do anything or hangs up.

Saif

RE: Select all columns except some

Reply via email to