question on transforms for spark 2.0 dataset

Bill Schwanitz Wed, 01 Mar 2017 08:22:18 -0800

Hi all,

I'm fairly new to spark and scala so bear with me.


I'm working with a dataset containing a set of column / fields. The data is
stored in hdfs as parquet and is sourced from a postgres box so fields and
values are reasonably well formed. We are in the process of trying out a
switch from pentaho and various sql databases to pulling data into hdfs and
applying transforms / new datasets with processing being done in spark (
and other tools - evaluation )

A rough version of the code I'm running so far:

val sample_data = spark.read.parquet("my_data_input")

val example_row = spark.sql("select * from parquet.my_data_input where id =
123").head

I want to apply a trim operation on a set of fields - lets call them
field1, field2, field3 and field4.

What is the best way to go about applying those trims and creating a new
dataset? Can I apply the trip to all fields in a single map? or do I need
to apply multiple map functions?

When I try the map ( even with a single )

scala> val transformed_data = sample_data.map(
     |   _.trim(col("field1"))
     |   .trim(col("field2"))
     |   .trim(col("field3"))
     |   .trim(col("field4"))
     | )

I end up with the following error:

<console>:26: error: value trim is not a member of org.apache.spark.sql.Row
         _.trim(col("field1"))
           ^

Any ideas / guidance would be appreciated!

question on transforms for spark 2.0 dataset

Reply via email to