Re: question on transforms for spark 2.0 dataset

Marco Mistroni Wed, 01 Mar 2017 09:04:50 -0800

Hi I think u need an UDF if u want to transform a column....
Hth

On 1 Mar 2017 4:22 pm, "Bill Schwanitz" <bil...@bilsch.org> wrote:


> Hi all,
>
> I'm fairly new to spark and scala so bear with me.
>
> I'm working with a dataset containing a set of column / fields. The data
> is stored in hdfs as parquet and is sourced from a postgres box so fields
> and values are reasonably well formed. We are in the process of trying out
> a switch from pentaho and various sql databases to pulling data into hdfs
> and applying transforms / new datasets with processing being done in spark
> ( and other tools - evaluation )
>
> A rough version of the code I'm running so far:
>
> val sample_data = spark.read.parquet("my_data_input")
>
> val example_row = spark.sql("select * from parquet.my_data_input where id
> = 123").head
>
> I want to apply a trim operation on a set of fields - lets call them
> field1, field2, field3 and field4.
>
> What is the best way to go about applying those trims and creating a new
> dataset? Can I apply the trip to all fields in a single map? or do I need
> to apply multiple map functions?
>
> When I try the map ( even with a single )
>
> scala> val transformed_data = sample_data.map(
>      |   _.trim(col("field1"))
>      |   .trim(col("field2"))
>      |   .trim(col("field3"))
>      |   .trim(col("field4"))
>      | )
>
> I end up with the following error:
>
> <console>:26: error: value trim is not a member of org.apache.spark.sql.Row
>          _.trim(col("field1"))
>            ^
>
> Any ideas / guidance would be appreciated!
>

Re: question on transforms for spark 2.0 dataset

Reply via email to