Re: Checking for null values when mapping

Ted Yu Sat, 20 Feb 2016 04:33:07 -0800

For #2, you can filter out row whose first column has length 0. 

Cheers


> On Feb 20, 2016, at 6:59 AM, Mich Talebzadeh <[email protected]> wrote:
> 
> Thanks
>  
>  
> So what I did was
>  
> scala> val df = 
> sqlContext.read.format("com.databricks.spark.csv").option("inferSchema", 
> "true").option("header", "true").load("/data/stg/table2")
> df: org.apache.spark.sql.DataFrame = [Invoice Number: string, Payment date: 
> string, Net: string, VAT: string, Total: string]
>  
>  
> scala> df.printSchema
> root
> |-- Invoice Number: string (nullable = true)
> |-- Payment date: string (nullable = true)
> |-- Net: string (nullable = true)
> |-- VAT: string (nullable = true)
> |-- Total: string (nullable = true)
>  
>  
> So all the columns are Strings
>  
> Then I tried to exclude null rows by filtering on all columns not being null 
> and map the rest
>  
> scala> val a = df.filter(col("Invoice Number").isNotNull and col("Payment 
> date").isNotNull and col("Net").isNotNull and col("VAT").isNotNull and 
> col("Total").isNotNull).map(x => 
> (x.getString(1),x.getString(2).substring(1).replace(",", 
> "").toDouble,x.getString(3).substring(1).replace(",", "").toDouble, 
> x.getString(4).substring(1).replace(",", "").toDouble))
>  
> a: org.apache.spark.rdd.RDD[(String, Double, Double, Double)] = 
> MapPartitionsRDD[176] at map at <console>:21
>  
> This still comes up with “String index out of range: “ error
>  
> 16/02/20 11:50:51 ERROR Executor: Exception in task 0.0 in stage 12.0 (TID 18)
> java.lang.StringIndexOutOfBoundsException: String index out of range: -1
>  
> My questions are:
>  
> 1.    Doing the map,  map(x => (x.getString(1)  -- Can I replace 
> x.getString(1) with the actual column name say “Invoice Number” and so forth 
> for other columns as well?
> 2.       Sounds like it crashes because of these columns below at the end
> [421,02/10/2015,?1,187.50,?237.50,?1,425.00]  \\ example good one
> [,,,,] \\ bad one, empty one
> [Net income,,?182,531.25,?14,606.25,?197,137.50]
> [,,,,] \\ bad one, empty one
> [year 2014,,?113,500.00,?0.00,?113,500.00]
> [Year 2015,,?69,031.25,?14,606.25,?83,637.50]
>  
> 3.    Also to clarify I want to drop those two empty line -> [,,,,]  if I 
> can. Unfortunately  drop() call does not work
> a.drop()
> <console>:24: error: value drop is not a member of 
> org.apache.spark.rdd.RDD[(String, Double, Double, Double)]
>               a.drop()
>                 ^
>  
> Thanka again,
>  
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
>  
> NOTE: The information in this email is proprietary and confidential. This 
> message is for the designated recipient only, if you are not the intended 
> recipient, you should destroy it immediately. Any information in this message 
> shall not be understood as given or endorsed by Peridale Technology Ltd, its 
> subsidiaries or their employees, unless expressly so stated. It is the 
> responsibility of the recipient to ensure that this email is virus free, 
> therefore neither Peridale Technology Ltd, its subsidiaries nor their 
> employees accept any responsibility.
>  
>  
> From: Michał Zieliński [mailto:[email protected]] 
> Sent: 20 February 2016 08:59
> To: Mich Talebzadeh <[email protected]>
> Cc: user @spark <[email protected]>
> Subject: Re: Checking for null values when mapping
>  
> You can use filter and isNotNull on Column before the map.
>  
> On 20 February 2016 at 08:24, Mich Talebzadeh <[email protected]> wrote:
>  
> I have a DF like below reading a csv file
>  
>  
> val df = 
> HiveContext.read.format("com.databricks.spark.csv").option("inferSchema", 
> "true").option("header", "true").load("/data/stg/table2")
>  
> val a = df.map(x => (x.getString(0), x.getString(1), 
> x.getString(2).substring(1).replace(",", 
> "").toDouble,x.getString(3).substring(1).replace(",", "").toDouble, 
> x.getString(4).substring(1).replace(",", "").toDouble))
>  
>  
> For most rows I am reading from csv file the above mapping works fine. 
> However, at the bottom of csv there are couple of empty columns as below
>  
> [421,02/10/2015,?1,187.50,?237.50,?1,425.00]
> [,,,,]
> [Net income,,?182,531.25,?14,606.25,?197,137.50]
> [,,,,]
> [year 2014,,?113,500.00,?0.00,?113,500.00]
> [Year 2015,,?69,031.25,?14,606.25,?83,637.50]
>  
> However, I get
>  
> a.collect.foreach(println)
> 16/02/20 08:31:53 ERROR Executor: Exception in task 0.0 in stage 123.0 (TID 
> 161)
> java.lang.StringIndexOutOfBoundsException: String index out of range: -1
>  
> I suspect the cause is substring operation  say x.getString(2).substring(1) 
> on empty values that according to web will throw this type of error
>  
>  
> The easiest solution seems to be to check whether x above is not null and do 
> the substring operation. Can this be done without using a UDF?
>  
> Thanks
>  
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
>  
> NOTE: The information in this email is proprietary and confidential. This 
> message is for the designated recipient only, if you are not the intended 
> recipient, you should destroy it immediately. Any information in this message 
> shall not be understood as given or endorsed by Peridale Technology Ltd, its 
> subsidiaries or their employees, unless expressly so stated. It is the 
> responsibility of the recipient to ensure that this email is virus free, 
> therefore neither Peridale Technology Ltd, its subsidiaries nor their 
> employees accept any responsibility.
>  
>  
>

Re: Checking for null values when mapping

Reply via email to