For #2, you can filter out row whose first column has length 0. Cheers
> On Feb 20, 2016, at 6:59 AM, Mich Talebzadeh <[email protected]> wrote: > > Thanks > > > So what I did was > > scala> val df = > sqlContext.read.format("com.databricks.spark.csv").option("inferSchema", > "true").option("header", "true").load("/data/stg/table2") > df: org.apache.spark.sql.DataFrame = [Invoice Number: string, Payment date: > string, Net: string, VAT: string, Total: string] > > > scala> df.printSchema > root > |-- Invoice Number: string (nullable = true) > |-- Payment date: string (nullable = true) > |-- Net: string (nullable = true) > |-- VAT: string (nullable = true) > |-- Total: string (nullable = true) > > > So all the columns are Strings > > Then I tried to exclude null rows by filtering on all columns not being null > and map the rest > > scala> val a = df.filter(col("Invoice Number").isNotNull and col("Payment > date").isNotNull and col("Net").isNotNull and col("VAT").isNotNull and > col("Total").isNotNull).map(x => > (x.getString(1),x.getString(2).substring(1).replace(",", > "").toDouble,x.getString(3).substring(1).replace(",", "").toDouble, > x.getString(4).substring(1).replace(",", "").toDouble)) > > a: org.apache.spark.rdd.RDD[(String, Double, Double, Double)] = > MapPartitionsRDD[176] at map at <console>:21 > > This still comes up with “String index out of range: “ error > > 16/02/20 11:50:51 ERROR Executor: Exception in task 0.0 in stage 12.0 (TID 18) > java.lang.StringIndexOutOfBoundsException: String index out of range: -1 > > My questions are: > > 1. Doing the map, map(x => (x.getString(1) -- Can I replace > x.getString(1) with the actual column name say “Invoice Number” and so forth > for other columns as well? > 2. Sounds like it crashes because of these columns below at the end > [421,02/10/2015,?1,187.50,?237.50,?1,425.00] \\ example good one > [,,,,] \\ bad one, empty one > [Net income,,?182,531.25,?14,606.25,?197,137.50] > [,,,,] \\ bad one, empty one > [year 2014,,?113,500.00,?0.00,?113,500.00] > [Year 2015,,?69,031.25,?14,606.25,?83,637.50] > > 3. Also to clarify I want to drop those two empty line -> [,,,,] if I > can. Unfortunately drop() call does not work > a.drop() > <console>:24: error: value drop is not a member of > org.apache.spark.rdd.RDD[(String, Double, Double, Double)] > a.drop() > ^ > > Thanka again, > > Dr Mich Talebzadeh > > LinkedIn > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > > http://talebzadehmich.wordpress.com > > NOTE: The information in this email is proprietary and confidential. This > message is for the designated recipient only, if you are not the intended > recipient, you should destroy it immediately. Any information in this message > shall not be understood as given or endorsed by Peridale Technology Ltd, its > subsidiaries or their employees, unless expressly so stated. It is the > responsibility of the recipient to ensure that this email is virus free, > therefore neither Peridale Technology Ltd, its subsidiaries nor their > employees accept any responsibility. > > > From: Michał Zieliński [mailto:[email protected]] > Sent: 20 February 2016 08:59 > To: Mich Talebzadeh <[email protected]> > Cc: user @spark <[email protected]> > Subject: Re: Checking for null values when mapping > > You can use filter and isNotNull on Column before the map. > > On 20 February 2016 at 08:24, Mich Talebzadeh <[email protected]> wrote: > > I have a DF like below reading a csv file > > > val df = > HiveContext.read.format("com.databricks.spark.csv").option("inferSchema", > "true").option("header", "true").load("/data/stg/table2") > > val a = df.map(x => (x.getString(0), x.getString(1), > x.getString(2).substring(1).replace(",", > "").toDouble,x.getString(3).substring(1).replace(",", "").toDouble, > x.getString(4).substring(1).replace(",", "").toDouble)) > > > For most rows I am reading from csv file the above mapping works fine. > However, at the bottom of csv there are couple of empty columns as below > > [421,02/10/2015,?1,187.50,?237.50,?1,425.00] > [,,,,] > [Net income,,?182,531.25,?14,606.25,?197,137.50] > [,,,,] > [year 2014,,?113,500.00,?0.00,?113,500.00] > [Year 2015,,?69,031.25,?14,606.25,?83,637.50] > > However, I get > > a.collect.foreach(println) > 16/02/20 08:31:53 ERROR Executor: Exception in task 0.0 in stage 123.0 (TID > 161) > java.lang.StringIndexOutOfBoundsException: String index out of range: -1 > > I suspect the cause is substring operation say x.getString(2).substring(1) > on empty values that according to web will throw this type of error > > > The easiest solution seems to be to check whether x above is not null and do > the substring operation. Can this be done without using a UDF? > > Thanks > > Dr Mich Talebzadeh > > LinkedIn > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > > http://talebzadehmich.wordpress.com > > NOTE: The information in this email is proprietary and confidential. This > message is for the designated recipient only, if you are not the intended > recipient, you should destroy it immediately. Any information in this message > shall not be understood as given or endorsed by Peridale Technology Ltd, its > subsidiaries or their employees, unless expressly so stated. It is the > responsibility of the recipient to ensure that this email is virus free, > therefore neither Peridale Technology Ltd, its subsidiaries nor their > employees accept any responsibility. > > >
