Re: Treadting NaN fields in Spark

Mich Talebzadeh Thu, 29 Sep 2016 08:29:57 -0700

Good points :) it took take "-" as a negative number -123456?

At this moment in time this is what the code does



   1. csv is imported into HDFS as is. No cleaning done for rogue columns
   done at shell level
   2. Spark programs does the following filtration:
   3. val rs = df2.filter($"Open" !== "-").filter($"Volume".cast("Integer")
   > 0)

So my first line of defence is to check for !== "-" which is a dash,
commonly used for not available. The next filter is for volume column > 0
(there was trades on this stock), otherwise the calculation could skew the
results.  Note that a filter with AND with !== will not work.

scala> val rs = df2.filter($"Open" !== "-" && $"Volume".cast("Integer") > 0)
<console>:40: error: value && is not a member of String
       val rs = df2.filter($"Open" !== "-" && $"Volume".cast("Integer") > 0)

Will throw an error.
But this equality === works!

scala> val rs = df2.filter($"Open" *=== "-"* && $"Volume".cast("Integer") >
0)
rs: org.apache.spark.sql.Dataset[columns] = [Stock: string, Ticker: string
... 6 more fields]


Another alternative is to check for all digits here

 scala> def isAllPostiveNumber (price: String) = price forall
Character.isDigit
isAllPostiveNumber: (price: String)Boolean
Retuns Boolean true or false.  But does not work unless someone tells me
what is wrong with this below!

scala> val rs = df2.filter(isAllPostiveNumber("Open") => true)

scala> val rs = df2.filter(isAllPostiveNumber("Open") => true)
<console>:1: error: not a legal formal parameter.
Note: Tuples cannot be directly destructured in method or function
parameters.
      Either create a single parameter accepting the Tuple1,
      or consider a pattern matching anonymous function: `{ case (param1,
param1) => ... }
val rs = df2.filter(isAllPostiveNumber("Open") => true)


Thanks












Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 29 September 2016 at 13:45, Michael Segel <msegel_had...@hotmail.com>
wrote:

> Hi,
>
> Just a few thoughts so take it for what its worth…
>
> Databases have static schemas and will reject a row’s column on insert.
>
> In your case… you have one data set where you have a column which is
> supposed to be a number but you have it as a string.
> You want to convert this to a double in your final data set.
>
>
> It looks like your problem is that your original data set that you
> ingested used a ‘-‘ (dash) to represent missing data, rather than a NULL
> value.
> In fact, looking at the rows… you seem to have a stock that didn’t trade
> for a given day. (All have Volume as 0. ) Why do you need this?  Wouldn’t
> you want to represent this as null or no row for a given date?
>
> The reason your ‘-‘ check failed when isnan() is that ‘-‘ actually could
> be represented as a number.
>
> If you replaced the ‘-‘ with a String that is wider than the width of a
> double … the isnan should flag the row.
>
> (I still need more coffee, so I could be wrong) ;-)
>
> HTH
>
> -Mike
>
> On Sep 28, 2016, at 5:56 AM, Mich Talebzadeh <mich.talebza...@gmail.com>
> wrote:
>
>
> This is an issue in most databases. Specifically if a field is NaN.. --> (
> *NaN*, standing for not a number, is a numeric data type value
> representing an undefined or unrepresentable value, especially in
> floating-point calculations)
>
> There is a method called isnan() in Spark that is supposed to handle this
> scenario . However, it does not return correct values! For example I
> defined column "Open" as String  (it should be Float) and it has the
> following 7 rogue entries out of 1272 rows in a csv
>
> df2.filter( $"OPen" === 
> "-").select((changeToDate("TradeDate").as("TradeDate")),
> 'Open, 'High, 'Low, 'Close, 'Volume).show
>
> +----------+----+----+---+-----+------+
> | TradeDate|Open|High|Low|Close|Volume|
> +----------+----+----+---+-----+------+
> |2011-12-23|   -|   -|  -|40.56|     0|
> |2011-04-21|   -|   -|  -|45.85|     0|
> |2010-12-30|   -|   -|  -|38.10|     0|
> |2010-12-23|   -|   -|  -|38.36|     0|
> |2008-04-30|   -|   -|  -|32.39|     0|
> |2008-04-29|   -|   -|  -|33.05|     0|
> |2008-04-28|   -|   -|  -|32.60|     0|
> +----------+----+----+---+-----+------+
>
> However, the following does not work!
>
>  df2.filter(isnan($"Open")).show
> +-----+------+---------+----+----+---+-----+------+
> |Stock|Ticker|TradeDate|Open|High|Low|Close|Volume|
> +-----+------+---------+----+----+---+-----+------+
> +-----+------+---------+----+----+---+-----+------+
>
> Any suggestions?
>
> Thanks
>
>
> Dr Mich Talebzadeh
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
> http://talebzadehmich.wordpress.com
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
>

Re: Treadting NaN fields in Spark

Reply via email to