This is an issue in most databases. Specifically if a field is NaN.. --> (
*NaN*, standing for not a number, is a numeric data type value representing
an undefined or unrepresentable value, especially in floating-point
calculations)

There is a method called isnan() in Spark that is supposed to handle this
scenario . However, it does not return correct values! For example I
defined column "Open" as String  (it should be Float) and it has the
following 7 rogue entries out of 1272 rows in a csv

df2.filter( $"OPen" ===
"-").select((changeToDate("TradeDate").as("TradeDate")),
'Open, 'High, 'Low, 'Close, 'Volume).show

+----------+----+----+---+-----+------+
| TradeDate|Open|High|Low|Close|Volume|
+----------+----+----+---+-----+------+
|2011-12-23|   -|   -|  -|40.56|     0|
|2011-04-21|   -|   -|  -|45.85|     0|
|2010-12-30|   -|   -|  -|38.10|     0|
|2010-12-23|   -|   -|  -|38.36|     0|
|2008-04-30|   -|   -|  -|32.39|     0|
|2008-04-29|   -|   -|  -|33.05|     0|
|2008-04-28|   -|   -|  -|32.60|     0|
+----------+----+----+---+-----+------+

However, the following does not work!

 df2.filter(isnan($"Open")).show
+-----+------+---------+----+----+---+-----+------+
|Stock|Ticker|TradeDate|Open|High|Low|Close|Volume|
+-----+------+---------+----+----+---+-----+------+
+-----+------+---------+----+----+---+-----+------+

Any suggestions?

Thanks


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

Reply via email to