This is an issue in most databases. Specifically if a field is NaN.. --> ( *NaN*, standing for not a number, is a numeric data type value representing an undefined or unrepresentable value, especially in floating-point calculations)
There is a method called isnan() in Spark that is supposed to handle this scenario . However, it does not return correct values! For example I defined column "Open" as String (it should be Float) and it has the following 7 rogue entries out of 1272 rows in a csv df2.filter( $"OPen" === "-").select((changeToDate("TradeDate").as("TradeDate")), 'Open, 'High, 'Low, 'Close, 'Volume).show +----------+----+----+---+-----+------+ | TradeDate|Open|High|Low|Close|Volume| +----------+----+----+---+-----+------+ |2011-12-23| -| -| -|40.56| 0| |2011-04-21| -| -| -|45.85| 0| |2010-12-30| -| -| -|38.10| 0| |2010-12-23| -| -| -|38.36| 0| |2008-04-30| -| -| -|32.39| 0| |2008-04-29| -| -| -|33.05| 0| |2008-04-28| -| -| -|32.60| 0| +----------+----+----+---+-----+------+ However, the following does not work! df2.filter(isnan($"Open")).show +-----+------+---------+----+----+---+-----+------+ |Stock|Ticker|TradeDate|Open|High|Low|Close|Volume| +-----+------+---------+----+----+---+-----+------+ +-----+------+---------+----+----+---+-----+------+ Any suggestions? Thanks Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.