Need to convert Dataset to HashMap

2018-09-27 Thread rishmanisation
I am writing a data-profiling application that needs to iterate over a large .gz file (imported as a Dataset). Each key-value pair in the hashmap will be the row value and the number of times it occurs in the column. There is one hashmap for each column, and they are all added to a JSON at the end.

Re: Need to convert Dataset to HashMap

2018-09-28 Thread rishmanisation
Thanks for the response! I'm not sure caching 'freq' would make sense, since there are multiple columns in the file and so it will need to be different for different columns. Original data format is .gz (gzip). I am a newbie to Spark, so could you please give a little more details on the appropri

Re: Need to convert Dataset to HashMap

2018-09-28 Thread rishmanisation
Thanks for the help so far. I tried caching but the operation seems to be taking forever. Any tips on how I can speed up this operation? Also I am not sure case class would work, since different files have different structures (I am parsing a 1GB file right now but there are a few different files

Application crashes when encountering oracle timestamp

2018-10-16 Thread rishmanisation
I am writing a Spark application to profile an Oracle database. The application works perfectly without any timestamp columns, but when I do try to profile a database with a timestamp column I run into the following error: Exception in thread "main" java.sql.SQLException: Unrecognized SQL type -10