I am writing a data-profiling application that needs to iterate over a large
.gz file (imported as a Dataset). Each key-value pair in the hashmap
will be the row value and the number of times it occurs in the column. There
is one hashmap for each column, and they are all added to a JSON at the end.
Thanks for the response! I'm not sure caching 'freq' would make sense, since
there are multiple columns in the file and so it will need to be different
for different columns.
Original data format is .gz (gzip).
I am a newbie to Spark, so could you please give a little more details on
the appropri
Thanks for the help so far. I tried caching but the operation seems to be
taking forever. Any tips on how I can speed up this operation?
Also I am not sure case class would work, since different files have
different structures (I am parsing a 1GB file right now but there are a few
different files
I am writing a Spark application to profile an Oracle database. The
application works perfectly without any timestamp columns, but when I do try
to profile a database with a timestamp column I run into the following
error:
Exception in thread "main" java.sql.SQLException: Unrecognized SQL type -10