Need to convert Dataset to HashMap

2018-09-27 Thread rishmanisation
I am writing a data-profiling application that needs to iterate over a large
.gz file (imported as a Dataset). Each key-value pair in the hashmap
will be the row value and the number of times it occurs in the column. There
is one hashmap for each column, and they are all added to a JSON at the end.

For now, I am using the following logic to generate the hashmap for a
column:

Dataset freq = df
.groupBy(columnName)
.count();

HashMap myHashMap = new HashMap<>();

Iterator rowIterator = freq.toLocalIterator();
while(rowIterator.hasNext()) {
Row currRow = rowIterator.next();
String rowString = currRow.toString();
String[] contents = rowString.substring(1, rowString.length() -
1).split(",");
Double percent = Long.valueOf(contents[1])*100.0/numOfRows;
myHashMap.put(contents[0], Double.toString(percent));
}

I have also tried converting to RDD and using the collectAsMap() function,
but both of these are taking a very long time (about 5 minutes per column,
where each column has approx. 30 million rows). Is there a more efficient
way to achieve the same? 



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Need to convert Dataset to HashMap

2018-09-28 Thread rishmanisation
Thanks for the response! I'm not sure caching 'freq' would make sense, since
there are multiple columns in the file and so it will need to be different
for different columns.

Original data format is .gz (gzip).

I am a newbie to Spark, so could you please give a little more details on
the appropriate case class?

Thanks!



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Need to convert Dataset to HashMap

2018-09-28 Thread rishmanisation
Thanks for the help so far. I tried caching but the operation seems to be
taking forever. Any tips on how I can speed up this operation?

Also I am not sure case class would work, since different files have
different structures (I am parsing a 1GB file right now but there are a few
different files that I also need to run this on).



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Application crashes when encountering oracle timestamp

2018-10-16 Thread rishmanisation
I am writing a Spark application to profile an Oracle database. The
application works perfectly without any timestamp columns, but when I do try
to profile a database with a timestamp column I run into the following
error:

Exception in thread "main" java.sql.SQLException: Unrecognized SQL type -102
at
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$getCatalystType(JdbcUtils.scala:238)
at
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$8.apply(JdbcUtils.scala:307)
at
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$8.apply(JdbcUtils.scala:307)
at scala.Option.getOrElse(Option.scala:121)
at
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.getSchema(JdbcUtils.scala:306)
at
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:62)
at
org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.(JDBCRelation.scala:115)
at
org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:52)
at
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:340)
at
org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:164)
at DatabaseProfiling.main(DatabaseProfiling.java:209)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at
org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:906)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:197)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:227)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:136)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

It is a Java application. I have tried using both ojdbc7 and ojdbc14 jars
but neither has helped.

Code:

Dataset df = spark
.read()
.format("jdbc")
.option("url", dbPath)
.option("oracle.jdbc.mapDateToTimestamp", "false")
.option("dbtable", tableName)
.option("user", username)
.option("password", password)
.option("driver", driverClass)
.load();

df.show(); //crashes right here and doesn't go any further

Has anyone encountered this issue and been able to fix it? Please advise.
Thanks!



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org