Need to convert Dataset to HashMap
I am writing a data-profiling application that needs to iterate over a large .gz file (imported as a Dataset). Each key-value pair in the hashmap will be the row value and the number of times it occurs in the column. There is one hashmap for each column, and they are all added to a JSON at the end. For now, I am using the following logic to generate the hashmap for a column: Dataset freq = df .groupBy(columnName) .count(); HashMap myHashMap = new HashMap<>(); Iterator rowIterator = freq.toLocalIterator(); while(rowIterator.hasNext()) { Row currRow = rowIterator.next(); String rowString = currRow.toString(); String[] contents = rowString.substring(1, rowString.length() - 1).split(","); Double percent = Long.valueOf(contents[1])*100.0/numOfRows; myHashMap.put(contents[0], Double.toString(percent)); } I have also tried converting to RDD and using the collectAsMap() function, but both of these are taking a very long time (about 5 minutes per column, where each column has approx. 30 million rows). Is there a more efficient way to achieve the same? -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Need to convert Dataset to HashMap
Thanks for the response! I'm not sure caching 'freq' would make sense, since there are multiple columns in the file and so it will need to be different for different columns. Original data format is .gz (gzip). I am a newbie to Spark, so could you please give a little more details on the appropriate case class? Thanks! -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Need to convert Dataset to HashMap
Thanks for the help so far. I tried caching but the operation seems to be taking forever. Any tips on how I can speed up this operation? Also I am not sure case class would work, since different files have different structures (I am parsing a 1GB file right now but there are a few different files that I also need to run this on). -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Application crashes when encountering oracle timestamp
I am writing a Spark application to profile an Oracle database. The application works perfectly without any timestamp columns, but when I do try to profile a database with a timestamp column I run into the following error: Exception in thread "main" java.sql.SQLException: Unrecognized SQL type -102 at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$getCatalystType(JdbcUtils.scala:238) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$8.apply(JdbcUtils.scala:307) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$8.apply(JdbcUtils.scala:307) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.getSchema(JdbcUtils.scala:306) at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:62) at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.(JDBCRelation.scala:115) at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:52) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:340) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:164) at DatabaseProfiling.main(DatabaseProfiling.java:209) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:906) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:197) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:227) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:136) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) It is a Java application. I have tried using both ojdbc7 and ojdbc14 jars but neither has helped. Code: Dataset df = spark .read() .format("jdbc") .option("url", dbPath) .option("oracle.jdbc.mapDateToTimestamp", "false") .option("dbtable", tableName) .option("user", username) .option("password", password) .option("driver", driverClass) .load(); df.show(); //crashes right here and doesn't go any further Has anyone encountered this issue and been able to fix it? Please advise. Thanks! -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org