I am using spark to read from database and write in hdfs as parquet file. Here is code snippet.
private long etlFunction(SparkSession spark){ spark.sqlContext().setConf("spark.sql.parquet.compression.codec", “SNAPPY"); Properties properties = new Properties(); properties.put("driver”,”oracle.jdbc.driver"); properties.put("fetchSize”,”5000"); Dataset<Row> dataset = spark.read().jdbc(jdbcUrl, query, properties); dataset.write.format(“parquet”).save(“pdfs-path”); return dataset.count(); } When I look at spark ui, during write I have stats of records written, visible in sql tab under query plan. While the count itself is a heavy task. Can someone suggest best way to get count in most optimized way. Thanks all..