Hello,
I have a requirement where I need to get total count of rows and total
count of failedRows based on a grouping.
The code looks like below:
myDataset.createOrReplaceTempView("temp_view");
Dataset <Row> countDataset = sparkSession.sql("Select
column1,column2,column3,column4,column5,column6,column7,column8, count(*) as
totalRows, sum(CASE WHEN (column8 is NULL) THEN 1 ELSE 0 END) as failedRows
from temp_view group by
column1,column2,column3,column4,column5,column6,column7,column8");
Up till around 50 Million records, the query performance was ok. After that
it gave it up. Mostly resulting in out of Memory exception.
I read documentation and blogs, most of them gives me examples of
RDD.reduceByKey. But here I got dataset and spark Sql.
What am I missing here ? .
Any help will be appreciated.
Thanks!
--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
---------------------------------------------------------------------
To unsubscribe e-mail: [email protected]