[pyspark 2.3+] Querying non-partitioned @TB data table is too slow

Rishi Shah Sun, 09 Jun 2019 14:51:42 -0700

Hi All,

I have a table with 3TB data, stored as parquet snappy compression - 100
columns.. However I am filtering the DataFrame on date column (date between
20190501-20190530) & selecting only 20 columns & counting.. This operation
takes about 45 mins!!


Shouldn't parquet do the predicate pushdown and filtering without scanning
the entire dataset?

-- 
Regards,

Rishi Shah

[pyspark 2.3+] Querying non-partitioned @TB data table is too slow

Reply via email to