Hi all, I'm relatively new to spark and something is bothering me for optimizing sort merge join from parquet.
My work consists to get stats on purchases for a retail company. For example, i have to calculate the mean purchase over a period, for a segment of prodcuts and a segment of client. This informations are in different tables so i have to join them : 1. a client table : ID_CLIENT, CLIENT_SEG 2. a ticket table : ID_CLIENT, ID_TICKET, DATE 3. a detailed ticket table : ID_CLIENT, ID_TICKET, ID_PRODUCT, PRODUCT_SEG For improving speed, I tried to save parquet files after a hashrepartition on keys, but the reload of those parquet files still need a lot of shuffle for the sort merge join. How to shuffle data once and for all for speeding requests ? Thanks, *Antoine Bonnin* Data scientist [image: C-Ways] *The smart way to your clients* [image: Mail] antoine.bon...@c-ways.com <antoine.bon...@c-ways.com> [image: Tél.] 06 65 37 99 60 [image: Web] www.c-ways.com <http://www.c-ways.com> [image: Twitter] @cways_fr <https://twitter.com/cways_fr>