https://docs.databricks.com/spark/latest/spark-sql/skew-join.html
The above might help, in case you are using a join.
On Mon, Jul 23, 2018 at 4:49 AM, 崔苗 wrote:
> but how to get count(distinct userId) group by company from count(distinct
> userId) group by company+x?
> count(userId) is differen
try divide and conquer, create a column x for the fist character of userid,
and group by company+x. if still too large, try first two character.
On 17 July 2018 at 02:25, 崔苗 wrote:
> 30G user data, how to get distinct users count after creating a composite
> key based on company and userid?
>
>