How to estimate the rdd size before the rdd result is written to disk

2019-12-19 Thread zhangliyun
Hi all: i want to ask a question about how to estimate the rdd size( according to byte) when it is not saved to disk because the job spends long time if the output is very huge and output partition number is small. following step is what i can solve for this problem 1.sample 0.01 's or

A question about radd bytes size

2019-12-01 Thread zhangliyun
Hi: I want to get the total bytes of a DataFrame by following function , but when I insert the DataFrame into hive , I found the value of the function is different from spark.sql.statistics.totalSize . The spark.sql.statistics.totalSize is less than the result of following function getRDDB

A question about skew join hint

2019-11-04 Thread zhangliyun
Hi all: i saw skewed join hint optimization in https://docs.azuredatabricks.net/delta/join-performance/skew-join.html. it is a great feature to help users to avoid the problem brought from skewed data. My question 1. which version we will have this ? i have not found the feature in the ma

Re:Re: A question about broadcast nest loop join

2019-10-23 Thread zhangliyun
generated by using a NOT IN (subquery), if you are OK with slightly different NULL semantics then you could use NOT EXISTS(subquery). The latter should perform a lot better. On Wed, Oct 23, 2019 at 12:02 PM zhangliyun wrote: Hi all: i want to ask a question about broadcast nestloop join?

Re:Re: A question about broadcast nest loop join

2019-10-23 Thread zhangliyun
hen OOM happens. Maybe there is an algorithm to implement left/right join in a distributed environment without broadcast, but currently Spark is only able to deal with it using broadcast. On Wed, Oct 23, 2019 at 6:02 PM zhangliyun wrote: Hi all: i want to ask a question about broadcast nest

A question about broadcast nest loop join

2019-10-23 Thread zhangliyun
Hi all: i want to ask a question about broadcast nestloop join? from google i know, that left outer/semi join and right outer/semi join will use broadcast nestloop. and in some cases, when the input data is very small, it is suitable to use. so here how to define the input data very small?