Hi all:
i want to ask a question about how to estimate the rdd size( according to
byte) when it is not saved to disk because the job spends long time if the
output is very huge and output partition number is small.
following step is what i can solve for this problem
1.sample 0.01 's or
Hi:
I want to get the total bytes of a DataFrame by following function , but when
I insert the DataFrame into hive , I found the value of the function is
different from spark.sql.statistics.totalSize . The
spark.sql.statistics.totalSize is less than the result of following function
getRDDB
Hi all:
i saw skewed join hint optimization in
https://docs.azuredatabricks.net/delta/join-performance/skew-join.html.
it is a great feature to help users to avoid the problem brought from skewed
data. My question
1. which version we will have this ? i have not found the feature in the ma
generated by using a NOT IN (subquery), if you are OK
with slightly different NULL semantics then you could use NOT EXISTS(subquery).
The latter should perform a lot better.
On Wed, Oct 23, 2019 at 12:02 PM zhangliyun wrote:
Hi all:
i want to ask a question about broadcast nestloop join?
hen OOM happens.
Maybe there is an algorithm to implement left/right join in a distributed
environment without broadcast, but currently Spark is only able to deal with it
using broadcast.
On Wed, Oct 23, 2019 at 6:02 PM zhangliyun wrote:
Hi all:
i want to ask a question about broadcast nest
Hi all:
i want to ask a question about broadcast nestloop join? from google i know,
that
left outer/semi join and right outer/semi join will use broadcast nestloop.
and in some cases, when the input data is very small, it is suitable to use.
so here
how to define the input data very small?