When I run this job in local mode spark-submit --master local[4]
with
spark = SparkSession.builder \
.appName("tests") \
.enableHiveSupport() \
.getOrCreate()
spark.conf.set("spark.sql.adaptive.enabled", "true")
df3.explain(extended=True)
and no caching
I see this p
Hi, Mich:
Thanks for your reply, but maybe I didn't make my question clear.
I am looking for a solution to compute the count of each element in an array,
without "exploding" the array, and output a Map structure as a column.
For example, for an array as ('a', 'b', 'a'), I want to output a column
I do not think InMemoryFileIndex means it is caching the data. The caches
get shown as InMemoryTableScan. InMemoryFileIndex is just for partition
discovery and partition pruning.
Any read will always show up as a scan from InMemoryFileIndex. It is not
cached data. It is a cached file index. Please
Hi, Mich:
Thanks for your reply, but maybe I didn't make my question clear.
I am looking for a solution to compute the count of each element in an array,
without "exploding" the array, and output a Map structure as a column.
For example, for an array as ('a', 'b', 'a'), I want to output a column
When you run this in yarn mode, it uses Broadcast Hash Join for join
operation as shown in the following output. The datasets here are the same
size, so it broadcasts one dataset to all of the executors and then reads
the same dataset and does a hash join.
It is typical of joins . No surprises h