I have a Hadoop cluster that uses Apache Spark to query parquet files saved on
Hadoop. For example, when i'm using the following PySpark code to find a word
in parquet files:
df = spark.read.parquet("hdfs://test/parquets/*")
df.filter(df['word'] == "jhon").show()
After running this code, I go to spark application UI, stages tab, I see that
locality level summery set on Any. In contrast, because of this query's nature,
it must run locally and on NODE_LOCAL locality level at least. When I check the
network IO of the cluster while running this, I find out that this query use
network (network IO increases while the query is running). The strange part of
this situation is that the number shown in the spark UI's shuffle section is
very small.
How can I find out the root cause of this problem and solve that?
link of stackoverflow.com :
https://stackoverflow.com/questions/66612906/problem-with-data-locality-when-running-spark-query-with-local-nature-on-apache
<https://stackoverflow.com/questions/66612906/problem-with-data-locality-when-running-spark-query-with-local-nature-on-apache>