I have a Hadoop cluster that uses Apache Spark to query parquet files saved on 
Hadoop. For example, when i'm using the following PySpark code to find a word 
in parquet files:
df = spark.read.parquet("hdfs://test/parquets/*")
df.filter(df['word'] == "jhon").show()
After running this code, I go to spark application UI, stages tab, I see that 
locality level summery set on Any. In contrast, because of this query's nature, 
it must run locally and on NODE_LOCAL locality level at least. When I check the 
network IO of the cluster while running this, I find out that this query use 
network (network IO increases while the query is running). The strange part of 
this situation is that the number shown in the spark UI's shuffle section is 
very small.
How can I find out the root cause of this problem and solve that?
link of stackoverflow.com : 
https://stackoverflow.com/questions/66612906/problem-with-data-locality-when-running-spark-query-with-local-nature-on-apache
 
<https://stackoverflow.com/questions/66612906/problem-with-data-locality-when-running-spark-query-with-local-nature-on-apache>

Reply via email to