Hi. We use spark 3.0.1 in HDFS cluster and we store our files as parquet with snappy compression and enabled dictionary. We try to perform a simple query:
parquetFile = spark.read.parquet("path/to/hadf") parquetFile.createOrReplaceTempView("parquetFile") spark.sql("SELECT * FROM parquetFile WHERE field1 = 'value' ORDER BY timestamp LIMIT 10000") Condition in 'where' clause is selected so that no record is selected (matched) for this query. (on purpose) However, this query takes about 5-6 minutes to complete on our cluster (with NODE_LOCAL) and a simple spark configuration. (Input data is about 8TB in the following tests but can be much more) We decided to test the consumption of disk, network, memory and CPU resources in order to detect bottlenecks in this query. However, we came to much more strange results, which we will discuss in the following. We provided dashboards of each network, disk, memory, and CPU usage by monitoring tools so that we could check the conditions when running the query. 1) First, we increased the amount of CPU allocated to Spark from the initial value to 2 and then about 2.5 times. Although the last increase in the total amount of dedicated CPU, all of it was not used, we did not see any change in the duration of the query. (As a general point, in all tests, the execution times were increased or decreased between 0 and 20 seconds, but in 5 minutes, these cases were insignificant) 2) Then we similarly increased the amount of memory allocated to Spark to 2 to 2.5 times its original value. In this case, in the last increase, the entire memory allocated to the spark was not used by query. But again, we did not see any change in the duration of the query. 3) In all these tests, we monitored the amount of network consumption and sending and receiving it in the whole cluster. We can run a query whose network consumption is 2 or almost 3 times the consumption of the query mentioned in this email, and this shows that we have not reached the maximum of the cluster network capacity in this query. Of course, it was strange to us why we need the network in a query that has no record and is completely node local, but we assumed that it probably needs it for a number of reasons, and with this assumption we were still very far from the network capacity. 4) In all these tests, we also checked the amount of writing and reading from the disk. In the same way as in the third case, we can write a query that is about 4 times the write and read rate of the query mentioned in the email, and our disks are much stronger. But the observation in this query shows that the write rate is almost zero (We were expecting it) and the read rate is running at a medium speed, which is very far from the amount of disk rate capacity and therefore cannot be a bottleneck. After all these tests and the query running time of 5 minutes, we did not know exactly what could be more restrictive, and it was strange to us that the simple query stated in the email needed such a run time (because with such execution time, heavier queries take longer). Does it make sense that the duration of the query is so long? Is there something we need to pay attention to or can we improve by changing it? Thanks, Amin Borjian