Hi.

We use spark 3.0.1 in HDFS cluster and we store our files as parquet with 
snappy compression and enabled dictionary. We try to perform a simple query:

parquetFile = spark.read.parquet("path/to/hadf")
parquetFile.createOrReplaceTempView("parquetFile")
spark.sql("SELECT * FROM parquetFile WHERE field1 = 'value' ORDER BY timestamp 
LIMIT 10000")

Condition in 'where' clause is selected so that no record is selected (matched) 
for this query. (on purpose) However, this query takes about 5-6 minutes to 
complete on our cluster (with NODE_LOCAL) and a simple spark configuration. 
(Input data is about 8TB in the following tests but can be much more)

We decided to test the consumption of disk, network, memory and CPU resources 
in order to detect bottlenecks in this query. However, we came to much more 
strange results, which we will discuss in the following.

We provided dashboards of each network, disk, memory, and CPU usage by 
monitoring tools so that we could check the conditions when running the query.

1) First, we increased the amount of CPU allocated to Spark from the initial 
value to 2 and then about 2.5 times. Although the last increase in the total 
amount of dedicated CPU, all of it was not used, we did not see any change in 
the duration of the query. (As a general point, in all tests, the execution 
times were increased or decreased between 0 and 20 seconds, but in 5 minutes, 
these cases were insignificant)

2) Then we similarly increased the amount of memory allocated to Spark to 2 to 
2.5 times its original value. In this case, in the last increase, the entire 
memory allocated to the spark was not used by query. But again, we did not see 
any change in the duration of the query.

3) In all these tests, we monitored the amount of network consumption and 
sending and receiving it in the whole cluster. We can run a query whose network 
consumption is 2 or almost 3 times the consumption of the query mentioned in 
this email, and this shows that we have not reached the maximum of the cluster 
network capacity in this query. Of course, it was strange to us why we need the 
network in a query that has no record and is completely node local, but we 
assumed that it probably needs it for a number of reasons, and with this 
assumption we were still very far from the network capacity.

4) In all these tests, we also checked the amount of writing and reading from 
the disk. In the same way as in the third case, we can write a query that is 
about 4 times the write and read rate of the query mentioned in the email, and 
our disks are much stronger. But the observation in this query shows that the 
write rate is almost zero (We were expecting it) and the read rate is running 
at a medium speed, which is very far from the amount of disk rate capacity and 
therefore cannot be a bottleneck.

After all these tests and the query running time of 5 minutes, we did not know 
exactly what could be more restrictive, and it was strange to us that the 
simple query stated in the email needed such a run time (because with such 
execution time, heavier queries take longer).

Does it make sense that the duration of the query is so long? Is there 
something we need to pay attention to or can we improve by changing it?

Thanks,
Amin Borjian

Reply via email to