Hi Juuien,

I assume you mean you are using JDBC drivers to retrieve from the source
table in Hive (older version) to the target table in Hive (newer version).

1) what JDBC drivers are you using?
2) Are these environments kerberized in both cases?
3) Have you considered other JDBC drivers for Hive, Example,

 hive_driver: "org.apache.hive.jdbc.HiveDriver"   ## default
hive_driver: com.cloudera.hive.jdbc41.HS2Driver ## Cloudera
hive_driver: "com.ddtek.jdbcx.hive.HiveDataSource" ## Progress direct
hive_driver: "com.ddtek.jdbc.hive.HiveDriver" ## Progress direct

I think besides JDBC there may be other disk read issues (that you can get
stats from Unix tools, like iostat etc

Another thing is that you try to read from the same table in both old and
new, find the timings and compare reads.

If the issue is throughput of writes through JDBC then you can test another
driver for it.

As a matter of interest are you doing all these through Java, Python etc
interface?

HTH



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*





*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Fri, 12 Feb 2021 at 15:10, Julien Tane <j...@solute.de> wrote:

> Dear all,
>
> we are in the process of switching from our old cluster with HDP 2.5:
>
> HDFS    2.7.3
> YARN    2.7.3
> Tez     0.7.0
> Hive    1.2.1000
> to a new cluster with HDP 3.1:
>
> HDFS    3.1.1.3.1
> YARN    3.1.0
> HIVE    3.0.0.3.1
> Tez     0.9.0.3.1
>
> We (1st) query and (2nd) retrieve data from table_0 from old cluster to 
> target-machine machine_a.
> We (1st) query and (2nd) retrieve data from table_1 from new cluster to same 
> target-machine machine_a.
> table_0 and table_1 are defined in the exact same way (partitioned by t_date) 
> and hold the exact same data.
>
> The querying from table_0 on old cluster and the querying from table_1 from 
> the new cluster show the same performance. All is good so far.
>
> After the query is processed and data is ready to be retrieved, we start data 
> retrieval with JDBC-driver. The data-retrieval-performance from the old 
> cluster is ca. 40'000 rows/sec whereas the data-retrieval-performance from 
> the new cluster is ca. 20'000 rows/sec. This big performance decrease is a 
> problem!
>
> Things we tried:
>     - Made sure that there's no bandwidth-issue with the new version.
>     - We tried downloading and uploading from and to HDFS on both old and new 
> cluster using HDFScli. We observed a difference in data-transfer-performance 
> with new cluster being a factor ca. 1.5x slower than old cluster.
>     - We made following observations while experimenting:
>         When we filled the table only with 3 days worth of data, then the new 
> cluster loaded faster then the old one.
>
>         When we filled the table with 2 years worth of data and selected in 
> the SQL statement only 3 days, then the new cluster loaded slower than the 
> old one.
>
>         The old cluster loaded with the same speed each time (regardless of 
> number of days) whereas the new cluster changed from 25.000 to 42.000 rows/s 
> for low number of days.
>
>         So it seems that if number of partitions increases, the 
> data-retrieval-performance from the new cluster decreases whereas the 
> data-retrieval-performance from the old cluster stays approx. the same.
>
> Questions:
>     q1) Do you have an idea about what this low data-retrieval-performance 
> could be caused by?
>     q2) How do we use the Hive Logging/Debug Infrastructure to find out what 
> the throughput of the rows are?
>     q3) How do we use the HDFS Logging/Debug Infrastructure to find out what 
> the throughput of the rows are?
>     q4) What are the parameters and settings we could use to make sure the 
> data-retrieval-performance is (as) high (as possible)?
>     q5) Could the garbage collector be slowing down the data-retrieval to 
> this extent? How can we find out?
>
> Looking forward to your ideas,
> Julien Tane
>
>
>
> Julien Tane
> Big Data Engineer
>
> [image: Tel.] +49 721 98993-393
> [image: Fax] +49 721 98993-66
> [image: E-Mail] j...@solute.de
>
> solute GmbH
> Zeppelinstraße 15
> 76185 Karlsruhe
> Germany
>
>
> [image: Logo Solute]
>
> Marken der solute GmbH | brands of solute GmbH
> [image: Marken]
> Geschäftsführer | Managing Director: Dr. Thilo Gans, Bernd Vermaaten
> Webseite | www.solute.de
> Sitz | Registered Office: Karlsruhe
> Registergericht | Register Court: Amtsgericht Mannheim
> Registernummer | Register No.: HRB 110579
> USt-ID | VAT ID: DE234663798
>
> *Informationen zum Datenschutz | Information about privacy policy*
> https://www.solute.de/ger/datenschutz/grundsaetze-der-datenverarbeitung.php
>
>
>
>

Reply via email to