Re: Performance issue with hive metastore

Peter Vary Thu, 30 Jan 2020 01:08:22 -0800

Hi Nirav,

There are several configurations which could affect the number of parallel 
queries running in your environment depending on you Hive version.


Thrift client is not thread safe and this causes bottleneck in the client - 
HS2, and HS2 - HMS communication.
Hive solves this by creating its own connections on Session level.

Not sure what spark.sql exactly does, but my guess it reuses the HS2 connection 
and with it the Session. You might be able to increase your throughput by 
creating multiple connections.

Thanks,
Peter


> On Jan 30, 2020, at 02:04, Nirav Patel <npa...@xactlycorp.com> wrote:
> 
> 
>  <https://stackoverflow.com/posts/59977690/timeline>
> Hi,
> 
> I am trying to do 1000s of update parquet partition operations on different 
> hive tables parallely from my client application. I am using sparksql with 
> hive enabled in my application to submit hive query.
> 
> spark.sql(" ALTER TABLE mytable PARTITION (a=3, b=3) SET LOCATION 
>         '/newdata/mytable/a=3/b=3/part.parquet")
> 
> I can see all the queries are submitted via different threads from my 
> fork-join pool. i couldn't scale this operation however way i tweak the 
> thread pool. Then I started observing hive metastore logs and I see that only 
> thread is making all writes.
> 
>     2020-01-29T16:27:15,638  INFO [pool-6-thread-163] 
> metastore.HiveMetaStore: 163: source:10.250.70.14 get_table : db=mydb 
> tbl=mytable1
> 2020-01-29T16:27:15,638  INFO [pool-6-thread-163] HiveMetaStore.audit: 
> ugi=mycomp   ip=10.250.70.14 cmd=source:10.250.70.14 get_table : db=mydb 
> tbl=mytable1    
> 2020-01-29T16:27:15,653  INFO [pool-6-thread-163] metastore.HiveMetaStore: 
> 163: source:10.250.70.14 get_database: mydb
> 2020-01-29T16:27:15,653  INFO [pool-6-thread-163] HiveMetaStore.audit: 
> ugi=mycomp   ip=10.250.70.14 cmd=source:10.250.70.14 get_database: mydb  
> 2020-01-29T16:27:15,655  INFO [pool-6-thread-163] metastore.HiveMetaStore: 
> 163: source:10.250.70.14 get_table : db=mydb tbl=mytable2
> 2020-01-29T16:27:15,656  INFO [pool-6-thread-163] HiveMetaStore.audit: 
> ugi=mycomp   ip=10.250.70.14 cmd=source:10.250.70.14 get_table : db=mydb 
> tbl=mytable2    
> 2020-01-29T16:27:15,670  INFO [pool-6-thread-163] metastore.HiveMetaStore: 
> 163: source:10.250.70.14 get_database: mydb
> 2020-01-29T16:27:15,670  INFO [pool-6-thread-163] HiveMetaStore.audit: 
> ugi=mycomp   ip=10.250.70.14 cmd=source:10.250.70.14 get_database: mydb  
> 2020-01-29T16:27:15,672  INFO [pool-6-thread-163] metastore.HiveMetaStore: 
> 163: source:10.250.70.14 get_table : db=mydb tbl=mytable3
> 2020-01-29T16:27:15,672  INFO [pool-6-thread-163] HiveMetaStore.audit: 
> ugi=mycomp   ip=10.250.70.14 cmd=source:10.250.70.14 get_table : db=mydb 
> tbl=mytable3
> ALl actions are performed by only one thread pool-6-thread-163 I have scanned 
> 100s of lines and it just same thread. I don't see much log in 
> hiverserver.log file.
> 
> I see in hive document following default values:
> 
> hive.metastore.server.min.threads Default Value: 200 
> hive.metastore.server.max.threads Default Value: 100000
> 
> which should be good enough but why just one thread doing all the work? Is it 
> bound to consumer IP ? which would make sense as I am submitting all jobs 
> from single machine.
> 
> 
> 
> Am I missing any configuration or is there any issue with this approach from 
> my application side?
> 
> 
> 
> Thanks,
> 
> Nirav
> 
> 
>  <http://www.xactlycorp.com/>
> 
>  
> <https://www.xactlyunleashed.com/event/a022327e-063e-4089-bfc2-e68b1773374c/summary?5S%2CM3%2Ca022327e-063e-4089-bfc2-e68b1773374c=&utm_campaign=event_unleashed2020&utm_content=cost&utm_medium=signature&utm_source=email>

Re: Performance issue with hive metastore

Reply via email to