> Am 20.06.2016 um 20:20 schrieb Gopal Vijayaraghavan <gop...@apache.org>:
> 
> 
>> is hosting the HiveServer2 is merely sending data with around 3 MB/sec.
>> Our network is capable of much more. Playing around with `fetchSize` did
>> not increase throughput.
> ...
>> --hiveconf 
>> mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec
>> \
> 
> The current implementation you have is CPU bound in HiveServer2, the
> compression generally makes it worse.
> 
> The fetch size does help, but it only prevents the system from doing
> synchronized operations frequently (pausing every 50 rows is too often,
> the default is now 10000 rows).
> 
>>   -e 'SELECT <a lot of columns> FROM `db`.`table` WHERE (year=2016 AND
>> month=6 AND day=1 AND hour=10)' > /dev/null
> 
> Quick q - are year/month/day/hour partition columns? If so, there might be
> a very different fix to this problem.

Yes, year, month, day and hour are partition columns. I.e. I want to export 
exactly one partition. In my real use case, I want to use another filter (WHERE 
some_other_column = <x>), but for this case right here, it is exactly the data 
of one partition I want.

> 
>> In all cases, Hive is able only to utilize a tiny fraction of the
>> bandwidth that is available. Is there a possibility to increase network
>> throughput?
> 
> A series of work-items are in progress for fixing the large row-set
> performance in HiveServer2
> 
> https://issues.apache.org/jira/browse/HIVE-11527
> 
> https://issues.apache.org/jira/browse/HIVE-12427
> 
> What would be great would be to attach a profiler to your HiveServer2 &
> see which functions are hot, that will help fix those codepaths as part of
> the joint effort with the ODBC driver teams.

I’ll see what I can do. I can’t restart the server at will though, since other 
teams are using it as well. 

> 
> Cheers,
> Gopal
> 

Thank you :)
-David

Reply via email to