> Am 20.06.2016 um 20:20 schrieb Gopal Vijayaraghavan <gop...@apache.org>: > > >> is hosting the HiveServer2 is merely sending data with around 3 MB/sec. >> Our network is capable of much more. Playing around with `fetchSize` did >> not increase throughput. > ... >> --hiveconf >> mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec >> \ > > The current implementation you have is CPU bound in HiveServer2, the > compression generally makes it worse. > > The fetch size does help, but it only prevents the system from doing > synchronized operations frequently (pausing every 50 rows is too often, > the default is now 10000 rows). > >> -e 'SELECT <a lot of columns> FROM `db`.`table` WHERE (year=2016 AND >> month=6 AND day=1 AND hour=10)' > /dev/null > > Quick q - are year/month/day/hour partition columns? If so, there might be > a very different fix to this problem.
Yes, year, month, day and hour are partition columns. I.e. I want to export exactly one partition. In my real use case, I want to use another filter (WHERE some_other_column = <x>), but for this case right here, it is exactly the data of one partition I want. > >> In all cases, Hive is able only to utilize a tiny fraction of the >> bandwidth that is available. Is there a possibility to increase network >> throughput? > > A series of work-items are in progress for fixing the large row-set > performance in HiveServer2 > > https://issues.apache.org/jira/browse/HIVE-11527 > > https://issues.apache.org/jira/browse/HIVE-12427 > > What would be great would be to attach a profiler to your HiveServer2 & > see which functions are hot, that will help fix those codepaths as part of > the joint effort with the ODBC driver teams. I’ll see what I can do. I can’t restart the server at will though, since other teams are using it as well. > > Cheers, > Gopal > Thank you :) -David