> another case of a query hangin' in v2.1.0.

I'm not sure that's a hang. If you can repro this, can you please do a jstack 
while it is "hanging" (like a jstack of hiveserver2 or cli)?

I have a theory that you're hitting a slow path in HDFS remote read because of 
the following stacktrace.

        at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:700)
        at java.io.DataInputStream.readInt(DataInputStream.java:387)
        at 
org.apache.hadoop.io.SequenceFile$Reader.readBlock(SequenceFile.java:2101)
        at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2508)
        at 
org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:82)
        at 
org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:484)

Notice that it is firing off a 4 byte HDFS read call without buffering - this 
is probably because Compression is usually the natural buffering mode for the 
SequenceFiles.

The uncompressed data might be triggering a 4 byte remote read directly, which 
would be an extremely slow way to read data out of HDFS.

> * so empty result expected.

The empty result is the worst-case scenario for the FetchTask optimization, 
because it means the CLI tool deserializes every single row in a single thread.

ORC which has internal indexes is somewhat safe against that.

> set hive.fetch.task.conversion=none;
> but not sure its the right thing to set globally just yet.

No, it's not - the right setting is to tune the size threshold for that 
optimization.

hive.fetch.task.conversion.threshold;

Setting that to <=1G bytes can be a win, while setting that to -1 can cause so 
much pain.

Cheers,
Gopal




Reply via email to