Re: Review Request 30739: HIVE-9574 Lazy computing in HiveBaseFunctionResultList may hurt performance [Spark Branch]

Jimmy Xiang Mon, 09 Feb 2015 18:27:07 -0800


> On Feb. 9, 2015, 2:51 a.m., Rui Li wrote:
> >
> 
> Rui Li wrote:
>     Some high level question, do we still need two buffers? And does it make 
> sense to use something like a queue instead of an array as the buffer?
> 
> Jimmy Xiang wrote:
>     Queue should work too. Using too buffers makes it easier to switch 
> between read and write. Switching itself is cheap here. For RowContainer, it 
> is expensive to switch because of first()/clear(), etc.
> 
> Rui Li wrote:
>     Thanks for the explanation Jimmy. I was just wondering if we can use a 
> single queue as the buffer and avoid switching between two arrays and 
> managing the cusors. That should make it less complicated right?


You are right. As to using a single queue, we could do so if not for the thread 
safety issue. Since we need to make it thread safe, with one queue, it is hard 
to maintain the states in case some data are flushed to disk.


> On Feb. 9, 2015, 2:51 a.m., Rui Li wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HiveKVResultCache.java, 
> > line 54
> > <https://reviews.apache.org/r/30739/diff/4/?file=853475#file853475line54>
> >
> >     If I understand correctly, this can be renamed to something like 
> > IN_MEMORY_NUM_ROWS?
> 
> Jimmy Xiang wrote:
>     Yes, you are right. Both are ok. Any strong reason for renaming it?
> 
> Rui Li wrote:
>     No, I just feel cache size is more like some size in bytes.

I see. Good point.


- Jimmy


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/30739/#review71597
-----------------------------------------------------------


On Feb. 9, 2015, 7:41 p.m., Jimmy Xiang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/30739/
> -----------------------------------------------------------
> 
> (Updated Feb. 9, 2015, 7:41 p.m.)
> 
> 
> Review request for hive, Rui Li and Xuefu Zhang.
> 
> 
> Bugs: HIVE-9574
>     https://issues.apache.org/jira/browse/HIVE-9574
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> Result KV cache doesn't use RowContainer any more since it has logic we don't 
> need, which is some overhead. We don't do lazy computing right away, instead 
> we wait a little till the cache is close to spill.
> 
> 
> Diffs
> -----
> 
>   
> ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HiveBaseFunctionResultList.java
>  78ab680 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HiveKVResultCache.java 
> 8ead0cb 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HiveMapFunction.java 
> 7a09b4d 
>   
> ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HiveMapFunctionResultList.java
>  e92e299 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HiveReduceFunction.java 
> 070ea4d 
>   
> ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HiveReduceFunctionResultList.java
>  d4ff37c 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/KryoSerializer.java 
> 286816b 
>   ql/src/test/org/apache/hadoop/hive/ql/exec/spark/TestHiveKVResultCache.java 
> 0df4598 
> 
> Diff: https://reviews.apache.org/r/30739/diff/
> 
> 
> Testing
> -------
> 
> Unit test, test on cluster
> 
> 
> Thanks,
> 
> Jimmy Xiang
> 
>

Re: Review Request 30739: HIVE-9574 Lazy computing in HiveBaseFunctionResultList may hurt performance [Spark Branch]

Reply via email to