[jira] Commented: (HIVE-1641) add map joined table to distributed cache

He Yongqiang (JIRA) Wed, 20 Oct 2010 14:06:51 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-1641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12923168#action_12923168
 ]


He Yongqiang commented on HIVE-1641:
------------------------------------

Some tests are failing because of plan change.

Can you refresh the diff?


And some more minor comments, you can fix them in the following up jiras or in 
your next patch (some of them are just few lines of change).
1. 
NOTSKIPBIGTABLE is defined in both AbstractMapJoinOperator and 
CommonJoinOperator. And let's not use 'static'.

2.
In MapJoinObjectKey, metadataTag is always -1, and we serialize and deserialize 
it for each key. We can avoid it by simply assume that metadataTag is -1.

3.
In JDBMSinkOperator, 

if (hashTable.cacheSize() > 0) {
  o.setObj(res);
  needNewKey = false;
}

has no effect. 

Even hashTable.cacheSize() > 0, and then needNewKey = true

In the following code,
if (needNewKey){
...
hashTable.put(keyObj, valueObj);
}

the keyObj and valueObj is already in hashTable, so the put also has no effect 
except put the value to the head of MRUList. But at the put time, it is already 
in the head because of the get()

So ideally, 

we should put most code into 

if (o == null) {
....

 if (metadataValueTag[tag] == -1) {
 .....
 }

 if (needNewKey) { //this is always true here
 
 }
} else {
        res = o.getObj();
        res.add(value);
}

These maybe beneficial to the client performance, and that will be good since 
now we are now putting all the process work of small tables at the client. 

4. 
In JDBMSinkOperator's close(), put hashTable.close(); before uploading jdbm 
file. That way, JDBM itself may want to do some cleanup work in the close 
before uploading jdbm file.

5.
In JDBMSinkOperator, remove getPersistentFilePath(). there is no referenced to 
it.

6.
In MapjoinOperator's loadJDBM, remove line "int alias;"
In loadJDBM(), remove code:
"
for(int i = 0;i<localFiles.length; i++){
  Path path = localFiles[i];
}
"
7. 
Instead of resolving the file name mapping at runtime. should do it at compile 
time. Need to open a follow up jira for this.

8.
In MapredLocalTask, remove line:
"
private MapOperator mo;
private File jdbmFile;
"
Maybe we should print some progress information in startForward(). That way, 
client will not think it is not responsible.



AWESOME work! Can you open the follow up jiras for the offline review comments?

> add map joined table to distributed cache
> -----------------------------------------
>
>                 Key: HIVE-1641
>                 URL: https://issues.apache.org/jira/browse/HIVE-1641
>             Project: Hive
>          Issue Type: Improvement
>          Components: Query Processor
>    Affects Versions: 0.7.0
>            Reporter: Namit Jain
>            Assignee: Liyin Tang
>             Fix For: 0.7.0
>
>         Attachments: Hive-1641(3).txt, Hive-1641(4).patch, 
> Hive-1641(5).patch, Hive-1641.patch
>
>
> Currently, the mappers directly read the map-joined table from HDFS, which 
> makes it difficult to scale.
> We end up getting lots of timeouts once the number of mappers are beyond a 
> few thousand, due to 
> concurrent mappers.
> It would be good idea to put the mapped file into distributed cache and read 
> from there instead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1641) add map joined table to distributed cache

Reply via email to