[ https://issues.apache.org/jira/browse/HIVE-1641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12923168#action_12923168 ]
He Yongqiang commented on HIVE-1641: ------------------------------------ Some tests are failing because of plan change. Can you refresh the diff? And some more minor comments, you can fix them in the following up jiras or in your next patch (some of them are just few lines of change). 1. NOTSKIPBIGTABLE is defined in both AbstractMapJoinOperator and CommonJoinOperator. And let's not use 'static'. 2. In MapJoinObjectKey, metadataTag is always -1, and we serialize and deserialize it for each key. We can avoid it by simply assume that metadataTag is -1. 3. In JDBMSinkOperator, if (hashTable.cacheSize() > 0) { o.setObj(res); needNewKey = false; } has no effect. Even hashTable.cacheSize() > 0, and then needNewKey = true In the following code, if (needNewKey){ ... hashTable.put(keyObj, valueObj); } the keyObj and valueObj is already in hashTable, so the put also has no effect except put the value to the head of MRUList. But at the put time, it is already in the head because of the get() So ideally, we should put most code into if (o == null) { .... if (metadataValueTag[tag] == -1) { ..... } if (needNewKey) { //this is always true here } } else { res = o.getObj(); res.add(value); } These maybe beneficial to the client performance, and that will be good since now we are now putting all the process work of small tables at the client. 4. In JDBMSinkOperator's close(), put hashTable.close(); before uploading jdbm file. That way, JDBM itself may want to do some cleanup work in the close before uploading jdbm file. 5. In JDBMSinkOperator, remove getPersistentFilePath(). there is no referenced to it. 6. In MapjoinOperator's loadJDBM, remove line "int alias;" In loadJDBM(), remove code: " for(int i = 0;i<localFiles.length; i++){ Path path = localFiles[i]; } " 7. Instead of resolving the file name mapping at runtime. should do it at compile time. Need to open a follow up jira for this. 8. In MapredLocalTask, remove line: " private MapOperator mo; private File jdbmFile; " Maybe we should print some progress information in startForward(). That way, client will not think it is not responsible. AWESOME work! Can you open the follow up jiras for the offline review comments? > add map joined table to distributed cache > ----------------------------------------- > > Key: HIVE-1641 > URL: https://issues.apache.org/jira/browse/HIVE-1641 > Project: Hive > Issue Type: Improvement > Components: Query Processor > Affects Versions: 0.7.0 > Reporter: Namit Jain > Assignee: Liyin Tang > Fix For: 0.7.0 > > Attachments: Hive-1641(3).txt, Hive-1641(4).patch, > Hive-1641(5).patch, Hive-1641.patch > > > Currently, the mappers directly read the map-joined table from HDFS, which > makes it difficult to scale. > We end up getting lots of timeouts once the number of mappers are beyond a > few thousand, due to > concurrent mappers. > It would be good idea to put the mapped file into distributed cache and read > from there instead. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.