wa-ooo edited a comment on pull request #92: URL: https://github.com/apache/sqoop/pull/92#issuecomment-800087918
> Hi hong , > > I've reviewed your changes (both Github and issues.apache.org), but TBH in the current state I'm concerned both about the intention of the change, and the correctness as well. > > First of all: > Could you please provide a bit more detail around what performance gain do you expect from this change and how did you measure it? Could you please provide also some automated testcase which would show the effect of this gain, and would ensure we don't loose it in the future? > > On the front of correctness: > SQOOP-3149 introduced the line you'd like to remove, and if I do remember correctly absolutely intentionally. Because of this reason: > Could you please provide automated test cases which ensures that SQOOP-3149 changes won't be undone by your change (so we keep the current correctness around NULL column updates)? > > Many thanks in advance, > Attila Szabo ---------- hi @maugly24 thk for review this pr our production environment was upgraded from CDH-5.13.0 to CDH-6.3.2, and it was found that the task of importing data from RDM into HBase in 6.3.2 cluster took 3\~4 hours longer (\~ 50 million records). The record output in MR log was much more than that in 5.13.0. This problem can be difficult to detect when importing small tables, and the larger the data volume, the more significant the delay.So I compared the changes of Hbase-import-job in SQOOP between the two versions and found the problem here. I think this is an easy-fix for HBase developers, so there is not much description in the issue. This change is also easy to understand, since it was added to the mutationList when PUT was initialized, and no subsequent PUT needs to be added again. Otherwise, the PUT will be recorded repeatedly in the generated HFILE. I looked at SQOOP-3149, and there is no explanation for this line ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org