I'm using HDP 2.5.0 with 1.2.1 Hive. Performing some tests I noticed that my query works better if I don't use limit clause.
My query is: insert into table *results_table *partition (task_id=xxx) select * from *data_table * where dt=20171102 and ..... limit 1000000 This query runs in about 30 seconds, but without limit clause I can get about 20 seconds. Query execution plan with limit <https://pastebin.com/Cmp2rPNr>, and without <https://pastebin.com/z1ps2EhG>. I can't remove limit clause because in some cases there's to much results and I don't whant to store them all in result table. Why limit affects performance so much? Intuitively, it seems that with limit clause it should work faster. What can I do to improve prefomance?