With different parameters or column number dense_rank function gets different count distinct results

ericni.ni Mon, 27 Oct 2014 21:46:04 -0700

Hi Hive users,
We create a table with sql which contains the dense_rank function,and then run 
count distinct on this table,
we found that with diffrent dense_rank parameters or even defferent columns,we 
will get the defferent count distinct results:
1.Less data will be ok(in our test case,200 million rows will get the same 
results,but 300 million rows will get the different results )
2.Different dense_rank parameters may be get the different results ,e.g 
"dense_rank() over(distribute by a,b sort by c desc)" and "dense_rank() 
over(distribute by a sort by c desc)"
3.All window functions(rank,row_number,dense_rank) have this problem
4.Less column number may be ok
5.Count(1) is ok,but Count distinct gets different results
6.It seems that some rows have been lost and some rows repeated


test data(File is too large to upload.):
http://pan.baidu.com/s/1hqnCzze

test sql:
http://pan.baidu.com/s/1eQna8q2

本电子邮件可能为保密文件。如果阁下非电子邮件所指定之收件人，谨请立即通知本人。敬请阁下不要使用、保存、复印、打印、散布本电子邮件及其内容，或将其用于其他任何目的或向任何人披露。谢谢您的合作！
 This communication is intended only for the addressee(s) and may contain 
information that is privileged and confidential. You are hereby notified that, 
if you are not an intended recipient listed above, or an authorized employee or 
agent of an addressee of this communication responsible for delivering e-mail 
messages to an intended recipient, any dissemination, distribution or 
reproduction of this communication (including any attachments hereto) is 
strictly prohibited. If you have received this communication in error, please 
notify us immediately by a reply e-mail addressed to the sender and permanently 
delete the original e-mail communication and any attachments from all storage 
devices without making or otherwise retaining a copy.

With different parameters or column number dense_rank function gets different count distinct results

Reply via email to