Hi,

I am trying to test some optimizations that Partitioning and Clustering tables 
can do, but I have a dude on how works the SORT BY clause in a table.
The case is the following:
I create a simple bucketed table as :

CREATE TABLE USERS(ID INT,NAME STRING, OTHER INT)
CLUSTERED BY (ID) SORTED BY(ID) into 4 buckets;

I am setting some configuration parameters :

*         set hive.enforce.bucketing=true;

*         set hive.enforce.sorting=true;

Suppose that I have a 1 million sample data for this table, and the data is 
stored automatically ordered by ID column.
Now I am trying to check for optimized queries with the sorted data. I put in 
the data a non-ordered id, duplicated sometimes into the data. I hope that how 
I tell hive that the data is ordered by id, it search by id and when finds the 
first and there aren't more consecutives match of the same id, it stop the 
search and return only the first. In the practice it's not happening and the 
query return all the row with same id.

Is my logic bad of how to SORT BY() clause helps in the query or something is 
happening?

Sorry my bad English.
I hope your help...
Regards



Reply via email to