Hey Nakul!
It's not clear which version you are using;
I've checked this issue on apache/master and the 3.1.2 release - and both of
them returned accurate results.
You could execute: 'select version()' ; or run 'hive --version' in a commandline
cheers,
Zoltan
On 2/11/20 11:38 AM, Nakul Khanna (BLOOMBERG/ LONDON) wrote:
Creating the table
CREATE TABLE foo
(tags MAP<String, String>, size int);
INSERT INTO foo VALUES
(map("country", "USA"), 10),
(map("country", "USA"), 20),
(map("country", "UK"), 30);
SELECT * FROM foo
+--------------------+-----------+
| foo.tags | foo.size |
+--------------------+-----------+
| {"country":"USA"} | 10 |
| {"country":"USA"} | 20 |
| {"country":"UK"} | 30 |
+--------------------+-----------+
Aggregating the Table
SELECT DISTINCT tags["country"] from foo;
+-------+
| _c0 |
+-------+
| USA |
| NULL |
+-------+
SELECT tags["country"], sum(size) FROM foo GROUP BY tags["country"];
+-------+------+
| _c0 | _c1 |
+-------+------+
| USA | 10 |
| NULL | 50 |
+-------+------+
And even more strangely, with a subquery:
SELECT flattened.country, sum(flattened.size)
FROM (
SELECT tags["country"] as country, size
FROM foo
WHERE tags["country"] IS NOT NULL
) as flattened
GROUP BY flattened.country;
+--------------------+------+
| flattened.country | _c1 |
+--------------------+------+
| USA | 10 |
+--------------------+------+
---------------
Is there any way to stop this from happening and get the correct aggregation behaviour? The only method I've found is to create a new table using the query, write that to
disk and then do the aggregation on that.