Re: Issues with aggregating on map values

Zoltan Haindrich Wed, 12 Feb 2020 04:32:24 -0800

Hey Nakul!

It's not clear which version you are using;
I've checked this issue on apache/master and the 3.1.2 release - and both of 
them returned accurate results.
You could execute: 'select version()' ; or run 'hive --version' in a commandline


cheers,
Zoltan

On 2/11/20 11:38 AM, Nakul Khanna (BLOOMBERG/ LONDON) wrote:

Creating the table

CREATE TABLE foo
(tags MAP<String, String>, size int);
INSERT INTO foo VALUES
(map("country", "USA"), 10),
(map("country", "USA"), 20),
(map("country", "UK"), 30);

SELECT * FROM foo

+--------------------+-----------+
| foo.tags | foo.size |
+--------------------+-----------+
| {"country":"USA"} | 10 |
| {"country":"USA"} | 20 |
| {"country":"UK"} | 30 |
+--------------------+-----------+

Aggregating the Table

SELECT DISTINCT tags["country"] from foo;

+-------+
| _c0 |
+-------+
| USA |
| NULL |
+-------+

SELECT tags["country"], sum(size) FROM foo GROUP BY tags["country"];

+-------+------+
| _c0 | _c1 |
+-------+------+
| USA | 10 |
| NULL | 50 |
+-------+------+

And even more strangely, with a subquery:

SELECT flattened.country, sum(flattened.size)
FROM (
SELECT tags["country"] as country, size
FROM foo
WHERE tags["country"] IS NOT NULL
) as flattened
GROUP BY flattened.country;

+--------------------+------+
| flattened.country | _c1 |
+--------------------+------+
| USA | 10 |
+--------------------+------+

---------------

Is there any way to stop this from happening and get the correct aggregation behaviour? The only method I've found is to create a new table using the query, write that todisk and then do the aggregation on that.

Re: Issues with aggregating on map values

Reply via email to