I tested the example on Hive 2.3.6, and it returned correct results. Hive 3.1.2 and 4.0.0-SNAPSHOT also returned correct results. So, I guess, if this is a bug, it was introduced somewhere around Hive 3.0 and fixed in 3.1.2.
On Hive 2.3.6, I used these commands instead: create table dummy(a string); insert into table dummy values ('a'); insert into foo select map("country", "USA"), 10 from dummy; insert into foo select map("country", "USA"), 20 from dummy; insert into foo select map("country", "UK"), 30 from dummy; Cheers, --- Sungwoo On Wed, Feb 12, 2020 at 10:44 PM Nakul Khanna (BLOOMBERG/ LONDON) < nkhann...@bloomberg.net> wrote: > Hey Zoltan, > > Thanks for the response. When I call "select version()" I get: > > 3.1.0.3.1.4.0-315 re8d79f440455fa4400daf79974666b3055f1730f > > So a couple of patch versions old - any idea if this was a known bug > before? > > Regards, > Nakul > > From: user@hive.apache.org At: 02/12/20 12:31:33 > To: Nakul Khanna (BLOOMBERG/ LONDON ) <nkhann...@bloomberg.net>, > user@hive.apache.org > Cc: Jacky Lee (BLOOMBERG/ PRINCETON ) <jlee2...@bloomberg.net>, He Chen > (BLOOMBERG/ PRINCETON ) <hchen...@bloomberg.net>, Peter Babinski > (BLOOMBERG/ PRINCETON ) <pbabin...@bloomberg.net>, Bernat Gabor > (BLOOMBERG/ LONDON ) <bgab...@bloomberg.net>, Shashank Singh (BLOOMBERG/ > PRINCETON ) <ssingh...@bloomberg.net> > Subject: Re: Issues with aggregating on map values > > Hey Nakul! > > It's not clear which version you are using; > I've checked this issue on apache/master and the 3.1.2 release - and both > of > them returned accurate results. > You could execute: 'select version()' ; or run 'hive --version' in a > commandline > > cheers, > Zoltan > > On 2/11/20 11:38 AM, Nakul Khanna (BLOOMBERG/ LONDON) wrote: > > Creating the table > > > > CREATE TABLE foo > > (tags MAP<String, String>, size int); > > INSERT INTO foo VALUES > > (map("country", "USA"), 10), > > (map("country", "USA"), 20), > > (map("country", "UK"), 30); > > > > SELECT * FROM foo > > > > +--------------------+-----------+ > > | foo.tags | foo.size | > > +--------------------+-----------+ > > | {"country":"USA"} | 10 | > > | {"country":"USA"} | 20 | > > | {"country":"UK"} | 30 | > > +--------------------+-----------+ > > > > Aggregating the Table > > > > SELECT DISTINCT tags["country"] from foo; > > > > +-------+ > > | _c0 | > > +-------+ > > | USA | > > | NULL | > > +-------+ > > > > SELECT tags["country"], sum(size) FROM foo GROUP BY tags["country"]; > > > > +-------+------+ > > | _c0 | _c1 | > > +-------+------+ > > | USA | 10 | > > | NULL | 50 | > > +-------+------+ > > > > And even more strangely, with a subquery: > > > > SELECT flattened.country, sum(flattened.size) > > FROM ( > > SELECT tags["country"] as country, size > > FROM foo > > WHERE tags["country"] IS NOT NULL > > ) as flattened > > GROUP BY flattened.country; > > > > +--------------------+------+ > > | flattened.country | _c1 | > > +--------------------+------+ > > | USA | 10 | > > +--------------------+------+ > > > > --------------- > > > > Is there any way to stop this from happening and get the correct > aggregation > behaviour? The only method I've found is to create a new table using the > query, > write that to > > disk and then do the aggregation on that. > > >