I tested the example on Hive 2.3.6, and it returned correct results. Hive
3.1.2 and 4.0.0-SNAPSHOT also returned correct results. So, I guess, if
this is a bug, it was introduced somewhere around Hive 3.0 and fixed in
3.1.2.

On Hive 2.3.6, I used these commands instead:
create table dummy(a string); insert into table dummy values ('a');
insert into foo select map("country", "USA"), 10 from dummy;
insert into foo select map("country", "USA"), 20 from dummy;
insert into foo select map("country", "UK"), 30 from dummy;

Cheers,

--- Sungwoo


On Wed, Feb 12, 2020 at 10:44 PM Nakul Khanna (BLOOMBERG/ LONDON) <
nkhann...@bloomberg.net> wrote:

> Hey Zoltan,
>
> Thanks for the response. When I call "select version()" I get:
>
> 3.1.0.3.1.4.0-315 re8d79f440455fa4400daf79974666b3055f1730f
>
> So a couple of patch versions old - any idea if this was a known bug
> before?
>
> Regards,
> Nakul
>
> From: user@hive.apache.org At: 02/12/20 12:31:33
> To: Nakul Khanna (BLOOMBERG/ LONDON ) <nkhann...@bloomberg.net>,
> user@hive.apache.org
> Cc: Jacky Lee (BLOOMBERG/ PRINCETON ) <jlee2...@bloomberg.net>, He Chen
> (BLOOMBERG/ PRINCETON ) <hchen...@bloomberg.net>, Peter Babinski
> (BLOOMBERG/ PRINCETON ) <pbabin...@bloomberg.net>, Bernat Gabor
> (BLOOMBERG/ LONDON ) <bgab...@bloomberg.net>, Shashank Singh (BLOOMBERG/
> PRINCETON ) <ssingh...@bloomberg.net>
> Subject: Re: Issues with aggregating on map values
>
> Hey Nakul!
>
> It's not clear which version you are using;
> I've checked this issue on apache/master and the 3.1.2 release - and both
> of
> them returned accurate results.
> You could execute: 'select version()' ; or run 'hive --version' in a
> commandline
>
> cheers,
> Zoltan
>
> On 2/11/20 11:38 AM, Nakul Khanna (BLOOMBERG/ LONDON) wrote:
> > Creating the table
> >
> > CREATE TABLE foo
> > (tags MAP<String, String>, size int);
> > INSERT INTO foo VALUES
> > (map("country", "USA"), 10),
> > (map("country", "USA"), 20),
> > (map("country", "UK"), 30);
> >
> > SELECT * FROM foo
> >
> > +--------------------+-----------+
> > | foo.tags | foo.size |
> > +--------------------+-----------+
> > | {"country":"USA"} | 10 |
> > | {"country":"USA"} | 20 |
> > | {"country":"UK"} | 30 |
> > +--------------------+-----------+
> >
> > Aggregating the Table
> >
> > SELECT DISTINCT tags["country"] from foo;
> >
> > +-------+
> > | _c0 |
> > +-------+
> > | USA |
> > | NULL |
> > +-------+
> >
> > SELECT tags["country"], sum(size) FROM foo GROUP BY tags["country"];
> >
> > +-------+------+
> > | _c0 | _c1 |
> > +-------+------+
> > | USA | 10 |
> > | NULL | 50 |
> > +-------+------+
> >
> > And even more strangely, with a subquery:
> >
> > SELECT flattened.country, sum(flattened.size)
> > FROM (
> > SELECT tags["country"] as country, size
> > FROM foo
> > WHERE tags["country"] IS NOT NULL
> > ) as flattened
> > GROUP BY flattened.country;
> >
> > +--------------------+------+
> > | flattened.country | _c1 |
> > +--------------------+------+
> > | USA | 10 |
> > +--------------------+------+
> >
> > ---------------
> >
> > Is there any way to stop this from happening and get the correct
> aggregation
> behaviour? The only method I've found is to create a new table using the
> query,
> write that to
> > disk and then do the aggregation on that.
>
>
>

Reply via email to