Hi Julien,

I am not sure about the MR codepath, but I seem to remember a case where the MR 
paln was optimized in a way that the table is only read once (with a wrong 
configuration) instead of twice with different configuration. When I asked 
around it was said that the issue was fixed for Tez. This seems like the same 
situation for me.

Thanks,
Peter

> On 2022. May 18., at 21:41, Julien Phalip <jpha...@gmail.com> wrote:
> 
> Hi,
> 
> /cc Peter, as you might have some thoughts based on your experience with 
> Iceberg :)
> 
> I'm noticed another odd behavior with the "hive.io.file.readcolumn.names" 
> property.
> 
> Consider this query that reads from two separate tables at once:
> 
> SELECT * FROM (
>     SELECT
>             num as number,
>             str_val as text
>     FROM t1,
>     UNION ALL
>     SELECT *
>     FROM t2
> ) unioned_table ORDER BY number
> 
> When using the "mr" execution engine, the value of the 
> "hive.io.file.readcolumn.names" property cannot be relied on as it seems to 
> be stuck on the fields of just one of the tables. As a workaround, I have to 
> use all of the tables' columns when querying the external storage in my 
> custom storage handler, which is unfortunately quite inefficient.
> 
> Interestingly, that issue doesn't occur with Tez.
> 
> I've noticed that the Iceberg storage handler does this:
> 
> jobConf.set("tez.mrreader.config.update.properties", 
> "hive.io.file.readcolumn.names,hive.io.file.readcolumn.ids");
> 
> Link: 
> https://github.com/apache/hive/blob/3b3da9ed7f3813bae3e959670df55682fea648d3/iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java#L538
>  
> <https://github.com/apache/hive/blob/3b3da9ed7f3813bae3e959670df55682fea648d3/iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java#L538>
> 
> However, it still works fine for me with tez even without setting 
> "tez.mrreader.config.update.properties".
> 
> Do you know what's causing this? Is there a workaround for the "mr" engine to 
> consistently get the proper value for "hive.io.file.readcolumn.names"?
> 
> Thank you,
> 
> Julien
> 
> On 2022/05/16 04:03:11 Julien Phalip wrote:
> > Also, I forgot to mention, I'm using Hive v3.1.2.
> > 
> > On 2022/05/16 03:09:19 Julien Phalip wrote:
> > > Hi,
> > >
> > > I've noticed an odd behavior with the 'hive.io.file.readcolumn.names' conf
> > > property.
> > >
> > > Imagine a simple table "mytable" with two fields: "text" and "number".
> > >
> > > - If you run the query "SELECT * FROM mytable", then the
> > > "hive.io.file.readcolumn.names" has the value: "text,number". Makes sense
> > > so far.
> > > - If you run the query "SELECT text FROM mytable", then the
> > > "hive.io.file.readcolumn.names" has the value: "text". Still makes sense.
> > >
> > > However, if you add a predicate (WHERE clause), then the behavior of that
> > > property seems strange to me:
> > >
> > > - If you run the query "SELECT * FROM mytable WHERE number = 999", then
> > the
> > > "hive.io.file.readcolumn.names" has the value: "text". The "number" column
> > > is missing from the property.
> > > - If you run the query "SELECT number FROM mytable WHERE number = 999",
> > > then the "hive.io.file.readcolumn.names" has the value: "" (empty string).
> > > The "number" column is still missing from the property.
> > >
> > > In other terms, it looks like if a column is part of a predicate, then it
> > > is omitted from the "hive.io.file.readcolumn.names" property. Do you know
> > > why that is?
> > >
> > > I'm writing a custom StorageHandler and so I would need to know exactly
> > > what columns the user is requesting. Is there a way to consistently
> > > retrieve all the requested columns either from the configuration or from
> > > within the InputFormat class, even when there is a WHERE clause?
> > >
> > > Thanks,
> > >
> > > Julien
> > >
> >

Reply via email to