Hi Julien, I am not sure about the MR codepath, but I seem to remember a case where the MR paln was optimized in a way that the table is only read once (with a wrong configuration) instead of twice with different configuration. When I asked around it was said that the issue was fixed for Tez. This seems like the same situation for me.
Thanks, Peter > On 2022. May 18., at 21:41, Julien Phalip <jpha...@gmail.com> wrote: > > Hi, > > /cc Peter, as you might have some thoughts based on your experience with > Iceberg :) > > I'm noticed another odd behavior with the "hive.io.file.readcolumn.names" > property. > > Consider this query that reads from two separate tables at once: > > SELECT * FROM ( > SELECT > num as number, > str_val as text > FROM t1, > UNION ALL > SELECT * > FROM t2 > ) unioned_table ORDER BY number > > When using the "mr" execution engine, the value of the > "hive.io.file.readcolumn.names" property cannot be relied on as it seems to > be stuck on the fields of just one of the tables. As a workaround, I have to > use all of the tables' columns when querying the external storage in my > custom storage handler, which is unfortunately quite inefficient. > > Interestingly, that issue doesn't occur with Tez. > > I've noticed that the Iceberg storage handler does this: > > jobConf.set("tez.mrreader.config.update.properties", > "hive.io.file.readcolumn.names,hive.io.file.readcolumn.ids"); > > Link: > https://github.com/apache/hive/blob/3b3da9ed7f3813bae3e959670df55682fea648d3/iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java#L538 > > <https://github.com/apache/hive/blob/3b3da9ed7f3813bae3e959670df55682fea648d3/iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java#L538> > > However, it still works fine for me with tez even without setting > "tez.mrreader.config.update.properties". > > Do you know what's causing this? Is there a workaround for the "mr" engine to > consistently get the proper value for "hive.io.file.readcolumn.names"? > > Thank you, > > Julien > > On 2022/05/16 04:03:11 Julien Phalip wrote: > > Also, I forgot to mention, I'm using Hive v3.1.2. > > > > On 2022/05/16 03:09:19 Julien Phalip wrote: > > > Hi, > > > > > > I've noticed an odd behavior with the 'hive.io.file.readcolumn.names' conf > > > property. > > > > > > Imagine a simple table "mytable" with two fields: "text" and "number". > > > > > > - If you run the query "SELECT * FROM mytable", then the > > > "hive.io.file.readcolumn.names" has the value: "text,number". Makes sense > > > so far. > > > - If you run the query "SELECT text FROM mytable", then the > > > "hive.io.file.readcolumn.names" has the value: "text". Still makes sense. > > > > > > However, if you add a predicate (WHERE clause), then the behavior of that > > > property seems strange to me: > > > > > > - If you run the query "SELECT * FROM mytable WHERE number = 999", then > > the > > > "hive.io.file.readcolumn.names" has the value: "text". The "number" column > > > is missing from the property. > > > - If you run the query "SELECT number FROM mytable WHERE number = 999", > > > then the "hive.io.file.readcolumn.names" has the value: "" (empty string). > > > The "number" column is still missing from the property. > > > > > > In other terms, it looks like if a column is part of a predicate, then it > > > is omitted from the "hive.io.file.readcolumn.names" property. Do you know > > > why that is? > > > > > > I'm writing a custom StorageHandler and so I would need to know exactly > > > what columns the user is requesting. Is there a way to consistently > > > retrieve all the requested columns either from the configuration or from > > > within the InputFormat class, even when there is a WHERE clause? > > > > > > Thanks, > > > > > > Julien > > > > >