[ https://issues.apache.org/jira/browse/FLINK-36594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Matthias Pohl updated FLINK-36594: ---------------------------------- Affects Version/s: 1.20.0 (was: 1.20.1) > HiveCatalog should set HiveConf.hiveSiteLocation back > ----------------------------------------------------- > > Key: FLINK-36594 > URL: https://issues.apache.org/jira/browse/FLINK-36594 > Project: Flink > Issue Type: Bug > Components: Connectors / Hive > Affects Versions: 1.17.1, 1.17.2, 1.13, 1.20.0, 2.0-preview > Reporter: slankka > Priority: Major > Labels: pull-request-available > > h3. {*}Background{*}: > Recently, I'm using HiveCatalog and Hudi sync to HMS. > HiveCatalog can cause subsequently failure of Hive configuration retrieval. > In my case, Hudi cannot get hive-site conf provided in classpath. > > h3. *TL;DR:* > I mean, once HiveCatalog initialized, it turn it off by setting > *HiveConf.hiveSiteLocation* to null, > then any instance of HiveConf will never load hive-site.xml, no matter what > user puts it on classpath, such as yarn provided. > > h3. {*}Detail{*}: > HiveCatalog can load hive-site.xml itself without this variable , however the > normal code after that, is still assuming HiveConf 'searches' hive-site.xml > from classpath. > Related change: https://issues.apache.org/jira/browse/FLINK-22092 > > h3. *Cause:* > Only if you addResource {*}explicitly{*}, set it back, or Hive search it from > user uber jar which need another effort. > My point is, {+}big data developers will be confused about to provide > core-site.xml, hive-site.xml, hbase-site.xml and so on{+}. > On the other side, developers search it from here and there, and could not > make sure it's right. > AS consequence, user and cloud provider put their xxx-site.xml everywhere: > # /etc/hive/conf, /etc/hadoop/conf > # FLINK_HOME/lib, SPARK_HOME/conf > # yarn.provided.lib.dir ( resource prefix ./lib, ./plugin/ ) > # packed in their uber jar > # --files of Apache spark, --yarnship hive-site.xml (works) > Due to the difference of deployment: yarn-per-job and yarn-application, the > main() of their application could run from different places. > The simplist way to provided xxx-site.xml is both client side classpath and > their container classpath (root path). By the way, if I am cloud > infrastructure provider, I like to put it on 1. and 2. and 3; if I am flink > users, I do not trust them, I packed in my jar and ask cloud provider to give > me xxx-site.xml. > > In addition, the code below are similar at using their private method > *findConfigFile* to search *hiveSiteLocation* from classpath > * org.apache.hadoop.hive.conf.HiveConf > * org.apache.hadoop.hive.metastore.conf.MetastoreConf > > {*}Conclusion{*}: > # HiveConf findConfigFile and cache hiveSiteLocation only once during class > intialization. > # MetastoreConf will searches hiveSiteLocation again even somebody set it to > null. (It's better) > # both HiveConf and MetastoreConf can recognize hive-site.xml from classpath > first level. eg: "lib/hive-site.xml" is invalid. > > {code:java} > class org.apache.hadoop.hive.metastore.conf.MetastoreConf > private MetastoreConf() { > throw new RuntimeException("You should never be creating one of these!"); > } > > public static Configuration newMetastoreConf() { > ... > if(hiveSiteURL == null) { > hiveSiteURL = findConfigFile(classLoader, "hive-site.xml"); > } > ... > }{code} > > {code:java} > class org.apache.hadoop.hive.conf.HiveConf > //HiveConf static initialization code try to search hive-site.xml, and only > once. > static { > hiveSiteURL = findConfigFile(classLoader, "hive-site.xml", true); > } > ... > private void initialize(Class<?> cls) { > ... > if (hiveSiteURL != null) { > addResource(hiveSiteURL); > } > ... > }{code} > > {code:java} > String name = "myhive"; > String defaultDatabase = "mydatabase"; > String hiveConfDir = "/opt/hive-conf"; > HiveCatalog hive = new HiveCatalog(name, defaultDatabase, hiveConfDir); > tableEnv.registerCatalog("myhive", hive); > // set the HiveCatalog as the current catalog of the session > tableEnv.useCatalog("myhive"); {code} > after running code above: > {code:java} > //Another framework who are using hive naturely: > HiveConf hiveConf = new HiveConf(hadoopConf, HiveConf.class); > // or directly > HiveConf hiveConf = new HiveConf(); {code} > The hiveConf *DOES NOT* load hive-site.xml from classpath, which will cause > configuration loading failure. > > Example code from HiveSyncConfig of Apache Hudi: > {code:java} > public HiveSyncConfig(Properties props, Configuration hadoopConf) { > super(props, hadoopConf); > HiveConf hiveConf = new HiveConf(); > // HiveConf needs to load Hadoop conf to allow instantiation via > AWSGlueClientFactory > hiveConf.addResource(hadoopConf); > setHadoopConf(hiveConf); > validateParameters(); > } {code} > > The temporary fix of this issue is to search again :) > {code:java} > HiveConf.setHiveSiteLocation(classLoader.getResource(HiveCatalog.HIVE_SITE_FILE)); > > HiveConf hiveConf = new HiveConf();{code} > > -- This message was sent by Atlassian Jira (v8.20.10#820010)