[jira] [Updated] (FLINK-36594) HiveCatalog should set HiveConf.hiveSiteLocation back

Matthias Pohl (Jira) Mon, 02 Dec 2024 01:12:06 -0800


     [ 
https://issues.apache.org/jira/browse/FLINK-36594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Matthias Pohl updated FLINK-36594:
----------------------------------
    Affects Version/s: 1.20.0
                           (was: 1.20.1)

> HiveCatalog should set HiveConf.hiveSiteLocation back
> -----------------------------------------------------
>
>                 Key: FLINK-36594
>                 URL: https://issues.apache.org/jira/browse/FLINK-36594
>             Project: Flink
>          Issue Type: Bug
>          Components: Connectors / Hive
>    Affects Versions: 1.17.1, 1.17.2, 1.13, 1.20.0, 2.0-preview
>            Reporter: slankka
>            Priority: Major
>              Labels: pull-request-available
>
> h3. {*}Background{*}:
> Recently, I'm using HiveCatalog and Hudi sync to HMS.
> HiveCatalog can cause subsequently failure of Hive configuration retrieval. 
> In my case, Hudi cannot get hive-site conf provided in classpath. 
>  
> h3. *TL;DR:*
> I mean, once HiveCatalog initialized, it turn it off by setting 
> *HiveConf.hiveSiteLocation* to null,
> then any instance of HiveConf will never load hive-site.xml, no matter what 
> user puts it on classpath, such as yarn provided. 
>  
> h3. {*}Detail{*}:
> HiveCatalog can load hive-site.xml itself without this variable , however the 
> normal code after that, is still assuming HiveConf 'searches' hive-site.xml 
> from classpath. 
> Related change:  https://issues.apache.org/jira/browse/FLINK-22092
>  
> h3. *Cause:*
> Only if you addResource {*}explicitly{*}, set it back, or Hive search it from 
> user uber jar which need another effort.
> My point is, {+}big data developers will be confused about to provide 
> core-site.xml, hive-site.xml, hbase-site.xml and so on{+}.
> On the other side, developers search it from here and there, and could not 
> make sure it's right.
> AS consequence, user and cloud provider put their xxx-site.xml everywhere:
>  # /etc/hive/conf, /etc/hadoop/conf
>  # FLINK_HOME/lib, SPARK_HOME/conf
>  # yarn.provided.lib.dir ( resource prefix ./lib, ./plugin/ )
>  # packed in their uber jar
>  # --files of Apache spark, --yarnship hive-site.xml (works)
> Due to the difference of deployment: yarn-per-job and yarn-application, the 
> main() of their application could run from different places.
> The simplist way to provided xxx-site.xml is both client side classpath and 
> their container classpath (root path). By the way, if I am cloud 
> infrastructure provider, I like to put it on 1. and 2. and 3; if I am flink 
> users, I do not trust them, I packed in my jar and ask cloud provider to give 
> me xxx-site.xml.
>  
> In addition, the code below are similar at using their private method 
> *findConfigFile* to search *hiveSiteLocation* from classpath
>  * org.apache.hadoop.hive.conf.HiveConf
>  * org.apache.hadoop.hive.metastore.conf.MetastoreConf
>  
> {*}Conclusion{*}:
>  # HiveConf findConfigFile and cache hiveSiteLocation only once during class 
> intialization.
>  # MetastoreConf will searches hiveSiteLocation again even somebody set it to 
> null. (It's better)
>  # both HiveConf and MetastoreConf can recognize hive-site.xml from classpath 
> first level. eg: "lib/hive-site.xml" is invalid.
>  
> {code:java}
> class org.apache.hadoop.hive.metastore.conf.MetastoreConf
> private MetastoreConf() {
>   throw new RuntimeException("You should never be creating one of these!");
> }
>  
> public static Configuration newMetastoreConf() {
> ...
>   if(hiveSiteURL == null) {
>     hiveSiteURL = findConfigFile(classLoader, "hive-site.xml");
>   }
> ...
> }{code}
>  
> {code:java}
> class org.apache.hadoop.hive.conf.HiveConf 
> //HiveConf static initialization code try to search hive-site.xml, and only 
> once.
> static {
>   hiveSiteURL = findConfigFile(classLoader, "hive-site.xml", true);
> }
> ...
> private void initialize(Class<?> cls) {
>   ...
>   if (hiveSiteURL != null) {
>     addResource(hiveSiteURL);
>   }
>   ...
> }{code}
>  
> {code:java}
> String name            = "myhive";
> String defaultDatabase = "mydatabase";
> String hiveConfDir     = "/opt/hive-conf";
> HiveCatalog hive = new HiveCatalog(name, defaultDatabase, hiveConfDir);
> tableEnv.registerCatalog("myhive", hive);
> // set the HiveCatalog as the current catalog of the session
> tableEnv.useCatalog("myhive"); {code}
> after running code above:
> {code:java}
> //Another framework who are using hive naturely:
> HiveConf hiveConf = new HiveConf(hadoopConf, HiveConf.class); 
> // or directly
> HiveConf hiveConf = new HiveConf(); {code}
> The hiveConf *DOES NOT* load hive-site.xml from classpath, which will cause 
> configuration loading failure.
>  
> Example code from HiveSyncConfig of Apache Hudi:
> {code:java}
> public HiveSyncConfig(Properties props, Configuration hadoopConf) {
>     super(props, hadoopConf);
>     HiveConf hiveConf = new HiveConf();
>     // HiveConf needs to load Hadoop conf to allow instantiation via 
> AWSGlueClientFactory
>     hiveConf.addResource(hadoopConf);
>     setHadoopConf(hiveConf);
>     validateParameters();
> } {code}
>  
> The temporary fix of this issue is to search again :)
> {code:java}
> HiveConf.setHiveSiteLocation(classLoader.getResource(HiveCatalog.HIVE_SITE_FILE));
>  
> HiveConf hiveConf = new HiveConf();{code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (FLINK-36594) HiveCatalog should set HiveConf.hiveSiteLocation back

Reply via email to