The classic XML configuration files like hive-site.xml have the following
patterns:

 <property><name></name><value></value><description></description></properly>

In order to read these xml into Hive, you need to do the following:

### 1. Brings each record to one line (deletes the root element
configuration tags):
### 2. Need to get rid of space between tags
###
<property><name></name><value></value><description></description></properly>
### 3. Each xml record needs to be in one line!
### 4. The xml file needs the same start and end tag for each line like
property tags. Since you have only one root tag for the whole document,
that is not going to work. Also
###    need to get rid of root tag at the beginning and end plus the xml
information at the beginning (i.e ?xml version)

This simple shell command does work (hive-site.xml, hdfs-site.xml etc)

cat ${HIVE_HOME}/conf/hive-site.xml | tr -d '&' | tr '\n' ' ' | tr '\r' ' '
| sed 's|</property>|</property>\n|g'| sed 's/<configuration>//g'|sed
's|</configuration>||g' \
 | sed -e 's/^[            ]*//g' | sed -e 's/>[            ]*/>/g' | grep
-v "<?xml version=" > ${XML_FILE}

replace ${XML_FILE} with any name

The below code will work for hive-ste.xml

drop table if exists xml_temp;
-- create a load table
create table xml_temp (xmlData string);
-- load data from local xml file
load data local inpath "${XML_FILE}" into table xml_temp;
select * from xml_temp;
drop table if exists hive_site_temp;
create table if not exists hive_site_temp(
       rownum int,
       Pname array<string>,
       Pvalue array<string>,
       Pdescription array<string>)
       row format delimited
       fields terminated by '|'
;
--desc hive_site_temp;
insert overwrite table hive_site_temp
select
          row_number() over()
        , xpath(xmlData,'property/name/text()')
        , xpath(xmlData,'property/value/text()')
        , xpath(xmlData,'property/description/text()')
from xml_temp
;
drop table if exists hive_site;
create table if not exists hive_site
     (
       rownum int,
       name string,
       value string,
       description string
     )
STORED AS PARQUET
TBLPROPERTIES ("parquet.compression"="SNAPPY")
;
insert overwrite table hive_site
select
          rownum
        , name
        , value
        , description
from
hive_site_temp



*lateral view explode(Pname) a as namelateral view explode(Pvalue) a as
valuelateral view explode(Pdescription) a as description*;
select * from hive_site
order by rownum


However, for hdfs-site.xml, by default there is no "description" tag. Still
you can read data in no problem. However, "lateral view
explode(Pdescription) a as description" will not work and table will have
no rows!

One way is to exclude that column from insert. However, I was wondering if
explode function can check if column has any value before populating it!

For example for hdfs-site.xml these are my values:

0: jdbc:hive2://rhes75:10099/default> select
xpath(xmlData,'property/description/text()') as description from xml_temp;
+--------------+
| description  |
+--------------+
| []           |
| []           |
| []           |
| []           |
| []           |
| []           |
| []           |
| []           |
| []           |
+--------------+


Thanks

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

Reply via email to