Re: Reading XML into Hive table, handing null columns

Mich Talebzadeh Thu, 28 Jun 2018 09:46:43 -0700

Hi,

One way of avowing null values in an xml tag when displaying the results is
the use of Outer Lateral Views
<https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LateralView#LanguageManualLateralView-OuterLateralViews>
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LateralView#LanguageManualLateralView-OuterLateralViews



The explanation below summarises it


The user can specify the optional OUTER keyword to generate rows even when
a LATERAL VIEW usually would not generate a row. This happens when the UDTF
used does not generate any rows which happens easily with explode when the
column to explode is empty. In this case the source row would never appear
in the results. *OUTER can be used to prevent that and rows will be
generated with NULL values in the columns coming from the UDTF.*


So for hdfs-site where description is null one can do a test for null value
for every column


insert overwrite table hdfs_site
select
          rownum
        , name
        , value
        , description
from
hdfs_site_temp
lateral view outer explode(Pname) a as name
lateral view outer explode(Pvalue) a as value
lateral view outer explode(Pdescription) a as description
;

0: jdbc:hive2://rhes75:10099/default> select * from hdfs_site
. . . . . . . . . . . . . . . . . . > order by rownum
. . . . . . . . . . . . . . . . . . > ;
+-------------------+----------------------------------------+----------------------------------------------------+------------------------+
| hdfs_site.rownum  |             hdfs_site.name
|                  hdfs_site.value                   |
hdfs_site.description  |
+-------------------+----------------------------------------+----------------------------------------------------+------------------------+
| 1                 | dfs.namenode.ec.system.default.policy  |
RS-6-3-1024k                                       | NULL
|
| 2                 | dfs.safemode.threshold.pct             |
0                                                  | NULL
|
| 3                 | hadoop.tmp.dir                         |
/tmp                                               | NULL
|
| 4                 | dfs.image.transfer.timeout             |
600000                                             | NULL
|
| 5                 | dfs.permissions                        |
false                                              | NULL
|
| 6                 | dfs.block.size                         |
134217728                                          | NULL
|
| 7                 | dfs.balance.bandwidthPerSec            |
10                                                 | NULL
|
| 8                 | dfs.datanode.data.dir                  |
file:/d4T/hduser/hadoop/hadoop_store/hdfs/datanode | NULL
|
| 9                 | dfs.namenode.name.dir                  |
file:/d4T/hduser/hadoop/hadoop_store/hdfs/namenode | NULL
|
+-------------------+----------------------------------------+----------------------------------------------------+------------------------+

HTH



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 28 Jun 2018 at 09:40, Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> The classic XML configuration files like hive-site.xml have the following
> patterns:
>
>
>  <property><name></name><value></value><description></description></properly>
>
> In order to read these xml into Hive, you need to do the following:
>
> ### 1. Brings each record to one line (deletes the root element
> configuration tags):
> ### 2. Need to get rid of space between tags
> ###
> <property><name></name><value></value><description></description></properly>
> ### 3. Each xml record needs to be in one line!
> ### 4. The xml file needs the same start and end tag for each line like
> property tags. Since you have only one root tag for the whole document,
> that is not going to work. Also
> ###    need to get rid of root tag at the beginning and end plus the xml
> information at the beginning (i.e ?xml version)
>
> This simple shell command does work (hive-site.xml, hdfs-site.xml etc)
>
> cat ${HIVE_HOME}/conf/hive-site.xml | tr -d '&' | tr '\n' ' ' | tr '\r' '
> ' | sed 's|</property>|</property>\n|g'| sed 's/<configuration>//g'|sed
> 's|</configuration>||g' \
>  | sed -e 's/^[            ]*//g' | sed -e 's/>[            ]*/>/g' | grep
> -v "<?xml version=" > ${XML_FILE}
>
> replace ${XML_FILE} with any name
>
> The below code will work for hive-ste.xml
>
> drop table if exists xml_temp;
> -- create a load table
> create table xml_temp (xmlData string);
> -- load data from local xml file
> load data local inpath "${XML_FILE}" into table xml_temp;
> select * from xml_temp;
> drop table if exists hive_site_temp;
> create table if not exists hive_site_temp(
>        rownum int,
>        Pname array<string>,
>        Pvalue array<string>,
>        Pdescription array<string>)
>        row format delimited
>        fields terminated by '|'
> ;
> --desc hive_site_temp;
> insert overwrite table hive_site_temp
> select
>           row_number() over()
>         , xpath(xmlData,'property/name/text()')
>         , xpath(xmlData,'property/value/text()')
>         , xpath(xmlData,'property/description/text()')
> from xml_temp
> ;
> drop table if exists hive_site;
> create table if not exists hive_site
>      (
>        rownum int,
>        name string,
>        value string,
>        description string
>      )
> STORED AS PARQUET
> TBLPROPERTIES ("parquet.compression"="SNAPPY")
> ;
> insert overwrite table hive_site
> select
>           rownum
>         , name
>         , value
>         , description
> from
> hive_site_temp
>
>
>
> *lateral view explode(Pname) a as namelateral view explode(Pvalue) a as
> valuelateral view explode(Pdescription) a as description*;
> select * from hive_site
> order by rownum
>
>
> However, for hdfs-site.xml, by default there is no "description" tag.
> Still you can read data in no problem. However, "lateral view
> explode(Pdescription) a as description" will not work and table will have
> no rows!
>
> One way is to exclude that column from insert. However, I was wondering if
> explode function can check if column has any value before populating it!
>
> For example for hdfs-site.xml these are my values:
>
> 0: jdbc:hive2://rhes75:10099/default> select
> xpath(xmlData,'property/description/text()') as description from xml_temp;
> +--------------+
> | description  |
> +--------------+
> | []           |
> | []           |
> | []           |
> | []           |
> | []           |
> | []           |
> | []           |
> | []           |
> | []           |
> +--------------+
>
>
> Thanks
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>

Re: Reading XML into Hive table, handing null columns

Reply via email to