Elliot West created HIVE-12860:
----------------------------------

             Summary: Add WITH HEADER option to INSERT OVERWRITE DIRECTORY
                 Key: HIVE-12860
                 URL: https://issues.apache.org/jira/browse/HIVE-12860
             Project: Hive
          Issue Type: New Feature
          Components: Hive
            Reporter: Elliot West
            Assignee: Elliot West


_As a Hive user_
_I'd like the option to seamlessly write out a header row to file system based 
result sets_
_So that I can generate reports whose specification mandates a header row._

h4. Motivations
There is a significant use-case where Hive is used to construct a scheduled 
data processing pipeline that generates a report in HDFS for consumption by 
some third party (internal or external). This report may then be transferred 
out of the system for consumption by other tools or processes. It is not 
uncommon for the third party to specify that the report includes a header row 
at the start of the file. The current options for adding headers are difficult 
to use effectively and elegantly.

h4. Acceptance criteria
* {{INSERT OVERWRITE DIRECTORY}} commands can be invoked with an option to 
include a header row at the start of the result set file.
* The header row will contain the column names derived from the accompanying 
{{SELECT}} query.
* It will likely be the case that multiple tasks will be writing the final file 
of the query result set. In this event only the task writing the first chunk of 
the file should emit the header row.

h4. Proposed HQL changes
{code}
1.  INSERT OVERWRITE [LOCAL] DIRECTORY directory1
2.    [ROW FORMAT row_format] [STORED AS file_format]
3.    [WITH HEADER]
4.    SELECT ... FROM ...
{code}
It is proposed that the {{WITH HEADER}} stanza at line 3 be introduced to 
enable this feature.
h4. Current workarounds
* It is usually suggested that users set the CLI option 
{{hive.cli.print.header=true}} and capture the result set from standard out. 
However, this does not work well in scheduled, headless environments such as 
the Oozie Hive action. This can also push the file handling into shell scripts 
and complicate the process of getting the report into HDFS.
* The keep report processing entirely within the domain of Hive some users 
{{UNION}} the result of their query with a tiny table of a single row 
containing the header names. A synthesised rank column is used with an {{ORDER 
BY}} to ensure that the header is written to the very start of the file. See 
[this example on Stack 
Overflow|http://stackoverflow.com/questions/15139561/adding-column-headers-to-hive-result-set/25214480#25214480].

h4. References
* HIVE-138: Original request for header functionality.
* [Hive Wiki: writing data into the file system from 
queries|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-Writingdataintothefilesystemfromqueries].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to