[ 
https://issues.apache.org/jira/browse/HIVE-22337?focusedWorklogId=442365&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-442365
 ]

ASF GitHub Bot logged work on HIVE-22337:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 07/Jun/20 00:26
            Start Date: 07/Jun/20 00:26
    Worklog Time Spent: 10m 
      Work Description: github-actions[bot] commented on pull request #815:
URL: https://github.com/apache/hive/pull/815#issuecomment-640136438


   This pull request has been automatically marked as stale because it has not 
had recent activity. It will be closed if no further activity occurs.
   Feel free to reach out on the d...@hive.apache.org list if the patch is in 
need of reviews.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
-------------------

    Worklog Id:     (was: 442365)
    Time Spent: 0.5h  (was: 20m)

> Improve and Expand Text-Based SerDes
> ------------------------------------
>
>                 Key: HIVE-22337
>                 URL: https://issues.apache.org/jira/browse/HIVE-22337
>             Project: Hive
>          Issue Type: Improvement
>          Components: Serializers/Deserializers
>    Affects Versions: 4.0.0
>            Reporter: David Mollitor
>            Assignee: David Mollitor
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 4.0.0
>
>         Attachments: HIVE-22337.1.patch, HIVE-22337.2.patch
>
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> * Add new SerDe package just for text-based formats: 
> org.apache.hadoop.hive.serde2.text.*
> * Add new SerDe package just for text-based log formats: 
> org.apache.hadoop.hive.serde2.text.log.*
> * Create a coherent hierarchy for processing delimited data: AbstractSerDe -> 
> TextSerDe -> EncodingAwareTextSerde -> DelimitedSerDe -> CsvTextSerDe
> * Create a coherent hierarchy for processing regex'ed data: AbstractSerDe -> 
> TextSerDe -> EncodingAwareTextSerde -> RegexSerDe -> CommonFormatLogSerDe
> * Create some standard text processors for super-quick out-of-the-box 
> processing: TSV SerDe and CSV SerDe
> * Create some standard log processors for super-quick out-of-the-box 
> processing: Apache Common Log Format and Apache Combined Log Format (Apache 
> HTTP Server Log Parsers)
> * Better default behaviors for processing text
> The default behavior should allow users to quick query data without any 
> failures.
> # When a blank line is encountered, insert a 'null' value for each column
> # When there are fewer fields in the data than defined in the table schema, 
> shift all available fields left, and fill in 'null' values for all remaining 
> fields
> # When there are too many fields in the data, the last field in the results 
> will contain all remaining values.  Currently, the data is silently swallows 
> and a warning is issued in the YARN logs.  A normal user will never see this 
> warning, especially if the job completes successfully.  Better to (by 
> default) provide them all the data than to hide anything.
> {code:none|title=CSV SerDe}
> "1,2,3"    = ["1","2","3"]
> "1,2,"     = ["1","2",null]
> ""         = [null,null,null]
> "1,2,3,4"  = ["1","2","3,4"]
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to