[ 
https://issues.apache.org/jira/browse/HIVE-680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13000321#comment-13000321
 ] 

Tim Robertson commented on HIVE-680:
------------------------------------

I have just experienced a frustrating issue manifested by a few bad rows.

Using Oozie we Sqoop from Mysql to HDFS and then process with Hive and custom 
UDFs / UDAFs and UDTFs - nearly working and a super clean solution.

What happened in my situation were some bad 70,000+ rows with tab and new line 
characters in fields within the source DB, resulting in invalid rows with 
missing IDs by the time they got to Hive work.  During the processing we ended 
up with a join across 4 tables each keyed on the ID meaning 70k x 70k x 70k x 
70k at quite some work for the reducer dealing with the NULL id.

Perhaps it would be nice to allow basic constraints be declared on a table, and 
then give some generic sanitize() method to warn of potential issues?  




> add user constraints 
> ---------------------
>
>                 Key: HIVE-680
>                 URL: https://issues.apache.org/jira/browse/HIVE-680
>             Project: Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Namit Jain
>
> Many times, because of a few bad input rows, the whole job fails and it takes 
> a long time to debug those.
> It might be very useful to add some constraints, which can be checked while 
> reading the data.
> An option can be added to ignore configurable number of bad rows.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to