[ https://issues.apache.org/jira/browse/HIVE-680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13000321#comment-13000321 ]
Tim Robertson commented on HIVE-680: ------------------------------------ I have just experienced a frustrating issue manifested by a few bad rows. Using Oozie we Sqoop from Mysql to HDFS and then process with Hive and custom UDFs / UDAFs and UDTFs - nearly working and a super clean solution. What happened in my situation were some bad 70,000+ rows with tab and new line characters in fields within the source DB, resulting in invalid rows with missing IDs by the time they got to Hive work. During the processing we ended up with a join across 4 tables each keyed on the ID meaning 70k x 70k x 70k x 70k at quite some work for the reducer dealing with the NULL id. Perhaps it would be nice to allow basic constraints be declared on a table, and then give some generic sanitize() method to warn of potential issues? > add user constraints > --------------------- > > Key: HIVE-680 > URL: https://issues.apache.org/jira/browse/HIVE-680 > Project: Hive > Issue Type: New Feature > Components: Query Processor > Reporter: Namit Jain > > Many times, because of a few bad input rows, the whole job fails and it takes > a long time to debug those. > It might be very useful to add some constraints, which can be checked while > reading the data. > An option can be added to ignore configurable number of bad rows. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira