Hi, I have to import > 400 million rows from a MySQL table(having a composite primary key) into a PARTITIONED Hive table Hive via Sqoop. The table has data for two years with a column departure date ranging from 20120605 to 20140605 and thousands of records for one day. I need to partition the data based on the departure date.
The versions : Apache Hadoop - 1.0.4 Apache Hive - 0.9.0 Apache Sqoop - sqoop-1.4.2.bin__hadoop-1.0.0 As per my knowledge, there are 3 approaches: 1. MySQL -> Non-partitioned Hive table -> INSERT from Non-partitioned Hive table into Partitioned Hive table The current painful one that I'm following 2. MySQL -> Partitioned Hive table I read that the support for this is added in later(?) versions of Hive and Sqoop but was unable to find an example 3. MySQL -> Non-partitioned Hive table -> ALTER Non-partitioned Hive table to add PARTITION The syntax dictates to specify partitions as key value pairs - not feasible in case of millions of records where one cannot think of all the partition key-value pairs Can anyone provide inputs for approaches 2 and 3? Regards, Omkar Joshi ________________________________ The contents of this e-mail and any attachment(s) may contain confidential or privileged information for the intended recipient(s). Unintended recipients are prohibited from taking action on the basis of information in this e-mail and using or disseminating the information, and must notify the sender and delete it from their system. L&T Infotech will not accept responsibility or liability for the accuracy or completeness of, or the presence of any virus or disabling code in this e-mail"