If you are using Sqoop 1.4.4, you could use the hcatalog table option to bring it to Hive. That allows it to be agnostic of the hive table formats (you can use ORC for example or RCFile) and it handles the partitioning easily (including dynamic partitions). And you can directly export from a hive table using hcatalog table option.
Please see https://sqoop.apache.org/docs/1.4.4/SqoopUserGuide.html#_sqoop_hcatalog_integration. Also a brief presentation at https://cwiki.apache.org/confluence/download/attachments/27361435/SqoopHCatIntegration-HadoopWorld2013.pptx Thanks Venkat On Sun, Nov 3, 2013 at 8:57 AM, Raj Hadoop <hadoop...@yahoo.com> wrote: > Manish, > > Thanks for reply. > > > 1. Load to Hdfs, beware of Sqoop error handling, as its a mapreduce based > framework, so if 1 mapper fails it might happen that you get partial data. > So do you say that - if I can handle errors in Sqoop, going for 100 HDFS > folders/files - is it OK ? > > 2. Create partition based on date and hour, if customer table has some > date or timestamp column. > I cannot rely on date or timestamp column. So can I go with Customer ID ? > > 3. Think about file format also, as that will affect the load and query > time. > Can you please suggest a file format that I have to use ? > > 4. Think about compression as well before hand, as that will govern the > data split, and performance of your queries as well. > Does compression increases or reduces performance ? Isn't the > compression advantage is saving in storage? > > - Raj > > > On Sunday, November 3, 2013 11:03 AM, manish.hadoop.work < > manish.hadoop.w...@gmail.com> wrote: > 1. Load to Hdfs, beware of Sqoop error handling, as its a mapreduce > based framework, so if 1 mapper fails it might happen that you get partial > data. > > 2. Create partition based on date and hour, if customer table has some > date or timestamp column. > > 3. Think about file format also, as that will affect the load and query > time. > > 4. Think about compression as well before hand, as that will govern the > data split, and performance of your queries as well. > > Regards, > Manish > > > > Sent from my T-Mobile 4G LTE Device > > > > -------- Original message -------- > From: Raj Hadoop <hadoop...@yahoo.com> > Date: 11/03/2013 7:39 AM (GMT-08:00) > To: Hive <user@hive.apache.org>,Sqoop <u...@sqoop.apache.org>,User < > u...@hadoop.apache.org> > Subject: Oracle to HDFS through Sqoop and a Hive External Table > > > Hi, > > I am sending this to the three dist-lists of Hadoop, Hive and Sqoop as > this question is closely related to all the three areas. > > I have this requirement. > > I have a big table in Oracle (about 60 million rows - Primary Key Customer > Id). I want to bring this to HDFS and then create > a Hive external table. My requirement is running queries on this Hive > table (at this time i do not know what queries i would be running). > > Is the following a good design for the above problem ? Any pros and cons > of this. > > 1) Load the table to HDFS using Sqoop into multiple folders (divide > Customer Id's into 100 segments). > 2) Create Hive external partition table based on the above 100 HDFS > directories. > > > Thanks, > Raj > > > -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.