My problem is in eliminating the duplicates and only keep the correct data, any advise please? On Dec 24, 2012 9:13 PM, "Dean Wampler" <dean.wamp...@thinkbiganalytics.com> wrote:
> Looks good, but a few suggestions. If you can eliminate duplicates, etc. > as you ingest the data into HDFS, that would eliminate a cleansing step. > Note that if the target directory in HDFS IS the specified location for an > external Hive table/partition, then there will be no separate step to "load > in Hive as External Table". It's already there! > > Your "transform data..." is a common pattern; stage "raw" data into a > location, then use Hive (or Pig) to transform it into the final form and > INSERT INTO the final Hive table. > > dean > > On Mon, Dec 24, 2012 at 9:34 AM, Ibrahim Yakti <iya...@souq.com> wrote: > >> Thanks Dean for the great reply, setting incremental import should be >> easy, if I partitioned my data how hive will get me the updated rows only >> considering that the row may have multiple fields that will be updated over >> time? and how will I manage the tables that based on multiple sources? and >> do you recommend to import the data to HDFS instead of Hive directly? Won't >> we have a lot of duplicated records then? >> >> Regarding automation we were thinking to use sqoop-job command or crons >> as you suggested. >> >> So, the suggested flow as follows: >> >> MySQL ---(Extract / Load)---> HDFS (Table/Year/Month/Day) ---> Load in >> Hive as External Table ---(Transform Data & Join Tables)--> Save it in Hive >> tables for reporting. >> >> >> Correct? >> >> Appreciated. >> >> >> -- >> Ibrahim >> >> >> On Mon, Dec 24, 2012 at 5:51 PM, Dean Wampler < >> dean.wamp...@thinkbiganalytics.com> wrote: >> >>> This is not as hard as it sounds. The hardest part is setting up the >>> incremental query against your MySQL database. Then you can write the >>> results to new files in the HDFS directory for the table and Hive will see >>> them immediately. Yes, even though Hive doesn't support updates, it doesn't >>> care how many files are in the directory. The trick is to avoid lots of >>> little files. >>> >>> As others have suggested, you should consider partitioning the data, >>> perhaps by time. Say you import about a few HDFS blocks-worth of data each >>> day, then use year/month/day partitioning to speed up your Hive queries. >>> You'll need to add the partitions to the table as you go, but actually, you >>> can add those once a month, for example, for all partitions. Hive doesn't >>> care if the partition directories don't exist yet or the directories are >>> empty. I also recommend using an external table, which gives you more >>> flexibility on directory layout, etc. >>> >>> Sqoop might be the easiest tool for importing the data, as it will even >>> generate a Hive table schema from the original MySQL table. However, that >>> feature may not be useful in this case, as you already have the table. >>> >>> I think Oozie is horribly complex to use and overkill for this purpose. >>> A simple bash script triggered periodically by cron is all you need. If you >>> aren't using a partitioned table, you have a single sqoop command to run. >>> If you have partitioned data, you'll also need a hive statement in the >>> script to create the partition, unless you do those in batch once a month, >>> etc., etc. >>> >>> Hope this helps, >>> dean >>> >>> On Mon, Dec 24, 2012 at 7:08 AM, Ibrahim Yakti <iya...@souq.com> wrote: >>> >>>> Hi All, >>>> >>>> We are new to hadoop and hive, we are trying to use hive to >>>> run analytical queries and we are using sqoop to import data into hive, in >>>> our RDBMS the data updated very frequently and this needs to be reflected >>>> to hive. Hive does not support update/delete but there are many workarounds >>>> to do this task. >>>> >>>> What's in our mind is importing all the tables into hive as is, then we >>>> build the required tables for reporting. >>>> >>>> My questions are: >>>> >>>> 1. What is the best way to reflect MySQL updates into Hive with >>>> minimal resources? >>>> 2. Is sqoop the right tool to do the ETL? >>>> 3. Is Hive the right tool to do this kind of queries or we should >>>> search for alternatives? >>>> >>>> Any hint will be useful, thanks in advanced. >>>> >>>> -- >>>> Ibrahim >>>> >>> >>> >>> >>> -- >>> *Dean Wampler, Ph.D.* >>> thinkbiganalytics.com >>> +1-312-339-1330 >>> >>> >> > > > -- > *Dean Wampler, Ph.D.* > thinkbiganalytics.com > +1-312-339-1330 > >