Looks good, but a few suggestions. If you can eliminate duplicates, etc. as
you ingest the data into HDFS, that would eliminate a cleansing step. Note
that if the target directory in HDFS IS the specified location for an
external Hive table/partition, then there will be no separate step to "load
in Hive as External Table". It's already there!

Your "transform data..." is a common pattern; stage "raw" data into a
location, then use Hive (or Pig) to transform it into the final form and
INSERT INTO the final Hive table.

dean

On Mon, Dec 24, 2012 at 9:34 AM, Ibrahim Yakti <iya...@souq.com> wrote:

> Thanks Dean for the great reply, setting incremental import should be
> easy, if I partitioned my data how hive will get me the updated rows only
> considering that the row may have multiple fields that will be updated over
> time? and how will I manage the tables that based on multiple sources? and
> do you recommend to import the data to HDFS instead of Hive directly? Won't
> we have a lot of duplicated records then?
>
> Regarding automation we were thinking to use sqoop-job command or crons as
> you suggested.
>
> So, the suggested flow as follows:
>
> MySQL ---(Extract / Load)---> HDFS (Table/Year/Month/Day) ---> Load in
> Hive as External Table ---(Transform Data & Join Tables)--> Save it in Hive
> tables for reporting.
>
>
> Correct?
>
> Appreciated.
>
>
> --
> Ibrahim
>
>
> On Mon, Dec 24, 2012 at 5:51 PM, Dean Wampler <
> dean.wamp...@thinkbiganalytics.com> wrote:
>
>> This is not as hard as it sounds. The hardest part is setting up the
>> incremental query against your MySQL database. Then you can write the
>> results to new files in the HDFS directory for the table and Hive will see
>> them immediately. Yes, even though Hive doesn't support updates, it doesn't
>> care how many files are in the directory. The trick is to avoid lots of
>> little files.
>>
>> As others have suggested, you should consider partitioning the data,
>> perhaps by time. Say you import about a few HDFS blocks-worth of data each
>> day, then use year/month/day partitioning to speed up your Hive queries.
>> You'll need to add the partitions to the table as you go, but actually, you
>> can add those once a month, for example, for all partitions. Hive doesn't
>> care if the partition directories don't exist yet or the directories are
>> empty. I also recommend using an external table, which gives you more
>> flexibility on directory layout, etc.
>>
>> Sqoop might be the easiest tool for importing the data, as it will even
>> generate a Hive table schema from the original MySQL table. However, that
>> feature may not be useful in this case, as you already have the table.
>>
>> I think Oozie is horribly complex to use and overkill for this purpose. A
>> simple bash script triggered periodically by cron is all you need. If you
>> aren't using a partitioned table, you have a single sqoop command to run.
>> If you have partitioned data, you'll also need a hive statement in the
>> script to create the partition, unless you do those in batch once a month,
>> etc., etc.
>>
>> Hope this helps,
>> dean
>>
>> On Mon, Dec 24, 2012 at 7:08 AM, Ibrahim Yakti <iya...@souq.com> wrote:
>>
>>> Hi All,
>>>
>>> We are new to hadoop and hive, we are trying to use hive to
>>> run analytical queries and we are using sqoop to import data into hive, in
>>> our RDBMS the data updated very frequently and this needs to be reflected
>>> to hive. Hive does not support update/delete but there are many workarounds
>>> to do this task.
>>>
>>> What's in our mind is importing all the tables into hive as is, then we
>>> build the required tables for reporting.
>>>
>>> My questions are:
>>>
>>>    1. What is the best way to reflect MySQL updates into Hive with
>>>    minimal resources?
>>>    2. Is sqoop the right tool to do the ETL?
>>>    3. Is Hive the right tool to do this kind of queries or we should
>>>    search for alternatives?
>>>
>>> Any hint will be useful, thanks in advanced.
>>>
>>> --
>>> Ibrahim
>>>
>>
>>
>>
>> --
>> *Dean Wampler, Ph.D.*
>> thinkbiganalytics.com
>> +1-312-339-1330
>>
>>
>


-- 
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330

Reply via email to