This is not as hard as it sounds. The hardest part is setting up the
incremental query against your MySQL database. Then you can write the
results to new files in the HDFS directory for the table and Hive will see
them immediately. Yes, even though Hive doesn't support updates, it doesn't
care how many files are in the directory. The trick is to avoid lots of
little files.

As others have suggested, you should consider partitioning the data,
perhaps by time. Say you import about a few HDFS blocks-worth of data each
day, then use year/month/day partitioning to speed up your Hive queries.
You'll need to add the partitions to the table as you go, but actually, you
can add those once a month, for example, for all partitions. Hive doesn't
care if the partition directories don't exist yet or the directories are
empty. I also recommend using an external table, which gives you more
flexibility on directory layout, etc.

Sqoop might be the easiest tool for importing the data, as it will even
generate a Hive table schema from the original MySQL table. However, that
feature may not be useful in this case, as you already have the table.

I think Oozie is horribly complex to use and overkill for this purpose. A
simple bash script triggered periodically by cron is all you need. If you
aren't using a partitioned table, you have a single sqoop command to run.
If you have partitioned data, you'll also need a hive statement in the
script to create the partition, unless you do those in batch once a month,
etc., etc.

Hope this helps,
dean

On Mon, Dec 24, 2012 at 7:08 AM, Ibrahim Yakti <iya...@souq.com> wrote:

> Hi All,
>
> We are new to hadoop and hive, we are trying to use hive to
> run analytical queries and we are using sqoop to import data into hive, in
> our RDBMS the data updated very frequently and this needs to be reflected
> to hive. Hive does not support update/delete but there are many workarounds
> to do this task.
>
> What's in our mind is importing all the tables into hive as is, then we
> build the required tables for reporting.
>
> My questions are:
>
>    1. What is the best way to reflect MySQL updates into Hive with
>    minimal resources?
>    2. Is sqoop the right tool to do the ETL?
>    3. Is Hive the right tool to do this kind of queries or we should
>    search for alternatives?
>
> Any hint will be useful, thanks in advanced.
>
> --
> Ibrahim
>



-- 
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330

Reply via email to