Hi Ruslan The solution Esteban pointed out was 1. Import look up data from RDBMS to hdfs/hive (you can fire any adhoc query here). If the data is just a few mbs one or two maps/connections are enough.
2. A look up on this smaller data can be achieved in terms of joining that with larger table Now since the look up table is small, enable map joins so that the look up table is in the distributed cache and that data is used by map tasks for join. The two sequential steps mentioned above can be scheduled using a workflow manager as oozie. In simple terms you can place these steps in order in a shell script and just execute the script. Regards Bejoy KS Sent from handheld, please excuse typos. -----Original Message----- From: Ruslan Al-Fakikh <metarus...@gmail.com> Date: Sat, 16 Jun 2012 04:40:36 To: <user@hive.apache.org> Reply-To: user@hive.apache.org Subject: Re: Quering RDBMS table in a Hive query Hi Esteban, Your solution is what I am trying to avoid, having to keep the hdfs data up-to-date. I know I can easily schedule a dependency between the Sqoop import job and the hive query job and currently we have a scheduling tool (opswise) for such things. But what if I just want to run an ad hoc query and forget to re-import the lookup data, etc? Maybe there is a way to put the Sqoop import as a hook for a particular hive table making it run before every query? But I understand the problem of having too many connections. I would like to have it only once and distribute it over all the mappers in a distributed cache or something like it. Isn't there a way for it? Ruslan On Fri, Jun 15, 2012 at 9:43 PM, Esteban Gutierrez <este...@cloudera.com> wrote: > Hi Ruslan, > > Jan's approach sounds like a good workaround only if you can use the output > in a mapjoin, but I don't think it will scale nicely if you have a very > large number of tasks since that will translate as DB connections to > MySQL. I think a more scalable and reliable way is just to schedule an Oozie > workflow to transfer the data from MySQL to HDFS using Sqoop and trigger the > Hive query once the transfer was done. > > cheers! > esteban. > > -- > Cloudera, Inc. > > > > > On Fri, Jun 15, 2012 at 10:28 AM, Ruslan Al-Fakikh <metarus...@gmail.com> > wrote: >> >> Thanks Jan >> >> On Fri, Jun 15, 2012 at 4:35 PM, Jan Dolinár <dolik....@gmail.com> wrote: >> > On 6/15/12, Ruslan Al-Fakikh <ruslan.al-fak...@jalent.ru> wrote: >> >> I didn't know InputFormat and LineReader could help, though I didn't >> >> look at them closely. I was thinking about implementing a >> >> Table-Generating Function (UDTF) if there is no an already implemented >> >> solution. >> > >> > Both is possible, InputFormat and/or UD(T)F. It all depends on what >> > you need. I actually use both - in Input format I load lists of >> > allowed values to check the data and in UDF I query some other >> > database for values necessary only in some queries. Generally, I'd use >> > InputFormat for situations where all jobs over given table would >> > require the additional data from RDBMS. Oppositely, in situations >> > where only few jobs out of many requires the RDBMS connection, I would >> > use UDF. >> > >> > I think that the difference in performance between the two is rather >> > small, if any. Also UDF is easier to write, so it might be the "weapon >> > of choice", at least if you don't already use custom InputFormat. >> > >> > Jan > >