Hi All,

In order to support approximate queries in Hive and BlinkDB (
http://blinkdb.org/), I am working towards implementing the bootstrap
primitive (http://en.wikipedia.org/wiki/Bootstrapping_(statistics)) in Hive
that can help us quantify the "error" incurred by a query Q when it
operates on a small sample S of data. This method essentially requires
launching the query Q simultaneously on a large number of samples of
original data (typically >=100) .

The downside to this is of course that we have to launch the same query 100
times but the upside is that each of this query would be so small that it
can be executed on a single machine. So, in order to do this efficiently,
we would ideally like to execute 100 instances of the query simultaneously
on the master and all available worker nodes. Furthermore, in order to
avoid generating the query plan 100 times on the master, we can do either
of the two things:

   1. Generate the query plan once on the master, serialize it and ship it
   to the worker nodes.
   2. Enable the worker nodes to access the Metastore so that they can
   generate the query plan on their own in parallel.

Given that making the query plan serializable (1) would require a lot of
refactoring of the current code, is (2) a viable option? Moreover, since
(2) will increase the load on the existing Metastore by 100x, is there any
other option?

Thanks,
Sameer

-- 
Sameer Agarwal
Computer Science | AMP Lab | UC Berkeley
http://cs.berkeley.edu/~sameerag

Reply via email to