Hi All, In order to support approximate queries in Hive and BlinkDB ( http://blinkdb.org/), I am working towards implementing the bootstrap primitive (http://en.wikipedia.org/wiki/Bootstrapping_(statistics)) in Hive that can help us quantify the "error" incurred by a query Q when it operates on a small sample S of data. This method essentially requires launching the query Q simultaneously on a large number of samples of original data (typically >=100) .
The downside to this is of course that we have to launch the same query 100 times but the upside is that each of this query would be so small that it can be executed on a single machine. So, in order to do this efficiently, we would ideally like to execute 100 instances of the query simultaneously on the master and all available worker nodes. Furthermore, in order to avoid generating the query plan 100 times on the master, we can do either of the two things: 1. Generate the query plan once on the master, serialize it and ship it to the worker nodes. 2. Enable the worker nodes to access the Metastore so that they can generate the query plan on their own in parallel. Given that making the query plan serializable (1) would require a lot of refactoring of the current code, is (2) a viable option? Moreover, since (2) will increase the load on the existing Metastore by 100x, is there any other option? Thanks, Sameer -- Sameer Agarwal Computer Science | AMP Lab | UC Berkeley http://cs.berkeley.edu/~sameerag