On Oct 3, 1:16 pm, Ross Gayler <r.gay...@gmail.com> wrote: > Thanks Michael. > > > This sounds very similar to NoSQL and Map/Reduce? > > I'm not so sure about that (which may be mostly due to my ignorance of > NoSQL and Map/Reduce). The amount of data involved in my problem is > quite small and any infrastructure aimed at massive scaling may bring > a load of conceptual and implementation baggage that is unnecessary/ > unhelpful. > > Let me restate my problem: > > I have a bunch of statistician colleagues with minimal programming > skills. (I am also a statistician, but with slightly better > programming skills.) As part of our analytical workflow we take data > sets and preprocess them by adding new variables that are typically > aggregate functions of other values. We source the data form a > database/file, add the new variables, and store the augmented data in > a database/file for subsequent, extensive and extended (a couple of > months) analysis with other tools (off the shelf statistical packages > such as SAS and R). After the analyses are complete, some subset of > the preprocessing calculations need to be implemented in an > operational environment. This is currently done by completely re- > implementing them in yet another fairly basic imperative language. > > The preprocessing in our analytical environment is usually written in > a combination of SQL and the SAS data manipulation language (think of > it as a very basic imperative language with macros but no user-defined > functions). The statisticians take a long time to get their > preprocessing right (they're not good at nested queries in SQL and > make all the usual errors iterating over arrays of values with > imperative code). So my primary goal is to find/build a query language > that minimises the cognitive impedance mismatch with the statisticians > and minimises their opportunity for error. > > Another goal is that the same mechanism should be applicable in our > statistical analytical environment and the corporate deployment > environment(s). The most different operational environment is online > and realtime. The data describing one case gets thrown at some code > that (among other things) implements the preprocessing with some > embedded imperative code. So, linking in some Java byte code to do the > preprocessing on a single case sounds feasible, whereas replacing/ > augmenting the current corporate infrastructure with NoSQL and a CPU > farm is more aggravation with corporate IT than I am paid for. > > The final goal is that the preprocessing mechanism should be no slower > than the current methods in each of the deployment environments. The > hardest one is probably in our statistical analysis environment, but > there we do have the option of farming the work across multiple CPUs > if needed. > > Let me describe the computational scale of the problem - it is really > quite small. > > Data is organised as completely independent cases. One case might > contain 500 primitive values for a total size of ~1kb. Preprocessing > might calulate another 500 values, each of those being an aggregate > function of some subset (say, 20 values) of the original 500 values. > Currently, all these new values are calculated independently of each > other, but there is a lot of overlap of intermediate results and, > therefore, potential for optimisation of the computational effort > required to calculate the entire set of results within a single case. > > In our statistical analytical environment the preprocessing is carried > out in batch mode. A large dataset might contain 1M cases (~1GB of > data). We can churn through the preprocessing at ~300 cases/second on > a modest PC. Higher throughput in our analytical environment would be > a bonus, but not essential. > > So I see the problem as primarily about the conceptual design of the > query language, with some side constraints about implementation > compatibility across a range of deployment environments and adequate > throughput performance. > > As I mentioned in an earlier post, I'll probably assemble a collection > of representative queries, express them in a variety of query > languages, and try to assess how compatible the different query > languages are with the way my colleagues want to think about the > proble.
Seeing examples (perhaps quite a few of them) will be certainly useful. (Due to my non-stats background) I may not have understood your use-cases correctly, but are these helpful for you? http://github.com/MrHus/rql http://bitbucket.org/kumarshantanu/sqlrat/wiki/Clause The SQLRat clause API will be part of 0.2 release (expected very soon). Regards, Shantanu > > Ross > > On Oct 3, 11:31 am, Michael Ossareh <ossa...@gmail.com> wrote: > > > > > On Fri, Oct 1, 2010 at 17:55, Ross Gayler <r.gay...@gmail.com> wrote: > > > Hi, > > > > This is probably an abuse of the Clojure forum, but it is a bit > > > Clojure-related and strikes me as the sort of thing that a bright, > > > eclectic bunch of Clojure users might know about. (Plus I'm not really > > > a software person, so I need all the help I can get.) > > > > I am looking at the possibility of finding/building a declarative data > > > aggregation language operating on a small relational representation. > > > Each query identifies a set of rows satisfying some relational > > > predicate and calculates some aggregate function of a set of values > > > (e.g. min, max, sum). There might be ~20 input tables of up to ~1k > > > rows. The data is immutable - it gets loaded and never changed. The > > > results of the queries get loaded as new rows in other tables and are > > > eventually used as input to other computations. There might be ~1k > > > queries. There is no requirement for transaction management or any > > > inherent concurrency (there is only one consumer of the results). > > > There is no requirement for persistent storage - the aggregation is > > > the only thing of interest. I would like the query language to map as > > > directly as possible to the task (SQL is powerful enough, but can get > > > very contorted and opaque for some of the queries). There is > > > considerable scope for optimisation of the calculations over the total > > > set of queries as partial results are common across many of the > > > queries. > > > > I would like to be able to do this in Clojure (which I have not yet > > > used), partly for some very practical reasons to do with Java interop > > > and partly because Clojure looks very cool. > > > > * Is there any existing Clojure functionality which looks like a good > > > fit to this problem? > > > > I have looked at Clojure-Datalog. It looks like a pretty good fit > > > except that it lacks the aggregation operators. Apart from that the > > > deductive power is probably greater than I need (although that doesn't > > > necessarily cost me anything). I know that there are other (non- > > > Clojure) Datalog implementations that have been extended with > > > aggregation operators (e.g. DLV > > >http://www.mat.unical.it/dlv-complex/dlv-complex). > > > > Tutorial D (what SQL should have been > > >http://en.wikipedia.org/wiki/D_%28data_language_specification%29#Tuto... > > > ) > > > might be a good fit, although once again, there is probably a lot of > > > conceptual and implementation baggage (e.g. Rel > > >http://dbappbuilder.sourceforge.net/Rel.php) > > > that I don't need. > > > > * Is there a Clojure implementation of something like Tutorial D? > > > > If there is no implementation of anything that meets my requirements > > > then I would be willing to look at the possibility of creating a > > > Domain Specific language. However, I am wary of launching straight > > > into that because of the probability that anything I dreamed up would > > > be an ad hoc kludge rather than a semantically complete and consistent > > > language. Optimised execution would be a whole other can of worms. > > > > * Does anyone know of any DSLs/formalisms for declaratively specifying > > > relational data aggregations? > > > > Thanks > > > > Ross > > > This sounds very similar to NoSQL and > > Map/Reduce?http://www.basho.com/Riak.html > > > Where your predicate is a reduce fn?- Hide quoted text - > > > - Show quoted text - -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en