Thanks Ross, that gives me a better insight into your environment. > In the online environment single cases are fetched from a > database with no aggregation capability and fired at the service that > contains the aggregation functionality.
What does a single case consist of? Is it just a result-set (as a consequence of running an SQL query)?. Maybe an example will help. > This would make sense if the IMDB query language supported the > aggregations we want in a way our statisticians can use, and the IMDB > is sufficiently lightweight that we can link it into our service as a > function library (rather than a separate server connected by some > communication protocol). I know that Apache Derby[1], HSQLDB[2] (now called HyperSQL) and H2[3] can be readily used in Clojure as in-memory databases without needing to start a server. You can find the examples here: http://bitbucket.org/kumarshantanu/clj-dbcp/src (though Clj-DBCP 0.1 is not actually released yet, expected soon with SQLRat 0.2) [1] http://db.apache.org/derby/ [2] http://hsqldb.org/ [3] http://www.h2database.com/html/main.html There are likely more in-memory databases I am not aware of at the moment. As long as they have JDBC drivers using them should be easy. Since your main criteria is SQL features/functions, I guess you would need to find out which IMDB suits better. I will be happy to add any missing bits to Clj-DBCP and SQLRat (I am the author) if you can let me know. Please feel free to ping me off the list. Regards, Shantanu On Oct 4, 4:57 pm, Ross Gayler <r.gay...@gmail.com> wrote: > Thanks for the two posts Shantanu. > > The rql and Clause examples are useful, both as potential parts of a > solution and as examples of how query/aggregation stuff may be done in > Clojure style. It is conceivable that I may end up deciding all I > need is a DSL that covers the kinds of aggregations of interest to us > and translates them to SQL via SQLrat. > > With respect to your three suggestions in your second post - things > get a bit more interesting. A major part of the problem (which I > failed to emphasize/mention in my first post) is that I really want > this aggregation stuff to work in two deployment environments: a batch > oriented statistical development environment that we control and an > online, realtime transactional environment that corporate IT > controls. In the online environment single cases are fetched from a > database with no aggregation capability and fired at the service that > contains the aggregation functionality. We control what happens inside > that service but have close to zero chance of changing anything > outside that service - so in that online environment we have no > possibility of putting aggregation into the datasource DB that feeds > our service. However, it *might* be reasonable to put an in-memory > database inside our service, purely to take advantage of the > aggregation facilities provided by that IMDB. A single case would get > loaded into the IMDB, the aggregation would be carried out in that > IMDB, the results exported, and the IMDB cleared ready for the next > case. This would make sense if the IMDB query language supported the > aggregations we want in a way our statisticians can use, and the IMDB > is sufficiently lightweight that we can link it into our service as a > function library (rather than a separate server connected by some > communication protocol). > > In our statistical development environment things are different. The > source data happens to live in a database, and we query that to get > the subset of cases we are interested in (say, 1M of them). In that > subset, each case can be treated completely in isolation and our > aggregations will use 100% of the component data in each case. An > individual aggregation might touch 20% of the data in one case, but we > might have ~500 different aggregations from the same case, so every > value gets used in lots of different aggregations. So although I am > interested in DB query languages as a way of specifying aggregations I > am not so convinced that I would actually use a full-blown DB to > implement those aggregations. > > Cheers > > Ross > > On Oct 4, 3:24 am, Shantanu Kumar <kumar.shant...@gmail.com> wrote: > > > > > I looked at Tutorial D - it's pretty interesting. Here are few top-of- > > my-head observations: > > > * Which RDBMS do you use? If you are free to choose a new RDBMS, > > probably you can pick one that provides most of the computational > > functionality (as SQL constructs/functions) out of the box. For > > example Oracle, MS SQL Server, PostgreSQL etc. The reason is > > performance - the more you can compute within the database, the less > > amount of data you will need to fetch in order to process. > > > * The kinds of computations you need to solve look like a superset of > > what SQL can provide. So, I think you will have to re-state the > > problem in terms of computations/iterations over SQL result-sets, > > which is probably what you are currently doing using the imperative > > language. If you can split every problem in terms of (a) computation > > you need versus (b) SQL queries you need to fire, then you can easily > > do it using Clojure itself without needing any DSL. > > > * If you want a DSL for this, I suppose it should make maximum use of > > the database's inbuilt query functions/constructs to maximize > > performance. This also means the DSL implementation needs to be > > database-aware. Secondly, it is possible to write functions in Clojure > > that would emit appropriate SQL clauses (as long as it is doable) to > > compute certain pieces of information. Looking at multiple use cases > > (covering various aspects - fuzzy vs deterministic) will be helpful. > > > Regards, > > Shantanu > > > On Oct 3, 5:10 pm, Shantanu Kumar <kumar.shant...@gmail.com> wrote: > > > > On Oct 3, 1:16 pm, Ross Gayler <r.gay...@gmail.com> wrote: > > > > > Thanks Michael. > > > > > > This sounds very similar to NoSQL and Map/Reduce? > > > > > I'm not so sure about that (which may be mostly due to my ignorance of > > > > NoSQL and Map/Reduce). The amount of data involved in my problem is > > > > quite small and any infrastructure aimed at massive scaling may bring > > > > a load of conceptual and implementation baggage that is unnecessary/ > > > > unhelpful. > > > > > Let me restate my problem: > > > > > I have a bunch of statistician colleagues with minimal programming > > > > skills. (I am also a statistician, but with slightly better > > > > programming skills.) As part of our analytical workflow we take data > > > > sets and preprocess them by adding new variables that are typically > > > > aggregate functions of other values. We source the data form a > > > > database/file, add the new variables, and store the augmented data in > > > > a database/file for subsequent, extensive and extended (a couple of > > > > months) analysis with other tools (off the shelf statistical packages > > > > such as SAS and R). After the analyses are complete, some subset of > > > > the preprocessing calculations need to be implemented in an > > > > operational environment. This is currently done by completely re- > > > > implementing them in yet another fairly basic imperative language. > > > > > The preprocessing in our analytical environment is usually written in > > > > a combination of SQL and the SAS data manipulation language (think of > > > > it as a very basic imperative language with macros but no user-defined > > > > functions). The statisticians take a long time to get their > > > > preprocessing right (they're not good at nested queries in SQL and > > > > make all the usual errors iterating over arrays of values with > > > > imperative code). So my primary goal is to find/build a query language > > > > that minimises the cognitive impedance mismatch with the statisticians > > > > and minimises their opportunity for error. > > > > > Another goal is that the same mechanism should be applicable in our > > > > statistical analytical environment and the corporate deployment > > > > environment(s). The most different operational environment is online > > > > and realtime. The data describing one case gets thrown at some code > > > > that (among other things) implements the preprocessing with some > > > > embedded imperative code. So, linking in some Java byte code to do the > > > > preprocessing on a single case sounds feasible, whereas replacing/ > > > > augmenting the current corporate infrastructure with NoSQL and a CPU > > > > farm is more aggravation with corporate IT than I am paid for. > > > > > The final goal is that the preprocessing mechanism should be no slower > > > > than the current methods in each of the deployment environments. The > > > > hardest one is probably in our statistical analysis environment, but > > > > there we do have the option of farming the work across multiple CPUs > > > > if needed. > > > > > Let me describe the computational scale of the problem - it is really > > > > quite small. > > > > > Data is organised as completely independent cases. One case might > > > > contain 500 primitive values for a total size of ~1kb. Preprocessing > > > > might calulate another 500 values, each of those being an aggregate > > > > function of some subset (say, 20 values) of the original 500 values. > > > > Currently, all these new values are calculated independently of each > > > > other, but there is a lot of overlap of intermediate results and, > > > > therefore, potential for optimisation of the computational effort > > > > required to calculate the entire set of results within a single case. > > > > > In our statistical analytical environment the preprocessing is carried > > > > out in batch mode. A large dataset might contain 1M cases (~1GB of > > > > data). We can churn through the preprocessing at ~300 cases/second on > > > > a modest PC. Higher throughput in our analytical environment would be > > > > a bonus, but not essential. > > > > > So I see the problem as primarily about the conceptual design of the > > > > query language, with some side constraints about implementation > > > > compatibility across a range of deployment environments and adequate > > > > throughput performance. > > > > > As I mentioned in an earlier post, I'll probably assemble a collection > > > > of representative queries, express them in a variety of query > > > > languages, and try to assess how compatible the different query > > > > languages are with the way my colleagues want to think about the > > > > proble. > > > > Seeing examples (perhaps quite a few of them) will be certainly > > > useful. (Due to my non-stats background) I may not have understood > > > your use-cases correctly, but are these helpful for you? > > > >http://github.com/MrHus/rql > > > >http://bitbucket.org/kumarshantanu/sqlrat/wiki/Clause > > > > The SQLRat clause API will be part of 0.2 release (expected very > > > soon). > > > > Regards, > > > Shantanu > > > > > Ross > > > > > On Oct 3, 11:31 am, Michael Ossareh <ossa...@gmail.com> wrote: > > > > > > On Fri, Oct 1, 2010 at 17:55, Ross Gayler <r.gay...@gmail.com> wrote: > > > > > > Hi, > > > > > > > This is probably an abuse of the Clojure forum, but it is a bit > > > > > > Clojure-related and strikes me as the sort of thing that a bright, > > > > > > eclectic bunch of Clojure users might know about. (Plus I'm not > > > > > > really > > > > > > a software person, so I need all the help I can get.) > > > > > > > I am looking at the possibility of finding/building a declarative > > > > > > data > > > > > > aggregation language operating on a small relational representation. > > > > > > Each query... > > read more » -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en