Thanks Shantanu. (Sorry for the slow reply.) > What does a single case consist of? Is it just a result-set (as a > consequence of running an SQL query)?. Maybe an example will help.
I can't be too specific, but a single case can be thought of as a tiny relational database with maybe 20 tables. One table will have one row with some admin and unique identifier information. Another few tables will have one to a few rows each with identifying information like names and addresses. The remaining tables will each correspond to a different type of event that may have occurred and there will be zero to 100 rows (but typically only up to 10) in each event table with 3 or 4 columns describing the event. > I know that Apache Derby[1], HSQLDB[2] (now called HyperSQL) and H2[3] > can be readily used in Clojure as in-memory databases without needing > to start a server. Thanks for the links to the in-memory databases. I think my brain is full now. It'll probably take me a couple of months to try out some ideas (this isn't on anybody's to-do list but mine). Thanks to all for your help. Ross On Oct 5, 12:53 am, Shantanu Kumar <kumar.shant...@gmail.com> wrote: > Thanks Ross, that gives me a better insight into your environment. > > > In the online environment single cases are fetched from a > > database with no aggregation capability and fired at the service that > > contains the aggregation functionality. > > What does a single case consist of? Is it just a result-set (as a > consequence of running an SQL query)?. Maybe an example will help. > > > This would make sense if the IMDB query language supported the > > aggregations we want in a way our statisticians can use, and the IMDB > > is sufficiently lightweight that we can link it into our service as a > > function library (rather than a separate server connected by some > > communication protocol). > > I know that Apache Derby[1], HSQLDB[2] (now called HyperSQL) and H2[3] > can be readily used in Clojure as in-memory databases without needing > to start a server. You can find the examples > here:http://bitbucket.org/kumarshantanu/clj-dbcp/src(though Clj-DBCP 0.1 > is not actually released yet, expected soon with SQLRat 0.2) > > [1]http://db.apache.org/derby/ > [2]http://hsqldb.org/ > [3]http://www.h2database.com/html/main.html > > There are likely more in-memory databases I am not aware of at the > moment. As long as they have JDBC drivers using them should be easy. > Since your main criteria is SQL features/functions, I guess you would > need to find out which IMDB suits better. > > I will be happy to add any missing bits to Clj-DBCP and SQLRat (I am > the author) if you can let me know. Please feel free to ping me off > the list. > > Regards, > Shantanu > > On Oct 4, 4:57 pm, Ross Gayler <r.gay...@gmail.com> wrote: > > > Thanks for the two posts Shantanu. > > > The rql and Clause examples are useful, both as potential parts of a > > solution and as examples of how query/aggregation stuff may be done in > > Clojure style. It is conceivable that I may end up deciding all I > > need is a DSL that covers the kinds of aggregations of interest to us > > and translates them to SQL via SQLrat. > > > With respect to your three suggestions in your second post - things > > get a bit more interesting. A major part of the problem (which I > > failed to emphasize/mention in my first post) is that I really want > > this aggregation stuff to work in two deployment environments: a batch > > oriented statistical development environment that we control and an > > online, realtime transactional environment that corporate IT > > controls. In the online environment single cases are fetched from a > > database with no aggregation capability and fired at the service that > > contains the aggregation functionality. We control what happens inside > > that service but have close to zero chance of changing anything > > outside that service - so in that online environment we have no > > possibility of putting aggregation into the datasource DB that feeds > > our service. However, it *might* be reasonable to put an in-memory > > database inside our service, purely to take advantage of the > > aggregation facilities provided by that IMDB. A single case would get > > loaded into the IMDB, the aggregation would be carried out in that > > IMDB, the results exported, and the IMDB cleared ready for the next > > case. This would make sense if the IMDB query language supported the > > aggregations we want in a way our statisticians can use, and the IMDB > > is sufficiently lightweight that we can link it into our service as a > > function library (rather than a separate server connected by some > > communication protocol). > > > In our statistical development environment things are different. The > > source data happens to live in a database, and we query that to get > > the subset of cases we are interested in (say, 1M of them). In that > > subset, each case can be treated completely in isolation and our > > aggregations will use 100% of the component data in each case. An > > individual aggregation might touch 20% of the data in one case, but we > > might have ~500 different aggregations from the same case, so every > > value gets used in lots of different aggregations. So although I am > > interested in DB query languages as a way of specifying aggregations I > > am not so convinced that I would actually use a full-blown DB to > > implement those aggregations. > > > Cheers > > > Ross > > > On Oct 4, 3:24 am, Shantanu Kumar <kumar.shant...@gmail.com> wrote: > > > > I looked at Tutorial D - it's pretty interesting. Here are few top-of- > > > my-head observations: > > > > * Which RDBMS do you use? If you are free to choose a new RDBMS, > > > probably you can pick one that provides most of the computational > > > functionality (as SQL constructs/functions) out of the box. For > > > example Oracle, MS SQL Server, PostgreSQL etc. The reason is > > > performance - the more you can compute within the database, the less > > > amount of data you will need to fetch in order to process. > > > > * The kinds of computations you need to solve look like a superset of > > > what SQL can provide. So, I think you will have to re-state the > > > problem in terms of computations/iterations over SQL result-sets, > > > which is probably what you are currently doing using the imperative > > > language. If you can split every problem in terms of (a) computation > > > you need versus (b) SQL queries you need to fire, then you can easily > > > do it using Clojure itself without needing any DSL. > > > > * If you want a DSL for this, I suppose it should make maximum use of > > > the database's inbuilt query functions/constructs to maximize > > > performance. This also means the DSL implementation needs to be > > > database-aware. Secondly, it is possible to write functions in Clojure > > > that would emit appropriate SQL clauses (as long as it is doable) to > > > compute certain pieces of information. Looking at multiple use cases > > > (covering various aspects - fuzzy vs deterministic) will be helpful. > > > > Regards, > > > Shantanu > > > > On Oct 3, 5:10 pm, Shantanu Kumar <kumar.shant...@gmail.com> wrote: > > > > > On Oct 3, 1:16 pm, Ross Gayler <r.gay...@gmail.com> wrote: > > > > > > Thanks Michael. > > > > > > > This sounds very similar to NoSQL and Map/Reduce? > > > > > > I'm not so sure about that (which may be mostly due to my ignorance of > > > > > NoSQL and Map/Reduce). The amount of data involved in my problem is > > > > > quite small and any infrastructure aimed at massive scaling may bring > > > > > a load of conceptual and implementation baggage that is unnecessary/ > > > > > unhelpful. > > > > > > Let me restate my problem: > > > > > > I have a bunch of statistician colleagues with minimal programming > > > > > skills. (I am also a statistician, but with slightly better > > > > > programming skills.) As part of our analytical workflow we take data > > > > > sets and preprocess them by adding new variables that are typically > > > > > aggregate functions of other values. We source the data form a > > > > > database/file, add the new variables, and store the augmented data in > > > > > a database/file for subsequent, extensive and extended (a couple of > > > > > months) analysis with other tools (off the shelf statistical packages > > > > > such as SAS and R). After the analyses are complete, some subset of > > > > > the preprocessing calculations need to be implemented in an > > > > > operational environment. This is currently done by completely re- > > > > > implementing them in yet another fairly basic imperative language. > > > > > > The preprocessing in our analytical environment is usually written in > > > > > a combination of SQL and the SAS data manipulation language (think of > > > > > it as a very basic imperative language with macros but no user-defined > > > > > functions). The statisticians take a long time to get their > > > > > preprocessing right (they're not good at nested queries in SQL and > > > > > make all the usual errors iterating over arrays of values with > > > > > imperative code). So my primary goal is to find/build a query language > > > > > that minimises the cognitive impedance mismatch with the statisticians > > > > > and minimises their opportunity for error. > > > > > > Another goal is that the same mechanism should be applicable in our > > > > > statistical analytical environment and the corporate deployment > > > > > environment(s). The most different operational environment is online > > > > > and realtime. The data describing one case gets thrown at some code > > > > > that (among other things) implements the preprocessing with some > > > > > embedded imperative code. So, linking in some Java byte code to do the > > > > > preprocessing on a single case sounds feasible, whereas replacing/ > > > > > augmenting the current corporate infrastructure with NoSQL and a CPU > > > > > farm is more aggravation with corporate IT than I am paid for. > > > > > > The final goal is that the preprocessing mechanism should be no slower > > > > > than the current methods in each of the deployment environments. The > > > > > hardest one is probably in our statistical analysis environment, but > > > > > there we do have the option of farming the work across multiple CPUs > > > > > if needed. > > > > > > Let me describe the computational scale of the problem - it is really > > > > > quite small. > > > > > > Data is organised as completely independent cases. One case might > > > > > contain 500 primitive values for a total size of ~1kb. Preprocessing > > > > > might calulate another 500 values, each of those being an aggregate > > > > > function of some subset (say, 20 values) of the original 500 values. > > > > > Currently, all these new values are calculated independently of each > > > > > other, but there is a lot of overlap of intermediate results and, > > > > > therefore, potential for optimisation of the computational effort > > > > > required to calculate the entire set of results within a single case. > > > > > > In our statistical analytical environment the preprocessing is carried > > > > > out in batch mode. A large dataset > > ... > > read more » -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en