Re: relational data aggregation language

Ross Gayler Sat, 09 Oct 2010 01:49:21 -0700

Thanks Shantanu. (Sorry for the slow reply.)

> What does a single case consist of? Is it just a result-set (as a
> consequence of running an SQL query)?. Maybe an example will help.


I can't be too specific, but a single case can be thought of as a tiny
relational database with maybe 20 tables. One table will have one row
with some admin and unique identifier information. Another few tables
will have one to a few rows each with identifying information like
names and addresses.  The remaining tables will each correspond to a
different type of event that may have occurred and there will be zero
to 100 rows (but typically only up to 10) in each event table with 3
or 4 columns describing the event.

> I know that Apache Derby[1], HSQLDB[2] (now called HyperSQL) and H2[3]
> can be readily used in Clojure as in-memory databases without needing
> to start a server.

Thanks for the links to the in-memory databases.  I think my brain is
full now.  It'll probably take me a couple of months to try out some
ideas (this isn't on anybody's to-do list but mine).

Thanks to all for your help.

Ross


On Oct 5, 12:53 am, Shantanu Kumar <kumar.shant...@gmail.com> wrote:
> Thanks Ross, that gives me a better insight into your environment.
>
> > In the online environment single cases are fetched from a
> > database with no aggregation capability and fired at the service that
> > contains the aggregation functionality.
>
> What does a single case consist of? Is it just a result-set (as a
> consequence of running an SQL query)?. Maybe an example will help.
>
> > This would make sense if the IMDB query language supported the
> > aggregations we want in a way our statisticians can use, and the IMDB
> > is sufficiently lightweight that we can link it into our service as a
> > function library (rather than a separate server connected by some
> > communication protocol).
>
> I know that Apache Derby[1], HSQLDB[2] (now called HyperSQL) and H2[3]
> can be readily used in Clojure as in-memory databases without needing
> to start a server. You can find the examples 
> here:http://bitbucket.org/kumarshantanu/clj-dbcp/src(though Clj-DBCP 0.1
> is not actually released yet, expected soon with SQLRat 0.2)
>
> [1]http://db.apache.org/derby/
> [2]http://hsqldb.org/
> [3]http://www.h2database.com/html/main.html
>
> There are likely more in-memory databases I am not aware of at the
> moment. As long as they have JDBC drivers using them should be easy.
> Since your main criteria is SQL features/functions, I guess you would
> need to find out which IMDB suits better.
>
> I will be happy to add any missing bits to Clj-DBCP and SQLRat (I am
> the author) if you can let me know. Please feel free to ping me off
> the list.
>
> Regards,
> Shantanu
>
> On Oct 4, 4:57 pm, Ross Gayler <r.gay...@gmail.com> wrote:
>
> > Thanks for the two posts Shantanu.
>
> > The rql and Clause examples are useful, both as potential parts of a
> > solution and as examples of how query/aggregation stuff may be done in
> > Clojure style.  It is conceivable that I may end up deciding all I
> > need is a DSL that covers the kinds of aggregations of interest to us
> > and translates them to SQL via SQLrat.
>
> > With respect to your three suggestions in your second post - things
> > get a bit more interesting.  A major part of the problem (which I
> > failed to emphasize/mention in my first post) is that I really want
> > this aggregation stuff to work in two deployment environments: a batch
> > oriented statistical development environment that we control and an
> > online, realtime transactional environment that corporate IT
> > controls.  In the online environment single cases are fetched from a
> > database with no aggregation capability and fired at the service that
> > contains the aggregation functionality. We control what happens inside
> > that service but have close to zero chance of changing anything
> > outside that service - so in that online environment we have no
> > possibility of putting aggregation into the datasource DB that feeds
> > our service.  However, it *might* be reasonable to put an in-memory
> > database inside our service, purely to take advantage of the
> > aggregation facilities provided by that IMDB. A single case would get
> > loaded into the IMDB, the aggregation would be carried out in that
> > IMDB, the results exported, and the IMDB cleared ready for the next
> > case. This would make sense if the IMDB query language supported the
> > aggregations we want in a way our statisticians can use, and the IMDB
> > is sufficiently lightweight that we can link it into our service as a
> > function library (rather than a separate server connected by some
> > communication protocol).
>
> > In our statistical development environment things are different. The
> > source data happens to live in a database, and we query that to get
> > the subset of cases we are interested in (say, 1M of them). In that
> > subset, each case can be treated completely in isolation and our
> > aggregations will use 100% of the component data in each case. An
> > individual aggregation might touch 20% of the data in one case, but we
> > might have ~500 different  aggregations from the same case, so every
> > value gets used in lots of different aggregations.  So although I am
> > interested in DB query languages as a way of specifying aggregations I
> > am not so convinced that I would actually use a full-blown DB  to
> > implement those aggregations.
>
> > Cheers
>
> > Ross
>
> > On Oct 4, 3:24 am, Shantanu Kumar <kumar.shant...@gmail.com> wrote:
>
> > > I looked at Tutorial D - it's pretty interesting. Here are few top-of-
> > > my-head observations:
>
> > > * Which RDBMS do you use? If you are free to choose a new RDBMS,
> > > probably you can pick one that provides most of the computational
> > > functionality (as SQL constructs/functions) out of the box. For
> > > example Oracle, MS SQL Server, PostgreSQL etc. The reason is
> > > performance - the more you can compute within the database, the less
> > > amount of data you will need to fetch in order to process.
>
> > > * The kinds of computations you need to solve look like a superset of
> > > what SQL can provide. So, I think you will have to re-state the
> > > problem in terms of computations/iterations over SQL result-sets,
> > > which is probably what you are currently doing using the imperative
> > > language. If you can split every problem in terms of (a) computation
> > > you need versus (b) SQL queries you need to fire, then you can easily
> > > do it using Clojure itself without needing any DSL.
>
> > > * If you want a DSL for this, I suppose it should make maximum use of
> > > the database's inbuilt query functions/constructs to maximize
> > > performance. This also means the DSL implementation needs to be
> > > database-aware. Secondly, it is possible to write functions in Clojure
> > > that would emit appropriate SQL clauses (as long as it is doable) to
> > > compute certain pieces of information. Looking at multiple use cases
> > > (covering various aspects - fuzzy vs deterministic) will be helpful.
>
> > > Regards,
> > > Shantanu
>
> > > On Oct 3, 5:10 pm, Shantanu Kumar <kumar.shant...@gmail.com> wrote:
>
> > > > On Oct 3, 1:16 pm, Ross Gayler <r.gay...@gmail.com> wrote:
>
> > > > > Thanks Michael.
>
> > > > > > This sounds very similar to NoSQL and Map/Reduce?
>
> > > > > I'm not so sure about that (which may be mostly due to my ignorance of
> > > > > NoSQL and Map/Reduce). The amount of data involved in my problem is
> > > > > quite small and any infrastructure aimed at massive scaling may bring
> > > > > a load of conceptual and implementation baggage that is unnecessary/
> > > > > unhelpful.
>
> > > > > Let me restate my problem:
>
> > > > > I have a bunch of statistician colleagues with minimal programming
> > > > > skills. (I am also a statistician, but with slightly better
> > > > > programming skills.) As part of our analytical workflow we take data
> > > > > sets and preprocess them by adding new variables that are typically
> > > > > aggregate functions of other values. We source the data form a
> > > > > database/file, add the new variables, and store the augmented data in
> > > > > a database/file for subsequent, extensive and extended (a couple of
> > > > > months) analysis with other tools (off the shelf statistical packages
> > > > > such as SAS and R).  After the analyses are complete, some subset of
> > > > > the preprocessing calculations need to be implemented in an
> > > > > operational environment. This is currently done by completely re-
> > > > > implementing them in yet another fairly basic imperative language.
>
> > > > > The preprocessing in our analytical environment is usually written in
> > > > > a combination of SQL and the SAS data manipulation language (think of
> > > > > it as a very basic imperative language with macros but no user-defined
> > > > > functions). The statisticians take a long time to get their
> > > > > preprocessing right (they're not good at nested queries in SQL and
> > > > > make all the usual errors iterating over arrays of values with
> > > > > imperative code). So my primary goal is to find/build a query language
> > > > > that minimises the cognitive impedance mismatch with the statisticians
> > > > > and minimises their opportunity for error.
>
> > > > > Another goal is that the same mechanism should be applicable in our
> > > > > statistical analytical environment and the corporate deployment
> > > > > environment(s). The most different operational environment is online
> > > > > and realtime. The data describing one case gets thrown at some code
> > > > > that (among other things) implements the preprocessing with some
> > > > > embedded imperative code. So, linking in some Java byte code to do the
> > > > > preprocessing on a single case sounds feasible, whereas replacing/
> > > > > augmenting the current corporate infrastructure with NoSQL and a CPU
> > > > > farm is more aggravation with corporate IT than I am paid for.
>
> > > > > The final goal is that the preprocessing mechanism should be no slower
> > > > > than the current methods in each of the deployment environments. The
> > > > > hardest one is probably in our statistical analysis environment, but
> > > > > there we do have the option of farming the work across multiple CPUs
> > > > > if needed.
>
> > > > > Let me describe the computational scale of the problem - it is really
> > > > > quite small.
>
> > > > > Data is organised as completely independent cases.  One case might
> > > > > contain 500 primitive values for a total size of ~1kb. Preprocessing
> > > > > might calulate another 500 values, each of those being an aggregate
> > > > > function of some subset (say, 20 values) of the original 500 values.
> > > > > Currently, all these new values are calculated independently of each
> > > > > other, but there is a lot of overlap of intermediate results and,
> > > > > therefore, potential for optimisation of the computational effort
> > > > > required to calculate the entire set of results within a single case.
>
> > > > > In our statistical analytical environment the preprocessing is carried
> > > > > out in batch mode. A large dataset
>
> ...
>
> read more »

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Re: relational data aggregation language

Reply via email to