Re: relational data aggregation language

Shantanu Kumar Mon, 04 Oct 2010 06:53:20 -0700

Thanks Ross, that gives me a better insight into your environment.

> In the online environment single cases are fetched from a
> database with no aggregation capability and fired at the service that
> contains the aggregation functionality.


What does a single case consist of? Is it just a result-set (as a
consequence of running an SQL query)?. Maybe an example will help.

> This would make sense if the IMDB query language supported the
> aggregations we want in a way our statisticians can use, and the IMDB
> is sufficiently lightweight that we can link it into our service as a
> function library (rather than a separate server connected by some
> communication protocol).

I know that Apache Derby[1], HSQLDB[2] (now called HyperSQL) and H2[3]
can be readily used in Clojure as in-memory databases without needing
to start a server. You can find the examples here:
http://bitbucket.org/kumarshantanu/clj-dbcp/src (though Clj-DBCP 0.1
is not actually released yet, expected soon with SQLRat 0.2)

[1] http://db.apache.org/derby/
[2] http://hsqldb.org/
[3] http://www.h2database.com/html/main.html

There are likely more in-memory databases I am not aware of at the
moment. As long as they have JDBC drivers using them should be easy.
Since your main criteria is SQL features/functions, I guess you would
need to find out which IMDB suits better.

I will be happy to add any missing bits to Clj-DBCP and SQLRat (I am
the author) if you can let me know. Please feel free to ping me off
the list.

Regards,
Shantanu

On Oct 4, 4:57 pm, Ross Gayler <r.gay...@gmail.com> wrote:
> Thanks for the two posts Shantanu.
>
> The rql and Clause examples are useful, both as potential parts of a
> solution and as examples of how query/aggregation stuff may be done in
> Clojure style.  It is conceivable that I may end up deciding all I
> need is a DSL that covers the kinds of aggregations of interest to us
> and translates them to SQL via SQLrat.
>
> With respect to your three suggestions in your second post - things
> get a bit more interesting.  A major part of the problem (which I
> failed to emphasize/mention in my first post) is that I really want
> this aggregation stuff to work in two deployment environments: a batch
> oriented statistical development environment that we control and an
> online, realtime transactional environment that corporate IT
> controls.  In the online environment single cases are fetched from a
> database with no aggregation capability and fired at the service that
> contains the aggregation functionality. We control what happens inside
> that service but have close to zero chance of changing anything
> outside that service - so in that online environment we have no
> possibility of putting aggregation into the datasource DB that feeds
> our service.  However, it *might* be reasonable to put an in-memory
> database inside our service, purely to take advantage of the
> aggregation facilities provided by that IMDB. A single case would get
> loaded into the IMDB, the aggregation would be carried out in that
> IMDB, the results exported, and the IMDB cleared ready for the next
> case. This would make sense if the IMDB query language supported the
> aggregations we want in a way our statisticians can use, and the IMDB
> is sufficiently lightweight that we can link it into our service as a
> function library (rather than a separate server connected by some
> communication protocol).
>
> In our statistical development environment things are different. The
> source data happens to live in a database, and we query that to get
> the subset of cases we are interested in (say, 1M of them). In that
> subset, each case can be treated completely in isolation and our
> aggregations will use 100% of the component data in each case. An
> individual aggregation might touch 20% of the data in one case, but we
> might have ~500 different  aggregations from the same case, so every
> value gets used in lots of different aggregations.  So although I am
> interested in DB query languages as a way of specifying aggregations I
> am not so convinced that I would actually use a full-blown DB  to
> implement those aggregations.
>
> Cheers
>
> Ross
>
> On Oct 4, 3:24 am, Shantanu Kumar <kumar.shant...@gmail.com> wrote:
>
>
>
> > I looked at Tutorial D - it's pretty interesting. Here are few top-of-
> > my-head observations:
>
> > * Which RDBMS do you use? If you are free to choose a new RDBMS,
> > probably you can pick one that provides most of the computational
> > functionality (as SQL constructs/functions) out of the box. For
> > example Oracle, MS SQL Server, PostgreSQL etc. The reason is
> > performance - the more you can compute within the database, the less
> > amount of data you will need to fetch in order to process.
>
> > * The kinds of computations you need to solve look like a superset of
> > what SQL can provide. So, I think you will have to re-state the
> > problem in terms of computations/iterations over SQL result-sets,
> > which is probably what you are currently doing using the imperative
> > language. If you can split every problem in terms of (a) computation
> > you need versus (b) SQL queries you need to fire, then you can easily
> > do it using Clojure itself without needing any DSL.
>
> > * If you want a DSL for this, I suppose it should make maximum use of
> > the database's inbuilt query functions/constructs to maximize
> > performance. This also means the DSL implementation needs to be
> > database-aware. Secondly, it is possible to write functions in Clojure
> > that would emit appropriate SQL clauses (as long as it is doable) to
> > compute certain pieces of information. Looking at multiple use cases
> > (covering various aspects - fuzzy vs deterministic) will be helpful.
>
> > Regards,
> > Shantanu
>
> > On Oct 3, 5:10 pm, Shantanu Kumar <kumar.shant...@gmail.com> wrote:
>
> > > On Oct 3, 1:16 pm, Ross Gayler <r.gay...@gmail.com> wrote:
>
> > > > Thanks Michael.
>
> > > > > This sounds very similar to NoSQL and Map/Reduce?
>
> > > > I'm not so sure about that (which may be mostly due to my ignorance of
> > > > NoSQL and Map/Reduce). The amount of data involved in my problem is
> > > > quite small and any infrastructure aimed at massive scaling may bring
> > > > a load of conceptual and implementation baggage that is unnecessary/
> > > > unhelpful.
>
> > > > Let me restate my problem:
>
> > > > I have a bunch of statistician colleagues with minimal programming
> > > > skills. (I am also a statistician, but with slightly better
> > > > programming skills.) As part of our analytical workflow we take data
> > > > sets and preprocess them by adding new variables that are typically
> > > > aggregate functions of other values. We source the data form a
> > > > database/file, add the new variables, and store the augmented data in
> > > > a database/file for subsequent, extensive and extended (a couple of
> > > > months) analysis with other tools (off the shelf statistical packages
> > > > such as SAS and R).  After the analyses are complete, some subset of
> > > > the preprocessing calculations need to be implemented in an
> > > > operational environment. This is currently done by completely re-
> > > > implementing them in yet another fairly basic imperative language.
>
> > > > The preprocessing in our analytical environment is usually written in
> > > > a combination of SQL and the SAS data manipulation language (think of
> > > > it as a very basic imperative language with macros but no user-defined
> > > > functions). The statisticians take a long time to get their
> > > > preprocessing right (they're not good at nested queries in SQL and
> > > > make all the usual errors iterating over arrays of values with
> > > > imperative code). So my primary goal is to find/build a query language
> > > > that minimises the cognitive impedance mismatch with the statisticians
> > > > and minimises their opportunity for error.
>
> > > > Another goal is that the same mechanism should be applicable in our
> > > > statistical analytical environment and the corporate deployment
> > > > environment(s). The most different operational environment is online
> > > > and realtime. The data describing one case gets thrown at some code
> > > > that (among other things) implements the preprocessing with some
> > > > embedded imperative code. So, linking in some Java byte code to do the
> > > > preprocessing on a single case sounds feasible, whereas replacing/
> > > > augmenting the current corporate infrastructure with NoSQL and a CPU
> > > > farm is more aggravation with corporate IT than I am paid for.
>
> > > > The final goal is that the preprocessing mechanism should be no slower
> > > > than the current methods in each of the deployment environments. The
> > > > hardest one is probably in our statistical analysis environment, but
> > > > there we do have the option of farming the work across multiple CPUs
> > > > if needed.
>
> > > > Let me describe the computational scale of the problem - it is really
> > > > quite small.
>
> > > > Data is organised as completely independent cases.  One case might
> > > > contain 500 primitive values for a total size of ~1kb. Preprocessing
> > > > might calulate another 500 values, each of those being an aggregate
> > > > function of some subset (say, 20 values) of the original 500 values.
> > > > Currently, all these new values are calculated independently of each
> > > > other, but there is a lot of overlap of intermediate results and,
> > > > therefore, potential for optimisation of the computational effort
> > > > required to calculate the entire set of results within a single case.
>
> > > > In our statistical analytical environment the preprocessing is carried
> > > > out in batch mode. A large dataset might contain 1M cases (~1GB of
> > > > data). We can churn through the preprocessing at ~300 cases/second on
> > > > a modest PC.  Higher throughput in our analytical environment would be
> > > > a bonus, but not essential.
>
> > > > So I see the problem as primarily about the conceptual design of the
> > > > query language, with some side constraints about implementation
> > > > compatibility across a range of deployment environments and adequate
> > > > throughput performance.
>
> > > > As I mentioned in an earlier post, I'll probably assemble a collection
> > > > of representative queries, express them in a variety of query
> > > > languages, and try to assess how compatible the different query
> > > > languages are with the way my colleagues want to think about the
> > > > proble.
>
> > > Seeing examples (perhaps quite a few of them) will be certainly
> > > useful. (Due to my non-stats background) I may not have understood
> > > your use-cases correctly, but are these helpful for you?
>
> > >http://github.com/MrHus/rql
>
> > >http://bitbucket.org/kumarshantanu/sqlrat/wiki/Clause
>
> > > The SQLRat clause API will be part of 0.2 release (expected very
> > > soon).
>
> > > Regards,
> > > Shantanu
>
> > > > Ross
>
> > > > On Oct 3, 11:31 am, Michael Ossareh <ossa...@gmail.com> wrote:
>
> > > > > On Fri, Oct 1, 2010 at 17:55, Ross Gayler <r.gay...@gmail.com> wrote:
> > > > > > Hi,
>
> > > > > > This is probably an abuse of the Clojure forum, but it is a bit
> > > > > > Clojure-related and strikes me as the sort of thing that a bright,
> > > > > > eclectic bunch of Clojure users might know about. (Plus I'm not 
> > > > > > really
> > > > > > a software person, so I need all the help I can get.)
>
> > > > > > I am looking at the possibility of finding/building a declarative 
> > > > > > data
> > > > > > aggregation language operating on a small relational representation.
> > > > > > Each query...
>
> read more »

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Re: relational data aggregation language

Reply via email to