On Fri, Oct 22, 2010 at 2:32 PM, Bob Jolliffe <bobjolli...@gmail.com> wrote:
> hi > > 2010/10/22 Ola Hodne Titlestad <ol...@ifi.uio.no>: > > 2010/10/22 Lars Helge Øverland <larshe...@gmail.com> > >> > >> Hi Bob > >> Sorry for not returning to this earlier, been battling with windows 64 > bit > >> odbc drivers. > >> > >> 2010/10/22 Bob Jolliffe <bobjolli...@gmail.com> > >>> > >>> Hi Lars, Ola and all > >>> > >>> Thinking about exporting pivot tables .. we need to pen up some > >>> requirements which we can get into a blueprint. While we are > >>> considering that, I've given some thoughts below. > >>> > >>> My initial thought was that we would construct a pivot table with a > >>> jdbc connection like we would normally do manually, and then simply > >>> populate the pivot cache so that the spreadsheet would be > >>> "free-standing". On reflection that is probably neither the best nor > >>> the simplest thing to do. Playing around with excel (in wine :-) I > >>> figured it is actually much more straightforward to have two sheets; > >>> one for the data and one for the pivot table. The pivot table would > >>> operate on the data in the other worksheet. This is more explicit > >>> than simply hiding the data in the pivot cache. Though I need to test > >>> this a bit more. If the pivotcache is an obligatory data store then > >>> it might not make sense to carry the data twice. The pivot table > >>> constructs are fairly complex but not outlandishly so - once I'm done > >>> unpicking, I will write up a short description of the anatomy of a > >>> pivot table so we can see what needs to be done. > >>> > >> > >> This sounds just fine to me. > > > > This might work, just have in mind that we will then need several pairs > of > > data+pivot worksheets as we want to have multiple pivot tables in an > Excel > > file. > > Also have in mind that we can talk about up to around 1.500.000 rows > (that's > > when my laptop starts to complain at least) of data here, so these data > > sheets will be large. Typically, number of rows of data will be somewhere > > around 200-600K. > > > > > >> > >> > >>> > >>> One possibility would be that the pivot table could be generated as > >>> part of the export of a report table ie. optionally export to excel or > >>> export to excel-with-pivot. > >>> > >>> Given that there is no advance indication of the columns in a report > >>> table, a challenge will be how and when to define the pivot model - > >>> ie. the pivot fields - and how to persist that model (we don't want > >>> users to have to recreate the pivot model each time). This will > >>> probably require an extra object in our data model (PivotModel) which > >>> defines the row and column pivot fields and data field, where a > >>> ReportTable can have a 0-* relationship with 0 or more PivotModels. A > >>> possible beneficial side effect of this could be that we also leave > >>> open the ability to generate the pivot table rendition with things > >>> other than excel. I don't see this as an immediate requirement but > >>> useful to have the pivotmodel abstraction anyway. In reality the > >>> design of the pivotmodel will be based on what excel requires. > >>> > >>> Generating the excel spreadsheet off the reporttable+pivotmodel will > >>> produce an OOXML xlsx file - actually a bundle of xml streams which > >>> need to be zipped. Its a bit unfortunate that M$ defines a flat file > >>> single xml for other OPC office documents, but sadly not excel. > >>> Dealing with a single flat file is much easier than zip containers > >>> with multiple streams but so be it. At least they end up smaller. > >>> Though given that these reporttables can be very large I don't think > >>> an xslt approach is really the most efficient at least for producing > >>> the data worksheet. It might make sense if the source data was xml, > >>> but even then probably not. So simple iteration through the table > >>> rows with a stax writer will work best. The second sheet (the pivot > >>> sheet) would just be a serialization of the pivot model. > >>> > >>> We will probably have to implement the excel optimization of > >>> sharedstrings (where literal strings are not written into the > >>> spreadsheet - just index values from a table). This adds complexity > >>> but I think we are talking pretty large files here. If we don't do > >>> the sharedstrings optimization, Excel will do it anyway the first time > >>> it opens and saves the file, but I suspect we will have to do it up > >>> front. > >>> > >>> Regarding UI and PivotModel I want to take a look at our existing work > >>> on web pivot tables to see whether we can't leverage some of this. If > >>> we follow roughly the outline above the user would have the option to > >>> define 0 or more pivottable definitions at the time of creating, or > >>> editing, a reporttable. What do you think? Is the above close to > >>> what you guys have in mind. Lets hammer out some requirements and > >>> create the blueprint. > >>> > >> > >> We won't base this on report tables as (at least I) can't see any huge > >> benefits. The excel pivot tables themselves are more capable of having > >> dimensions on columns, rows, filters than our report tables. In fact we > >> usually use a SQL view as datasource for the pivot tables - which has a > >> fixed number of columns. The view is based on the aggregateddatavalue > table > >> and joins in the dataelement, orgunit, period and periodtype tables. In > >> addition we join in a series of resource tables to get information about > the > >> orgunit structure, dataelement groupset structure, orgunit groupset > >> structure (additional dimensional information!) to make the pivot table > more > >> valuable for analysis. (This also goes for indicators.) > >> Since this should be more of a offline analysis tool I think exporting > all > >> dataelements/indicators will be most appropriate. The user interface > could > >> simply request a start and endate (and/or relative periods), indicators > vs > >> dataelements, parent organisation unit and organisation unit level. Ola > will > >> maybe have some views here... > > > > Agree with Lars, I don't think report tables should be used as source for > > pivot tables, at least not for the standard "ALL data" pivot tables. > > Yes I understand that now > > > For > > more ad-hoc use the pivots generated off a report table might be useful > as > > an export option, but let's start with the most common use cases, the big > > pivot tables with all the data for browsing and data analysis. > > The pivot source views (sql) that we have used e.g. in the newly > distributed > > pivots for the demo database can be found in an email to the list on Sep > 21, > > together with the pivots and the database dumps: > > http://www.mail-archive.com/dhis2-devs@lists.launchpad.net/msg07341.html > (from > > Sep 21) > > These views have been stable for a long time and describe the fields of > the > > basic pivot tables that are needed for DHIS data analysis. > > These set of fields go back as far as DHIS 1.3, so the basic pivot tables > > are stable and we know their structure, basically the core data > dimensions > > of DHIS2 (see chapter 2 in user manual). > > What always will be variables are the orgunit level of aggregation for > the > > data or indicator values and the period type. You will see these two are > > specified at the end of every where clause in the pivot views. These are > > used as a filters when fetching data from the data mart. > > If you open the pivot table for download in that email you can follow my > > reasoning below: > > Typically we have two tables for monthly (raw data+indicator data) for > the > > lowest level plus at least two more for a higher level. E.g. in the demo > > based on SL we use the health facility level (level 4) and the Chiefdom > > level (3). The reason for using Chiefdom level as well is that these > tables > > are much faster to manipulate since there are a lot less data when > shaving > > away the lowest level. How many levels and which levels to put in the > pivot > > table will vary from place to place. In addition to these at least four > > tables, there is often at least one table for population data, which has > a > > yearly period type and therefore needs a separate view and table. So for > the > > demo database we have 5 basic tables which covers all the data for data > > elements and indicators. This set of pivot tables is what I would define > as > > a minimum pivot table setup for a DHIS database. Over time (depending on > the > > amount of data collected) it might be necessary to split up these basic > > tables by year since they can get too big for excel to handle. Such a > split > > can be facilitated by a simple from and to filter on the start date in > the > > period field in the sql. > > Then there can be many more customised pivot tables which make use of > > additional dimensions to the data like the data element categories, and > data > > element and orgunit group sets. The simplest approach there is to (still > > talking sql) do a select * from the various resource tables. Then you > will > > have a data source for all other customised tables (using the same period > > type and aggregation level). Working with data element categories in > pivot > > tables it doesn't make sense to use too many at the same time, and rather > > work on data elements that share the same categories (within or across > > datasets). You can see in the demo pivot table file that we have set up > > multiple tables making use of data element categories (EPI, RCH, HIV > etc.). > > These are all using the same source data (chiefdom level monthly raw > data), > > but are using different pivot fields (columns in the result set). The > total > > number of available fields are the same for all these custom chiefdom > > tables, but we are only using a small subset of these fields in each > table. > > This means that the same data source could also be used for the basic > tables > > (the two chiefdom tables at least), and just select even fewer fields in > the > > table. The users can also easily add more fields to a table in Excel, > using > > drag and drop functionality. > > It seems to me that there is a need to separate between the source data > and > > the pivot structure, to reuse the source data in multiple tables. I think > > that is how Excel is thinking too, but it doesn't always seem to work > that > > way, e.g. when creating a copy of an existing worksheet with a pivot > table. > > Yes we can ask excel pivot tables to use common pivot table cache. > > > I am not sure how much of this customisation we need to support inside > DHIS2 > > and how much we can leave to Excel (which has nice tools for copying > tables, > > drag and drop pivot fields etc.). First we need to come up with something > > that works for the basic tables and then look at these more custom use > > cases. > > When it comes to pivot tables using period types other than monthly or > > yearly (both covered in the basic tables listed above), like six-monthly, > > quarterly or weekly these are also important to support. The pivot fields > > are the same as in the basic tables, it is just a different periodtype > > filter on the data source, so these might be easier to accommodate. > > Looking at the next steps, to me the key new features that I would like > to > > see with regards to pivot tables are: > > 1) a pivot table update service, where users that are on the online > server > > (no local install) can update their local pivot tables against the online > > DHIS server > > Since a pivot table can contain months and years of data we do not want > to > > download the full tables every month, but simply append the latest month > of > > data to the existing local pivot tables > > 2) a pivot generator, where the users can download a new pivot table file > > directly off a DHIS2 server without having to do any local configuration > > etc. , and then later use the update service to update their tables > > with 1) being the most critical and urgent, but I realise that we might > need > > 2) first, in order to deal with a standard set of tables and pivot data > > Yes I think 1 can only really work in concert with 2. Since we can't > ask dhis to update any old pivot table, it will need to be a > pivottable which it is intimately familiar with - which is either a > fixed format one as you have discussed or one that is created through > dhis. Naturally we start with former. BTW looking at size of tables > I have dropped the idea of holding data in a worksheet. This does in > fact produce duplication of data which is also held in cache and both > unfortunately also held in memory. So we work only with the hidden > pivot table cache as the local store. > > Thinking how and where data is being moved, it might not make sense to > pass the (massive) excel file back and forwards to dhis in order to > get updates injected. That might defeat the purpose. Which raises > some questions. Either we have a standalone tool which grabs > incremental update from dhis and appends to excel pivottable cache or > we have some sort of VBA macro defined within excel. I think the > latter might prove problematic - like performing circumcision on > oneself whilst awake, it might cause excel to breakdown. So probably > a standalone tool .. > > I haven't worked through the samples yet. I think I will buy excel on > the weekend and invoice Jorn. > > Definitely lets start with 2). We can think of 1) later, maybe even a script will do the job. Bob I will put up excel with key on separate mail. Did some quick investigations on size and it seems excel is using around 30 KB / 1000 datavalues. The file seems already compressed btw. If a typical pivot file has 600 000 values that means 18 MB. Lars
_______________________________________________ Mailing list: https://launchpad.net/~dhis2-devs Post to : dhis2-devs@lists.launchpad.net Unsubscribe : https://launchpad.net/~dhis2-devs More help : https://help.launchpad.net/ListHelp