I think Jason also pointed out that this could be achieved from the API, but the question is whether it needs to be more user-friendly, i.e. customisable using the web application as opposed to requiring a custom script triggered by a cron job.
Cheers, -doh On Sun, Sep 11, 2016 at 8:36 PM, Dan Cocos <dco...@gmail.com> wrote: > Hi All, > > You could run this > /api/24/maintenance/analyticsTablesClear > and this possibly this > /api/24/maintenance/periodPruning > > I don't see it in the documentation but we use call this > /api/resourceTables/analytics?lastYears=2 quite often for clients with a > lot of historical data. > > Good luck, > Dan > > *Dan Cocos* > Principal, BAO Systems > dco...@baosystems.com <nho...@baosystems.com> | http://www.baosystems.com > | 2900 K Street, Suite 404, Washington D.C. 20007 > > > > > > On Sep 11, 2016, at 10:05 AM, Calle Hedberg <calle.hedb...@gmail.com> > wrote: > > Hi, > > It's not only analytics that would benefit from segmented/staggered > processing: I exported around 100 mill data values yesterday from a number > of instance, and found that the export process was (seemingly) > exponentially slower with increasing number of records exported. Most of > the export files contained well under 10 mill records, which was pretty > fast. In comparison, the largest export file with around 30 mill data > values probably took 20 times as much time as an 8 mill value export. Based > on just keeping an eye on the "progress bar", it seemed like some kind of > cache staggering was taking place - the amount exported would increase > quickly by 2-3mb, then "hang" for a good while, then increase quickly by > 2-3mb again. > > Note also that there are several fundamental strategies one could use to > reducing heavy work processes like analytics, exports (and thus imports), > etc: > - to be able to specify a sub-period as Jason's suggest > - to be able to specify the "dirty" part of the instance by using e.g. > LastUpdated >= xxxxx > - to be able to specify a sub-OrgUnit-area > > These partial strategies are of course mostly relevant for very large > instances, but such large instances are also the ones where you typically > only have changes made to a small segment of the total - like if you have > data for 30 years, 27 of those might be locked down and no longer available > for updates. > > Regards > Calle > > On 11 September 2016 at 15:47, David Siang Fong Oh <d...@thoughtworks.com> > wrote: > >> +1 to Calle's idea of staggering analytics year by year >> >> I also like Jason's suggestion of being able to configure the time period >> for which analytics is regenerated. If the general use-case has data being >> entered only for the current year, then is it perhaps unnecessary to >> regenerate data for previous years? >> >> Cheers, >> >> -doh >> >> On Tue, Jul 26, 2016 at 2:36 PM, Calle Hedberg <calle.hedb...@gmail.com> >> wrote: >> >>> Hi, >>> >>> One (presumably) simple solution is to stagger analytics on a year by >>> year basis - i.e. run and complete 2009 before processing 2010. That would >>> reduce temp disk space requirements significantly while (presumably) not >>> changing the general design. >>> >>> Regards >>> Calle >>> >>> On 26 July 2016 at 10:24, Jason Pickering <jason.p.picker...@gmail.com> >>> wrote: >>> >>>> Hi Devs, >>>> I am seeking some advice on how to try and decrease the amount of disk >>>> usage with DHIS2. >>>> >>>> Here is a list of the biggest tables in the system. >>>> >>>> public.datavalue | 2316 MB >>>> public.datavalue_pkey | 1230 MB >>>> public.in_datavalue_lastupdated | 680 MB >>>> >>>> >>>> There are a lot more tables, and all in all, the database occupies >>>> about 5.4 GB without analytics. >>>> >>>> This represents about 30 million data rows, so not that big of a >>>> database really. This server is being run off of a Digital Ocean virtual >>>> server with 60 GB of disk space. The only thing on the server really is >>>> Linux, Postgresql and Tomcat. Nothing else. With out analytics and >>>> everything installed for the system, we have about 23% of that 60 GB free. >>>> >>>> When analytics runs, it maintains a copy of the main analytics tables ( >>>> analytics_XXXX) and creates temp tables like analytics_temp_2004. When >>>> things are finished and the indexes are built, the tables are swapped. This >>>> ensures that analytics resources are available while analytics are being >>>> built, but the downside of this is that A LOT more disk space is required, >>>> as now we effectively have two copies of the tables along with all their >>>> indexes, which are quite large themselves (up to 60% the size of the table >>>> itself). Here's what happens when analytics is run >>>> >>>> public.analytics_temp_2015 | 1017 MB >>>> public.analytics_temp_2014 | 985 MB >>>> public.analytics_temp_2011 | 952 MB >>>> public.analytics_temp_2010 | 918 MB >>>> public.analytics_temp_2013 | 885 MB >>>> public.analytics_temp_2012 | 835 MB >>>> public.analytics_temp_2009 | 804 MB >>>> >>>> Now each analytics table is taking about 1 GB of space. In the end, it >>>> adds up to more than 60 GB and analytics fails to complete. >>>> >>>> So, while I understand the need for this functionality, I am wondering >>>> if we need a system option to allow the analytics tables to be dropped >>>> prior to regenerating them, or to have more control over the order in which >>>> they are generated (for instance to generate specific periods). I realize >>>> this can be done from the API or the scheduler, but only for the past three >>>> relative years. >>>> >>>> The reason I am asking for this is because its a bit of a pain (at the >>>> moment) when using Digital Ocean as a service provider, since their stock >>>> disk storage is 60 GB. With other VPS providers (Amazon, Linode), its a bit >>>> easier, but DigitalOcean only supports block storage in two regions at the >>>> moment. Regardless, it would seem somewhat wasteful to have to have such a >>>> large amount of disk space, for such a relatively small database. >>>> >>>> Is this something we just need to plan for and maybe provide better >>>> documentation on, or should we think about trying to offer better >>>> functionality for people running smaller servers? >>>> >>>> Regards, >>>> Jason >>>> >>>> _______________________________________________ >>>> Mailing list: https://launchpad.net/~dhis2-devs >>>> Post to : dhis2-devs@lists.launchpad.net >>>> Unsubscribe : https://launchpad.net/~dhis2-devs >>>> More help : https://help.launchpad.net/ListHelp >>>> >>>> >>> >>> >>> -- >>> >>> ******************************************* >>> >>> Calle Hedberg >>> >>> 46D Alma Road, 7700 Rosebank, SOUTH AFRICA >>> >>> Tel/fax (home): +27-21-685-6472 >>> >>> Cell: +27-82-853-5352 >>> >>> Iridium SatPhone: +8816-315-19119 >>> >>> Email: calle.hedb...@gmail.com >>> >>> Skype: calle_hedberg >>> >>> ******************************************* >>> >>> >>> _______________________________________________ >>> Mailing list: https://launchpad.net/~dhis2-devs >>> Post to : dhis2-devs@lists.launchpad.net >>> Unsubscribe : https://launchpad.net/~dhis2-devs >>> More help : https://help.launchpad.net/ListHelp >>> >>> >> > > > -- > > ******************************************* > > Calle Hedberg > > 46D Alma Road, 7700 Rosebank, SOUTH AFRICA > > Tel/fax (home): +27-21-685-6472 > > Cell: +27-82-853-5352 > > Iridium SatPhone: +8816-315-19119 > > Email: calle.hedb...@gmail.com > > Skype: calle_hedberg > > ******************************************* > > _______________________________________________ > Mailing list: https://launchpad.net/~dhis2-devs > Post to : dhis2-devs@lists.launchpad.net > Unsubscribe : https://launchpad.net/~dhis2-devs > More help : https://help.launchpad.net/ListHelp > > >
_______________________________________________ Mailing list: https://launchpad.net/~dhis2-devs Post to : dhis2-devs@lists.launchpad.net Unsubscribe : https://launchpad.net/~dhis2-devs More help : https://help.launchpad.net/ListHelp