Today at 5:20, Owen Taylor wrote: > Something that takes 4 hours of CPU time on window (a day?) probably > isn't a huge deal ... window isn't terribly CPU-bound currently... and > the process could be niced down.
Stats are usually done multiple times a day. I am not quite sure of the current schedule, but I think Carlos is doing them at least 3 times a day (7am, 3pm, 11pm). If you ask translators, they'd prefer to have them updated as often as possible. Though, with current code, that's not realistic. Carlos new stuff should provide that with much lower CPU usuage, by watching CVS directly. Now, if only someone found time to finish the code if Carlos doesn't make it (it's available in his svn repo somewhere on carlos.pemas.net, I think :). > But if it's doing significant disk work - so ejecting stuff out of > cache, then it's going to impact all bugzilla users, all anoncvs users, > all people accessing www.gnome.org etc. window isn't really a good place > to run intensive jobs, because so much is going on there. It does do significant disk work: it basically checks out entire gnome cvs, runs "intltool-update -p" and then "msgmerge" on every single PO file in Gnome CVS repository (sometimes for multiple branches) and creates hundreds of static .html files containing statistics. > Things to do: > > - Try running it on window, see if it really takes 4 hours, or 2 hours. > (30-45 minutes might be an OK time for an intensive task to churn.) > > - Get someone to look at optimizing it. You can do an incredible amount > of work in 4 hours these days ... if this task is taking 4 hours, > it's being done inefficiently. (Not volunteering) It is somewhat inefficient, and Carlos acknowledges it, since he has new status pages in the works which would provide more features and should be better suited for running on window. :) The big CPU-bound task is running "msgmerge" with fuzzy matching (it uses a slow string distance algorithm to find "most similar" strings in translations, and to reuse translations; now, imagine it working 50 times for a set of 5000 strings [eg. Evolution with 50 translations]). It's my estimate that most of the time is spent in it. I just ran a test drive without fuzzy matching, just as a check for my assumption, and the runtime dropped from 4.5 hours to 1.8 hours on i18n-status.gnome.org (it's another machine, hosted by Keld). Note that fuzzy matching is very important for translators, so basically, most of the process is CPU bound (some part of those 1.8 hours is also CPU bound, and the 60% time difference is definitely CPU- and memory-related). I have some ideas for optimising msgmerge step that will do a significantly better job (i.e. I'd first concatenate all the PO files to get a list of *all* English strings, and run msgmerge-style step once for such list and a POT file, creating a table of similarity matches: the problem with this is that it will require bit more memory, but it should speed up the process O(n) times, where "n" is the number of PO files/languages, provided we don't run into excessive memory page faults and swapping :). As for disk optimisations, storing statistics in the database is probably way better (and that's what Carlos' new code does) than generating hundreds of .html files. Another disk optimisation is handling of cvs checkouts. For different reasons, all cvs checkouts are usually done in full, i.e. checkout is first removed, and only then is it "cvs co"ed again. If there was no hand tuning of cvs repositories in Gnome, maybe "cvs up -Pd" would be sufficient? I don't know enough details of CVS hacking to answer this, but the basic thing is that we need to insure pristine CVS tree before running "intltool-update -p" and msgmerge. > - If it really is that intensive, it's not optimizable, we need it on a > gnome.org server, than container is probably the most appropriate > home: > > window: 2 gig ram, 72gig (raid 1) disk, load avg ~1 > container: 6 gig ram, 500gig (raid 5) disk, load avg ~0.2 Any machine with sufficient CPU power and low load will do. But, some amount of disk-bound work is still necessary, because we are anyway talking about working with/parsing full CVS code to find extractable strings, and then working on each PO file in turn. Cheers, Danilo _______________________________________________ gnome-i18n mailing list gnome-i18n@gnome.org http://mail.gnome.org/mailman/listinfo/gnome-i18n