Hi, On Fri, 3 May 2013, Bruce Dubbs wrote:
> I'm going to write a program to automatically identify out of date > packages for LFS. Has anyone already done such a beast? I'm kind of doing that for a couple of years now (including some BLFS and even Windows stuff as well ;-]). I started with a bunch of bash scripts that basically parsed certain maintainer websites with certain regexps. This was quite hard to read, neither fast nor flexible and always out of date. Current solution (that I'm happy with for quite some years): All parsing stuff is done by a simple single C(++?) program now. It basically follows _all_ links and handles general stuff like stripping common extensions (*.tgz etc) or an appended "/download" and replacing "/from/a/mirror" by "/from/this/mirror". As basic input it gets a list of simple rules to look for: $packagename $starturl $pattern, e.g.: mpc http://www.multiprecision.org/?prog=mpc&page=download tar.gz check http://sourceforge.net/projects/check/files/check/ /tar.gz/download $pattern in most cases only specifies the (sub/parent)directory depth to search in (number of leading slashes) and the extension (or better: end) of the links to look for there. It usually does not filter for any kind of naming or versioning scheme. As a result I get a list of directories/websites searched in and a list of URLs to potentially download. This would include following uninteresting links (such as parent dirs or adverts or subdirs of outdated versions or subdirs of packages I'm not interested in). Therefore I keep a list of fully qualified directories/websites not to be searched by above C program again, e.g: ftp://ftp.funet.fi:21/pub/mirrors/ftp.easysw.com/pub/cups/1.1.19/ ftp://ftp.funet.fi:21/pub/mirrors/ftp.easysw.com/pub/cups/1.1.20/ http://apache.osuosl.org/ http://creativecommons.org/licenses/by-sa/3.0/ hhttp://jobs.sourceforge.net/ This would give me a list of package URLs, but include stuff that I'm not intersted in (which just happens to come from the same directory/site) or stuff that I already have. Therefore I keep a list of such done packages with certain extensions stripped (to avoid getting an tar.gz as tar.xz again), e.g.: autoconf-2.52 autoconf-2.53 autoconf-2.54 linux-2.6.16.18-utf8_input-1.patch linux-2.6.16.19 linux-2.6.16.19-utf8_input-1.patch The C program has those 3 lists (currently 24KB commented rules, 120KB dirs done, 230KB packages done) in memory and can therefore filter results rapidly. [You can add further sanity checks like remembering when a certain rule resulted in package URLs at all or in new package URLs for the last time to hint at taking a look whether the maintainer changed website, extension or subdir structure.] So I automatically get a list of subdirs currently searched (and may exclude older versions or new unintersting packages or new advert from further search) and I automatically get a list of new package URLs that I may either want to download or just mark as done (for skipping missed intermediate versions or by-catch of packages I'm not interested in). Example: current list of new package URLs that I might potentially be interested in downloading: http://ftp.gnome.org/pub/gnome/sources/gtk+/3.9/gtk+-3.9.0.tar.xz http://icedtea.wildebeest.org/download/source/icedtea-2.1.8.tar.gz http://sourceforge.net/projects/libpng/files/libpng15/1.5.16beta02/libpng-1.5.16beta04.tar.xz/download http://www.linuxfromscratch.org/blfs/downloads/svn/blfs-book-svn-html-2013-05-03.tar.bz2 http://www.linuxfromscratch.org/lfs/downloads/development/LFS-BOOK-SVN-20130501.tar.bz2 Surely not perfect, but easy to maintain and does the job for me... Uwe -- http://linuxfromscratch.org/mailman/listinfo/lfs-dev FAQ: http://www.linuxfromscratch.org/faq/ Unsubscribe: See the above information page