This is interesting. is this related to http://www.fossology.org/projects/fossology fosology in any way? mike
On Tue, Apr 17, 2012 at 6:35 AM, Silvio Cesare <silvio.ces...@gmail.com>wrote: > The Debian Package clonewise-core (currently in the mentors archive) > http://mentors.debian.net/package/clonewise-core > http://www.foocodechu.com/downloads/clonewise > -- > > Clonewise is a tool for detecting code reuse in Debian packages. This is > also > known as detecting embedded code copies. Debian maintains a database of > packages that embed code in the security tracker. Clonewise is a tool to > automate and supplement the manual tracking of packages. > > The primary use of it is for the security team who may identify a > vulnerability > in a library and want to know if that library is reused and embedded in any > other Debian packages. > > -- QUICK GUIDE > > You might want to install the Clonewise database instead of generating it > (which can take several days when you first run Clonewise). > > Download it from http://www.foocodechu.com/downloads/clonewise/ > > Example usage to discover if the source package libpng is reused in other > Debian packages is as follows: > > $ Clonewise -vv libpng > libpng CLONED_IN_SOURCE afterstep (18.457640) > MATCH png.c (5.605583) (33.000000) > MATCH pngtrans.c (6.409078) (57.000000) > MATCH pngwtran.c (6.442979) (80.000000) > libpng CLONED_IN_PACKAGE libafterimage-dev > libpng CLONED_IN_PACKAGE afterstep > libpng CLONED_IN_PACKAGE afterstep-data > libpng CLONED_IN_PACKAGE libafterimage0 > libpng CLONED_IN_PACKAGE afterstep-dbg > libpng CLONED_IN_PACKAGE libafterstep1 > libpng CLONED_IN_SOURCE fltk1.1 (44.336105) > MATCH png.c (5.605583) (58.000000) > MATCH pngerror.c (6.442979) (57.000000) > MATCH pngmem.c (6.442979) (85.000000) > MATCH pngpread.c (6.514438) (52.000000) > MATCH pngrio.c (6.478071) (77.000000) > MATCH pngtrans.c (6.409078) (63.000000) > MATCH pngwtran.c (6.442979) (80.000000) > libpng CLONED_IN_PACKAGE fltk1.1-doc > libpng CLONED_IN_PACKAGE fltk1.1-games > libpng CLONED_IN_PACKAGE libfltk1.1 > libpng CLONED_IN_PACKAGE libfltk1.1-dbg > libpng CLONED_IN_PACKAGE libfltk1.1-dev > [ snip ] > > So libpng is embedded in the source packages afterstep and fltk1.1. > Looking at my version of the embedded-code-copies file on the security > tracker, I can see that fltk1.1 is actually referenced as libfltk1.1 and > has > been fixed a while ago. The security tracker is meant to report the source > package name, so this should probably be fixed. Clonewise otherwise > ignores embedded code copies that have been fixed (according to the > security tracker). I can't see afterstep in the tracker, so again, we might > need to make an update. We don't know if afterstep has been patched > to use a system library so we need to investigate more - like seeing > if libpng is a dependency of the afterstep package. In real usage, if > libpng > is buggy, it's probably important to do this and check the afterstep > package > to see if is vulnerable to a libpng bug. > > The matching files have a weight and a score that represents the > significance > of the file in the repository and and the similarity of the file between > the > two packages. > > CLONED_IN_SOURCE are the source packages. > CLONED_PACKAGE are the binary packages built from the source package. > > -- BUILDING THE DATABASE > > If you don't install clonewise-database, then the database of the package > repository will probably need to be built the first time you run Clonewise. > You will need to be the superuser to do this and in all likelihood it will > take several days to complete. > > Clonewise will run Clonewise-BuildDatabase when the database has not been > built. It will download the entire Debian source repository, unpack the > packages and generate signatures for each package. > > -- CONFIGURATION FILES > > There are a number of configuration files in Clonewise. > > /var/lib/Clonewise/extensions - contains a list of filename extensions that > are used to identify source code. Clonewise ignores all reuse of non > program > code in package contents and this is how it knows this. > > /var/lib/Clonewise/threshold - is the default threshold of the amount of > code > reuse that needs to occur before Clonewise reports it. If you get too many > false positives, then increase this number. You can also override this > threshold on the command line with Clonewise -C <threshold>. > > /var/lib/Clonewise/ignore-these-fixed - is a list of package pairs from > the embedded-code-copies file maintained in the Debian security tracker > where > it has been reported that the packages in question have been modified so > system wide libraries are being used and there is no embedded code in the > build. > > /var/lib/Clonewise/ignore-these-false-positives - is a list of package > pairs > that should not be reported as having code reuse. This file is intended to > contain known false positives. > > -- HELPER UTILITIES > > Clonewise-ParseDatabase is a program to parse Debian's embedded-code-copies > file maintained in the security tracker. Probably the main use of it is to > generate the content for the ignore-these-fixed configuration file. > > To list the package pairs of embedded code that are reported to have been > "fixed", run this command: > > $ Clonewise-ParseDatabase -f <embedded-code-copies-file> > > The output of that command can go directly into the ignore-these-fixed > configuration file. For example: > > # Clonewise-ParseDatabase -f <embedded-code-copies> > > /var/lib/Clonewise/ignore-these-fixed > > You might want to run that command whenever the upstream version of the > embedded-code-copies file is changed to reflect that a package has been > fixed > to avoid an embedded code copy. > > The -u option is for identifying unfixed embedded code copies. The command > run without any options prints all embedded code copies in the Clonewise > native format. > > Another utility which is probably only useful for developers is: > > $ Clonewise-RunTests > > This is useful for comparing Clonewise's results against Debian's manually > created embedded-code-copies file maintained in the security tracker. > > -- COMMAND LINE OPTIONS > > The command line options for Clonewise are: > > -e Report all internal errors. > > -o xml Output in XML. > > -C <threshold> Override threshold configuration on how much code reuse > needs > to occur before reporting. > > -v Verbose - show more information. > > -vv Really verbose - show why packages are reported as reusing > code. This is the option most people want.S > > -vvv Show scores for all packages. Not really useful for non > developers. > > -a Run analysis over entire database and show all embedded > code > copies. When using this option, no package name argument is > required on the command line. > > -s Don't use ssdeep to do a fuzzy check of similar content. > This > will increase the false positive rate, but can also increase > the true positive rate. Probably not useful for non > developers. > > -t Don't use filename extensions when compring packages. This > is > useful if you are looking for reuse of a package's contents > that is not based on program code. > > -- EXTENDED DESCRIPTION OF THE NUMBERS IN THE OUTPUT > > What are the numbers in the output of Clonewise? They represent weights and > scores. > > $ Clonewise -vv libpng > libpng CLONED_IN_SOURCE afterstep (18.457640) > MATCH png.c (5.605583) (33.000000) > MATCH pngtrans.c (6.409078) (57.000000) > MATCH pngwtran.c (6.442979) (80.000000) > [ snip ] > > png.c has a weight of 5.605583. The more frequent png.c occurs accross > packages > in the Debian source repository, the lower the weight. For example, if > extensions were not used and README was matched, then the weight would be > very low because the filename README occurs in almost every package. > > png.c has a similarity of 33.000000. This means that ssdeep identified a > similarity of 33% between png.c in the afterstep and libpng package. > Because it > is greater than 0, it probably means that they derive from the same source > in > some earlier version of libpng. > > The score of 18.45760 is an accumulation of the weights in the matching > files. > This score is what the Clonewise threshold is compared against. If this > score > is greater than the threshold, Clonewise reports code reuse to have > occured. > The higher this number, the much more believable it is that code reuse has > occured. > > -- HOW DOES IT WORK? > > It's a simple idea really. If two packages' source trees share the same > filenames, and the content looks similar according to a fuzzy hash, then > they > share code. > > Each filename has a weight based on the inverse document frequency. This > is a fancy way of saying if the same filename is common to lots of packages > then it has a lower weight. > > Each matching file is counted and the weights all add up. If the sum weight > exceeds a threshold, Clonewise will report it. > > -- Silvio Cesare <silvio.ces...@gmail.com> > > > -- > To UNSUBSCRIBE, email to debian-mentors-requ...@lists.debian.org > with a subject of "unsubscribe". Trouble? Contact > listmas...@lists.debian.org > Archive: > http://lists.debian.org/ca+ygn1ja3dpdnjfyzy_bzje2iurvhuhmy9rxshy3kfbe3p...@mail.gmail.com > > -- James Michael DuPont Member of Free Libre Open Source Software Kosova http://flossk.org