On Sat, Aug 3, 2024 at 7:44 AM Werner LEMBERG <w...@gnu.org> wrote:
> Han-Wen wrote: > > > The idea is that we'd want to trigger less on diffuse lines of > > difference (shifted staffline), but more on concentrated blobs > > (disappearing symbol). Here is a suggestion > > > > * generate images without anti-aliasing > > * generate the image diff > > * for each changed pixel, count the number of changed neighbor pixels. > > * for each pixel, take max(changed - THRESHOLD, 0) > > * then do MAE or some other metric on that image. > > > > By setting THRESHOLD = 2, you could make single line differences > > disappear. (We'd need to make sure to make the resolution such that > > stafflinethickness is at least 2 pixels, or otherwise dramatic > > changes in staff positioning would not trigger anything.) > > Sounds good! > > > Alternatively, on the diff image, you could segment the diff into > > connected components, and weight the errors by the minimum of > > {x-size, y-size}. Thus, a line has a diameter of only 1 pixel, but > > missing notehead is much larger. > > Sounds good, too. However, my knowledge of such algorithms is far too > limited to even have an opinion. > During these days I did not have enough time to write code for this, but I had a bif of time to look into various approaches I could use for the code. These sub-pixel shifts I must admit worry me a bit. Especially without antialiasing, I think they will mean that (for example) the staff lines will be jumpy as to whether they'll be 1 pixel wide or 2 pixels wide. How many pixels are covered by a notehead, same story. It seems to me we should strive to increase the rendering resolution as much as we can (within reason, but if we could work at say 600 or 1200 dpi, this would reduce the effects above). Either way I got it into my mind that instead of doing a direct pixel-to-pixel analysis of the images, we should employ a strategy similar to tracking: this is what you do to video when you want to remove camera motion, for example, or add digitally some writing on a wall while the camera moves around, and you'd like it to appear to stick to the wall itself. In that scenario the idea is that you use a feature detector to identify "features" in your image, from that you get a series of locations on each frame separately, and then you use some approach to reconcile the previous frame's locations with your current frame's locations. The "feature" is a key notion in computer vision: being it said that a feature is "whatever you want"(TM), typical features would be line segments, line corners, crossings, blobs, these all being things that happen in a small region around a pixel (a few pixels across, say 3x3, 5x5, 8x8, these kind of sizes). Because our images have so much contrast and no texture (white is dead flat, and so is black), the feature detectors will be on the one side a bit out of their element (they're developed so they work well on photographs and CG renders), but also most of them should end up having a really easy time, giving the _same_ features on both images (under the assumption the two images are that "shift by 0.37593 pixels" case, they will clearly give you the exact same features for "no shift" images, up to honest missing pieces, these are all deterministic algorithms, or rather, for our use case it wouldn't make much sense to use non-deterministic algorithms). Once you have your list of features you can just realign them however you please (bbox matching for bulk alignment, then use spatial sorting to match each one feature with its twin on the other set, say). This step is called matching. Once the feature sets are matched you can do whatever: attempt a very accurate alignment (lots of points give you more information about alignment, which is how you'd attempt recovering the subpixel shifts), simply say that if "more than 5" features have disappeared, some major disaster happened (notehead's gone?) Stuff like this. So the above is the idea of using Computer Vision to honestly "look" at the image, infer somewhat higher level information from it, and take decisions about that infoset. One thing I find appealing of the above is that in a way it moves beyond the pixels themselves, which I think is good here: the output of lilypond is a PDF, not the pixels. In an ideal world, if the renderer changes, our tests would return identical results, we don't even deliver the pixels, why would we be so picky about them? So this is the testing strategy I have in mind. It's easy to put together, and it should be fast enough for us (people run this stuff in real time on continuous tone images that are 2k across). Once we confirm this works out for our needs, there will be some fine-print choices to make wrt licensing of the libraries and other legal stuff. But let's not rush into that just yet. Keen to hear your thoughts on the above, L PS: Having said all that, I must admit I remain fascinated by the idea of analyzing the output stream directly. I know you don't like the idea, but it seems to me it'd be a lot easier to reason about "glyph 37 moved from xy location to xy2 location" or "document font 3 changed from emmentaler to sebastiano", when compared to "you had a pointcloud of 34 points, these are now 37, the median shift is 0.274 pixels, and 85% moved less than 0.1% of the median". I do wonder if scraping the PDF or something along those lines (PDF>PS, and use PS programming to extract what you want, for example) wouldn't be a fundamentally more robust way to regtest. -- Luca Fascione