Re: search for better regtest comparison algorithm

Luca Fascione Sat, 03 Aug 2024 00:36:26 -0700

On Sat, Aug 3, 2024 at 7:44 AM Werner LEMBERG <w...@gnu.org> wrote:

> Han-Wen wrote:
>
> > The idea is that we'd want to trigger less on diffuse lines of
> > difference (shifted staffline), but more on concentrated blobs
> > (disappearing symbol). Here is a suggestion
> >
> > * generate images without anti-aliasing
> > * generate the image diff
> > * for each changed pixel, count the number of changed neighbor pixels.
> > * for each pixel, take max(changed - THRESHOLD, 0)
> > * then do MAE or some other metric on that image.
> >
> > By setting THRESHOLD = 2, you could make single line differences
> > disappear. (We'd need to make sure to make the resolution such that
> > stafflinethickness is at least 2 pixels, or otherwise dramatic
> > changes in staff positioning would not trigger anything.)
>
> Sounds good!
>
> > Alternatively, on the diff image, you could segment the diff into
> > connected components, and weight the errors by the minimum of
> > {x-size, y-size}. Thus, a line has a diameter of only 1 pixel, but
> > missing notehead is much larger.
>
> Sounds good, too.  However, my knowledge of such algorithms is far too
> limited to even have an opinion.
>

During these days I did not have enough time to write code for this,
but I had a bif of time to look into various approaches I could use for the
code.

These sub-pixel shifts I must admit worry me a bit.
Especially without antialiasing, I think they will mean that (for example)
the staff lines
will be jumpy as to whether they'll be 1 pixel wide or 2 pixels wide.
How many pixels are covered by a notehead, same story.
It seems to me we should strive to increase the rendering resolution as
much as we can
(within reason, but if we could work at say 600 or 1200 dpi, this would
reduce the effects above).

Either way I got it into my mind that instead of doing a direct
pixel-to-pixel analysis of the images,
we should employ a strategy similar to tracking: this is what you do to
video when you want to
remove camera motion, for example, or add digitally some writing on a wall
while the camera
moves around, and you'd like it to appear to stick to the wall itself.

In that scenario the idea is that you use a feature detector to identify
"features" in your image,
from that you get a series of locations on each frame separately, and then
you use some approach
to reconcile the previous frame's locations with your current frame's
locations.

The "feature" is a key notion in computer vision: being it said that a
feature is "whatever you want"(TM),
typical features would be line segments, line corners, crossings, blobs,
these all being things
that happen in a small region around a pixel (a few pixels across, say 3x3,
5x5, 8x8, these kind of sizes).
Because our images have so much contrast and no texture (white is dead
flat, and so is black),
the feature detectors will be on the one side a bit out of their element
(they're developed so they work
well on photographs and CG renders), but also most of them should end up
having a really easy time,
giving the _same_ features on both images
(under the assumption the two images are that "shift by 0.37593 pixels"
case, they will clearly give you
the exact same features for "no shift" images, up to honest missing pieces,
these are all deterministic
algorithms, or rather, for our use case it wouldn't make much sense to use
non-deterministic algorithms).
Once you have your list of features you can just realign them however you
please
(bbox matching for bulk alignment, then use spatial sorting to match each
one feature with its twin on the
other set, say). This step is called matching.

Once the feature sets are matched you can do whatever: attempt a very
accurate alignment (lots of points give you
more information about alignment, which is how you'd attempt recovering the
subpixel shifts),
simply say that if "more than 5" features have disappeared, some major
disaster happened (notehead's gone?)
Stuff like this.

So the above is the idea of using Computer Vision to honestly "look" at the
image, infer somewhat higher
level information from it, and take decisions about that infoset.
One thing I find appealing of the above is that in a way it moves beyond
the pixels themselves,
which I think is good here: the output of lilypond is a PDF, not the
pixels. In an ideal world, if the
renderer changes, our tests would return identical results, we don't even
deliver the pixels,
why would we be so picky about them?

So this is the testing strategy I have in mind.

It's easy to put together, and it should be fast enough for us (people run
this stuff in real time on continuous tone
images that are 2k across). Once we confirm this works out for our needs,
there will be some fine-print
choices to make wrt licensing of the libraries and other legal stuff. But
let's not rush into that just yet.

Keen to hear your thoughts on the above,
L

PS: Having said all that, I must admit I remain fascinated by the idea of
analyzing the output stream directly.
I know you don't like the idea, but it seems to me it'd be a lot easier to
reason about
"glyph 37 moved from xy location to xy2 location" or "document font 3
changed from emmentaler to sebastiano",
when compared to "you had a pointcloud of 34 points, these are now 37,
the median shift is 0.274 pixels, and 85% moved less
than 0.1% of the median".
I do wonder if scraping the PDF or something along those lines (PDF>PS, and
use PS programming to extract what you want,
for example) wouldn't be a fundamentally more robust way to regtest.

-- 
Luca Fascione

Re: search for better regtest comparison algorithm

Reply via email to