On 28/04/2023 15:06, Nicolas George wrote:
Max Nikulin (12023-04-28):
So URI comparison is not a trivial task.
It is an impossible task unless you have specific information about the
workings of the website.
However some steps toward URL normalization should still be tried.
And you will quickly face servers that sends incorrectly Content-Type or
intentionally put application/octet-stream with no sniff header to force
browser to save the file instead of opening it e.g. in built-in PDF reader.
So what?
Usually I would trust libmagic/file(1) more than the content-type
header. HTTP server may send header depending on file extension. Of
course, there are cases when info provided by libmagic may be extended
by Content-Type or file suffix (in URI path or download file name hint
in HTTP headers): XPI browser extensions are ZIP files. Plain text file
may contain markdown or reStructured text markup. You regret absence of
standard way to store file type, but incorrect value may be
intentionally specified there. I consider heuristics unavoidable whether
with standardized place or without it.