Coincidentally - we are also using pagefind ;) - and *LITERALLY* 10 minutes ago I've tested and submitted feedback to pagefind - so that they produce a `pip check` compliant wheels (the 1.4.0 version publishes their meta-data wrongly and they fixed it in 1.5.0a3 but I suggested them to do more accurate version specification [1].
> As part of the site build, it generates a small HTML page for each entry and then uses pagefind [1] to index those files; it takes < 0.5 seconds. I think one way of doing it will be to pull the entire cwiki content as "description" and index it - i think it will be a little longer, but pagefind is design to index efficiently even huge sites (no problem with indexing 1000s of content pages in Airflow), and you can even parallelise downloads of the cwiki pages with multiprocessing or even threads to make it wat under few seconds. and you do not have to do it manually - there are ready-to-use packages that do just that [2]. This seems really the way how pagefind works: > Pagefind runs after Hugo, Eleventy, Jekyll, Next, Astro, SvelteKit, or any other website framework. The installation process is always the same: Pagefind only requires a folder containing the built static files of your website, so in most cases no configuration is needed to get started. Also Pagefind is an amazing piece of software - it allows all kinds of filtering, it has a built-in ranking algorithm that can be easily customized [3] and it even has a playground where you can explore your custom ranking algorithms [4] - to fine-tune things. So I think the "noise" issue should not happen - pagefind is exactly optimised to do such search on full page content, so I guess that even the standard ranking will give very good results (it works VERY well for Airflow). Happy to help with a PR if that seems a viable direction. [1] https://github.com/Pagefind/pagefind/pull/989#issuecomment-3692725260 [2] Pywebcopy - threaded copying of websites/subset of those to local folders: https://pypi.org/project/pywebcopy/ [3] Customizing pagefind ranking https://pagefind.app/docs/ranking/ [4] Pagefind playground https://pagefind.app/docs/playground/ -> where you can try different kinds of rankings. On Fri, Dec 26, 2025 at 9:51 AM Justin Mclean <[email protected]> wrote: > Hi, > > > Second suggestion: It looks like the search does not index all the > content > > of the wiki pages, only headings ? Or at least it looks like - when I > > started to search for "signa" (tures) - it did find "signals" from > > > https://cwiki.apache.org/confluence/display/INCUBATOR/Graduation+Readiness > > but did not find signatures anywhere, despite many mentions in the docs. > > It searches the heading and a short description of each page, this is also > how the topic filters work. You can find the data here: > https://github.com/apache/incubator/blob/master/tools/seealso/resources.yml > > As part of the site build, it generates a small HTML page for each entry > and then uses pagefind [1] to index those files; it takes < 0.5 seconds. > > > Possibly indexing 'everything" would be much better - currently a few > > things I searched for issues "not found" - when I clearly used some terms > > that apparently **are** mentioned many times. > > Indexing everything might be more difficult and, as you say, introduce > noise. I’m also not sure it can crawl the Wiki pages. > > Kind Regards, > Justin > > 1. https://pagefind.app/
