I am trying to understand how to use the full-text search parser for URLS and hostnames to filter results from a text field containing URLS based on domain, and also how to index text columns for fast lookup/matching based on domain.
I have a PostgreSQL database containing documents and links downloaded by a web crawler, with the following tables: pages ---------- id: Integer (primary key) url: String (unique) title: String text: String html: String last_visit: DateTime word_pos: TSVECTOR links ---------- id Integer (primary key) source: String target: String link_text: String UNIQUE(source,target) crawls --------- id: Integer (primary key) query: String crawl_results ------------- id: Integer (primary key) score: Integer (constraint 0<=score<=1) crawl_id: Integer (foreign key, crawls.id) page_id: Integer (foreign key, pages.id) The `source` and `target` fields in the `links` table contain URLs. I am running the following query to extract scored links from the top-ranking search results, for pages that haven't been fetched yet: WITH top_results AS (SELECT page_id, score FROM crawl_results WHERE crawl_id=$1 ORDER BY score LIMIT 100) SELECT top_results.score, l.target FROM top_results JOIN pages p ON top_results.page_id=p.id JOIN links l on p.url=l.source WHERE NOT EXISTS (SELECT pp.id FROM pages pp WHERE l.target=pp.url) However, *I would like to filter these results so that only one row is returned for a given domain (the one with the lowest score)*. So for instance, if I get (0.3, 'http://www.foo.com/bar') and (0.8, 'http://www.foo.com/zor'), I only want the first because it has same domain `foo.com` and has the lower score. I was able to find documentation for the builtin full text search parsers <https://www.postgresql.org/docs/11/textsearch-parsers.html>, which can parse URLS and extract the hostname. For instance, I can extract the hostname from a URL as follows: SELECT token FROM ts_parse('default', 'http://www.foo.com') WHERE tokid = 6; token ------------- www.foo.com (1 row) However, I can't figure out how I would integrate this into the above query to filter out duplicate domains from the results. And because this is the docs for "testing and debugging text search <https://www.postgresql.org/docs/11/textsearch-debugging.html#TEXTSEARCH-PARSER-TESTING>", I don't know if this use of `ts_parse()` is even related to how the URL parser is intended to be used in practice. How would I use the "host" parser in my query above to return one row per domain? Also, how would I appropriately index the "links" table for "host" and "url" token lookup? Thanks!