On Fri, Nov 22, 2019 at 8:02 AM Christoph Gößmann <m...@goessmann.io> wrote:
> Hi everybody, > > I am trying to get all the lexemes for a text using to_tsvector(). But I > want only words that english_stem -- the integrated snowball dictionary -- > is able to handle to show up in the final tsvector. Since snowball > dictionaries only remove stop words, but keep the words that they cannot > stem, I don't see an easy option to do this. Do you have any ideas? > > I went ahead with creating a new configuration: > > -- add new configuration english_led > CREATE TEXT SEARCH CONFIGURATION public.english_led (COPY = > pg_catalog.english); > > -- dropping any words that contain numbers already in the parser > ALTER TEXT SEARCH CONFIGURATION english_led > DROP MAPPING FOR numword; > > EXAMPLE: > > SELECT * from to_tsvector('english_led','A test sentence with ui44 \tt > somejnk words'); > to_tsvector > -------------------------------------------------- > 'sentenc':3 'somejnk':6 'test':2 'tt':5 'word':7 > > In this tsvector, I would like 'somejnk' and 'tt' not to be included. > I don't think the question is well defined. It will happily stem 'somejnking' to ' somejnk', doesn't that mean that it **can** handle it? The fact that 'somejnk' itself wasn't altered during stemming doesn't mean it wasn't handled, just like 'test' wasn't altered during stemming. Cheers, Jeff