Re: [GENERAL] [pg_trgm] Making similarity(?, ?) < ? use an index
Hello. As I know 'lhs % rhs' is equivalent to 'similarity(lhs, rhs) >= show_limit()'. And so your query should looks like this: SELECT * FROM restaurants WHERE city % 'warsw'; And it should use index. On 03.06.2016 13:35, Greg Navis wrote: Hey! I'm playing with pg_trgm. It seems that `lhs % rhs` is _almost_ equivalent to `similarity(lhs, rhs) < show_limit()`. The difference that I noticed is that `%` uses a GIN index while `similarity` does not. ``` grn=# \d restaurants Table "public.restaurants" Column | Type | Modifiers ++--- city | character varying(255) | not null Indexes: "restaurants_city_trgm_idx" gin (city gin_trgm_ops) grn=# SELECT COUNT(*) FROM restaurants; count 515475 (1 row) Time: 45.964 ms grn=# EXPLAIN ANALYZE SELECT * FROM restaurants WHERE similarity(city, 'warsw') > show_limit(); QUERY PLAN Seq Scan on restaurants (cost=0.00..11692.81 rows=171825 width=10) (actual time=16.436..665.062 rows=360 loops=1) Filter: (similarity((city)::text, 'warsw'::text) > show_limit()) Rows Removed by Filter: 515115 Planning time: 0.139 ms Execution time: 665.105 ms (5 rows) Time: 665.758 ms ``` My question is: is it possible to make `similarity` use the index? If not, is there a way to speed up the query above? Best regards -- Greg Navis I help tech companies to scale Heroku-hosted Rails apps. Free, biweekly scalability newsletter for SaaS CEOs <http://www.gregnavis.com/newsletter/> -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] [pg_trgm] Per-query set_limit()
Unfortunately, there is not a better way. You should manually set it if you want different limit. On 03.06.2016 13:43, Greg Navis wrote: Hey! I'd like to call `set_limit` on a per-query basis, i.e. searching for restaurants in a given city should use 0.4 and searching for cinemas by name should use 0.2. I can accomplish this with: ``` SELECT set_limit(0.2); SELECT * FROM cinemas WHERE name % ?; ``` (similarly for restaurants). My question is: is there a better way? Best regards -- Greg Navis I help tech companies to scale Heroku-hosted Rails apps. Free, biweekly scalability newsletter for SaaS CEOs <http://www.gregnavis.com/newsletter/> -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] [pg_trgm] Making similarity(?, ?) < ? use an index
Oh, I understand. It is because you want different limits for restaurants and cinemas? I see only one solution. It is custom extension, which will create operator class similar to gin_trgm_ops and will depends on pg_trgm. In gin_trgm_consistent() you can use your own limit variable. As I know functions do not use indexes. Of course I may be wrong. And somebody knows a better solution. On 03.06.2016 14:24, Greg Navis wrote: Artur, thanks for your reply. That's right, `%` does use the index. The goal of using `similarity(lhs, rhs) >= show_limit()` was to replace `show_limit()` with a custom, per-query limit. I noticed that the latter approach does _not_ use the index, hence my question: grn=# EXPLAIN ANALYZE SELECT * FROM restaurants WHERE city % 'warsw'; QUERY PLAN -- Bitmap Heap Scan on restaurants (cost=24.28..1319.36 rows=515 width=10) (actual time=96.081..96.456 rows=400 loops=1) Recheck Cond: ((city)::text % 'warsw'::text) Heap Blocks: exact=359 -> Bitmap Index Scan on restaurants_city_gist_trgm_idx (cost=0.00..24.15 rows=515 width=0) (actual time=96.030..96.030 rows=400 loops=1) Index Cond: ((city)::text % 'warsw'::text) Planning time: 0.211 ms Execution time: 96.528 ms (7 rows) grn=# EXPLAIN ANALYZE SELECT * FROM restaurants WHERE similarity(city, 'warsw') >= show_limit(); QUERY PLAN Seq Scan on restaurants (cost=0.00..11692.81 rows=171825 width=10) (actual time=14.520..692.520 rows=400 loops=1) Filter: (similarity((city)::text, 'warsw'::text) >= show_limit()) Rows Removed by Filter: 515075 Planning time: 0.109 ms Execution time: 692.560 ms (5 rows) If this functionality isn't supported then it might be a good idea for a contribution. Best regards On Fri, Jun 3, 2016 at 12:51 PM, Artur Zakirov mailto:a.zaki...@postgrespro.ru>> wrote: Hello. As I know 'lhs % rhs' is equivalent to 'similarity(lhs, rhs) >= show_limit()'. And so your query should looks like this: SELECT * FROM restaurants WHERE city % 'warsw'; And it should use index. On 03.06.2016 13:35, Greg Navis wrote: Hey! I'm playing with pg_trgm. It seems that `lhs % rhs` is _almost_ equivalent to `similarity(lhs, rhs) < show_limit()`. The difference that I noticed is that `%` uses a GIN index while `similarity` does not. ``` grn=# \d restaurants Table "public.restaurants" Column | Type | Modifiers ++--- city | character varying(255) | not null Indexes: "restaurants_city_trgm_idx" gin (city gin_trgm_ops) grn=# SELECT COUNT(*) FROM restaurants; count 515475 (1 row) Time: 45.964 ms grn=# EXPLAIN ANALYZE SELECT * FROM restaurants WHERE similarity(city, 'warsw') > show_limit(); QUERY PLAN Seq Scan on restaurants (cost=0.00..11692.81 rows=171825 width=10) (actual time=16.436..665.062 rows=360 loops=1) Filter: (similarity((city)::text, 'warsw'::text) > show_limit()) Rows Removed by Filter: 515115 Planning time: 0.139 ms Execution time: 665.105 ms (5 rows) Time: 665.758 ms ``` My question is: is it possible to make `similarity` use the index? If not, is there a way to speed up the query above? Best regards -- Greg Navis I help tech companies to scale Heroku-hosted Rails apps. Free, biweekly scalability newsletter for SaaS CEOs <http://www.gregnavis.com/newsletter/> -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Greg Navis I help tech companies to scale Heroku-hosted Rails apps. Free, biweekly scalability newsletter for SaaS CEOs <http://www.gregnavis.com/newsletter/> -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] [pg_trgm] Making similarity(?, ?) < ? use an index
On 08.06.2016 12:16, Greg Navis wrote: Would this be a better plan then: 1. Add support for trigram operators. 2. Implement `issimilar(lhs, rhs, threshold)`. 3. Add `issimilar` to the trigram operator classes. I think Tom's proposal with composite type is exelent option. If I understand correctly it introduce a new function similarity_rhs(). You can use it as the following: SELECT * FROM restaurants WHERE city % similarity_rhs('warsw', 0.4); SELECT * FROM cinemas WHERE name % similarity_rhs('warsw', 0.2); This is what you need? If I understand correctly your plan differs from Tom's proposal. And I am afraid that you will do a waste work. -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] [pg_trgm] Making similarity(?, ?) < ? use an index
2016-06-11 13:47 GMT+03:00 Greg Navis : > I made some progress but I'm stuck. I'm focused on GiST for now. Please > ignore sloppy naming for now. > > I made the following changes to pg_trgm--1.2.sql: > > CREATE TYPE pg_trgm_match AS (match TEXT, threshold REAL); > > CREATE OR REPLACE FUNCTION trgm_check_match(string TEXT, match > pg_trgm_match) RETURNS bool AS $$ > BEGIN > RETURN match.match <-> string <= 1 - match.threshold; > END; > $$ LANGUAGE plpgsql; > > CREATE OPERATOR %%(leftarg = text, rightarg = pg_trgm_match, > procedure=trgm_check_match); > > ALTER OPERATOR FAMILY gist_trgm_ops USING gist ADD >OPERATOR9 %% (text, > pg_trgm_match); > You can overload existing % operator: ALTER OPERATOR FAMILY gist_trgm_ops USING gist ADD OPERATOR9 % (text, pg_trgm_match); > > It does indeed make PostgreSQL complain about undefined strategy 9. I > added the following define to trgm.h: > > #define ThresholdStrategyNumber 9 > > It seems StrategyNumber is used in gtrgm_consistent and gtrgm_distance. > > In gtrgm_consistent, I need change the way `nlimit` is obtained: > > nlimit = (strategy == SimilarityStrategyNumber) ? > similarity_threshold : word_similarity_threshold; > > I need to add a case for ThresholdStrategyNumber and extract `nlimit` from > the argument of `pg_trgm_match`. I'm not sure what to do in > `gtrgm_distance`. > > My questions: > > 1a. Is it possible to make `gtrgm_consistent` accept `text` or > `pg_trgm_match` as the second argument? > I think you can change definition of the gtrgm_consistent() in .sql file in CREATE FUNCTION and CREATE OPERATOR CLASS commands to: gtrgm_consistent(internal,anynonarray,smallint,oid,internal) But I do not sure that anynonarray is good here. > 1b. What's the equivalent of `match.match` and `match.threshold` (where > `match` is a `pg_trgm_match`) in C? > After changing the definition you can extract values from composite type in the gtrgm_consistent(). I think the code in the beginning of function may looks like this: if (strategy == SimilarityStrategyNumber || strategy == WordSimilarityStrategyNumber) { query = PG_GETARG_TEXT_P(1); nlimit = (strategy == SimilarityStrategyNumber) ? similarity_threshold : word_similarity_threshold; } else if (strategy == ThresholdStrategyNumber) { HeapTupleHeader query_match = PG_GETARG_HEAPTUPLEHEADER(1); Oid tupType = HeapTupleHeaderGetTypeId(query_match); int32 tupTypmod = HeapTupleHeaderGetTypMod(query_match); TupleDesc tupdesc = lookup_rowtype_tupdesc(tupType, tupTypmod); HeapTupleData tuple; bool isnull; tuple.t_len = HeapTupleHeaderGetDatumLength(query_match); ItemPointerSetInvalid(&(tuple.t_self)); tuple.t_tableOid = InvalidOid; tuple.t_data = query_match; query = DatumGetTextP(fastgetattr(&tuple, 1, tupdesc, &isnull)); nlimit = DatumGetFloat4(fastgetattr(&tuple, 2, tupdesc, &isnull)); ReleaseTupleDesc(tupdesc); } else query = PG_GETARG_TEXT_P(1); After this code you should execute the query using index: select t,similarity(t,'qwertyu0988') as sml from test_trgm where t % row('qwertyu0988', 0.6)::pg_trgm_match; I got the query from the regression test. And of course the code need to be checked for bugs. > 2. What to do with `gtrgm_distance`? > You do not need to change gtrgm_distance(). It is used only in ORDER BY clause to calculate distances. To calculate distance you do not need threshold. > > Thanks for help. > -- > Greg Navis > I help tech companies to scale Heroku-hosted Rails apps. > Free, biweekly scalability newsletter for SaaS CEOs > <http://www.gregnavis.com/newsletter/> > > -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company
Re: [GENERAL] FTS with more than one language in body and with unknown query language?
Hi, On 14.07.2016 01:16, Stefan Keller wrote: Hi, I have a text corpus which contains either German or English docs and I expect queries where I don't know if it's German or English. So I'd like e.g. that a query "forest" matches "forest" in body_en but also "Wald" in body_de. I created a table with attributes body_en and body_de (type "text"). I will use ts_vector/ts_query on the fly (don't need yet an index (attributes)). * Can FTS handle this multilingual situation? In my opinion, PostgreSQL cant handle it. It cant translate words from one language to another, it just stems word from original form to basic form. First you need to translate word from English to German, then search word in the body_de attribute. And the issue is complicated by the fact that one word could have different meaning in the other language. * How to setup a text search configuration which e.g. stems en and de words? * Should I create a synonym dictionary which contains word translations en-de instead of synonyms en-en? This synonym dictionary will contain a thousands entries. So it will require a great effort to make this dictionary. * Any hints to related work where FTS has been used in a multilingual context? :Stefan -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] FTS with more than one language in body and with unknown query language?
Hello, Stefan! On 15.07.2016 01:54, Stefan Keller wrote: приве́т! Artur Thanks for your explanations. 2016-07-14 17:20 GMT+02:00 Artur Zakirov : On 14.07.2016 01:16, Stefan Keller wrote: ... * Should I create a synonym dictionary which contains word translations en-de instead of synonyms en-en? This synonym dictionary will contain a thousands entries. So it will require a great effort to make this dictionary. It's a domain-specific corpus of max. 1000 records of descriptive text (metadata) about geographic data, like topographic map, land use planning, etc. Oh, I understand. If you need word stemming you can use a thesaurus dictionary. So you should do the following. 1 - Create the file $SHAREDIR/tsearch_data/geo.ths. Here is my example of entry: wald : forest 2 - PostgreSQL already has english_stem dictionary. In this solution you do not need german_stem dictionary. Create the dictionary and the configuration in PostgreSQL: =# CREATE TEXT SEARCH DICTIONARY geo_ths ( Template = thesaurus, DictFile = geo, Dictionary = pg_catalog.english_stem); =# CREATE TEXT SEARCH CONFIGURATION geo_ths(COPY='simple'); =# ALTER TEXT SEARCH CONFIGURATION geo_ths ALTER MAPPING FOR asciiword, asciihword, hword_asciipart, word, hword, hword_part WITH geo_ths, english_stem; For example, we have the following data: =# CREATE TABLE geo (id int, body_en text, body_de text); =# INSERT INTO geo VALUES (1, 'forest', NULL), (2, NULL, 'wald'); Then you can execute the following queries: =# SELECT * FROM geo WHERE to_tsvector('geo_ths', body_en) @@ to_tsquery('geo_ths', 'forest'); id | body_en | body_de +-+- 1 | forest | (null) (1 row) =# SELECT * FROM geo WHERE to_tsvector('geo_ths', body_de) @@ to_tsquery('geo_ths', 'forest'); id | body_en | body_de +-+- 2 | (null) | wald (1 row) =# SELECT * FROM geo WHERE to_tsvector('geo_ths', body_en) @@ to_tsquery('geo_ths', 'wald'); id | body_en | body_de +-+- 1 | forest | (null) (1 row) =# SELECT * FROM geo WHERE to_tsvector('geo_ths', body_de) @@ to_tsquery('geo_ths', 'wald'); id | body_en | body_de +-+- 2 | (null) | wald (1 row) This solution does not take into account the fact that queries in german can be in various forms (for example, plural forms). It works well for english. As a solution of this, you can put into geo.ths various forms of word. But maybe the solution above is what you need. Otherwise it can serves as a concept of solution. ... * How to setup a text search configuration which e.g. stems en and de words? I still would like to give FTS a try with synonym dictionary (en-de). Now, I'm wondering how to setup the configuration. I've seen examples to process either english, german or russian alone. But I did not find yet any documentation on how to setup the text search configuration where a corpus contains two (or more) languages at same time in a table (body_en and body_de). :Stefan 2016-07-14 17:20 GMT+02:00 Artur Zakirov : Hi, On 14.07.2016 01:16, Stefan Keller wrote: Hi, I have a text corpus which contains either German or English docs and I expect queries where I don't know if it's German or English. So I'd like e.g. that a query "forest" matches "forest" in body_en but also "Wald" in body_de. I created a table with attributes body_en and body_de (type "text"). I will use ts_vector/ts_query on the fly (don't need yet an index (attributes)). * Can FTS handle this multilingual situation? In my opinion, PostgreSQL cant handle it. It cant translate words from one language to another, it just stems word from original form to basic form. First you need to translate word from English to German, then search word in the body_de attribute. And the issue is complicated by the fact that one word could have different meaning in the other language. * How to setup a text search configuration which e.g. stems en and de words? * Should I create a synonym dictionary which contains word translations en-de instead of synonyms en-en? This synonym dictionary will contain a thousands entries. So it will require a great effort to make this dictionary. * Any hints to related work where FTS has been used in a multilingual context? :Stefan -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] FTS with more than one language in body and with unknown query language?
On 15.07.2016 21:34, Stefan Keller wrote: I actually expect that stemming takes place for english and german. And we will in fact have queries in english and in german as well. So I think we still have some issues to resolve...? I performed the following things: - patch for PostgreSQL: https://github.com/select-artur/postgres/tree/join_tsconfig It adds new option for FTS dictionary mapping (JOIN). I want to propose this patch to -hackers. - dict_translate dictionary based on dict_xsyn contrib: https://github.com/select-artur/dict_translate This things are made for multilingual purpose and are interesting for us. Maybe they will be helpful for you too. Example: 1 - Create files: $SHAREDIR/tsearch_data/geo_en.trn: forest wald forst holz $SHAREDIR/tsearch_data/geo_de.trn: wald forest wood forst forest wood holz forest wood 2 - Execute queries: =# CREATE TEXT SEARCH DICTIONARY geo_en ( Template = translate, DictFile = geo_en, InputDict = pg_catalog.english_stem); =# CREATE TEXT SEARCH DICTIONARY geo_de ( Template = translate, DictFile = geo_de, InputDict = pg_catalog.german_stem); =# CREATE TEXT SEARCH CONFIGURATION geo(COPY='simple'); =# ALTER TEXT SEARCH CONFIGURATION geo_ths ALTER MAPPING FOR asciiword, asciihword, hword_asciipart, word, hword, hword_part WITH geo_en (JOIN), english_stem (JOIN), geo_de (JOIN), german_stem (JOIN); =# CREATE TABLE geo (id int, body_en text, body_de text); =# INSERT INTO geo VALUES (1, 'forest', NULL), (2, NULL, 'wald'); 3 - Sample queries: =# SELECT * FROM geo WHERE to_tsvector('geo', body_en) @@ to_tsquery('geo', 'forests'); id | body_en | body_de +-+- 1 | forest | (null) (1 row) =# SELECT * FROM geo WHERE to_tsvector('geo', body_de) @@ to_tsquery('geo', 'forests'); id | body_en | body_de +-+- 2 | (null) | wald 3 | (null) | forst (2 rows) =# SELECT * FROM geo WHERE to_tsvector('geo', body_en) @@ to_tsquery('geo', 'walde'); id | body_en | body_de +-+- 1 | forest | (null) (1 row) I will be glad for your comments. -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Updated RUM-index and support for bigint as part of index
Hello, 2016-08-02 21:08 GMT+03:00 Andreas Joseph Krogh : > > The ORDER BY part seems strange; It seems one has to find a value "lower than > any other value" to use as a kind of base, why is this necessary? It also > seems that in order to be able to sort DESC one has to provide a timestamp > value "higher than any other value", is this correct? > > > It would be great if the docs explained this. > We will write more detailed documentation for RUM. > > I really miss the opportunity to include a BIGINT as part of the index, so > that the WHERE-clause could be like this: > > WHERE del.fts_all @@ to_tsquery('simple', 'andreas&kr') AND del.folder_id IN > (1,2,3) > > Having this would be perfect for my use-case searching in email in > folders, sorted by received_date, and having it use ONE index. > > Will this be supported? > We have a plan to use generic types to able to include bigint, timestamp and other types as part of index. But I cant tell date of it. -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company
Re: [GENERAL] Updated RUM-index and support for bigint as part of index
t;=> '2013-01-01' AS rank FROM test WHERE fts @@ to_tsquery('simple', 'andreas') ORDER BY received <=> '2013-01-01' DESC LIMIT 8; id |received | rank +-+-- 6 | 2015-09-29 00:26:56 | 86488016 3 | 2015-08-18 03:49:02 | 82871342 2 | 2015-08-18 03:07:55 | 82868875 1 | 2015-08-17 23:53:26 | 82857206 8 | 2015-05-09 08:39:14.128 | 74162354.128 5 | 2012-11-20 15:41:04.936 | 3572335.064 7 | 2012-11-21 14:16:26.448 | 3491013.552 4 | 2012-12-03 14:14:05.488 | 2454354.512 (8 rows) I hope this is what you want. -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] FTS query, statistics and planner estimations…
Hello, On 09.11.2016 12:22, Pierre Ducroquet wrote: Hello I recently stumbled on a slow query in my database that showed an odd behaviour related to the statistics of FTS queries. The query does a few joins «after» running a FTS query on a main table. The FTS query returns a few thousand rows, but the estimations are wrong, leading the optimizer to terrible plans compared to what should happen, and thus creates a far higher execution time. I managed to isolate the odd behaviour in a single query, and I would like your opinion about it. I have modified the table name, columns and query to hide sensitive values, but the issue remain the same. The table contains about 295,000 documents, and all is running under PostgreSQL 9.5. EXPLAIN ANALYZE SELECT COUNT(*) FROM documents WHERE to_tsvector('french', subject || ' ' || body) @@ plainto_tsquery('XXX'); Of course, there is an index on to_tsvector('french', subject || ' ' || body). Did you try in the query to specify FTS configuration: WHERE to_tsvector('french', subject || ' ' || body) @@ plainto_tsquery('french', 'XXX'); plainto_tsquery() with specified configuration should be faster and plainto_tsquery(regconfig,test) is immutable function. -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Incorrect information about GIN-index in RUM's docs
On 15.11.2016 11:12, Andreas Joseph Krogh wrote: På tirsdag 15. november 2016 kl. 08:48:37, skrev Oleg Bartunov mailto:obartu...@gmail.com>>: On Tue, Nov 15, 2016 at 10:37 AM, Andreas Joseph Krogh mailto:andr...@visena.com>> wrote: This item isn't valid, is it? * There isn't phrase search with |GIN| index. This problem relates with previous problem. It is need position information to perform phrase search. RUM being an extention, and having index-access-methods as extentions is new in 9.6, where phrase-search was introduced, this isn't really true, or am I missing something? RUM is very good for phrase search, since it has access to coordinates in index, check slides from PGconf.eu (http://www.sai.msu.su/~megera/postgres/talks/pgconfeu-fts-2016.pdf). Where did you find the citation ? https://github.com/postgrespro/rum#user-content-introduction I assume I read the item wrong tho. As it's phrased, it (for me at least but I'm Norwegian) seems like phrase-search is not possible with GIN, which it definitely is. So I assume the real issue, that this item is trying to point out, is that phrase-search isn't really /optimized/ with GIN, just syntactic sugar. And with RUM it is optimized, not just syntactic sugar. Am I correct assuming this? Yes. I suppose the original explanation about phrase search in GIN is confusing and ambiguous. GIN supports it, but it requires additional bitmap heap scan and so it slower. I fixed it. Thank you. -- *Andreas Joseph Krogh* CTO / Partner - Visena AS Mobile: +47 909 56 963 andr...@visena.com <mailto:andr...@visena.com> www.visena.com <https://www.visena.com> <https://www.visena.com> -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Full text search tsv column aproach vs concat confusion
Hi, On 16.11.2016 11:54, cen wrote: Hi I am seeking some clarification in regard to full text search across multiple tables and what the best approach is. Documentation talks about two approaches when it comes to building a document: on-the-fly concat of columns and a dedicated tsv column approach. Let's say I want to perform a search for |"txt1 & txt2 & txt3" on columns table1.col1, table1.col2 and table2.col1. I see the following solutions:| |1. Concat all three into a document and perform a FTS.| |SELECT * FROM ( | |SELECTto_tsvector(table1.col1)||to_tsvector(table1.col2)||to_tsvector(table2.col1)asdocument F|||ROM table1 LEFTJOINtable2 ONtable1.table2_id=table2.id| ) subquery |||WHEREsubquery.document@@to_tsquery(unaccent(?));| | |2. Create a tsv column in each table, concat tsv columns and perform FTS on that.| |SELECT*FROMtable1 LEFTJOINtable2 ONtable1.table2_id=table2.id WHEREtable1.tsv ||tale2.tsv @@to_tsquery(unaccent(?));| |3. Have a tsv column only in table1 and insert table2.col1 to the tsv via triggers. Works but seems very hacky.| | | |It seems to me that option #2 is fast and easy to implement but I am not sure what the concat of tsvs really means from index usage and performance standpoint. Option #1 is the most flexible and I'd use that all the time if it was not THAT much slower than tsv column approacj. Documentation on TSV columns states: "||Another advantage is that searches will be faster, since it will not be necessary to redo the to_tsvector calls to verify index matches." | The question is, how much faster are tsv columns really? Are there any benchmarks about this? If the performance difference is negligible I'd advocate that using tsv columns is a waste of time and space in most general cases. But since there is no information on how much faster it's hard to decide. I haven't any such benchmarks. But if you have a real database, you can perform tests using it on your solutions. Because it depends on your task and what you need. By the way, I suppose it is better to use COALESCE() function if your columns could have NULL value: SELECT * FROM ( SELECT to_tsvector(coalesce(table1.col1,'')) || to_tsvector(coalesce(table1.col2,'')) || to_tsvector(coalesce(table2.col1,'')) as document FROM table1 LEFT JOIN table2 ON table1.table2_id=table2.id ) subquery WHERE subquery.document @@ to_tsquery(unaccent(?)); And specifying a text search configuration makes queries a little bit faster: ... to_tsvector('english', coalesce(table1.col1,'')) ... -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Install pgAudit extension
On 06.12.2016 08:44, Dylan Luong wrote: Hi I need some advice on installing the pgAudit extension as I am new to PostgreSQL extenstions. You can use PGXS. With PGXS you don't need postgres's source code, only source code of pgaudit. But you need installed PostgreSQL. Here is the example: 1 - unzip pgaudit to a pgaudit directory 2 - go to a pgaudit directory: cd pgaudit 3 - compile and install: make USE_PGXS=1 make USE_PGXS=1 install It is assumed that pg_config is in your $PATH variable. If it is not then you can run: make USE_PGXS=1 PG_CONFIG=/pg_config make USE_PGXS=1 PG_CONFIG=/pg_config install -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Using a german affix file for compound words
On 28.01.2016 18:57, Oleg Bartunov wrote: On Thu, Jan 28, 2016 at 6:04 PM, Wolfgang Winkler mailto:wolfgang.wink...@digital-concepts.com>> wrote: Hi! We have a problem with importing a compound dictionary file for german. I downloaded the files here: http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/dicts/ispell/ispell-german-compound.tar.gz and converted them to utf-8 with iconv. The affix file seems ok when opened with an editor. When I try to create or alter a dictionary to use this affix file, I get the following error: alter TEXT SEARCH DICTIONARY german_ispell ( DictFile = german, AffFile = german, StopWords = german ); ERROR: syntax error CONTEXT: line 224 of configuration file "/usr/local/pgsql/share/tsearch_data/german.affix": " ABE > -ABE,äBIN " This is the first occurrence of an umlaut character in the file. I've found a view postings where the same file is used, e.g.: http://www.postgresql.org/message-id/flat/556c1411.4010...@tbz-pariv.de#556c1411.4010...@tbz-pariv.de This users has been able to import the file. Am I missing something obvious? What version of PostgreSQL do you use? I tested this dictionary on PostgreSQL 9.4.5. Downloaded from the link files and executed commands: iconv -f ISO-8859-1 -t UTF-8 german.aff -o german2.affix iconv -f ISO-8859-1 -t UTF-8 german.dict -o german2.dict I renamed them to german.affix and german.dict and moved to the tsearch_data directory. Executed commands without errors: -> create text search dictionary german_ispell ( Template = ispell, DictFile = german, AffFile = german, Stopwords = german ); DROP TEXT SEARCH DICTIONARY -> select ts_lexize('german_ispell', 'test'); ts_lexize --- {test} (1 row) -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Using a german affix file for compound words
On 28.01.2016 20:36, Wolfgang Winkler wrote: I'm using 9.4.5 as well and I used exactly the same iconv lines as you postes below. Are there any encoding options that have to be set right? The database encoding is set to UTF8. ww What output does the command show: -> SHOW LC_CTYPE; ? Did you try a dictionary from http://extensions.openoffice.org/en/project/german-de-de-frami-dictionaries ? You need extract from a downloaded archive de_DE_frami.aff and de_DE_frami.dic files, rename them and convert them to UTF-8. Am 2016-01-28 um 17:34 schrieb Artur Zakirov: On 28.01.2016 18:57, Oleg Bartunov wrote: On Thu, Jan 28, 2016 at 6:04 PM, Wolfgang Winkler mailto:wolfgang.wink...@digital-concepts.com>> wrote: Hi! We have a problem with importing a compound dictionary file for german. I downloaded the files here: http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/dicts/ispell/ispell-german-compound.tar.gz and converted them to utf-8 with iconv. The affix file seems ok when opened with an editor. When I try to create or alter a dictionary to use this affix file, I get the following error: alter TEXT SEARCH DICTIONARY german_ispell ( DictFile = german, AffFile = german, StopWords = german ); ERROR: syntax error CONTEXT: line 224 of configuration file "/usr/local/pgsql/share/tsearch_data/german.affix": " ABE > -ABE,äBIN " This is the first occurrence of an umlaut character in the file. I've found a view postings where the same file is used, e.g.: http://www.postgresql.org/message-id/flat/556c1411.4010...@tbz-pariv.de#556c1411.4010...@tbz-pariv.de This users has been able to import the file. Am I missing something obvious? What version of PostgreSQL do you use? I tested this dictionary on PostgreSQL 9.4.5. Downloaded from the link files and executed commands: iconv -f ISO-8859-1 -t UTF-8 german.aff -o german2.affix iconv -f ISO-8859-1 -t UTF-8 german.dict -o german2.dict I renamed them to german.affix and german.dict and moved to the tsearch_data directory. Executed commands without errors: -> create text search dictionary german_ispell ( Template = ispell, DictFile = german, AffFile = german, Stopwords = german ); DROP TEXT SEARCH DICTIONARY -> select ts_lexize('german_ispell', 'test'); ts_lexize --- {test} (1 row) -- *Wolfgang Winkler* Geschäftsführung wolfgang.wink...@digital-concepts.com mobil +43.699.19971172 dc:*büro* digital concepts Novak Winkler OG Software & Design Landstraße 68, 5. Stock, 4020 Linz www.digital-concepts.com <http://www.digital-concepts.com> tel +43.732.997117.72 tel +43.699.1997117.2 Firmenbuchnummer: 192003h Firmenbuchgericht: Landesgericht Linz -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] commit time in logical decoding
Hello, Weiping It seems that it is a bug. Thank you for report. I guess it will be fixed soon. On 01.03.2016 17:36, Weiping Qu wrote: Dear postgresql general mailing list, I am currently using the logical decoding feature (version 9.6 I think as far as I found in the source, wal_level: logical, max_replication_slot: > 1, track_commit_timestamp: on, I am not sure whether this will help or not). Following the online documentation, everything works fine until I input SELECT * FROM pg_logical_slot_peek_changes('regression_slot', NULL, NULL, 'include-timestamp', 'on'); I always got 1999-12-31 16:00 as the commit time for arbitrary transactions with DML statements. After several tries, I realize that the txn->commit_time returned was always 0. Could you help me by indicating me what could be wrong in my case? Any missing parameters set? Thank you in advance, Kind Regards, Weiping -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Shrinking TSvectors
On 05.04.2016 14:37, Howard News wrote: Hi, does anyone have any pointers for shrinking tsvectors I have looked at the contents of some of these fields and they contain many details that are not needed. For example... "'+1':935,942 '-0500':72 '-0578':932 '-0667':938 '-266':937 '-873':944 '-9972':945 '/partners/application.html':222 '/partners/program/program-agreement.pdf':271 '/partners/reseller.html':181,1073 '01756':50,1083 '07767':54,1087 '1':753,771 '12':366 '14':66 (...)" I am not interested in keeping the numbers or urls in the indexes. Thanks, Howard. Hello, You need create a new text search configuration. Here is an example of commands: CREATE TEXT SEARCH CONFIGURATION public.english_cfg ( PARSER = default ); ALTER TEXT SEARCH CONFIGURATION public.english_cfg ALTER MAPPING FOR asciiword, asciihword, hword_asciipart, word, hword, hword_part WITH pg_catalog.english_stem; Instead of the "pg_catalog.english_stem" you can use your own dictionary. Lets compare new configuration with the embedded configuration "pg_catalog.english": postgres=# select to_tsvector('english_cfg', 'home -9972 /partners/application.html /partners/program/program-agreement.pdf'); to_tsvector - 'home':1 (1 row) postgres=# select to_tsvector('english', 'home -9972 /partners/application.html /partners/program/program-agreement.pdf'); to_tsvector --- '-9972':2 '/partners/application.html':3 '/partners/program/program-agreement.pdf':4 'home':1 (1 row) You can get some additional information about configurations using \dF+: postgres=# \dF+ english Text search configuration "pg_catalog.english" Parser: "pg_catalog.default" Token | Dictionaries -+-- asciihword | english_stem asciiword | english_stem email | simple file| simple float | simple host| simple hword | english_stem hword_asciipart | english_stem hword_numpart | simple hword_part | english_stem int | simple numhword| simple numword | simple sfloat | simple uint| simple url | simple url_path| simple version | simple word| english_stem postgres=# \dF+ english_cfg Text search configuration "public.english_cfg" Parser: "pg_catalog.default" Token | Dictionaries -+-- asciihword | english_stem asciiword | english_stem hword | english_stem hword_asciipart | english_stem hword_part | english_stem word| english_stem -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] How to build docs from master?
On 11.04.2016 15:27, Andreas Joseph Krogh wrote: Hi all. I have installed PG (master) from git, but it seems the docs are not built issuing "make install". Yes. You need to run "make world" to build everything or "make docs" to build docs. And run "make install-world" to install everything or "make install-docs" to install docs. I've tried "make docs" but it fails with: interface Makefile:127: recipe for target 'HTML.index' failed make[3]: *** [HTML.index] Error 1 make[3]: Leaving directory '/home/andreak/dev/postgresql/doc/src/sgml' Makefile:8: recipe for target 'all' failed make[2]: *** [all] Error 2 make[2]: Leaving directory '/home/andreak/dev/postgresql/doc/src' Makefile:16: recipe for target 'all' failed make[1]: *** [all] Error 2 make[1]: Leaving directory '/home/andreak/dev/postgresql/doc' GNUmakefile:17: recipe for target 'docs' failed make: *** [docs] Error 2 Any hints? Do you have also the following error? *** ERROR: `jade' is missing on your system. *** Did you install necessary tools? You can see it from http://www.postgresql.org/docs/devel/static/docguide-toolsets.html Thanks. -- *Andreas Joseph Krogh* CTO / Partner - Visena AS Mobile: +47 909 56 963 andr...@visena.com <mailto:andr...@visena.com> www.visena.com <https://www.visena.com> <https://www.visena.com> -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Question about ranking & full text
On 15.04.2016 15:47, Nicolas Paris wrote: Hi, I wonder why the third query returns 0. To me, it would return 0.1, because there is not baz in the text Thanks ! (pg 9.4) SELECT ts_rank_cd(apod.t, query,4) AS rank FROM (SELECT to_tsvector('foo baz') as t) as apod, to_tsquery('foo & baz') query WHERE query @@ apod.t; rank| | 0.10001 | SELECT ts_rank_cd(apod.t, query,4) AS rank FROM (SELECT to_tsvector('foo ') as t) as apod, to_tsquery('foo & ! baz') query WHERE query @@ apod.t; rank| | 0.10001 | SELECT ts_rank_cd(apod.t, query,4) AS rank FROM (SELECT to_tsvector('foo ') as t) as apod, to_tsquery(' ! baz') query WHERE query @@ apod.t; rank| | 0.0 | Hello, I think this is a right behaviour of the function. Because it computes how often the query terms appear in the document, how close together the terms are in the document, and how important is the part of the document where they occur. In the third example found terms are empty. And the function can't do calculates with empty terms. It sees that the terms are empty and returns 0.0. -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Use full text to rank results higher if they are "closer hit"
On 14.02.2017 18:57, Artur Zakirov wrote: Hello, try the query: SELECT s, ts_rank(vector, query) AS rank FROM t, to_tsvector(s) vector, to_tsquery('hello') query WHERE query @@ vector; s | rank -+--- hello | 0.0607927 hello world | 0.0303964 (2 rows) Sorry, the query is: SELECT s, ts_rank(vector, query, 2) AS rank FROM t, to_tsvector(s) vector, to_tsquery('hello') query WHERE query @@ vector; -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Use full text to rank results higher if they are "closer hit"
On 14.02.2017 18:35, Thomas Nyberg wrote: Here both 'hello' and 'hello world' are ranked equally highly when searching with 'hello'. What I'm wondering is, is there a way within postgres to have it match higher to just 'hello' than 'hello world'? I.e. something like it slightly down-weights extraneous terms? Of course in general I don't know the query or the field strings ahead of time. Thanks for any help! Cheers, Thomas Hello, try the query: SELECT s, ts_rank(vector, query) AS rank FROM t, to_tsvector(s) vector, to_tsquery('hello') query WHERE query @@ vector; s | rank -+--- hello | 0.0607927 hello world | 0.0303964 (2 rows) And read about **normalization** in [1] https://www.postgresql.org/docs/current/static/textsearch-controls.html -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Full Text Search combined with Fuzzy
On 03.03.2017 15:49, Nicolas Paris wrote: Hi Oleg, Thanks. I thought pgtrgm was not able to index my long texts because of limitation of 8191 bytes per index row for btree. Then I found out it is possible to use pgtrgm over a GIN/GIST index. My final use case is phrase mining in texts. I want my application returns texts that contains approximatly the user entry: Eg: user search "Hello Word" a text containing "blah blah blah hello world blah blah blah" would be returned. Test: postgres=# CREATE table test_trgm (texts text); CREATE TABLE postgres=# CREATE INDEX ON test_trgm USING GIN(texts gin_trgm_ops); CREATE INDEX postgres=# SET enable_seqscan = OFF; SET postgres=# insert into test_trgm VALUES ('blah blah blah hello world blah blah blah'); INSERT 0 1 postgres=# insert into test_trgm VALUES ('blah blah blah hello word blah blah blah'); INSERT 0 1 postgres=# SELECT texts, similarity(texts, 'hello word') FROM test_trgm WHERE texts % 'hello word'; texts | similarity ---+ blah blah blah hello world blah blah blah | 0.473684 blah blah blah hello word blah blah blah | 0.6875 (2 rows) postgres=# EXPLAIN SELECT texts, similarity(texts, 'hello word') FROM test_trgm WHERE texts % 'hello word'; QUERY PLAN --- Bitmap Heap Scan on test_trgm (cost=52.01..56.03 rows=1 width=32) Recheck Cond: (texts % 'hello word'::text) -> Bitmap Index Scan on test_trgm_texts_idx (cost=0.00..52.01 rows=1 width=0) Index Cond: (texts % 'hello word'::text) (4 rows) Conclusion: If I d'say 0.4 is my threshold, would this methodology meet my requirements ? Thanks for the help ! Hello, If you use PostgreSQL 9.6, then word_similarity() can help you [1]. For example: postgres=# SELECT texts, word_similarity('hello word', texts) FROM test_trgm WHERE 'hello word' <% texts; texts | word_similarity ---+- blah blah blah hello world blah blah blah |0.818182 blah blah blah hello word blah blah blah | 1 (2 rows) 1. https://www.postgresql.org/docs/9.6/static/pgtrgm.html -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Full Text Search combined with Fuzzy
On 03.03.2017 16:17, Nicolas Paris wrote: Nice ! I do have 9.6 version. Would this kind of index could handle more than 20M large texts ? The recheck condition looks ressource consuming. You are right. I think pg_trgm will be not good for such large texts, unfortunately. The full text index + phrase search + synonym dictionnary is the only other alternativ to deal with typo-phrase mining ? I suppose there are no other options now. Though, prefix search maybe will help you [1]. Is there any possibility in the future to add typo in the full text road-map ? As far as I know, there is no plans in the near future to add similarity full text search. 1. https://www.postgresql.org/docs/current/static/textsearch-dictionaries.html -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general