[GENERAL] [pg_trgm] Making similarity(?, ?) < ? use an index

2016-06-03 Thread Greg Navis
Hey!

I'm playing with pg_trgm. It seems that `lhs % rhs` is _almost_ equivalent
to `similarity(lhs, rhs) < show_limit()`. The difference that I noticed is
that `%` uses a GIN index while `similarity` does not.

```
grn=# \d restaurants
 Table "public.restaurants"
 Column |  Type  | Modifiers
++---
 city   | character varying(255) | not null
Indexes:
"restaurants_city_trgm_idx" gin (city gin_trgm_ops)

grn=# SELECT COUNT(*) FROM restaurants;
 count

 515475
(1 row)

Time: 45.964 ms
grn=# EXPLAIN ANALYZE SELECT * FROM restaurants WHERE similarity(city,
'warsw') > show_limit();
 QUERY PLAN


 Seq Scan on restaurants  (cost=0.00..11692.81 rows=171825 width=10)
(actual time=16.436..665.062 rows=360 loops=1)
   Filter: (similarity((city)::text, 'warsw'::text) > show_limit())
   Rows Removed by Filter: 515115
 Planning time: 0.139 ms
 Execution time: 665.105 ms
(5 rows)

Time: 665.758 ms
```

My question is: is it possible to make `similarity` use the index? If not,
is there a way to speed up the query above?

Best regards
-- 
Greg Navis
I help tech companies to scale Heroku-hosted Rails apps.
Free, biweekly scalability newsletter for SaaS CEOs
<http://www.gregnavis.com/newsletter/>


[GENERAL] [pg_trgm] Per-query set_limit()

2016-06-03 Thread Greg Navis
Hey!

I'd like to call `set_limit` on a per-query basis, i.e. searching for
restaurants in a given city should use 0.4 and searching for cinemas by
name should use 0.2. I can accomplish this with:

```
SELECT set_limit(0.2);
SELECT * FROM cinemas WHERE name % ?;
```

(similarly for restaurants).

My question is: is there a better way?

Best regards
-- 
Greg Navis
I help tech companies to scale Heroku-hosted Rails apps.
Free, biweekly scalability newsletter for SaaS CEOs
<http://www.gregnavis.com/newsletter/>


Re: [GENERAL] [pg_trgm] Making similarity(?, ?) < ? use an index

2016-06-03 Thread Greg Navis
Artur, thanks for your reply. That's right, `%` does use the index. The
goal of using `similarity(lhs, rhs) >= show_limit()` was to replace
`show_limit()` with a custom, per-query limit. I noticed that the latter
approach does _not_ use the index, hence my question:

grn=# EXPLAIN ANALYZE SELECT * FROM restaurants WHERE city % 'warsw';
  QUERY
PLAN
--
 Bitmap Heap Scan on restaurants  (cost=24.28..1319.36 rows=515 width=10)
(actual time=96.081..96.456 rows=400 loops=1)
   Recheck Cond: ((city)::text % 'warsw'::text)
   Heap Blocks: exact=359
   ->  Bitmap Index Scan on restaurants_city_gist_trgm_idx
 (cost=0.00..24.15 rows=515 width=0) (actual time=96.030..96.030 rows=400
loops=1)
 Index Cond: ((city)::text % 'warsw'::text)
 Planning time: 0.211 ms
 Execution time: 96.528 ms
(7 rows)

grn=# EXPLAIN ANALYZE SELECT * FROM restaurants WHERE similarity(city,
'warsw') >= show_limit();
 QUERY PLAN


 Seq Scan on restaurants  (cost=0.00..11692.81 rows=171825 width=10)
(actual time=14.520..692.520 rows=400 loops=1)
   Filter: (similarity((city)::text, 'warsw'::text) >= show_limit())
   Rows Removed by Filter: 515075
 Planning time: 0.109 ms
 Execution time: 692.560 ms
(5 rows)

If this functionality isn't supported then it might be a good idea for a
contribution.

Best regards

On Fri, Jun 3, 2016 at 12:51 PM, Artur Zakirov 
wrote:

> Hello.
>
> As I know 'lhs % rhs' is equivalent to 'similarity(lhs, rhs) >=
> show_limit()'.
>
> And so your query should looks like this:
>
> SELECT * FROM restaurants WHERE city % 'warsw';
>
> And it should use index.
>
>
> On 03.06.2016 13:35, Greg Navis wrote:
>
>> Hey!
>>
>> I'm playing with pg_trgm. It seems that `lhs % rhs` is _almost_
>> equivalent to `similarity(lhs, rhs) < show_limit()`. The difference that
>> I noticed is that `%` uses a GIN index while `similarity` does not.
>>
>> ```
>> grn=# \d restaurants
>>  Table "public.restaurants"
>>  Column |  Type  | Modifiers
>> ++---
>>  city   | character varying(255) | not null
>> Indexes:
>> "restaurants_city_trgm_idx" gin (city gin_trgm_ops)
>>
>> grn=# SELECT COUNT(*) FROM restaurants;
>>  count
>> 
>>  515475
>> (1 row)
>>
>> Time: 45.964 ms
>> grn=# EXPLAIN ANALYZE SELECT * FROM restaurants WHERE similarity(city,
>> 'warsw') > show_limit();
>>  QUERY PLAN
>>
>>
>> 
>>  Seq Scan on restaurants  (cost=0.00..11692.81 rows=171825 width=10)
>> (actual time=16.436..665.062 rows=360 loops=1)
>>Filter: (similarity((city)::text, 'warsw'::text) > show_limit())
>>Rows Removed by Filter: 515115
>>  Planning time: 0.139 ms
>>  Execution time: 665.105 ms
>> (5 rows)
>>
>> Time: 665.758 ms
>> ```
>>
>> My question is: is it possible to make `similarity` use the index? If
>> not, is there a way to speed up the query above?
>>
>> Best regards
>> --
>> Greg Navis
>> I help tech companies to scale Heroku-hosted Rails apps.
>> Free, biweekly scalability newsletter for SaaS CEOs
>> <http://www.gregnavis.com/newsletter/>
>>
>>
>
> --
> Artur Zakirov
> Postgres Professional: http://www.postgrespro.com
> Russian Postgres Company
>



-- 
Greg Navis
I help tech companies to scale Heroku-hosted Rails apps.
Free, biweekly scalability newsletter for SaaS CEOs
<http://www.gregnavis.com/newsletter/>


Re: [GENERAL] [pg_trgm] Making similarity(?, ?) < ? use an index

2016-06-03 Thread Greg Navis
Thanks for answers and sorry for not searching hard enough.

I'm curious ... would it be difficult to modify PostgreSQL so that it'd use
the index for `similarity(lhs, rhs) >= show_limit()` too? Or even add
`is_similar(lhs, rhs, threshold)` that'd allow to change the threshold on a
per-query basis. I might be able to block some time to contribute.

Best regards
-- 
Greg Navis
I help tech companies to scale Heroku-hosted Rails apps.
Free, biweekly scalability newsletter for SaaS CEOs
<http://www.gregnavis.com/newsletter/>


Re: [GENERAL] [pg_trgm] Making similarity(?, ?) < ? use an index

2016-06-04 Thread Greg Navis
Thanks for your replies.

Sorry for confusion. Instead of `similarity(lhs, rhs) >= show_limit()`,
which of course is completely equivalent to `lhs % rhs`, I wanted to write
`similarity(lhs, rhs) >= my_custom_threshold`. It seems that the approach
with ternary operators is quite a bit of work. I might have a simpler idea:

pg_trgm also provides `<->` but it seems this operator doesn't use indexes
either. It seems the shortest path to per-query thresholds, without
compromising the design, is making this operator use the index. Please help
me understand whether my reasoning is correct. If it is, I'd appreciate a
high-level overview of what needs to be done. I can block a few hours to
work on this in the upcoming weeks.

Best regards
-- 
Greg Navis
I help tech companies to scale Heroku-hosted Rails apps.
Free, biweekly scalability newsletter for SaaS CEOs
<http://www.gregnavis.com/newsletter/>


Re: [GENERAL] [pg_trgm] Making similarity(?, ?) < ? use an index

2016-06-08 Thread Greg Navis
Thanks for the replies.

On Sat, Jun 4, 2016 at 8:48 PM, Jeff Janes  wrote:

> On Sat, Jun 4, 2016 at 2:50 AM, Greg Navis  wrote:
> > Thanks for your replies.
> >
> > Sorry for confusion. Instead of `similarity(lhs, rhs) >= show_limit()`,
> > which of course is completely equivalent to `lhs % rhs`, I wanted to
> write
> > `similarity(lhs, rhs) >= my_custom_threshold`. It seems that the approach
> > with ternary operators is quite a bit of work. I might have a simpler
> idea:
>
> The problem is that you would have to build your index on the
> expression (similarity(lhs,rhs)), and the then one of the lhs or rhs
> would have to be a column, and the other would have to be a constant
> string at the time you define the index, which is the same constant
> string as specified when you want to use the index.  This could work
> as long as there are only a few strings you ever care about checking
> similarity to, and you would build an index for each one.  These would
> be ordinary btree indexes, not gin indexes.   I've done this before,
> only using substring rather than similarity.


>
> > pg_trgm also provides `<->` but it seems this operator doesn't use
> indexes
> > either. It seems the shortest path to per-query thresholds, without
> > compromising the design, is making this operator use the index.
>
> To be efficient, the index has to know where to start and stop its
> scan.  Such a thing doesn't exists for "a<->b", there is no stopping
> point.  It would have to be "a <-> b < c", at which point you are back
> to ternary again.
>
Sorry, that was sloppy thinking on my end. I was fixated on looking for
something simple and saw things that aren't there. ;-) You're absolutely
right that we're back to ternary with `a <-> b < c`.

>
> > Please help
> > me understand whether my reasoning is correct. If it is, I'd appreciate a
> > high-level overview of what needs to be done. I can block a few hours to
> > work on this in the upcoming weeks.
>
> I think that you are proposing rewriting the indexing architecture
> from scratch.  Doing that in a way that is robust and tested and
> backward compatible would  probably take years, not hours, for a
> single individual who is not already extensively experienced in the
> internals of the code.  And, it is not really clear even what you want
> to rewrite it into.
>
No rewrites, please. ;-) I was just confused. I'm looking for a simple,
incremental contribution, not a rewrite.

>
> Tom's proposal is to do it in a way that works with the existing
> architecture, rather than trying to re-write that architecture.
> (Which could probably not be done in a few hours, either, especially
> not once you write documentation and upgrade scripts and all the other
> stuff that goes into production-quality code.)
>
Thanks for the info. I blocked a few hours for open source work every week
so whether it takes 10 or 40 hours shouldn't be a problem.

>
> I don't know if this would even be appropriate as an addition to
> pg_trgm.  We might want to fork that code instead.  That would be a
> shame, because the underlying c code would be the fundamentally the
> same, but the alternative would be to force people who like % and
> set_limit() to carry around the baggage of new operators and types
> they have no interest in using, and vice versa.  True, we did just add
> several new functions and operators to pg_trgm that many people will
> have no interest in, so maybe that is not a big deal.
>
> Cheers,
>
> Jeff
>

Would this be a better plan then:

1. Add support for trigram operators.
2. Implement `issimilar(lhs, rhs, threshold)`.
3. Add `issimilar` to the trigram operator classes.

PS Should we move this discussion to pgsql-hackers?
-- 
Greg Navis
I help tech companies to scale Heroku-hosted Rails apps.
Free, biweekly scalability newsletter for SaaS CEOs
<http://www.gregnavis.com/newsletter/>


Re: [GENERAL] [pg_trgm] Making similarity(?, ?) < ? use an index

2016-06-09 Thread Greg Navis
Artur, no worries, I'm not writing any code ;-)

I did the following:

CREATE TYPE trgm_match AS (match TEXT, threshold NUMERIC);
CREATE OR REPLACE FUNCTION trgm_check_match (string TEXT, match trgm_match)
  RETURNS bool
  AS 'SELECT match.match <-> string <= 1 - match.threshold'
  LANGUAGE SQL;
CREATE OPERATOR %(leftarg = text, rightarg = trgm_match,
procedure=trgm_check_match);

This allows me to write:

SELECT ('Warsaw' % row('Warsw', 0.3)::trgm_match);

I'm not sure how to make this operator use an index. It seems I need to
create an operator class but I'm not sure how. This is how pg_trgm creates
its operator class:

-- create the operator class for gist
CREATE OPERATOR CLASS gist_trgm_ops
FOR TYPE text USING gist
AS
OPERATOR1   % (text, text),
FUNCTION1   gtrgm_consistent (internal, text, smallint,
oid, internal),
FUNCTION2   gtrgm_union (internal, internal),
FUNCTION3   gtrgm_compress (internal),
FUNCTION4   gtrgm_decompress (internal),
FUNCTION5   gtrgm_penalty (internal, internal,
internal),
FUNCTION6   gtrgm_picksplit (internal, internal),
FUNCTION7   gtrgm_same (gtrgm, gtrgm, internal),
STORAGE gtrgm;

Should my operator class mimic the one above?
-- 
Greg Navis
I help tech companies to scale Heroku-hosted Rails apps.
Free, biweekly scalability newsletter for SaaS CEOs
<http://www.gregnavis.com/newsletter/>


Re: [GENERAL] [pg_trgm] Making similarity(?, ?) < ? use an index

2016-06-11 Thread Greg Navis
I made some progress but I'm stuck. I'm focused on GiST for now. Please
ignore sloppy naming for now.

I made the following changes to pg_trgm--1.2.sql:

CREATE TYPE pg_trgm_match AS (match TEXT, threshold REAL);

CREATE OR REPLACE FUNCTION trgm_check_match(string TEXT, match
pg_trgm_match) RETURNS bool AS $$
BEGIN
RETURN match.match <-> string <= 1 - match.threshold;
END;
$$ LANGUAGE plpgsql;

CREATE OPERATOR %%(leftarg = text, rightarg = pg_trgm_match,
procedure=trgm_check_match);

ALTER OPERATOR FAMILY gist_trgm_ops USING gist ADD
   OPERATOR9   %% (text,
pg_trgm_match);

It does indeed make PostgreSQL complain about undefined strategy 9. I added
the following define to trgm.h:

#define ThresholdStrategyNumber 9

It seems StrategyNumber is used in gtrgm_consistent and gtrgm_distance.

In gtrgm_consistent, I need change the way `nlimit` is obtained:

nlimit = (strategy == SimilarityStrategyNumber) ?
similarity_threshold : word_similarity_threshold;

I need to add a case for ThresholdStrategyNumber and extract `nlimit` from
the argument of `pg_trgm_match`. I'm not sure what to do in
`gtrgm_distance`.

My questions:

1a. Is it possible to make `gtrgm_consistent` accept `text` or
`pg_trgm_match` as the second argument?
1b. What's the equivalent of `match.match` and `match.threshold` (where
`match` is a `pg_trgm_match`) in C?
2. What to do with `gtrgm_distance`?

Thanks for help.
-- 
Greg Navis
I help tech companies to scale Heroku-hosted Rails apps.
Free, biweekly scalability newsletter for SaaS CEOs
<http://www.gregnavis.com/newsletter/>


Re: [GENERAL] [pg_trgm] Making similarity(?, ?) < ? use an index

2016-06-20 Thread Greg Navis
Artur, thanks for help. I managed to add the new strategy to the index.
Hurray! I also discovered a bug in the process that I reported via the form.

I still have a few questions:

1. Naming - pg_trgm_match, match, threshold, trgm_check_match,
ThresholdStrategyNumber - are these good names?
2. I made trgm_check_match IMMUTABLE. Are there any other modifies that
should be there?
3. I defined % (text, pg_trgm_match) but didn't provide a commutator and
other helper procedures. Which of them should I implement?
4. Can I obtain query and nlimit with less code?
5. The attached patch replaced "res = (*(int *) &tmpsml == *(int *) &nlimit
|| tmpsml > nlimit);" with "res = (tmpsml >= nlimit);" to fix the bug on my
machine. I'm not sure whether that's the long-term fix we want to have.
It's just there to help me make progress with trigrams.

Thanks for help.

Cheers
Greg
ᐧ
diff --git a/contrib/pg_trgm/pg_trgm--1.3.sql b/contrib/pg_trgm/pg_trgm--1.3.sql
index b279f7d..faa1fce 100644
--- a/contrib/pg_trgm/pg_trgm--1.3.sql
+++ b/contrib/pg_trgm/pg_trgm--1.3.sql
@@ -3,6 +3,8 @@
 -- complain if script is sourced in psql, rather than via CREATE EXTENSION
 \echo Use "CREATE EXTENSION pg_trgm" to load this file. \quit
 
+CREATE TYPE pg_trgm_match AS (match TEXT, threshold REAL);
+
 -- Deprecated function
 CREATE FUNCTION set_limit(float4)
 RETURNS float4
@@ -108,6 +110,18 @@ CREATE OPERATOR <->> (
 COMMUTATOR = '<<->'
 );
 
+CREATE OR REPLACE FUNCTION trgm_check_match(string TEXT, match pg_trgm_match) RETURNS bool AS $$
+BEGIN
+RETURN similarity(match.match, string) >= match.threshold;
+END;
+$$ LANGUAGE plpgsql IMMUTABLE;
+
+CREATE OPERATOR %(
+	leftarg = text,
+	rightarg = pg_trgm_match,
+	procedure = trgm_check_match
+);
+
 -- gist key
 CREATE FUNCTION gtrgm_in(cstring)
 RETURNS gtrgm
@@ -126,7 +140,7 @@ CREATE TYPE gtrgm (
 );
 
 -- support functions for gist
-CREATE FUNCTION gtrgm_consistent(internal,text,smallint,oid,internal)
+CREATE FUNCTION gtrgm_consistent(internal,anynonarray,smallint,oid,internal)
 RETURNS bool
 AS 'MODULE_PATHNAME'
 LANGUAGE C IMMUTABLE STRICT PARALLEL SAFE;
@@ -171,7 +185,7 @@ CREATE OPERATOR CLASS gist_trgm_ops
 FOR TYPE text USING gist
 AS
 OPERATOR1   % (text, text),
-FUNCTION1   gtrgm_consistent (internal, text, smallint, oid, internal),
+FUNCTION1   gtrgm_consistent (internal, anynonarray, smallint, oid, internal),
 FUNCTION2   gtrgm_union (internal, internal),
 FUNCTION3   gtrgm_compress (internal),
 FUNCTION4   gtrgm_decompress (internal),
@@ -252,3 +266,6 @@ LANGUAGE C IMMUTABLE STRICT PARALLEL SAFE;
 ALTER OPERATOR FAMILY gin_trgm_ops USING gin ADD
 OPERATOR7   %> (text, text),
 FUNCTION6  (text,text) gin_trgm_triconsistent (internal, int2, text, int4, internal, internal, internal);
+
+ALTER OPERATOR FAMILY gist_trgm_ops USING gist ADD
+OPERATOR9   % (text, pg_trgm_match);
diff --git a/contrib/pg_trgm/trgm.h b/contrib/pg_trgm/trgm.h
index 8cd88e7..f2b6008 100644
--- a/contrib/pg_trgm/trgm.h
+++ b/contrib/pg_trgm/trgm.h
@@ -34,6 +34,7 @@
 #define RegExpICaseStrategyNumber		6
 #define WordSimilarityStrategyNumber	7
 #define WordDistanceStrategyNumber		8
+#define ThresholdStrategyNumber			9
 
 typedef char trgm[3];
 
diff --git a/contrib/pg_trgm/trgm_gist.c b/contrib/pg_trgm/trgm_gist.c
index 3a5aff9..3884b13 100644
--- a/contrib/pg_trgm/trgm_gist.c
+++ b/contrib/pg_trgm/trgm_gist.c
@@ -5,7 +5,10 @@
 
 #include "trgm.h"
 
+#include "access/htup.h"
+#include "access/htup_details.h"
 #include "access/stratnum.h"
+#include "utils/typcache.h"
 #include "fmgr.h"
 
 
@@ -181,7 +184,7 @@ Datum
 gtrgm_consistent(PG_FUNCTION_ARGS)
 {
 	GISTENTRY  *entry = (GISTENTRY *) PG_GETARG_POINTER(0);
-	text	   *query = PG_GETARG_TEXT_P(1);
+	text	   *query;
 	StrategyNumber strategy = (StrategyNumber) PG_GETARG_UINT16(2);
 
 	/* Oid		subtype = PG_GETARG_OID(3); */
@@ -189,10 +192,43 @@ gtrgm_consistent(PG_FUNCTION_ARGS)
 	TRGM	   *key = (TRGM *) DatumGetPointer(entry->key);
 	TRGM	   *qtrg;
 	bool		res;
-	Size		querysize = VARSIZE(query);
+	Size		querysize;
 	gtrgm_consistent_cache *cache;
 	double		nlimit;
 
+	HeapTupleHeader	query_match;
+	OidtupType;
+	int32			tupTypmod;
+	TupleDesc		tupdesc;
+	HeapTupleData	tuple;
+	bool			isnull;
+
+	if (strategy == ThresholdStrategyNumber)
+	{
+		query_match = PG_GETARG_HEAPTUPLEHEADER(1);
+		tupType = HeapTupleHeaderGetTypeId(query_match);
+		tupTypmod = HeapTupleHeaderGetTypMod(query_match);
+		tupdesc = lookup_rowtype_tupdesc(tupType, tupTypmod);
+
+		tuple.t_len = HeapTupleHeaderGetDatumLength(query_match);
+		ItemPointerSetInvalid(&(tuple.t_self));
+		tuple.t_tableOid = InvalidOid;
+		tuple.t_data = query_match;
+
+		query = DatumGetTextP(fastgetattr(&tuple, 1, tupdesc, &isnull));
+		querysize = VARSIZE(query);
+		nlimit = DatumGetFloat4(fastgetattr(&tuple, 2, tupdesc, &

Re: [GENERAL] Generating sample data

2016-12-27 Thread Greg Navis
In the Ruby land there's a gem called faker
 that allows you to generate fake data.
However, I'm not sure it can generate data based on a schema so a little
bit of scripting my be necessary. Would this approach work for you?

Yours
Greg