Re: [GENERAL] grep -f keyword data query

Hiroyuki Sato Mon, 04 Jan 2016 00:46:18 -0800

Hello Arjen

Thank you for replying.


2016年1月4日(月) 16:49 Arjen Nienhuis <a.g.nienh...@gmail.com>:

>
> On Dec 28, 2015 00:55, "Hiroyuki Sato" <hiroys...@gmail.com> wrote:
> >
> > Hello Andreas and Tom
> >
> > Thank you for replying.
> >
> > Sorry, I re-created my questions. I was mis-pasted query log on previous
> question.
> > (@~ operator is PGroonga extension (http://pgroonga.github.io))
> > Please ignore it.
> >
> > Best regards.
> >
> > 1, Problem.
> >   (1) Following query is exteme slow. (478sec)
> >     SELECT
> >       u.url
> >     FROM
> >       url_lists4 u,
> >       keywords4 k
> >     WHERE
> >       u.url like k.url
> >     AND
> >       k.name = 'esc_url';
> >
> >
> >   (2) grep -f kwd.txt sample.txt (exec time under 1sec)
> >
>
> These are not the same 'query'. Grep will match any of the patterns for
> each url while postgres finds all matching combinations. You need to use
> '... OR ... OR ...' or 'ANY()' in postgres to get the same result.
>
I thought the following query are same meaning.
Could you point me same examples about two differences?

  (1) u.url like k.url
  (2) u.url like 'k.url 1', or u.url like 'k.url2' ...

> 2, Questions
> >
> >   (1) Is it possible to improve this query like the command ``grep -f
> keyword data``?
>
> I get the best results by using OR of all the different patterns:
>
> SELECT url FROM url_lists4
> WHERE
>     url LIKE 'http://ak.yahoo.co.jp/xwv/%'
>     OR url LIKE 'http://ao.yahoo.co.jp/wdl/%'
>     OR ...
>

I'll try it.

BTW Do you know how many OR can I use ?
I have 5000 URLs.

In theory you could use:
>
> ... WHERE url LIKE ANY(ARRAY(SELECT url FROM keywords4 k WHERE k.name =
> 'esc_url'));
>
> but that's very slow.
>
It's very interesting. It seems same query.
I'll check EXPLAIN



> >   (2) What kind of Index should I create on url_lists table?
>
> Both btree text_pattern_ops and gin trigram on the URL.
>
> >
> > 3, Environment
> >   OS: CentOS7
> >   PostgreSQL 9.4
> >
> > 4, sample source
> >   https://github.com/hiroyuki-sato/postgres_like_test
> >
> >
> > 5, Create table
> >
> >     drop table if exists url_lists4;
> >     create table url_lists4 (
> >       id int not null primary key,
> >       url text not null
> >     );
> >     create index ix_url_url_lists4 on url_lists4(url);
> >
> >     drop table if exists keywords4;
> >     create table keywords4 (
> >       id int not null primary key,
> >       name varchar(40) not null,
> >       url text not null
> >     );
> >
> >     create index ix_url_keywords4 on keywords4(url);
> >     create index ix_name_keywords4 on keywords4(name);
> >
> >
> >     \copy url_lists4(id,url) from 'sample.txt' with delimiter ',';
> >     \copy keywords4(id,name,url) from 'keyword.txt' with delimiter ',';
> >
> >     vacuum url_lists4;
> >     vacuum keywords4;
> >     analyze url_lists4;
> >     analyze keywords4;
> >
> > 6, Query
> >
> >     EXPLAIN SELECT
> >       u.url
> >     FROM
> >       url_lists4 u,
> >       keywords4 k
> >     WHERE
> >       u.url like k.url
> >     AND
> >       k.name = 'esc_url';
> >
> >     EXPLAIN ANALYZE SELECT
> >       u.url
> >     FROM
> >       url_lists4 u,
> >       keywords4 k
> >     WHERE
> >       u.url like k.url
> >     AND
> >       k.name = 'esc_url';
> >
> >
> >     SELECT
> >       u.url
> >     FROM
> >       url_lists4 u,
> >       keywords4 k
> >     WHERE
> >       u.url like k.url
> >     AND
> >       k.name = 'esc_url';
> >
> > 7, EXPLAIN
> >
> >                                      QUERY PLAN
>
> >
> -----------------------------------------------------------------------------
> >      Nested Loop  (cost=0.00..37510799.00 rows=12500000 width=57)
> >        Join Filter: (u.url ~~ k.url)
> >        ->  Seq Scan on url_lists4 u  (cost=0.00..10682.00 rows=500000
> width=57)
> >        ->  Materialize  (cost=0.00..129.50 rows=5000 width=28)
> >              ->  Seq Scan on keywords4 k  (cost=0.00..104.50 rows=5000
> width=28)
> >                    Filter: ((name)::text = 'esc_url'::text)
> >     (6 rows)
> >
> > 8, EXPLAIN ANALYZE
> >                                                              QUERY PLAN
>
> >
> -----------------------------------------------------------------------------------------------------------------------------
> >      Nested Loop  (cost=0.00..37510799.00 rows=12500000 width=57)
> (actual time=6011.642..478011.117 rows=4850 loops=1)
> >        Join Filter: (u.url ~~ k.url)
> >        Rows Removed by Join Filter: 2499995150
> >        ->  Seq Scan on url_lists4 u  (cost=0.00..10682.00 rows=500000
> width=57) (actual time=0.034..192.646 rows=500000 loops=1)
> >        ->  Materialize  (cost=0.00..129.50 rows=5000 width=28) (actual
> time=0.000..0.261 rows=5000 loops=500000)
> >              ->  Seq Scan on keywords4 k  (cost=0.00..104.50 rows=5000
> width=28) (actual time=0.021..1.705 rows=5000 loops=1)
> >                    Filter: ((name)::text = 'esc_url'::text)
> >      Planning time: 0.061 ms
> >      Execution time: 478011.773 ms
> >     (9 rows)
> >
> >
> > 2015年12月28日(月) 3:39 Tom Lane <t...@sss.pgh.pa.us>:
> >>
> >> Andreas Kretschmer <andr...@a-kretschmer.de> writes:
> >> >> Tom Lane <t...@sss.pgh.pa.us> hat am 27. Dezember 2015 um 19:11
> geschrieben:
> >> >> What in the world is this @~ operator?  And what sort of index are
> >> >> you using now, that can accept it?  Are the rowcount estimates in
> >> >> the EXPLAIN output accurate?  (If they are, it's hardly surprising
> >> >> that the query takes a long time.)
> >>
> >> > in a privat mail he called an other operator: ~~. I think, the @~ is
> an error.
> >>
> >> Well, ~~ isn't directly indexable by btree indexes either, so there's
> >> still something wrong with either the EXPLAIN output or the claimed
> >> index definitions.
> >>
> >>                         regards, tom lane
>

Best regards.

Re: [GENERAL] grep -f keyword data query

Reply via email to