Re: Counting the number of repeated phrases in a column

2022-01-27 Thread Merlin Moncure
On Wed, Jan 26, 2022 at 5:23 PM Merlin Moncure  wrote:
>
> with s as (select 'Hello World Hello World' as sentence)
> select
>   phrase,
>   array_upper(string_to_array((select sentence from s), phrase), 1) -
> 1 as occurrances
> from
> (
>   select array_to_string(x, ' ') as phrase
>   from
>   (
> select distinct v[a:b]  x
> from regexp_split_to_array((select sentence from s), ' ') v
> cross join lateral generate_series(1, array_upper(v, 1)) a
> cross join lateral generate_series(a + 1, array_upper(v, 1)) b
>   ) q
> ) q;

Simplified to:
select distinct array_to_string(v[a:b], ' ') phrase, count(*) as occurrences
from regexp_split_to_array('Hello World Hello World', ' ') v
cross join lateral generate_series(1, array_upper(v, 1)) a
cross join lateral generate_series(a + 1, array_upper(v, 1)) b
group by 1;

 phrase  │ occurances
─┼
 World Hello │  1
 Hello World Hello   │  1
 Hello World │  2
 Hello World Hello World │  1
 World Hello World   │  1

merlin




Re: Counting the number of repeated phrases in a column

2022-01-27 Thread Rob Sargent

On 1/27/22 10:03, Merlin Moncure wrote:

On Wed, Jan 26, 2022 at 5:23 PM Merlin Moncure  wrote:

with s as (select 'Hello World Hello World' as sentence)
select
   phrase,
   array_upper(string_to_array((select sentence from s), phrase), 1) -
1 as occurrances
from
(
   select array_to_string(x, ' ') as phrase
   from
   (
 select distinct v[a:b]  x
 from regexp_split_to_array((select sentence from s), ' ') v
 cross join lateral generate_series(1, array_upper(v, 1)) a
 cross join lateral generate_series(a + 1, array_upper(v, 1)) b
   ) q
) q;

Simplified to:
select distinct array_to_string(v[a:b], ' ') phrase, count(*) as occurrences
from regexp_split_to_array('Hello World Hello World', ' ') v
cross join lateral generate_series(1, array_upper(v, 1)) a
cross join lateral generate_series(a + 1, array_upper(v, 1)) b
group by 1;

  phrase  │ occurances
─┼
  World Hello │  1
  Hello World Hello   │  1
  Hello World │  2
  Hello World Hello World │  1
  World Hello World   │  1

merlin



And since we're looking for repeated phrases maybe add

   having count(*) > 1



Re: Counting the number of repeated phrases in a column

2022-01-27 Thread Merlin Moncure
On Thu, Jan 27, 2022 at 11:09 AM Rob Sargent  wrote:
>
> On 1/27/22 10:03, Merlin Moncure wrote:
>
> On Wed, Jan 26, 2022 at 5:23 PM Merlin Moncure  wrote:
>
> with s as (select 'Hello World Hello World' as sentence)
> select
>   phrase,
>   array_upper(string_to_array((select sentence from s), phrase), 1) -
> 1 as occurrances
> from
> (
>   select array_to_string(x, ' ') as phrase
>   from
>   (
> select distinct v[a:b]  x
> from regexp_split_to_array((select sentence from s), ' ') v
> cross join lateral generate_series(1, array_upper(v, 1)) a
> cross join lateral generate_series(a + 1, array_upper(v, 1)) b
>   ) q
> ) q;
>
> Simplified to:
> select distinct array_to_string(v[a:b], ' ') phrase, count(*) as occurrences
> from regexp_split_to_array('Hello World Hello World', ' ') v
> cross join lateral generate_series(1, array_upper(v, 1)) a
> cross join lateral generate_series(a + 1, array_upper(v, 1)) b
> group by 1;
>
>  phrase  │ occurances
> ─┼
>  World Hello │  1
>  Hello World Hello   │  1
>  Hello World │  2
>  Hello World Hello World │  1
>  World Hello World   │  1
>
> merlin
>
>
> And since we're looking for repeated phrases maybe add
>
> having count(*) > 1

thanks.  also, testing on actual data, I noticed that a couple other
things are mandatory, mainly doing a bit of massaging before
tokenizing:

select distinct array_to_string(v[a:b], ' ') phrase, count(*) as occurrences
from
(
  select array_agg(t) v
  from
  (
select trim(replace(unnest(v), E'\n', '')) t
from regexp_split_to_array(, ' ') v
  ) q
  where length(t) > 1
) q
cross join lateral generate_series(1, array_upper(v, 1)) a
cross join lateral generate_series(a + 1, array_upper(v, 1)) b
group by 1
having count(*) > 1;

We are definitely in N^2 space here, so look for things to start
breaking down for sentences > 1000 words.

merlin




Re: Counting the number of repeated phrases in a column

2022-01-27 Thread benj . dev



Le 27/01/2022 à 18:35, Merlin Moncure a écrit :

On Thu, Jan 27, 2022 at 11:09 AM Rob Sargent  wrote:


On 1/27/22 10:03, Merlin Moncure wrote:

On Wed, Jan 26, 2022 at 5:23 PM Merlin Moncure  wrote:

with s as (select 'Hello World Hello World' as sentence)
select
   phrase,
   array_upper(string_to_array((select sentence from s), phrase), 1) -
1 as occurrances
from
(
   select array_to_string(x, ' ') as phrase
   from
   (
 select distinct v[a:b]  x
 from regexp_split_to_array((select sentence from s), ' ') v
 cross join lateral generate_series(1, array_upper(v, 1)) a
 cross join lateral generate_series(a + 1, array_upper(v, 1)) b
   ) q
) q;

Simplified to:
select distinct array_to_string(v[a:b], ' ') phrase, count(*) as occurrences
from regexp_split_to_array('Hello World Hello World', ' ') v
cross join lateral generate_series(1, array_upper(v, 1)) a
cross join lateral generate_series(a + 1, array_upper(v, 1)) b
group by 1;

  phrase  │ occurances
─┼
  World Hello │  1
  Hello World Hello   │  1
  Hello World │  2
  Hello World Hello World │  1
  World Hello World   │  1

merlin


And since we're looking for repeated phrases maybe add

having count(*) > 1


thanks.  also, testing on actual data, I noticed that a couple other
things are mandatory, mainly doing a bit of massaging before
tokenizing:

select distinct array_to_string(v[a:b], ' ') phrase, count(*) as occurrences
from
(
   select array_agg(t) v
   from
   (
 select trim(replace(unnest(v), E'\n', '')) t
 from regexp_split_to_array(, ' ') v
   ) q
   where length(t) > 1
) q
cross join lateral generate_series(1, array_upper(v, 1)) a
cross join lateral generate_series(a + 1, array_upper(v, 1)) b
group by 1
having count(*) > 1;

We are definitely in N^2 space here, so look for things to start
breaking down for sentences > 1000 words.

merlin



(for better complexity) you may search about "Ukkonen suffix tree"
Similar problem as yours : 
https://www.geeksforgeeks.org/suffix-tree-application-3-longest-repeated-substring/?ref=lbp





OpenPGP_signature
Description: OpenPGP digital signature


Re: could not open relation with OID

2022-01-27 Thread Ben Chobot

Michael Paquier wrote on 1/26/22 9:14 PM:

On Wed, Jan 26, 2022 at 05:30:01PM -0800, Ben Chobot wrote:
Other things we've considered:

     - we run pg_repack, which certainly seems like it could make an error
like this, but we see this error in places and times that pg_repack isn't
currently running

It could also take time for the issue to show up, depending on the
state of the relcache.


So tell me more about stale relcaches? It turns out I was totally 
wrong and this is being driven by pg_repack. I can even make it happen 
pretty easily:


1. Put some data in a table with a single btree index on a primary db.
2. Set up streaming replication to a secondary db.
3. In a loop on the primary, have pg_repack repack the indices of that 
table. (the -x flag)
4. In a loop on the secondary, have psql query the secondary db for an 
indexed value of that table.


When I do this with replication, I can get the OID error consistently 
within 30 minutes. Without replication, I've been unable to get it to 
happen after 2 hours.


Given that this fails much faster on the secondary than the primary 
(where it has yet to fail at all) I'm leaning towards a postgres bug, 
but I'm happy to do more research to point the blame at something 
pg_repack is doing, if you could point me at a thing to research.


Re: Counting the number of repeated phrases in a column

2022-01-27 Thread Merlin Moncure
On Thu, Jan 27, 2022 at 11:56 AM  wrote:
> Le 27/01/2022 à 18:35, Merlin Moncure a écrit :
> > select distinct array_to_string(v[a:b], ' ') phrase, count(*) as occurrences
> > from
> > (
> >select array_agg(t) v
> >from
> >(
> >  select trim(replace(unnest(v), E'\n', '')) t
> >  from regexp_split_to_array(, ' ') v
> >) q
> >where length(t) > 1
> > ) q
> > cross join lateral generate_series(1, array_upper(v, 1)) a
> > cross join lateral generate_series(a + 1, array_upper(v, 1)) b
> > group by 1
> > having count(*) > 1;
> >
> > We are definitely in N^2 space here, so look for things to start
> > breaking down for sentences > 1000 words.
> >
> > merlin
> >
>
> (for better complexity) you may search about "Ukkonen suffix tree"
> Similar problem as yours :
> https://www.geeksforgeeks.org/suffix-tree-application-3-longest-repeated-substring/?ref=lbp

Yep.  Many problems like this are well solved in imperative languages
and will fit poorly into SQL quase-functional space.  That
implementation could probably be converted to pl/pgsql pretty easily,
or a 'sql + tables' variant as a fun challenge.  It also slightly
exploits the fact that only the most repeated needle is returned,
rather than all of them.

Having the need to have single statement stateless SQL solutions to
interesting problems comes up all the time in common development
practice though for simplicity's sake even if there are better
approaches out there.  It's also fun.

merlin




Re: Undetected Deadlock

2022-01-27 Thread Michael Harris
> but I would expect drop concurrently to resolve your issue with the two 
> processes conflicting

I guess you mean ALTER TABLE DETACH PARTITION ... CONCURRENTLY?
DROP TABLE does not seem to have a concurrent option.
Still that does seem like a good option to try, thanks for drawing it
to my attention.

Cheers
Mike

On Thu, 27 Jan 2022 at 14:20, Michael Lewis  wrote:
>
> There may be a bug so perhaps still pursue reproducing the issue, but I would 
> expect drop concurrently to resolve your issue with the two processes 
> conflicting. Also, perhaps trying"insert, on conflict do update" could be 
> more efficient than the copy but obviously there are too many unknowns and 
> variables for either to be a definite winner.