Re: Counting the number of repeated phrases in a column
On Wed, Jan 26, 2022 at 5:23 PM Merlin Moncure wrote: > > with s as (select 'Hello World Hello World' as sentence) > select > phrase, > array_upper(string_to_array((select sentence from s), phrase), 1) - > 1 as occurrances > from > ( > select array_to_string(x, ' ') as phrase > from > ( > select distinct v[a:b] x > from regexp_split_to_array((select sentence from s), ' ') v > cross join lateral generate_series(1, array_upper(v, 1)) a > cross join lateral generate_series(a + 1, array_upper(v, 1)) b > ) q > ) q; Simplified to: select distinct array_to_string(v[a:b], ' ') phrase, count(*) as occurrences from regexp_split_to_array('Hello World Hello World', ' ') v cross join lateral generate_series(1, array_upper(v, 1)) a cross join lateral generate_series(a + 1, array_upper(v, 1)) b group by 1; phrase │ occurances ─┼ World Hello │ 1 Hello World Hello │ 1 Hello World │ 2 Hello World Hello World │ 1 World Hello World │ 1 merlin
Re: Counting the number of repeated phrases in a column
On 1/27/22 10:03, Merlin Moncure wrote: On Wed, Jan 26, 2022 at 5:23 PM Merlin Moncure wrote: with s as (select 'Hello World Hello World' as sentence) select phrase, array_upper(string_to_array((select sentence from s), phrase), 1) - 1 as occurrances from ( select array_to_string(x, ' ') as phrase from ( select distinct v[a:b] x from regexp_split_to_array((select sentence from s), ' ') v cross join lateral generate_series(1, array_upper(v, 1)) a cross join lateral generate_series(a + 1, array_upper(v, 1)) b ) q ) q; Simplified to: select distinct array_to_string(v[a:b], ' ') phrase, count(*) as occurrences from regexp_split_to_array('Hello World Hello World', ' ') v cross join lateral generate_series(1, array_upper(v, 1)) a cross join lateral generate_series(a + 1, array_upper(v, 1)) b group by 1; phrase │ occurances ─┼ World Hello │ 1 Hello World Hello │ 1 Hello World │ 2 Hello World Hello World │ 1 World Hello World │ 1 merlin And since we're looking for repeated phrases maybe add having count(*) > 1
Re: Counting the number of repeated phrases in a column
On Thu, Jan 27, 2022 at 11:09 AM Rob Sargent wrote: > > On 1/27/22 10:03, Merlin Moncure wrote: > > On Wed, Jan 26, 2022 at 5:23 PM Merlin Moncure wrote: > > with s as (select 'Hello World Hello World' as sentence) > select > phrase, > array_upper(string_to_array((select sentence from s), phrase), 1) - > 1 as occurrances > from > ( > select array_to_string(x, ' ') as phrase > from > ( > select distinct v[a:b] x > from regexp_split_to_array((select sentence from s), ' ') v > cross join lateral generate_series(1, array_upper(v, 1)) a > cross join lateral generate_series(a + 1, array_upper(v, 1)) b > ) q > ) q; > > Simplified to: > select distinct array_to_string(v[a:b], ' ') phrase, count(*) as occurrences > from regexp_split_to_array('Hello World Hello World', ' ') v > cross join lateral generate_series(1, array_upper(v, 1)) a > cross join lateral generate_series(a + 1, array_upper(v, 1)) b > group by 1; > > phrase │ occurances > ─┼ > World Hello │ 1 > Hello World Hello │ 1 > Hello World │ 2 > Hello World Hello World │ 1 > World Hello World │ 1 > > merlin > > > And since we're looking for repeated phrases maybe add > > having count(*) > 1 thanks. also, testing on actual data, I noticed that a couple other things are mandatory, mainly doing a bit of massaging before tokenizing: select distinct array_to_string(v[a:b], ' ') phrase, count(*) as occurrences from ( select array_agg(t) v from ( select trim(replace(unnest(v), E'\n', '')) t from regexp_split_to_array(, ' ') v ) q where length(t) > 1 ) q cross join lateral generate_series(1, array_upper(v, 1)) a cross join lateral generate_series(a + 1, array_upper(v, 1)) b group by 1 having count(*) > 1; We are definitely in N^2 space here, so look for things to start breaking down for sentences > 1000 words. merlin
Re: Counting the number of repeated phrases in a column
Le 27/01/2022 à 18:35, Merlin Moncure a écrit : On Thu, Jan 27, 2022 at 11:09 AM Rob Sargent wrote: On 1/27/22 10:03, Merlin Moncure wrote: On Wed, Jan 26, 2022 at 5:23 PM Merlin Moncure wrote: with s as (select 'Hello World Hello World' as sentence) select phrase, array_upper(string_to_array((select sentence from s), phrase), 1) - 1 as occurrances from ( select array_to_string(x, ' ') as phrase from ( select distinct v[a:b] x from regexp_split_to_array((select sentence from s), ' ') v cross join lateral generate_series(1, array_upper(v, 1)) a cross join lateral generate_series(a + 1, array_upper(v, 1)) b ) q ) q; Simplified to: select distinct array_to_string(v[a:b], ' ') phrase, count(*) as occurrences from regexp_split_to_array('Hello World Hello World', ' ') v cross join lateral generate_series(1, array_upper(v, 1)) a cross join lateral generate_series(a + 1, array_upper(v, 1)) b group by 1; phrase │ occurances ─┼ World Hello │ 1 Hello World Hello │ 1 Hello World │ 2 Hello World Hello World │ 1 World Hello World │ 1 merlin And since we're looking for repeated phrases maybe add having count(*) > 1 thanks. also, testing on actual data, I noticed that a couple other things are mandatory, mainly doing a bit of massaging before tokenizing: select distinct array_to_string(v[a:b], ' ') phrase, count(*) as occurrences from ( select array_agg(t) v from ( select trim(replace(unnest(v), E'\n', '')) t from regexp_split_to_array(, ' ') v ) q where length(t) > 1 ) q cross join lateral generate_series(1, array_upper(v, 1)) a cross join lateral generate_series(a + 1, array_upper(v, 1)) b group by 1 having count(*) > 1; We are definitely in N^2 space here, so look for things to start breaking down for sentences > 1000 words. merlin (for better complexity) you may search about "Ukkonen suffix tree" Similar problem as yours : https://www.geeksforgeeks.org/suffix-tree-application-3-longest-repeated-substring/?ref=lbp OpenPGP_signature Description: OpenPGP digital signature
Re: could not open relation with OID
Michael Paquier wrote on 1/26/22 9:14 PM: On Wed, Jan 26, 2022 at 05:30:01PM -0800, Ben Chobot wrote: Other things we've considered: - we run pg_repack, which certainly seems like it could make an error like this, but we see this error in places and times that pg_repack isn't currently running It could also take time for the issue to show up, depending on the state of the relcache. So tell me more about stale relcaches? It turns out I was totally wrong and this is being driven by pg_repack. I can even make it happen pretty easily: 1. Put some data in a table with a single btree index on a primary db. 2. Set up streaming replication to a secondary db. 3. In a loop on the primary, have pg_repack repack the indices of that table. (the -x flag) 4. In a loop on the secondary, have psql query the secondary db for an indexed value of that table. When I do this with replication, I can get the OID error consistently within 30 minutes. Without replication, I've been unable to get it to happen after 2 hours. Given that this fails much faster on the secondary than the primary (where it has yet to fail at all) I'm leaning towards a postgres bug, but I'm happy to do more research to point the blame at something pg_repack is doing, if you could point me at a thing to research.
Re: Counting the number of repeated phrases in a column
On Thu, Jan 27, 2022 at 11:56 AM wrote: > Le 27/01/2022 à 18:35, Merlin Moncure a écrit : > > select distinct array_to_string(v[a:b], ' ') phrase, count(*) as occurrences > > from > > ( > >select array_agg(t) v > >from > >( > > select trim(replace(unnest(v), E'\n', '')) t > > from regexp_split_to_array(, ' ') v > >) q > >where length(t) > 1 > > ) q > > cross join lateral generate_series(1, array_upper(v, 1)) a > > cross join lateral generate_series(a + 1, array_upper(v, 1)) b > > group by 1 > > having count(*) > 1; > > > > We are definitely in N^2 space here, so look for things to start > > breaking down for sentences > 1000 words. > > > > merlin > > > > (for better complexity) you may search about "Ukkonen suffix tree" > Similar problem as yours : > https://www.geeksforgeeks.org/suffix-tree-application-3-longest-repeated-substring/?ref=lbp Yep. Many problems like this are well solved in imperative languages and will fit poorly into SQL quase-functional space. That implementation could probably be converted to pl/pgsql pretty easily, or a 'sql + tables' variant as a fun challenge. It also slightly exploits the fact that only the most repeated needle is returned, rather than all of them. Having the need to have single statement stateless SQL solutions to interesting problems comes up all the time in common development practice though for simplicity's sake even if there are better approaches out there. It's also fun. merlin
Re: Undetected Deadlock
> but I would expect drop concurrently to resolve your issue with the two > processes conflicting I guess you mean ALTER TABLE DETACH PARTITION ... CONCURRENTLY? DROP TABLE does not seem to have a concurrent option. Still that does seem like a good option to try, thanks for drawing it to my attention. Cheers Mike On Thu, 27 Jan 2022 at 14:20, Michael Lewis wrote: > > There may be a bug so perhaps still pursue reproducing the issue, but I would > expect drop concurrently to resolve your issue with the two processes > conflicting. Also, perhaps trying"insert, on conflict do update" could be > more efficient than the copy but obviously there are too many unknowns and > variables for either to be a definite winner.