Hi, On Sun, Feb 28, 2021, at 03:13, David Fetter wrote: > Maybe an int4multirange, which would fit unless I'm misunderstanding > g's meaning with respect to non-overlapping patterns, but that might > be a little too cute and not easy ever to extend. > > Come to that, would a row structure that looked like > > (match, start, end) > > be useful?
Nice, didn't know about the new multirange. Its data structure seems like a perfect fit for this. Hmm, I cannot find any catalog function to extract the ranges from the data structure though? As a caller, I might need the exact start/end values, not just wanting to know if a certain values overlaps any of the ranges. Is there such a function? Here is a PoC that just returns the start_pos and end_pos for all the matches. It would be simple to modify it to instead return multirange. CREATE OR REPLACE FUNCTION regexp_ranges(string text, pattern text, OUT start_pos integer, OUT end_pos integer) RETURNS SETOF RECORD LANGUAGE plpgsql AS $$ DECLARE match text; remainder text := string; len integer; BEGIN end_pos := 0; -- -- Ignore possible capture groups, instead just wrap the entire regex -- in an enclosing capture group, which is then extracted as the first array element. -- FOR match IN SELECT (regexp_matches(string,format('(%s)',pattern),'g'))[1] LOOP len := length(match); start_pos := position(match in remainder) + end_pos; end_pos := start_pos + len - 1; RETURN NEXT; remainder := right(remainder, -len); END LOOP; RETURN; END $$; This works fine for small strings: SELECT * FROM regexp_ranges('aaaa aa aaa','a+'); start_pos | end_pos -----------+--------- 1 | 4 6 | 7 10 | 12 (3 rows) Time: 0.209 ms But quickly gets slow for longer strings: SELECT COUNT(*) FROM regexp_ranges(repeat('aaaa aa aaa',10000),'a+'); 20001 Time: 98.663 ms SELECT COUNT(*) FROM regexp_ranges(repeat('aaaa aa aaa',20000),'a+'); 40001 Time: 348.027 ms SELECT COUNT(*) FROM regexp_ranges(repeat('aaaa aa aaa',30000),'a+'); 60001 Time: 817.412 ms SELECT COUNT(*) FROM regexp_ranges(repeat('aaaa aa aaa',40000),'a+'); 80001 Time: 1478.438 ms (00:01.478) Compared to the much nicer observed O-notation for regexp_matches(): SELECT COUNT(*) FROM regexp_matches(repeat('aaaa aa aaa',10000),'(a+)','g'); 20001 Time: 12.602 ms SELECT COUNT(*) FROM regexp_matches(repeat('aaaa aa aaa',20000),'(a+)','g'); 40001 Time: 25.161 ms SELECT COUNT(*) FROM regexp_matches(repeat('aaaa aa aaa',30000),'(a+)','g'); 60001 Time: 44.795 ms SELECT COUNT(*) FROM regexp_matches(repeat('aaaa aa aaa',40000),'(a+)','g'); 80001 Time: 57.292 ms /Joel