[Python-ideas] str.substring_after and str.substring_before as in Kotlin

janfrederik . konopka Sat, 03 Apr 2021 07:10:50 -0700

Dear pythonistas,


I would really like having a function like String.substringAfter(sep) and 
String.substringBefore(sep) as in Kotlin.

fun String.substringAfter(
    delimiter: String,
    missingDelimiterValue: String = this
): String

substringAfter/substringBefore takes a delimiter, and keeps everything after 
the first occurrence of the delimiter. This could be useful for simple HTML 
parsing for example. It also take a second argument to use as value when the 
seperator is not found (defaults to the whole input String).

----

A lot of code in the wild currently uses str.split(sep) or manual indexing, as 
exemplified by the two top answers here: 
https://stackoverflow.com/q/12572362/2111778

- str.split(sep) runs into issues if the separator occurs repeatedly:
        substringBefore = lambda s, sep: s.split(sep)[0]    # Don't use this.
        substringAfter  = lambda s, sep: s.split(sep)[1]    # Don't use this.
        substringAfter  = lambda s, sep: s.split(sep)[-1]   # "Fixes" 
IndexError, but still don't use this.
    This will bite the user if he doesn't know about the second argument which 
limits the number of splits:
        substringBefore = lambda s, sep: s.split(sep, 1)[0]
        substringAfter  = lambda s, sep: s.split(sep, 1)[1]   # IndexError if 
nonexistent
        substringAfter  = lambda s, sep: s.split(sep, 1)[-1]  # original string 
if nonexistent
    I have regrettably even written code like this before:
        substringAfter  = lambda s, sep: sep.join(s.split(sep)[1:])

- Another approach uses indexing:
        substringAfter = lambda s, sep: s[s.index(sep) + len(sep):]
    This works okay as a separate function but cannot be inlined very well due 
to 'sep' and 's' both being used twice:
        after = s[s.index('some separator string') + len('some separator 
string'):]
    Plus it's too much cognitive load for a simple "substring after" operation.

- Regexes can be used, but these are too powerful and can introduce subtle 
bugs. Java has this problem with their String#split method which takes a regex 
String. So while in Python '8.8..'.split('.') == ['8', '8', '', ''], in Java 
you have to doubly escape the dot using "8.8..".split("\\."), EXCEPT that will 
result in a String[] of length 2 rather than 4, so you ACTUALLY have to do 
"8.8..".split("\\.", -1). So regexes have huge potential for confusing users. 
Another example with a substringAfter/substringBefore use case:
        >>> import re
        >>> re.findall('\(.*\)', 'whitespace bad (haha) (not really)')
        ['(haha) (not really)']
    It is not obvious how to fix this:
        >>> re.findall("\(.*?\)", 'whitespace bad (haha) (not really)')
        ['(haha)', '(not really)']

- The best alternative currently is str.partition(sep) but such code is not 
very readable, plus most users do not know about it, as proven by the 
StackOverflow link. Note that if not found this defaults to the original str 
for substringAfter, and an empty string for substringBefore:
        substringAfter  = lambda s, sep: s.partition(sep)[2]
        substringBefore = lambda s, sep: s.partition(sep)[0]

----

If added to the <str> class, a typical use case would be something like this:
        bracketed = 'whitespace bad (haha) (not 
really)'.substringAfter('(').substringBefore(')')
        assert bracketed == 'haha'

I find this code highly readable. Currently the best alternatives would be:
        bracketed = 'whitespace bad (haha) (not 
really)'.partition('(')[2].partition(')')[0]
        bracketed = 'whitespace bad (haha) (not really)'.split('(', 
1)[-1].split(')', 1)[0]
        bracketed = 'whitespace bad (haha) (not really)'.split('(', 
1)[1].split(')', 1)[0]

All of these are not very readable, the latter ones even has 4 seemingly-random 
integer constants. Even while writing this I got them wrong multiple times. 
Plus they differ in behavior: The first one returns an empty string if the 
separators are not found, the second one returns the original string instead, 
the third one throws an IndexError (but only if the first separator is missing, 
not if the second one is, yikes).

Monkey-patching the <str> class to add str.substringAfter(sep) on the user side 
is also not possible as it is a C type.

I think this would fit well in Python (apart from the camelCase ;)) because it 
would complement removeprefix/removesuffix which are being added in 3.9 
already. Plus I do use substringAfter/substringBefore in Kotlin all the time.

One might even think about a substringBetween to be honest:
        bracketed = 'whitespace bad (haha) (not really)'.substringBetween('(', 
')')
        assert bracketed == 'haha'

As an alternative, I think str.partition(sep) could be changed to return a 
NamedTuple rather than a simple tuple. This should be interoperable with 
pre-existing code. This could enable calls as follows:
        substringAfter  = lambda s, sep: s.partition(sep).after
        substringBefore = lambda s, sep: s.partition(sep).before
        bracketed = 'whitespace bad (haha) (not 
really)'.partition('(').after.partition(')').before

What do you think?


Greetings Jan
_______________________________________________
Python-ideas mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/[email protected]/message/TOUHY2LV4PAHEVH2BHGK5CXHKOPAOQNN/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] str.substring_after and str.substring_before as in Kotlin

Reply via email to