Dear pythonistas,
I would really like having a function like String.substringAfter(sep) and
String.substringBefore(sep) as in Kotlin.
fun String.substringAfter(
delimiter: String,
missingDelimiterValue: String = this
): String
substringAfter/substringBefore takes a delimiter, and keeps everything after
the first occurrence of the delimiter. This could be useful for simple HTML
parsing for example. It also take a second argument to use as value when the
seperator is not found (defaults to the whole input String).
----
A lot of code in the wild currently uses str.split(sep) or manual indexing, as
exemplified by the two top answers here:
https://stackoverflow.com/q/12572362/2111778
- str.split(sep) runs into issues if the separator occurs repeatedly:
substringBefore = lambda s, sep: s.split(sep)[0] # Don't use this.
substringAfter = lambda s, sep: s.split(sep)[1] # Don't use this.
substringAfter = lambda s, sep: s.split(sep)[-1] # "Fixes"
IndexError, but still don't use this.
This will bite the user if he doesn't know about the second argument which
limits the number of splits:
substringBefore = lambda s, sep: s.split(sep, 1)[0]
substringAfter = lambda s, sep: s.split(sep, 1)[1] # IndexError if
nonexistent
substringAfter = lambda s, sep: s.split(sep, 1)[-1] # original string
if nonexistent
I have regrettably even written code like this before:
substringAfter = lambda s, sep: sep.join(s.split(sep)[1:])
- Another approach uses indexing:
substringAfter = lambda s, sep: s[s.index(sep) + len(sep):]
This works okay as a separate function but cannot be inlined very well due
to 'sep' and 's' both being used twice:
after = s[s.index('some separator string') + len('some separator
string'):]
Plus it's too much cognitive load for a simple "substring after" operation.
- Regexes can be used, but these are too powerful and can introduce subtle
bugs. Java has this problem with their String#split method which takes a regex
String. So while in Python '8.8..'.split('.') == ['8', '8', '', ''], in Java
you have to doubly escape the dot using "8.8..".split("\\."), EXCEPT that will
result in a String[] of length 2 rather than 4, so you ACTUALLY have to do
"8.8..".split("\\.", -1). So regexes have huge potential for confusing users.
Another example with a substringAfter/substringBefore use case:
>>> import re
>>> re.findall('\(.*\)', 'whitespace bad (haha) (not really)')
['(haha) (not really)']
It is not obvious how to fix this:
>>> re.findall("\(.*?\)", 'whitespace bad (haha) (not really)')
['(haha)', '(not really)']
- The best alternative currently is str.partition(sep) but such code is not
very readable, plus most users do not know about it, as proven by the
StackOverflow link. Note that if not found this defaults to the original str
for substringAfter, and an empty string for substringBefore:
substringAfter = lambda s, sep: s.partition(sep)[2]
substringBefore = lambda s, sep: s.partition(sep)[0]
----
If added to the <str> class, a typical use case would be something like this:
bracketed = 'whitespace bad (haha) (not
really)'.substringAfter('(').substringBefore(')')
assert bracketed == 'haha'
I find this code highly readable. Currently the best alternatives would be:
bracketed = 'whitespace bad (haha) (not
really)'.partition('(')[2].partition(')')[0]
bracketed = 'whitespace bad (haha) (not really)'.split('(',
1)[-1].split(')', 1)[0]
bracketed = 'whitespace bad (haha) (not really)'.split('(',
1)[1].split(')', 1)[0]
All of these are not very readable, the latter ones even has 4 seemingly-random
integer constants. Even while writing this I got them wrong multiple times.
Plus they differ in behavior: The first one returns an empty string if the
separators are not found, the second one returns the original string instead,
the third one throws an IndexError (but only if the first separator is missing,
not if the second one is, yikes).
Monkey-patching the <str> class to add str.substringAfter(sep) on the user side
is also not possible as it is a C type.
I think this would fit well in Python (apart from the camelCase ;)) because it
would complement removeprefix/removesuffix which are being added in 3.9
already. Plus I do use substringAfter/substringBefore in Kotlin all the time.
One might even think about a substringBetween to be honest:
bracketed = 'whitespace bad (haha) (not really)'.substringBetween('(',
')')
assert bracketed == 'haha'
As an alternative, I think str.partition(sep) could be changed to return a
NamedTuple rather than a simple tuple. This should be interoperable with
pre-existing code. This could enable calls as follows:
substringAfter = lambda s, sep: s.partition(sep).after
substringBefore = lambda s, sep: s.partition(sep).before
bracketed = 'whitespace bad (haha) (not
really)'.partition('(').after.partition(')').before
What do you think?
Greetings Jan
_______________________________________________
Python-ideas mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at
https://mail.python.org/archives/list/[email protected]/message/TOUHY2LV4PAHEVH2BHGK5CXHKOPAOQNN/
Code of Conduct: http://python.org/psf/codeofconduct/