On 2020-03-23 06:00:41 +1100, Chris Angelico wrote: > Second point, and related to the above. The regex that defines break > points, as found in the source code, is: > > wordsep_re = re.compile(r''' > ( # any whitespace > %(ws)s+ > | # em-dash between words > (?<=%(wp)s) -{2,} (?=\w) > | # word, possibly hyphenated > %(nws)s+? (?: > # hyphenated word > -(?: (?<=%(lt)s{2}-) | (?<=%(lt)s-%(lt)s-)) > (?= %(lt)s -? %(lt)s) > | # end of word > (?=%(ws)s|\Z) > | # em-dash > (?<=%(wp)s) (?=-{2,}\w) > ) > )''' % {'wp': word_punct, 'lt': letter, > 'ws': whitespace, 'nws': nowhitespace}, > > It's built primarily out of small matches with long assertions, eg > "match a hyphen, as long as it's preceded by two letters or a letter > and a hyphen".
Do you need that fancy logic? Could you only break on white-space instead? It won't wrap "tetrabromo-phenolsulfonephthalein" in that case but since you mentioned its for a twitter client, most users probably won't mind (and those who do mind will probably insist that the algorithm should be able to split it into tetrabromo-phenolsulfone- phthalein, if that's where the line end is, as it was here purely by lucky accident). A regexp for whitespace is pretty simple. hp -- _ | Peter J. Holzer | Story must make more sense than reality. |_|_) | | | | | h...@hjp.at | -- Charles Stross, "Creative writing __/ | http://www.hjp.at/ | challenge!"
signature.asc
Description: PGP signature
-- https://mail.python.org/mailman/listinfo/python-list