Re: Code improvement question

MRAB via Python-list Fri, 17 Nov 2023 11:00:22 -0800

On 2023-11-17 09:38, jak via Python-list wrote:

Mike Dewhirst ha scritto:

On 15/11/2023 10:25 am, MRAB via Python-list wrote:

On 2023-11-14 23:14, Mike Dewhirst via Python-list wrote:

I'd like to improve the code below, which works. It feels clunky to me.


I need to clean up user-uploaded files the size of which I don't know in
advance.

After cleaning they might be as big as 1Mb but that would be super rare.
Perhaps only for testing.

I'm extracting CAS numbers and here is the pattern xx-xx-x up to
xxxxxxx-xx-x eg., 1012300-77-4

def remove_alpha(txt):

      """  r'[^0-9\- ]':

      [^...]: Match any character that is not in the specified set.

      0-9: Match any digit.

      \: Escape character.

      -: Match a hyphen.

      Space: Match a space.

      """

      cleaned_txt = re.sub(r'[^0-9\- ]', '', txt)

      bits = cleaned_txt.split()

      pieces = []

      for bit in bits:

# minimum size of a CAS number is 7 so drop smaller clumpsof digits


          pieces.append(bit if len(bit) > 6 else "")

      return " ".join(pieces)


Many thanks for any hints

Why don't you use re.findall?

re.findall(r'\b[0-9]{2,7}-[0-9]{2}-[0-9]{2}\b', txt)

I think I can see what you did there but it won't make sense to me - orwhoever looks at the code - in future.

That answers your specific question. However, I am in awe of people whocan just "do" regular expressions and I thank you very much for whatwould have been a monumental effort had I tried it.

That little re.sub() came from ChatGPT and I can understand it withouttoo much effort because it came documented


I suppose ChatGPT is the answer to this thread. Or everything. Or will be.

Thanks

Mike


I respect your opinion but from the point of view of many usenet users
asking a question to chatgpt to solve your problem is truly an overkill.
The computer world overflows with people who know regex. If you had not
already had the answer with the use of 're' I would have sent you my
suggestion that as you can see it is practically identical. I am quite
sure that in this usenet the same solution came to the mind of many
people.

with open(file) as fp:
      try: ret = re.findall(r'\b\d{2,7}\-\d{2}\-\d{1}\b', fp.read())
      except: ret = []

The only difference is '\d' instead of '[0-9]' but they are equivalent.

Bare excepts are a very bad idea.
--
https://mail.python.org/mailman/listinfo/python-list

Re: Code improvement question

Reply via email to