On Wed, Dec 11, 2019 at 1:31 PM Ben Bacarisse <ben.use...@bsb.me.uk> wrote: > > A S <aishan0...@gmail.com> writes: > > > I would like to extract all words within specific keywords in a .txt > > file. For the keywords, there is a starting keyword of "PROC SQL;" (I > > need this to be case insensitive) and the ending keyword could be > > either "RUN;", "quit;" or "QUIT;". This is my sample .txt file. > > > > Thus far, this is my code: > > > > with open('lan sample text file1.txt') as file: > > text = file.read() > > regex = re.compile(r'(PROC SQL;|proc sql;(.*?)RUN;|quit;|QUIT;)') > > k = regex.findall(text) > > print(k) > > Try > > re.compile(r'(?si)(PROC SQL;.*(?:QUIT|RUN);)') > > Read up one what (?si) means and what (?:...) means.. You can do the > same by passing flags to the compile method. > > > Output: > > > > [('quit;', ''), ('quit;', ''), ('PROC SQL;', '')] > > Your main issue is that | binds weakly. Your whole pattern tries to > match any one of just four short sub-patterns: > > PROC SQL; > proc sql;(.*?)RUN; > quit; > QUIT; > > -- > Ben. > -- > https://mail.python.org/mailman/listinfo/python-list
Consider using python string functions. 1. read your string, lets call it s. 2 . start = s.find("PROC SQL:" This will find the starting index point. It returns and index 3. DO the same for each of the three possible ending strings. Use if/else 4. This will give you your ending index. 5 slice the included string, taking into account the start is start + len("PROC SQL;") and the end is the ending index - the length of whichever string ended in your case Regular expressions are powerful, but not so easy to read unless you are really into them. -- Joel Goldstick http://joelgoldstick.com/blog http://cc-baseballstats.info/stats/birthdays -- https://mail.python.org/mailman/listinfo/python-list