On Fri, Jan 19, 2024 at 03:30:17PM +0000, fxkl4...@protonmail.com wrote: > >> But at this point, we have to wonder what the *actual* goal is. > > to exclude phrases with commas for seperate examination
Parsing natural language text is going to be tricky. I can only talk about English, and not about whatever language your text is actually written in. Let's look at a few example English sentences: Good morning, John. I went to the store with Mary, Paul, Susan and Ralph. I won, and you lost. The bear, who was hungry, looked for food. Oh, that's interesting. These are five different examples of comma usage in English. Do you happen to know in advance that your text will *only* contain samples that use the fourth style above? Let's assume this. Let's then form a template: STUFF, ASIDE, MORE STUFF, ASIDE, STILL MORE STUFF. I.e. given a sentence which conforms to expectation, we should see an even number of commas (is *THIS* why you were counting them??) and we should extract the ASIDEs from in between the first and second, then the third and fourth, and so on. So... uh, I guess my next question is: are you *pre-filtering* the sentences and keeping only the ones which have an even number of commas? Or have you already *done* that, and now you're asking how to extract the ASIDEs? I really don't think I'd try this with shell scripts. The tools just aren't designed for this. You really want tools that are custom built for natural language processing, or a language that lets you run through a large string character by character in a fast, efficient way (C comes to mind) if you're trying to build your tools from the ground up. The "obvious" algorithm for extracting the ASIDEs would be use a simple finite state machine, and march through the sentence character by character. When you encounter a comma, change state. Otherwise, if you're in the "ASIDE" state, copy the character to your output buffer. When you leave the "ASIDE" state, terminate the current output buffer and move to the next one. That's how I'd do it in C. Add whitespace trimming and so on. Also note that breaking a piece of natural language text *into* sentences in the first place is extraordinarily difficult. If you haven't already got a way to do that, you're probably screwed. Seriously, asking the debian-user list how to count the number of commas in a text file is *not* a good sign if you're dealing with a masters-degree-level problem in natural language analysis.