On Fri, 19 Jan 2024, Greg Wooledge wrote: > On Fri, Jan 19, 2024 at 03:30:17PM +0000, fxkl4...@protonmail.com wrote: >>>> But at this point, we have to wonder what the *actual* goal is. >> >> to exclude phrases with commas for seperate examination > > Parsing natural language text is going to be tricky. I can only talk > about English, and not about whatever language your text is actually > written in. > > Let's look at a few example English sentences: > > Good morning, John. > > I went to the store with Mary, Paul, Susan and Ralph. > > I won, and you lost. > > The bear, who was hungry, looked for food. > > Oh, that's interesting. > > These are five different examples of comma usage in English. Do you > happen to know in advance that your text will *only* contain samples > that use the fourth style above? Let's assume this. Let's then form > a template: > > STUFF, ASIDE, MORE STUFF, ASIDE, STILL MORE STUFF. > > I.e. given a sentence which conforms to expectation, we should see > an even number of commas (is *THIS* why you were counting them??) and > we should extract the ASIDEs from in between the first and second, then > the third and fourth, and so on. > > So... uh, I guess my next question is: are you *pre-filtering* the > sentences and keeping only the ones which have an even number of > commas? Or have you already *done* that, and now you're asking how > to extract the ASIDEs? > > I really don't think I'd try this with shell scripts. The tools just > aren't designed for this. You really want tools that are custom built > for natural language processing, or a language that lets you run > through a large string character by character in a fast, efficient > way (C comes to mind) if you're trying to build your tools from the > ground up. > > The "obvious" algorithm for extracting the ASIDEs would be use a > simple finite state machine, and march through the sentence > character by character. When you encounter a comma, change state. > Otherwise, if you're in the "ASIDE" state, copy the character to your > output buffer. When you leave the "ASIDE" state, terminate the current > output buffer and move to the next one. That's how I'd do it in C. > Add whitespace trimming and so on. > > Also note that breaking a piece of natural language text *into* > sentences in the first place is extraordinarily difficult. If you > haven't already got a way to do that, you're probably screwed. > Seriously, asking the debian-user list how to count the number of > commas in a text file is *not* a good sign if you're dealing with a > masters-degree-level problem in natural language analysis. >
certain characters give the interpreter a hard time i'll just process what is easy for the interpreter then work on the rest