On Fri, 19 Jan 2024, Greg Wooledge wrote:

> On Fri, Jan 19, 2024 at 03:30:17PM +0000, fxkl4...@protonmail.com wrote:
>>>> But at this point, we have to wonder what the *actual* goal is.
>>
>> to exclude phrases with commas for seperate examination
>
> Parsing natural language text is going to be tricky.  I can only talk
> about English, and not about whatever language your text is actually
> written in.
>
> Let's look at a few example English sentences:
>
>    Good morning, John.
>
>    I went to the store with Mary, Paul, Susan and Ralph.
>
>    I won, and you lost.
>
>    The bear, who was hungry, looked for food.
>
>    Oh, that's interesting.
>
> These are five different examples of comma usage in English.  Do you
> happen to know in advance that your text will *only* contain samples
> that use the fourth style above?  Let's assume this.  Let's then form
> a template:
>
>    STUFF, ASIDE, MORE STUFF, ASIDE, STILL MORE STUFF.
>
> I.e. given a sentence which conforms to expectation, we should see
> an even number of commas (is *THIS* why you were counting them??) and
> we should extract the ASIDEs from in between the first and second, then
> the third and fourth, and so on.
>
> So... uh, I guess my next question is: are you *pre-filtering* the
> sentences and keeping only the ones which have an even number of
> commas?  Or have you already *done* that, and now you're asking how
> to extract the ASIDEs?
>
> I really don't think I'd try this with shell scripts.  The tools just
> aren't designed for this.  You really want tools that are custom built
> for natural language processing, or a language that lets you run
> through a large string character by character in a fast, efficient
> way (C comes to mind) if you're trying to build your tools from the
> ground up.
>
> The "obvious" algorithm for extracting the ASIDEs would be use a
> simple finite state machine, and march through the sentence
> character by character.  When you encounter a comma, change state.
> Otherwise, if you're in the "ASIDE" state, copy the character to your
> output buffer.  When you leave the "ASIDE" state, terminate the current
> output buffer and move to the next one.  That's how I'd do it in C.
> Add whitespace trimming and so on.
>
> Also note that breaking a piece of natural language text *into*
> sentences in the first place is extraordinarily difficult.  If you
> haven't already got a way to do that, you're probably screwed.
> Seriously, asking the debian-user list how to count the number of
> commas in a text file is *not* a good sign if you're dealing with a
> masters-degree-level problem in natural language analysis.
>


certain characters give the interpreter a hard time
i'll just process what is easy for the interpreter
then work on the rest

Reply via email to