Re: counting commas

Greg Wooledge Fri, 19 Jan 2024 08:47:12 -0800

On Fri, Jan 19, 2024 at 03:30:17PM +0000, fxkl4...@protonmail.com wrote:
> >> But at this point, we have to wonder what the *actual* goal is.
> 
> to exclude phrases with commas for seperate examination


Parsing natural language text is going to be tricky.  I can only talk
about English, and not about whatever language your text is actually
written in.

Let's look at a few example English sentences:

    Good morning, John.

    I went to the store with Mary, Paul, Susan and Ralph.

    I won, and you lost.

    The bear, who was hungry, looked for food.

    Oh, that's interesting.

These are five different examples of comma usage in English.  Do you
happen to know in advance that your text will *only* contain samples
that use the fourth style above?  Let's assume this.  Let's then form
a template:

    STUFF, ASIDE, MORE STUFF, ASIDE, STILL MORE STUFF.

I.e. given a sentence which conforms to expectation, we should see
an even number of commas (is *THIS* why you were counting them??) and
we should extract the ASIDEs from in between the first and second, then
the third and fourth, and so on.

So... uh, I guess my next question is: are you *pre-filtering* the
sentences and keeping only the ones which have an even number of
commas?  Or have you already *done* that, and now you're asking how
to extract the ASIDEs?

I really don't think I'd try this with shell scripts.  The tools just
aren't designed for this.  You really want tools that are custom built
for natural language processing, or a language that lets you run
through a large string character by character in a fast, efficient
way (C comes to mind) if you're trying to build your tools from the
ground up.

The "obvious" algorithm for extracting the ASIDEs would be use a
simple finite state machine, and march through the sentence
character by character.  When you encounter a comma, change state.
Otherwise, if you're in the "ASIDE" state, copy the character to your
output buffer.  When you leave the "ASIDE" state, terminate the current
output buffer and move to the next one.  That's how I'd do it in C.
Add whitespace trimming and so on.

Also note that breaking a piece of natural language text *into*
sentences in the first place is extraordinarily difficult.  If you
haven't already got a way to do that, you're probably screwed.
Seriously, asking the debian-user list how to count the number of
commas in a text file is *not* a good sign if you're dealing with a
masters-degree-level problem in natural language analysis.

Re: counting commas

Reply via email to