On Monday 13 Jun 2005 14:52, Markus Wiederkehr wrote:
> On 6/13/05, Andy Roberts <[EMAIL PROTECTED]> wrote:
> > On Monday 13 Jun 2005 13:18, Markus Wiederkehr wrote:
> > > I see, the list of exceptions makes this a lot more complicated than I
> > > thought... Thanks a lot, Erik!
> >
> > I expect yo
On Jun 13, 2005, at 10:55 AM, Andy Roberts wrote:
On Monday 13 Jun 2005 13:18, Markus Wiederkehr wrote:
I see, the list of exceptions makes this a lot more complicated
than I
thought... Thanks a lot, Erik!
I expect you'll need to do some pre-processing. Read in your text
into a
buffer
On Jun 13, 2005, at 6:18 AM, Markus Wiederkehr wrote:
I see, the list of exceptions makes this a lot more complicated than I
thought... Thanks a lot, Erik!
There is a section about the problems that hyphens create in
"Foundations of Statistical Natural Language Processing". Not only
are t
On 6/13/05, Andy Roberts <[EMAIL PROTECTED]> wrote:
> On Monday 13 Jun 2005 13:18, Markus Wiederkehr wrote:
> > I see, the list of exceptions makes this a lot more complicated than I
> > thought... Thanks a lot, Erik!
> >
>
> I expect you'll need to do some pre-processing. Read in your text into a
On Monday 13 Jun 2005 13:18, Markus Wiederkehr wrote:
> I see, the list of exceptions makes this a lot more complicated than I
> thought... Thanks a lot, Erik!
>
I expect you'll need to do some pre-processing. Read in your text into a
buffer, line-by-line. If a given line ends with a hyphen, you
I see, the list of exceptions makes this a lot more complicated than I
thought... Thanks a lot, Erik!
Markus
On 6/13/05, Erik Hatcher <[EMAIL PROTECTED]> wrote:
>
> On Jun 13, 2005, at 7:08 AM, Markus Wiederkehr wrote:
> > I work on an application that has to index OCR texts of scanned books.
>
On Jun 13, 2005, at 7:08 AM, Markus Wiederkehr wrote:
I work on an application that has to index OCR texts of scanned books.
Naturally there occur many words that are hyphenated across lines.
I wonder if there is already an Analyzer or maybe a TokenFilter that
can merge those syllables back int
Hello,
I work on an application that has to index OCR texts of scanned books.
Naturally there occur many words that are hyphenated across lines.
I wonder if there is already an Analyzer or maybe a TokenFilter that
can merge those syllables back into whole words? It looks like Erik
Hatcher uses so