Cédrick

In principle, what you are asking for is to identify 'islands' of structured 
information in a 'sea' of otherwise unstructured material, which is now a 
standard pattern in PetitParser. You could imagine a parser spec of the form:

(sea optional, (email/phone/address/....), sea optional) plus

Where email etc are parsers for the individual structures. As a parser this 
would probably lead to lots of backtracking and be hideously inefficient, but 
for a short text like an e-mail it could be usable. This also assumes that the 
items of interest are really structured; there could be many ways of writing 
phone numbers, for instance.

HTH

Peter Kenny

-----Original Message-----
From: Pharo-users <pharo-users-boun...@lists.pharo.org> On Behalf Of Cédrick 
Béler
Sent: 07 March 2019 09:52
To: Any question about pharo is welcome <pharo-users@lists.pharo.org>
Cc: Tudor Girba <tu...@tudorgirba.com>
Subject: [Pharo-users] Parsing text to discover general data of interest 
(phone, email, address, ...)

Hi all,

I’ve often got the need to analyse some random unstructured text to discover 
(structured) information (in email for instance), to extract :
- emails
- telephone numbers
- addresses
- events
- person names (according to a list of known persons),
- etc… 

Apple do it in email for instance (strangely, this is not generalized).


So my questions are :
- do we have something equivalent in Smalltalk/Pharo ? (I didn’t find)
- if not, what strategy would you use ?
=> I do really stupid text analysis (substrings, finding @, …, parsing 
according to the text structure when there is… kind of Soup parsing…) => I feel 
this is a job for PetitParser ? And would be a nice feet to the new GToolkit.

All ideas or suggestions are welcome ;-)


TIA,

Cédrick 




Reply via email to