Re: [Pharo-users] Parsing text to discover general data of interest (phone, email, address, ...)

Cédrick Béler Thu, 07 Mar 2019 23:54:27 -0800

> 
> 
> You say that named entity recognition is not generalised beyond Mail,
> but the support library is there for anyone to use.  See for
> example 
> https://developer.apple.com/documentation/foundation/nslinguistictagger/identifying_people_places_and_organizations
>  
> <https://developer.apple.com/documentation/foundation/nslinguistictagger/identifying_people_places_and_organizations>
Yes true.

> 
> In Python, you can use NLTK to do roughly the same.
> 
> There's no real point in reimplementing this stuff in Pharo.
> Just set up a separate process, send text to it, and receive
> results back.

I agree, that is an excellent option.

"NLTK is a leading platform for building Python programs to work with human 
language data. It provides easy-to-use interfaces to over 50 corpora and 
lexical resources <http://nltk.org/nltk_data/> such as WordNet, along with a 
suite of text processing libraries for classification, tokenization, stemming, 
tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP 
libraries, and an active discussion forum 
<http://groups.google.com/group/nltk-users>." 

Thanks for pointing NLTK. Great tool for sure. I agree that there no point in 
reimplementing but in this case of NLP, this might worth it as I think we have 
already all good foundations (Basic string manipulation, PP, Zn, XML, etc…). 
The hard stuff is probably the integration of 100 corpora and lexical 
ressources (http://www.nltk.org/nltk_data/)...

The thing is I need a lighter version, based on my own (growing) corpora (my 
experience). 
So Hernan solution, PP, or straight string processing would do the job.

But thanks, this is something I will explore too (especially the luges 
ressources).

Cheers,

Cédrick

> 
> 
> On Thu, 7 Mar 2019 at 22:53, Cédrick Béler <cdric...@gmail.com 
> <mailto:cdric...@gmail.com>> wrote:
> Hi all,
> 
> I’ve often got the need to analyse some random unstructured text to discover 
> (structured) information (in email for instance), to extract :
> - emails
> - telephone numbers
> - addresses
> - events
> - person names (according to a list of known persons), 
> - etc… 
> 
> Apple do it in email for instance (strangely, this is not generalized).
> 
> 
> So my questions are :
> - do we have something equivalent in Smalltalk/Pharo ? (I didn’t find) 
> - if not, what strategy would you use ?
> => I do really stupid text analysis (substrings, finding @, …, parsing 
> according to the text structure when there is… kind of Soup parsing…)
> => I feel this is a job for PetitParser ? And would be a nice feet to the new 
> GToolkit.
> 
> All ideas or suggestions are welcome ;-)
> 
> 
> TIA,
> 
> Cédrick 
> 
> 
>

Re: [Pharo-users] Parsing text to discover general data of interest (phone, email, address, ...)

Reply via email to