> > > You say that named entity recognition is not generalised beyond Mail, > but the support library is there for anyone to use. See for > example > https://developer.apple.com/documentation/foundation/nslinguistictagger/identifying_people_places_and_organizations > > <https://developer.apple.com/documentation/foundation/nslinguistictagger/identifying_people_places_and_organizations> Yes true.
> > In Python, you can use NLTK to do roughly the same. > > There's no real point in reimplementing this stuff in Pharo. > Just set up a separate process, send text to it, and receive > results back. I agree, that is an excellent option. "NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources <http://nltk.org/nltk_data/> such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum <http://groups.google.com/group/nltk-users>." Thanks for pointing NLTK. Great tool for sure. I agree that there no point in reimplementing but in this case of NLP, this might worth it as I think we have already all good foundations (Basic string manipulation, PP, Zn, XML, etc…). The hard stuff is probably the integration of 100 corpora and lexical ressources (http://www.nltk.org/nltk_data/)... The thing is I need a lighter version, based on my own (growing) corpora (my experience). So Hernan solution, PP, or straight string processing would do the job. But thanks, this is something I will explore too (especially the luges ressources). Cheers, Cédrick > > > On Thu, 7 Mar 2019 at 22:53, Cédrick Béler <cdric...@gmail.com > <mailto:cdric...@gmail.com>> wrote: > Hi all, > > I’ve often got the need to analyse some random unstructured text to discover > (structured) information (in email for instance), to extract : > - emails > - telephone numbers > - addresses > - events > - person names (according to a list of known persons), > - etc… > > Apple do it in email for instance (strangely, this is not generalized). > > > So my questions are : > - do we have something equivalent in Smalltalk/Pharo ? (I didn’t find) > - if not, what strategy would you use ? > => I do really stupid text analysis (substrings, finding @, …, parsing > according to the text structure when there is… kind of Soup parsing…) > => I feel this is a job for PetitParser ? And would be a nice feet to the new > GToolkit. > > All ideas or suggestions are welcome ;-) > > > TIA, > > Cédrick > > >