On 26May2018 04:02, Subhabrata Banerjee <subhabangal...@gmail.com> wrote:
On Saturday, May 26, 2018 at 3:54:37 AM UTC+5:30, Cameron Simpson wrote:
It sounds like you want a more general purpose parser, and that depends upon
your purposes. If you're coding to learn the basics of breaking up text, what
you're doing is fine and I'd stick with it. But if you're just after the
outcome (tags), you could use other libraries to break up the text.

For example, the Natural Language ToolKit (NLTK) will do structured parsing of
text and return you a syntax tree, and it has many other facilities. Doco:

  http://www.nltk.org/

PyPI module:

  https://pypi.org/project/nltk/

which you can install with the command:

  pip install --user nltk

That would get you a tree structure of the corpus, which you could process more
meaningfully. For example, you could traverse the tree and tag higher level
nodes as you came across them, possibly then _not_ traversing their inner
nodes. The effect of that would be that if you hit the grammatic node:

  government of Mexico

you might tags that node with "ORGANISATION", and choose not to descend inside
it, thus avoiding tagging "government" and "of" and so forth because you have a
high level tags. Nodes not specially recognised you're keep descending into,
tagging smaller things.

Cheers,
Cameron Simpson

Dear Sir,

Thank you for your kind and valuable suggestions. Thank you for your kind time 
too.
I know NLTK and machine learning. I am of belief if I may use language properly 
we need machine learning-the least.

I have similar beliefs: not that machine learning is not useful, but that it has a tendency to produce black boxes in terms of the results it produces because its categorisation rules are not overt, rather they tend to be side effects of weights in a graph.

So one might end up with a useful tool, but not understand how or why it works.

So, I am trying to design a tagger without the help of machine learning, by 
simple Python coding. I have thus removed standard Parts of Speech(PoS) or 
Named Entity (NE) tagging scheme.
I am trying to design a basic model if required may be implemented on any one 
of these problems.
Detecting longer phrase is slightly a problem now I am thinking to employ re.search(pattern,text). If this part is done I do not need machine learning. Maintaining so much data is a cumbersome issue in machine learning.

NLTK is not machine learning (I believe). It can parse the corpus for you, emitting grammatical structures. So that would aid you in recognising words, phrases, nouns, verbs and so forth. With that structure you can then make better decisions about what to tag and how.

Using the re module is a very hazard prone way of parsing text. It can be useful for finding fairly fixed text, particularly in machine generated text, but it is terrible for prose.

Cheers,
Cameron Simpson <c...@cskk.id.au>
--
https://mail.python.org/mailman/listinfo/python-list

Reply via email to