Re: [R] How do I use R to build a dictionary of proper nouns?

Boris Steipe Mon, 08 May 2017 02:10:20 -0700

Your workflow is not clear to me, so I can't give any specific advice.

1: I don't understand what you need. Do you need the column names changed? They 
correspond to the matched
   words.


2: How was the vector dictionary_word created? These are (mostly) stemmed 
nouns, but some of them are two or even three words? Did you do this by hand? 
But this also contains "cmp" which is not a stemmed word, or "particl", or 
"recoveri" which is not correctly stemmed. This doesn't look promising, I think 
at least you will need to place hyphens between the words, but since you are 
using stemmed words this will be difficult. 

3: Since the default tokenizer is "words", I think the two-word and three-word 
elements of the dictionary_word vector will not be found. They don't exist as 
tokens.

4: Don't use "list" as a variable name.

In summary - I think your problems have to do with stemming and tokenizing and 
not with formatting the output of DocumentTermMatrix(). I don't think tm has 
functions to produce stemmed multi-word tokens like the elements in your 
dictionary_word vector. You may need to do the analysis with your own 
functions, using regular expressions.


B.


> On May 8, 2017, at 3:56 AM, θ ＂ <yarmi1...@hotmail.com> wrote:
> 
> Hi Steipe：
> Thanks for your recommend.
> I have used the DocumentTermMatrix function of tm package to try. 
> But I prefer the matrix result shows the frequency of the dictionary word.
> Is there any way to do?  
> The following are my code and result：
> 
> dictionary_word <- c("neutral", "abras particl", "acid", "apparatus", "back 
> film", "basic", "carrier", "chemic", "chromat confoc", "clean system", "cmp", 
> "compens type", "compress", "comsum", "control system", "down pressur", 
> "dresser condition", "detect system", "flow rate control", "fractal type", 
> "groov", "hard", "improv type", "infrar", "laser confoc", "layer", "measur 
> system", "micro stuctur", "monitor system", "multi layer", "none-por", 
> "nonwoven pad", "pad", "pad applic", "pad condit system", "pad materi", "pad 
> properti", "pad structur", "ph sensor", "planet type", "plate", "plat", 
> "poisson ratio", "polish head", "polish system", "polym pad", "polyurethan 
> pad", "porous", "process paramet", "process path", "process time", 
> "recoveri", "rotat speed", "rough", "scatter", "semiconductor cmp", "sensor", 
> "signal acceptor", "singl layer", "slurri", "slurri flow rate", "slurri ph 
> valu", "slurri stirrer", "slurri suppli system", "slurri temperatur", "slurri 
> weight percentag", "storag cmp", "stylus profil", "substrat cmp", "thick", 
> "transfer robot", "ultrason", "urethan pad", "wafer cassett", "wafer transfer 
> system", "white light interferomet", "young modulus")
> 
> list<-inspect(DocumentTermMatrix(corpus_tm,
>                                  list(weighting =weightTf,
>                                       dictionary = dictionary_word)))
> 
> <keywords of dictionary.PNG>
> 
> 
> 寄件者: Boris Steipe <boris.ste...@utoronto.ca>
> 寄件日期: 2017年5月5日 下午 04:39
> 收件者: θ ＂
> 副本: r-help@r-project.org
> 主旨: Re: [R] How do I use R to build a dictionary of proper nouns?
>  
> Did you try using the table() function, possibly in combination with sort() 
> or rank()?
> 
> 
> Consider:
> 
> myNouns <- c("proper", "nouns", "domain", "ontology", "dictionary",
>              "dictionary", "corpus", "patent", "files", "proper", "nouns",
>              "word", "frequency", "file", "preprocess", "corpus", "proper",
>              "nouns", "domain", "ontology", "idea", "nouns", "dictionary",
>              "dictionary", "corpus", "attachments", "texts", "corpus",
>              "preprocesses", "proper", "nouns")
> 
> myNounFrequencies <- table(myNouns)
> myNounFrequencies
> 
> myNounFrequencies <- sort(myNounFrequencies, decreasing = TRUE)
> myNounFrequencies
> 
> which(names(myNounFrequencies) == "corpus")
> 
> 
> 
> 
> 
> > On May 5, 2017, at 1:58 AM, θ ＂ <yarmi1...@hotmail.com> wrote:
> > 
> > θ ＂ 已與您共用 OneDrive 檔案。若要檢視檔案，請按下面的連結。
> > 
> > 
> > <https://1drv.ms/u/s!Aq27nOPOP5izgVRRxXomVBv0YV0j>
> 
> 
> 
> > [https://r1.res.office365.com/owa/prem/images/dc-png_20.png]<https://1drv.ms/u/s!Aq27nOPOP5izgVRRxXomVBv0YV0j>
> > 
> > 2.corpus_patent text.PNG<https://1drv.ms/u/s!Aq27nOPOP5izgVRRxXomVBv0YV0j>
> 
> 
> 
> > 
> > <https://1drv.ms/u/s!Aq27nOPOP5izgVURiS7MbYH6hJzo>
> 
> 
> 
> > [https://r1.res.office365.com/owa/prem/images/dc-png_20.png]<https://1drv.ms/u/s!Aq27nOPOP5izgVURiS7MbYH6hJzo>
> > 
> > 3ontology_proper nouns 
> > keywords.PNG<https://1drv.ms/u/s!Aq27nOPOP5izgVURiS7MbYH6hJzo>
> 
> 
> 
> > 
> > <https://1drv.ms/u/s!Aq27nOPOP5izgVYuRVxM1OyzIPzF>
> 
> 
> 
> > [https://r1.res.office365.com/owa/prem/images/dc-png_20.png]<https://1drv.ms/u/s!Aq27nOPOP5izgVYuRVxM1OyzIPzF>
> > 
> > 1.patents.PNG<https://1drv.ms/u/s!Aq27nOPOP5izgVYuRVxM1OyzIPzF>
> 
> 
> 
> > 
> > 
> > 
> > 
> > Hi :
> > 
> > I want to do patents text mining in R.
> > I need to use the proper nouns of domain ontology to build a dictionary.
> > Then use the dictionary to analysis my corpus of patent files.
> > I want to calculate the proper nouns and get the word frequency that 
> > appears in each file.
> > 
> > Now I have done the preprocess for the corpus and extract the proper nouns 
> > from domain ontology.
> > But I have no idea how to build a proper nouns dictionary and use the 
> > dictionary to analysis my corpus.
> > 
> > The Attachments are my texts, corpus preprocesses and proper nouns.
> > 
> > Thanks.
> > 
> >        [[alternative HTML version deleted]]
> > 
> > ______________________________________________
> > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> R-help Info Page - Homepage - SfS – Seminar for Statistics
> stat.ethz.ch
> The main R mailing list, for announcements about the development of R and the 
> availability of new code, questions and answers about problems and solutions 
> using R ...
> 
> 
> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] How do I use R to build a dictionary of proper nouns?

Reply via email to