Re: [R] Question on Stopword Removal from a Cyrillic (Bulgarian)Text

2013-04-10 Thread Ventseslav Kozarev
I just wanted to confirm that Milan's suggestion about adding (*UCP) like in the example below: gsub(sprintf("(*UCP)\\b(%s)\\b", "който"), "", "който", perl=TRUE) solved all problems (under openSuse Linux 12.3 64-bit, R 2.15.2). I reencoded input files and stop word list in UTF-8, and now stop

Re: [R] Question on Stopword Removal from a Cyrillic (Bulgarian)Text

2013-04-10 Thread Ventseslav Kozarev
I just wanted to confirm that Milan's suggestion about adding (*UCP) like in the example below: gsub(sprintf("(*UCP)\\b(%s)\\b", "който"), "", "който", perl=TRUE) solved all problems (under openSuse Linux 12.3 64-bit, R 2.15.2). I reencoded input files and stop word list in UTF-8, and now stop

Re: [R] Question on Stopword Removal from a Cyrillic (Bulgarian)Text

2013-04-10 Thread Milan Bouchet-Valat
Le mercredi 10 avril 2013 à 13:17 +0200, Ingo Feinerer a écrit : > On Wed, Apr 10, 2013 at 10:29:27AM +0200, Milan Bouchet-Valat wrote: > > Thanks for the reproducible example. Indeed, it does not work here > > either (Linux with UTF-8 locale). The problem seems to be in the call to > > gsub() in r

Re: [R] Question on Stopword Removal from a Cyrillic (Bulgarian)Text

2013-04-10 Thread Ventseslav Kozarev, MPP
Thank you so much! You made it look (almost) so easy. I greatly appreciate it! On 10.4.2013 г. 11:29 ч., Milan Bouchet-Valat wrote: Le mercredi 10 avril 2013 à 10:50 +0300, Ventseslav Kozarev, MPP a écrit : Hi, Thanks for taking the time. Here is a more reproducible example of the entire proc

Re: [R] Question on Stopword Removal from a Cyrillic (Bulgarian)Text

2013-04-10 Thread Milan Bouchet-Valat
Le mercredi 10 avril 2013 à 10:50 +0300, Ventseslav Kozarev, MPP a écrit : > Hi, > > Thanks for taking the time. Here is a more reproducible example of the > entire process: > > # Creating a vector source - stupid text in the Bulgarian language > bg<-c('Днес е хубав и слънчев ден, в който всички

Re: [R] Question on Stopword Removal from a Cyrillic (Bulgarian)Text

2013-04-10 Thread Ventseslav Kozarev, MPP
Hi, Thanks for taking the time. Here is a more reproducible example of the entire process: # Creating a vector source - stupid text in the Bulgarian language bg<-c('Днес е хубав и слънчев ден, в който всички искат да бъдат навън.','Утре ще бъде още по-хубав ден.') # Converting strings from

Re: [R] Question on Stopword Removal from a Cyrillic (Bulgarian)Text

2013-04-09 Thread Milan Bouchet-Valat
Le mardi 09 avril 2013 à 10:10 +0300, Ventseslav Kozarev, MPP a écrit : > Hi, > > I bumped into a serious issue while trying to analyse some texts in > Bulgarian language (with the tm package). I import a tab-separated csv > file, which holds a total of 22 variables, most of which are text cells

[R] Question on Stopword Removal from a Cyrillic (Bulgarian)Text

2013-04-09 Thread Ventseslav Kozarev, MPP
Hi, I bumped into a serious issue while trying to analyse some texts in Bulgarian language (with the tm package). I import a tab-separated csv file, which holds a total of 22 variables, most of which are text cells (not factors), using the read.delim function: data<-read.delim("bigcompanies_