Hi again
I've now had the chance to try this out, and using scan() doesn't seem
to work either.
This is what I used:
1) I generated a plain text file called stopDict.txt. This file is of
the format: "a, bunch, of, words, to, use"
2) I invoked scan(), like this:
> userStopList <- scan(text = '~/path/to/stopDict.txt', what = " ", sep
= ",")
3) Then I used the externally generated list as stop words:
> docs <- tm_map(docs, removeWords, userStopList)
3) When I go to inspect the document, at least two of the user-defined
stop words are in the text
Is there a further argument I should be passing to scan(), or is the
stopDict.txt file not set up the correct way? I tried each term
separated by ' ' and ',', (e.g. 'all', 'the', 'text') but that didn't
work, neither does it seem to work when the whole list is enclosed
within quotes (e.g. "all, the, text").
While not critical to have the capacity to read in an externally
generated list, it sure would be helpful.
Thanks.
Sun
On 02/03/15 07:36, Sun Shine wrote:
Thanks Jim.
I thought that I was passing a vector, not realising I had converted
this to a list object.
I haven't come across the scan() function so far, so this is good to
know.
Good explanation - I'll give this a go when I can get back to that
piece of work later today.
Thanks again.
Regards,
Sun
On 01/03/15 21:13, jim holtman wrote:
The 'read.table' was creating a data.frame (not a vector) and applying
'c' to it converted it to a list. You should alway look at the object
you are creating. You probably want to use 'scan'.
======================
testFile <-
"Although,this,query,applies,specifically,to,the,tm,package"
# read in with read.table create a data.frame
df_words <- read.table(text = testFile, sep = ',')
df_words # not a vector
V1 V2 V3 V4 V5 V6 V7 V8 V9
1 Although this query applies specifically to the tm package
c(df_words) # this results in a list
$V1
[1] Although
Levels: Although
$V2
[1] this
Levels: this
$V3
[1] query
Levels: query
$V4
[1] applies
Levels: applies
$V5
[1] specifically
Levels: specifically
$V6
[1] to
Levels: to
$V7
[1] the
Levels: the
$V8
[1] tm
Levels: tm
$V9
[1] package
Levels: package
# now read with 'scan'
scan_words <- scan(text = testFile, what = '', sep = ',')
Read 9 items
scan_words
[1] "Although" "this" "query" "applies"
"specifically" "to"
[7] "the" "tm" "package"
Jim Holtman
Data Munger Guru
What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.
On Sat, Feb 28, 2015 at 8:46 AM, Sun Shine <phaedr...@gmail.com> wrote:
Hi list
Although this query applies specifically to the tm package, perhaps
it's
something that others might be able to lend a thought to.
Using tm to do some initial text mining, I want to include an
external (to
R) generated dictionary of words that I want removed from the corpus.
I have created a comma separated list of terms in " " marks in a
stopList.txt plain UTF-8 file. I want to read this into R, so do:
stopDict <- read.table('~/path/to/file/stopList.txt', sep=',')
When I want to load it as part of the removeWords function in tm, I do:
docs <- tm_map(docs, removeWords, stopDict)
which has no effect. Neither does:
docs <- tm_map(docs, removeWords, c(stopDict))
What am I not seeing/ doing?
How do I pass a text file with pre-defined terms to the removeWords
transform of tm?
Thanks for any ideas.
Cheers
Sun
______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.