Here is one way to parse the data. I just took the lines you had in the
email to show how to do it. You can do the same thing on your complete
object:
> x <- readLines(textConnection("C:\\Program
Files\\R\\20news18828/talk.politics.guns/54215
+ C:\\Program Files\\R\\20news18828/talk.politics.guns/54216
+ C:\\Program Files\\R\\20news18828/talk.politics.guns/54217
+ C:\\Program Files\\R\\20news18828/talk.politics.misc/178341
+ C:\\Program Files\\R\\20news18828/talk.politics.misc/178342
+ C:\\Program Files\\R\\20news18828/talk.politics.misc/178343
+ C:\\Program Files\\R\\20news18828/talk.politics.mideast/75964
+ C:\\Program Files\\R\\20news18828/talk.politics.mideast/75965"))
> # parse the data with 'strsplit' (split at the '/' character)
> x.parsed <- strsplit(x, '/')
> # now create a matrix with the first column the directory and the second
the file
> # the 'x.parsed' has 3 elements for each line and we only want the last
two 'c(2,3)'
> x.names <- t(sapply(x.parsed, '[', c(2,3)))
>
> x.names
[,1] [,2]
[1,] "talk.politics.guns" "54215"
[2,] "talk.politics.guns" "54216"
[3,] "talk.politics.guns" "54217"
[4,] "talk.politics.misc" "178341"
[5,] "talk.politics.misc" "178342"
[6,] "talk.politics.misc" "178343"
[7,] "talk.politics.mideast" "75964"
[8,] "talk.politics.mideast" "75965"
>
2010/4/4 MeLiS MeLiS <[email protected]>
> Hello again,
>
> I tried what you have sent to me and i get:
> ...
>
> [15742] "C:\\Program Files\\R\\20news18828/talk.politics.guns/54215"
> [15743] "C:\\Program Files\\R\\20news18828/talk.politics.guns/54216"
> [15744] "C:\\Program Files\\R\\20news18828/talk.politics.guns/54217"
> ...
> [17608] "C:\\Program Files\\R\\20news18828/talk.politics.misc/178341"
> [17609] "C:\\Program Files\\R\\20news18828/talk.politics.misc/178342"
> [17610] "C:\\Program Files\\R\\20news18828/talk.politics.misc/178343"
> ...
> [16602] "C:\\Program Files\\R\\20news18828/talk.politics.mideast/75964"
> [16603] "C:\\Program Files\\R\\20news18828/talk.politics.mideast/75965"
> ...
> this is the closest thing what i need. i only need to take
> "talk.politics.guns", "talk.politics.misc" and "talk.politics.mideast" parts
> to the list for the example above.
> this help document (
> http://127.0.0.1:29974/library/base/html/list.files.html) mentions about
> "pattern".Do i need to use this to achieve what i want because i realyy did
> not undersatand how to use it.
>
> ------------------------------
> Date: Sun, 4 Apr 2010 12:43:58 -0400
>
> Subject: Re: [R] How to add a column to dtm showing a part from directory
> source?
> From: [email protected]
> To: [email protected]
>
> You can use 'list.files(startPath, recursive=TRUE)' to get a list of all
> the file names and then strip off the paths to create the data that you
> need. Is this what you want to do?
>
> 2010/4/4 MeLiS MeLiS <[email protected]>
>
>
> word1 word2 word3 ... CLASS doc1 comp.graphics doc2
> rec.autos doc3 rec.motorcycles ... ...
> This is basically my dtm.I will apply a classification algorithm later to
> categorize newly coming txt documents.So many of the existing nes will be
> used for machine learning.I have a folder called 20news-18828 and this
> folder includes 20 subfolders some of which are comp.graphics, rec.autos,
> rec.motorcycles, etc.And these subfolders include thousands of txt files.
> After some algorithms i created the dtm showing most used words in the txt
> files as you may guess. Now i have to add a column called "CLASS". The class
> column should tell me doc1 is in which subfolder.
> I hope this will help you understand..
> ------------------------------
> Date: Sun, 4 Apr 2010 12:07:56 -0400
> Subject: Re: [R] How to add a column to dtm showing a part from directory
> source?
> From: [email protected]
> To: [email protected]
>
>
> I would like to help, but it is not clear what you are asking for since
> there is no example of what you might want in the "dtm" (whatever that is
> supposed to be). What do you mean by the "class" information. An example
> would be helpful. You can recursively go down the subfolders extracting
> information, you just need to tell us what the information is.
>
> On Sun, Apr 4, 2010 at 11:04 AM, Melis Mete <[email protected]>wrote:
>
>
> Hello Experts,
>
> I'm new with R and having troubles doing my graduation project.I have 20
> subfolders including almost 20000 txt files.What i need to do is to create
> a
> dtm and add a column to it showing a "class" information of the txt files.
> My directory source is like "C:\\R\\20news-18828\\comp.graphics" for the
> comp.graphic subfolder.I need to take only "comp.graphic" part to be seen
> at
> the CLASS column.Pleasehelp...
>
> --
> View this message in context:
> http://n4.nabble.com/How-to-add-a-column-to-dtm-showing-a-part-from-directory-source-tp1750923p1750923.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> [email protected] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html<http://www.r-project.org/posting-guide.html>
> and provide commented, minimal, self-contained, reproducible code.
>
>
>
>
> --
> Jim Holtman
> Cincinnati, OH
> +1 513 646 9390
>
> What is the problem that you are trying to solve?
>
> ------------------------------
> Windows 7: Gündelik iþlerinizi basitleþtirin. Size en uygun bilgisayarý
> bulun. <http://windows.microsoft.com/shop>
>
>
>
>
> --
> Jim Holtman
> Cincinnati, OH
> +1 513 646 9390
>
> What is the problem that you are trying to solve?
>
> ------------------------------
> Windows 7: Gündelik iþlerinizi basitleþtirin. Size en uygun bilgisayarý
> bulun. <http://windows.microsoft.com/shop>
>
--
Jim Holtman
Cincinnati, OH
+1 513 646 9390
What is the problem that you are trying to solve?
[[alternative HTML version deleted]]
______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.