Re: [R] splitting First 10 words in a string

Matevž Pavlič Tue, 02 Nov 2010 13:48:46 -0700

Hi Steven,


Thank you for the help. I get an error though when i do this :

 

>lit<-read.csv("litologija.csv", sep=";", dec=".")

>sent <-data.frame(sentence=lit$Opis,stringsAsFactors=FALSE)

>str(sent)

>sentV<-rep(sent,10)

>str(sentV)

 

>first=second=third=fourth=fifth=sixth=seventh=eighth=ninth=tenth<-vector(length=10)

>DF 
><-data.frame(Sentence=sent,first,second,third,fourth,fifth,sixth,seventh,eighth,ninth,tenth,stringsAsFactors=FALSE)

 

Â»Error in data.frame(Sentence = sent, first, second, third, fourth, fifth,  : 

arguments imply differing number of rows: 22928, 10Â«

 

What am I doing wrong?

 

Thnks, m

 

 

 

From: steven mosher [mailto:mosherste...@gmail.com] 
Sent: Tuesday, November 02, 2010 8:45 PM
To: David Winsemius
Cc: MatevÅ¾ PavliÄ; Gaj Vidmar; r-h...@stat.math.ethz.ch
Subject: Re: [R] spliting first 10 words in a string

 

 Thanks david.

 

  Matevz, maybe I can help explain by doing a very simple and brute force 
approach

as opposed to  the way david did it. But you should learn his methods.

 

I will just do a subset of your problem and if you understand how it works then 
you should

be able to get something done and then make it more elegant.

 

First, I simplify the problem by separating out the "sentence" column.

 

You can do this with your data frame by simply doing this

 

MySentence <-data.frame(sentence=yourbigDF$Opis,stringsAsFactors=FALSE)

 

so I take your original data.frame (yourbigDF) and I just create a copy of that 
one column

 $Opis

 

Later we can merge the two back together after I add 10 columns for the words

 

 

Lets make some dummy data with just 10 rows

 

 

 

 sentence<- "this is a sentence with ten words or maybe more than ten words"

 sentV<-rep(sentence,10)

# now I just made 10 rows of the same sentence

# NEXT because I am going to create 10 new colums of 10 rows I create

# 10 vectors> each is named and each has 10 elements For the rows.

# they have NO DATA in them

 

 
first=second=third=fourth=fifth=sixth=seventh=eighth=ninth=tenth<-vector(length=10)

 

#Next I create a dataframe with Sentence in the first column and 10 blank 
colums.

# NOTE I use stringsAsFactors=False

 

 DF 
<-data.frame(Sentence=sentence,first,second,third,fourth,fifth,sixth,seventh,eighth,ninth,tenth,stringsAsFactors=FALSE)

 

# This is what it would look like ( the first row)

DF[1,]

 

Sentence first second third fourth fifth sixth seventh eighth ninth tenth

1 this is a sentence with ten words or maybe more than ten words FALSE  FALSE 
FALSE  FALSE FALSE FALSE   FALSE  FALSE FALSE FALSE

 

Next, I will show you how to assign the first ten words to the 10 blank columns

 

DF[1,2:11]<-strsplit(DF[1,1]," ")[[1]][1:10]

 

#DF[1,2:11]  selects the columns 2-11 of the first row

#strsplit  returns the first 10 words [1:10] and place them in the columsn2-11

 

If you want to do this the slow way you can just loop through your dataframe 
row by row

or you can probably use apply.

 

Make more sense?

> DF[1,2:11]<-strsplit(DF[1,1]," ")[[1]][1:10]

> DF[1,]

                                                        Sentence first second 
third   fourth fifth sixth seventh eighth ninth tenth

1 this is a sentence with ten words or maybe more than ten words  this     is   
  a sentence  with   ten   words     or maybe  more

> DF[1,"first"]

[1] "this"

 

On Tue, Nov 2, 2010 at 12:22 PM, David Winsemius <dwinsem...@comcast.net> wrote:


On Nov 2, 2010, at 3:01 PM, MatevÅ¾ PavliÄ wrote:

Hi all,

Thanks for all the help. I managed to do it with what Gaj suggested (Excel :().

The last solution from David is also freat i just don't undestand why R  put 
the words in 14 columns and thre rows?

 

Because the maximum number of words was 14 and the fill argument was TRUE. 
There were three rows because there were three items in the supplied character 
vector.

         

        I would like it to put just the first 10 words in source field to 10 
diefferent destiantion fields, but the same row. And so on...is that possible?

 

I don't know what a destination field might be. Those are not R data types.

This would trim the extra columns (in this example set to those greater than 8) 
by adding a lot of "NULL"'s to the end of a colClasses specification .... at 
the expense of a warning message which can be ignored:

> read.table(textConnection(words), fill=T, colClasses = c(rep("character", 8), 
> rep("NULL", 30) ) , stringsAsFactors=FALSE )


  V1    V2    V3      V4    V5    V6    V7      V8

1   I  have     a columnn  with  text  that     has

2   I would  like      to split these words      in

3 but  just first     ten words    in   the string.

Warning message:
In read.table(textConnection(words), fill = T, colClasses = c(rep("character",  
:
 cols = 14 != length(data) = 38


If you want to assign the first column to a variable then just:
> first8 <- read.table(textConnection(words), fill=T, colClasses = 
> c(rep("character", 8), rep("NULL", 30) ) , stringsAsFactors=FALSE)
> var1 <- first8[[1]]
> var1
[1] "I"   "I"   "but"

-- 
David.

         

        
        Thank you, m
        -----Original Message-----
        From: r-help-boun...@r-project.org 
[mailto:r-help-boun...@r-project.org] On Behalf Of David Winsemius
        Sent: Tuesday, November 02, 2010 3:47 PM
        To: Gaj Vidmar
        Cc: r-h...@stat.math.ethz.ch
        Subject: Re: [R] spliting first 10 words in a string
        
        
        On Nov 2, 2010, at 6:24 AM, Gaj Vidmar wrote:

        Though <forbidden> in this list, in Excel it's just (literally!)
        five clicks
        away!
        (with the column in question selected)
        Data -> Text to Columns -> Delimited -> tick Space -> Finish
        Pa je! (~Voila in Slovenian)
        (then import back to R, keeping only the first 10 columns if so
        desired)

        
        You could do the same thing without needing to leave R. Just
        read.table( textConnection(..), header=FALSE, fill=TRUE)

        read.table(textConnection(words), fill=T)

          V1    V2    V3      V4    V5    V6    V7      V8       V9
        V10      V11   V12 V13 V14
        1   I  have     a columnn  with  text  that     has    quite
        a      few words  in it.
        2   I would  like      to split these words      in separate columns
        3 but  just first     ten words    in   the string.       Is    that
        possible    in  R?

        
        Regards,
        Assist. Prof. Gaj Vidmar, PhD
        University Rehabilitattion Institute, Republic of Slovenia
        
        Irrelevant P.S. Long ago, before embarking on what eventually ended
        mainly
        in statistics,
        I did two years of geology, so (and also because of knowing what the
        poster's institute does)
        I even kinda imagine what these data are.
        
        "MatevÂ¾ PavliÃ¨" <matevz.pav...@gi-zrmk.si> wrote in message
        news:ad5ca6183570b54f92aa45ce2619f9b9d96...@gi-zrmk.si...

        Hi,
        
        I am sorry, will try to be more exact from now on...
        
        I have a data.frame  with a field called Opis. IT contains
        sentenses that
        I would like to split in words or fields in data.frame...when I say
        columns I mean as in Excel table. I would like to split "Opis" into
        ten
        fields from the first ten words in Opis field.
        Here is an example of my data.frame.
        
        'data.frame':   22928 obs. of  12 variables:
        $ VrtinaID        : int  1 1 1 1 2 2 2 2 2 2 ...
        $ ZapStev         : int  1 2 3 4 1 2 3 4 5 6 ...
        $ GlobinaOd       : num  0 0.8 9.2 10.1 0 0.9 2.6 4.9 6.8 7.3 ...
        $ GlobinaDo       : num  0.8 9.2 10.1 11 0.9 2.6 4.9 6.8 7.3 8.2 ...
        $ Opis            : Factor w/ 12754 levels "","(MIVKA) DROBEN MELJAST
        PESEK, GOST, SIVORJAV",..: 2060 11588 2477 11660 7539 3182 7884
        9123 2500
        4756 ...
        $ ACklasifikacija : Factor w/ 290 levels "","(CL)","(CL)/(SC)",..:
        154 125
        101 101 NA 106 125 80 106 101 ...
        $ GeolNastOd      : num  0 0.8 9.2 10.1 0 0.9 2.6 4.9 6.8 7.3 ...
        $ GeolNastDo      : num  0.8 9.2 10.1 11 0.9 2.6 4.9 6.8 7.3 8.2 ...
        $ GeolNastOpis    : Factor w/ 113 levels "","B. M. S.",..: 56 53 53
        53 56
        53 53 53 53 53 ...
        $ NacinVrtanjaOd  : num  0e+00 1e+09 1e+09 1e+09 0e+00 ...
        $ NacinVrtanjaDo  : num  1.1e+01 1.0e+09 1.0e+09 1.0e+09 1.0e+01 ...
        $ NacinVrtanjaOpis: Factor w/ 43 levels "","H. N.","IZKOP",..: 26 1
        1 1 26
        1 1 1 1 1 ...
        
        Hope that explains better...
        Thank you, m
        
        -----Original Message-----
        From: David Winsemius [mailto:dwinsem...@comcast.net]
        Sent: Monday, November 01, 2010 10:13 PM
        To: MatevÂ¾ PavliÃ¨
        Cc: r-help@r-project.org
        Subject: Re: [R] spliting first 10 words in a string
        
        
        On Nov 1, 2010, at 4:39 PM, MatevÂ¾ PavliÃ¨ wrote:

        Hi all,
        
        
        
        I have a columnn with text that has quite a few words in it. I would
        like to split these words in separate columns, but just first ten
        words in the string. Is that possible in R?
        
        

        
        Not sure what a column means to you. It's not a precisely defined R
        type or class. (And you are requested to offered a concrete example
        rather than making us guess.)

        words <-"I have a columnn with text that has quite a few words in

        it. I would like to split these words in separate columns, but just
        first ten words in the string. Is that possible in R?"

        strsplit(words, " ")[[1]][1:10]

        [1] "I"       "have"    "a"       "columnn" "with"    "text"
        "that"    "has"     "quite"   "a"
        
        
        Or if in a dataframe:

        words <-c("I have a columnn with text that has quite a few words in

        it.",   "I would like to split these words in separate columns", "but
        just first ten words in the string. Is that possible in R?")

        worddf <- data.frame(words=words)

                 

                t(sapply(strsplit(worddf$words, " "), "[", 1:10) )

          [,1]  [,2]    [,3]    [,4]      [,5]    [,6]    [,7]    [,
        8]      [,9]       [,10]
        [1,] "I"   "have"  "a"     "columnn" "with"  "text"  "that"  "has"
        "quite"    "a"
        [2,] "I"   "would" "like"  "to"      "split" "these" "words" "in"
        "separate" "columns"
        [3,] "but" "just"  "first" "ten"     "words" "in"    "the"
        "string."
        "Is"       "that"
        
        
        -- 
        David Winsemius, MD
        West Hartford, CT
        
        ______________________________________________
        R-help@r-project.org mailing list
        https://stat.ethz.ch/mailman/listinfo/r-help
        PLEASE do read the posting guide
        http://www.R-project.org/posting-guide.html
        and provide commented, minimal, self-contained, reproducible code.

        
        ______________________________________________
        R-help@r-project.org mailing list
        https://stat.ethz.ch/mailman/listinfo/r-help
        PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html
        and provide commented, minimal, self-contained, reproducible code.

        
        David Winsemius, MD
        West Hartford, CT
        
        ______________________________________________
        R-help@r-project.org mailing list
        https://stat.ethz.ch/mailman/listinfo/r-help
        PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html
        and provide commented, minimal, self-contained, reproducible code.


David Winsemius, MD
West Hartford, CT

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

 


        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] splitting First 10 words in a string

Reply via email to