Re: [R] Comparison of two very large strings

David Winsemius Mon, 12 Jul 2010 16:18:05 -0700


On Jul 12, 2010, at 6:46 PM, David Winsemius wrote:

On Jul 12, 2010, at 6:03 PM, harsh yadav wrote:
Hi,
I have a function in R that compares two very large strings forabout 1
million records.

The strings are very large URLs like:-
http://query.nytimes.com/gst/sitesearch_selector.html?query=US+Visa+Laws&type=nyt&x=25&y=8.
..

or of larger lengths.

The data-frame looks like:-

id url
1
http://query.nytimes.com/gst/sitesearch_selector.html?query=US+Visa+Laws&type=nyt&x=25&y=8.
..
2   http://query.nytimes.com/search/sitesearch?query=US+Visa+Laws&srchst=cse
3
http://www.google.com/search?hl=en&q=us+student+visa+changes+9/11+washington+post&start=10&sa=N.
..
4
http://www.google.com/search?hl=en&q=us+student+visa+changes+9/11+washington+post&start=10&sa=N
5
http://www.google.com/url?sa=U&start=11&q=http://app1.chinadaily.com.cn/star/2004/0610/fo4-1.html&ei=uUKwSe7XN9CCt

and so on for about 1 million records.

Here is the function that I am using to compare the two strings:-

stringCompare <- function(currentURL, currentId){
j <- currentId - 1
while(j>=1)
previousURL <-
previousURLLength <- nchar(previousURL)
#Compare smaller with bigger
if(nchar(currentURL) <= previousURLLength){
matchPhrase <- substr(previousURL,1,nchar(currentURL))
if(matchPhrase == currentURL){
return(TRUE)
}
}else{
matchPhrase <- substr(currentURL,1,previousURLLength)
if(matchPhrase == previousURL){
return(TRUE)
}
}
j <- j -1
}
return(FALSE)
}
Couldn't you just store the "url" vector after running throughnchar and then do the comparison in a vectorized manner?
test <- rd.txt('id url
1 "http://query.nytimes.com/gst/sitesearch_selector.html?query=US+Visa+Laws&type=nyt&x=25&y=8"2 "http://query.nytimes.com/search/sitesearch?query=US+Visa+Laws&srchst=cse"3 "http://www.google.com/search?hl=en&q=us+student+visa+changes+9/11+washington+post&start=10&sa=N"4 "http://www.google.com/search?hl=en&q=us+student+visa+changes+9/11+washington+post&start=10&sa=N"5 "http://www.google.com/url?sa=U&start=11&q=http://app1.chinadaily.com.cn/star/2004/0610/fo4-1.html&ei=uUKwSe7XN9CCt"', stringsAsFactors=FALSE)
copyUrls <- test[,"url"]
sizeUrls <- nchar(copyUrls)
lengU <- length(sizeUrls)
sizidx <- pmax(sizeUrls[1:(lengU-1)], sizeUrls[2:(lengU)])
substr(copyUrls[2:lengU], 1, sizidx) == substr(copyUrls[1:(lengU-1)], 1, sizidx)
#[1] FALSE FALSE  TRUE FALSE

Let me hasten to admit that when I tried to fix what I thought was anerror in that program, I go the same result. It seemed as though Ishould have been getting errors by choosing the maximum stringlength. Changing the pmax to pmin did not alter the results ... to mypuzzlement ... until I further noticed that urls #3 and #4 were of thesame length. When I extend the lengths, then only the version usingpmin works properly.



--
David.

Here, I compare the URL at a given row with all the previous URLsin thedata-frame. I compare the smaller of the two given URls with thelarger one

(upto the length of the smaller).

When I run the above function for about 1 million records, theexecution

becomes really slow, which otherwise is fast if I remove the
string comparison step.

Any ideas how it can be implemented in a fast and efficient way.

Thanks and Regards,
Harsh Yadav

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


David Winsemius, MD
West Hartford, CT

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


David Winsemius, MD
West Hartford, CT

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Comparison of two very large strings

Reply via email to