Hi,

I have a function in R that compares two very large strings for about 1
million records.

The strings are very large URLs like:-


http://query.nytimes.com/gst/sitesearch_selector.html?query=US+Visa+Laws&type=nyt&x=25&y=8.
..

or of larger lengths.

The data-frame looks like:-

id url
1
http://query.nytimes.com/gst/sitesearch_selector.html?query=US+Visa+Laws&type=nyt&x=25&y=8.
..
2   http://query.nytimes.com/search/sitesearch?query=US+Visa+Laws&srchst=cse
3
http://www.google.com/search?hl=en&q=us+student+visa+changes+9/11+washington+post&start=10&sa=N.
..
4
http://www.google.com/search?hl=en&q=us+student+visa+changes+9/11+washington+post&start=10&sa=N
5
http://www.google.com/url?sa=U&start=11&q=http://app1.chinadaily.com.cn/star/2004/0610/fo4-1.html&ei=uUKwSe7XN9CCt

and so on for about 1 million records.

Here is the function that I am using to compare the two strings:-

stringCompare <- function(currentURL, currentId){
  j <- currentId - 1
 while(j>=1)
previousURL <- urlDataFrame[j,"url"]
previousURLLength <- nchar(previousURL)
 #Compare smaller with bigger
if(nchar(currentURL) <= previousURLLength){
 matchPhrase <- substr(previousURL,1,nchar(currentURL))
if(matchPhrase == currentURL){
 return(TRUE)
}
}else{
 matchPhrase <- substr(currentURL,1,previousURLLength)
if(matchPhrase == previousURL){
 return(TRUE)
}
}
 j <- j -1
}
 return(FALSE)
}

Here, I compare the URL at a given row with all the previous URLs in the
data-frame. I compare the smaller of the two given URls with the larger one
(upto the length of the smaller).

When I run the above function for about 1 million records, the execution
becomes really slow, which otherwise is fast if I remove the
string comparison step.

Any ideas how it can be implemented in a fast and efficient way.

Thanks and Regards,
Harsh Yadav

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to