Guys,
I used Random Forest with a couple of data sets I had to predict for binary
response. In all the cases, the AUC of the training set is coming to be 1.
Is this always the case with random forests? Can someone please clarify
this?
I have given a simple example, first using logistic regressi
Thanks Max and Andy. If the Random Forest is always giving an AUC of 1, isn't
it over fitting??? If not, how do you differentiate this from over
fitting??? I believe Random forests are claimed to never over fit (from the
following link).
http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home
I am trying to use which function to obtain the index of a value in a
dataframe. Depending on whether the value is present in the dataframe or not
I am performing further operations to the dataframe.
However, if the value is not present in the dataframe, I am getting an
integer(0).
How do I chec
Thank you. It works fine.
--
View this message in context:
http://r.789695.n4.nabble.com/Which-value-not-present-tp3035455p3035575.html
Sent from the R help mailing list archive at Nabble.com.
__
R-help@r-project.org mailing list
https://stat.ethz.ch/
I am trying to do binning on three variables (3d binning). The bin boundaries
are specified by the user separately for each variable. I used the bin2
function in the 'ash' package for 2d binning that involves only two
variables but didn't any package for similar binning with three variables.
Are t
I am trying to do binning on three variables (3d binning). The bin boundaries
are specified by the user separately for each variable. I used the bin2
function in the 'ash' package for 2d binning that involves only two
variables but didn't any package for similar binning with three variables.
Are t
I am trying to fit gamma and exponential distributions using fitdist function
in the "fitdistrplus" package to the data I have and obtain the parameters
along with the AIC values of the fit. However, I am getting errors with both
distributions. I have given an reproducible example with the errors I
There was a small error in the data creation step and have fixed it as below:
test <- c(895.1358,2915.7447,335.5472,1470.4022,194.5461,1814.2328,
1056.3067,3110.0783,11441.8656,142.1714,2136.0964,1958.9022,
891.89,352.6939,1341.7042,167.4883,2502.0528,1742.1306,
837.1481,867.8533,3590.4308,1125
Joshua, thanks for your reply.
I have tried out the following scaling and it seems to work fine:
scaledVariable <- (test-min(test)+0.001)/(max(test)-min(test)+0.002)
The gamma distribution parameters are obtained using the scaled variable and
samples obtained from this distributions are scaled
I tried using JMP for the same and get two distinct recommendations when
using the unscaled values.
When using the unscaled values, Log Normal appears to be best fit. fitdist
in R is unable to provide a fit in this case.
Compare Distributions
ShowDistributionNumber of Parameters -2
I have a FORTRAN DLL file obtained from Compaq Visual Fortran and when I try
to load the DLL into the R environment I get an error.
> dyn.load("my_function.dll")
"This application has failed to start because MSCVRTD.dll was not found.
Re-installing this application may fix the problem."
When I
This should work!!
for(i in 1:12){
xLabel <- paste("Graph",i)
plotTitle <- paste("Graph",i,".jpg")
jpeg(plotTitle)
print(hist(zNort1[,i], freq=FALSE, xlab=xLabel, col="blue",
main="Standardized Residuals Histogram", ylim=c(0,1), xlim=c(-3.0,3.0)),axes
= FALSE)
axis(1, col = "blue",col.axis = "bl
I am trying to call a FORTRAN subroutine from R. is.loaded is turning out to
be TRUE. However when I run my .Fortran command I get the following error:
Error in .Fortran("VALUEAHROPTIMIZE", as.double(ahrArray),
as.double(kwArray), :
Fortran symbol name "valueahroptimize" not in load table
I used the DLL export viewer to what is the table name being exported. It is
showing as VALUEAHROPTIMIZE_. This is the name of the function we have used
plus the underscore.
Is there any other reason for the function not getting recognized??? Thanks.
--
View this message in context:
http://r.78
I am trying to optimize function similar to the following:
Minimize x1^2 - x2^2 - x3^2
st x1 < x2
x2 < x3
The constraint is that the variables should be monotonically increasing. Is
there any package that implements Sequential Quadratic Programming with
ability include these constraints???
This is what I believe is referred to as "supression" in regression, where
the correlation correlation between the independent and the dependent
variable turns out to be of one sign whereas the regression coefficient
turns out to be of the opposite sign.
Read here about supression:
http://www.uv
I am looking to build simple GUIs based on the R codes I have. The main
objective is to hide the scary R codes from non-programming people and make
it easier for them to try out different inputs.
For example,
1. The GUI will have means to upload a csv file which will be read by the R
code.
2.
Thanks everyone. I will try out the packages you have mentioned.
Ravi
--
View this message in context:
http://r.789695.n4.nabble.com/Building-Custom-GUIs-for-R-tp3537794p3538539.html
Sent from the R help mailing list archive at Nabble.com.
__
R-help@r
I have a R code that loads a FORTRAN DLL to do some calculations. The code
works fine when I use it in R. But when I try it in spotfire it throws an
error that the it is unable to load the shared library and the specified DLL
cannot be found. I have used "setwd" to point to the location in the
spot
I am using read.xls command from the gdata package. I get the following error
when I try to read a work sheet from an excel sheet.
Error in xls2sep(xls, sheet, verbose = verbose, ..., method = method, :
Intermediate file 'C:\Tmp\RtmpYvLnAu\file7f06650f.csv' missing!
In addition: Warning messa
Is there a text mining/ NLP package in R that could do text summarization?
For example, take a huge text as input and provide a summary of the text.
In package tm, summarization is defined more as high frequency terms which
is not what I want. I actually want a summary of what is present in the h
I am trying to implement some expert rules based on the presence or absence
of words in a sentence. I have given a reproducible example below. In this,
every time I come across the words lunch and bag in the same sentence, the
outcome would be 1. If lunch and pack are in the same sentence, then the
You could use the paste function to define the filename with date appended to
it. See the example below:
currentDate <- Sys.Date()
csvFileName <- paste("C:/R/Remake/XPX",currentDate,".csv",sep="")
write.csv(S1X.sub, file=csvFileName)
--
View this message in context:
http://r.789695.n4.nabble.c
Check this out:
http://www1.maths.lth.se/help/R/RCC/
--
View this message in context:
http://r.789695.n4.nabble.com/R-program-writing-standard-practices-tp3588716p3588911.html
Sent from the R help mailing list archive at Nabble.com.
__
R-help@r-projec
I tried using the "Snowball" package for performing stemming in text mining.
But when I tried to load the package the following error is thrown:
Error : .onLoad failed in loadNamespace() for 'Snowball', details:
call: NULL
error: .onLoad failed in loadNamespace() for 'rJava', details:
call:
This worked fine. Thanks.
--
View this message in context:
http://r.789695.n4.nabble.com/3D-Binning-tp3236223p3248489.html
Sent from the R help mailing list archive at Nabble.com.
__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listi
I am currently fitting the following distributions using JMP and looking for
ways to fit the same distributions in R:
Zero Inflated Lognormal
Zero Inflated Loglogistic
Zero Inflated Frechet
Zero Inflated Weibull
Threshold Frechet
Threshold Loglogistic
Threshold Lognormal
Log Generalized Gamma
Thre
Thanks, Thierry.
Has anyone used the "bayescount" for estimating zero inflated distributions?
It states that it is a "crude function". Does that mean the estimates are
only approximate???
The example they have given seems to work only with Gamma Poisson.
data <- rpois(100, rgamma(100, shape=1,
Any help on this would be appreciated. Thank you.
--
View this message in context:
http://r.789695.n4.nabble.com/Zero-Inflated-Distributions-tp3334861p3338344.html
Sent from the R help mailing list archive at Nabble.com.
__
R-help@r-project.org mailing
I having issues with interpreting the results of STL decomposition. The
following is the data used as well as the decompsed seasonality, trend and
the remainder components. It is a weekly data.
The original data doesn't appear to be seasonal. But there seems to be a
periodic peak in the seasonal c
I am working on a similar problem. I have to add two columns: one containing
the US state to which the origin belongs and another one to add the state in
to which destination belongs. All I have is the latitude and the longitude
of the origin and destination. Are there any packages in R that can do
I am trying to optimize a nested function using nlminb. This throws out an
error that y is missing. Can someone help me with the correct syntax?? Thank
you.
test1 <- function(x,y)
{
sum <- x + y
return(sum)
}
test2 <- function(x,y)
{
sum <- test1(x,y)
sumSq <- sum*sum
return(sumSq)
}
n
This should work!!
rmse<-function (x){
dquared<-x^2
sum1<-sum(x^2,na.rm=TRUE)
rmse<-sqrt((1/length(x))*sum1)
rmse}
--
View this message in context:
http://r.789695.n4.nabble.com/Integrate-na-rm-in-own-defined-functions-tp3462492p3462615.html
Sent from the R help mailing list archive at Nab
I am trying to find the distance between a vector and each row of a
dataframe. I am using the function "distancevector" in the package "hopach"
as follows:
mydata<-as.data.frame(matrix(c(1,1,1,1,0,1,1,1,1,0),nrow=2))
V1 V2 V3 V4 V5
1 1 1 0 1 1
2 1 1 1 1 0
vec <- c(1,1,1,1,1)
d2<-distan
Thank you both for your reply. I went with the cosine function for similarity
and used it with apply to get a measure of distance.
Ravi
--
View this message in context:
http://r.789695.n4.nabble.com/Distance-between-a-vector-and-matrix-rows-tp3726268p3726610.html
Sent from the R help mailing lis
I am using 'tm' package for text mining and facing an issue with finding the
frequently occuring terms. From the definition it appears that findFreqTerms
and minDocFreq are equivalent commands and both tries to identify the
documents with terms appearing more than a specified threshold. However, I
Thanks, Bettina.
--
View this message in context:
http://r.789695.n4.nabble.com/findFreqTerms-vs-minDocFreq-in-Package-tm-tp3806644p3808134.html
Sent from the R help mailing list archive at Nabble.com.
__
R-help@r-project.org mailing list
https://stat.
I am trying to perform Singular Value Decomposition (SVD) on a Term Document
Matrix I created using the 'tm' package. Eventually I want to do a Latent
Semantic Analysis (LSA).
There are 5677 documents with 771 terms (the DTM is 771 x 5677). When I try
to do the SVD, it runs out of memory. I am us
I am looking for some large datasets (10,000 rows & 100,000 columns or vice
versa) to create some test sets. I am not concerned about the invidividual
elements since I will be converting them to binary (0/1) by using arbitrary
thresholds.
Does any R package provide such big datasets?
Also, what
I have a text file that has semi-colon separated values. The table is nearly
10,000 by 585. The files looks as follows:
***
First line: Skip this line
Second line: skip this line
Third line: skip this line
variable1 Variable2 Variable3 Variable4
Thanks a lot Rui and Arun.
The methods work fine with the data I gave but when I tried the two methods
with the following semi-colon separated data using sep = ";". Only the first
3 columnns are read properly rest of the columns are either empty or NAs.
**
Thanks a lot for the guidance. I have another text file with a time stamp and
an empty column as given below:
First line: Skip this line
Second line: skip this line
Third line: skip this line
variable1
I have the following dataframe with the first column being of type datetime:
dateTime <- c("10/01/2005 0:00",
"10/01/2005 0:20",
"10/01/2005 0:40",
"10/01/2005 1:00",
"10/01/2005 1:20")
var1 <- c(1,2,3,4,5)
var2 <- c(10,20,30,40,50)
df <- dat
I have used "tm" package to import a set of text documents using the
following command:
text <- Corpus(DirSource("."),readerControl = list(language ="ansi"))
I would like to extract only a certain portion of the text in each document
using certain keywords. For example, I would like to include al
I have the following data:
x <- as.factor(c(1,1,1,2,2,2,3,3,3))
y <- as.factor(c(10,10,10,20,20,20,30,30,30))
z <- c(100,100,NA,200,200,200,300,300,300)
I could create the cross tab of x and y with Sum of z as its elements using
the xtabs function as follows:
# X Vs. Y with Sum Z
xtabs(z ~ x +
I am performing document clustering on a set of documents using R. I
performed hierarchical clustering using hclust and have identified the
cluster corresponding to each data point. I would like to lablel each
cluster automatically in order to identify the top keywords associated with
each cluster.
I am looking for online courses to learn Spatial Statistics using R.
Statistics.com is offering an online course in December on the same topic
but that schedule doesn't suit mine. Are there any other similar modes for
learning spatial statistics using R??? Can someone please advice???
Thank you.
Thanks, Raphael. Just checked their website. It appears that they currently
do not have any online courses planned.
--
View this message in context:
http://r.789695.n4.nabble.com/Spatial-Statistics-using-R-tp4079092p4079574.html
Sent from the R help mailing list archive at Nabble.com.
_
Thanks a lot for the guidance. I will take a look at these options.
Ravi
--
View this message in context:
http://r.789695.n4.nabble.com/Spatial-Statistics-using-R-tp4079092p4082354.html
Sent from the R help mailing list archive at Nabble.com.
__
R-hel
I have currently a R function that reads a csv file, does some computations,
produces some plots and writes a csv file as output. I would like to use
HTML forms to make a user interface for calling appropriate parts of the
functions (reading csv file, doing computations, displaying plots and
writin
I am trying to code the following excel formula in R.
ab cResultFormula
1 10 0.1 #N/A
IF(B2<20,NA(),C2+IF(ISERROR(D1),0,D1))
2 20 0.2 0.2
IF(B3<20,NA(),C3+IF(ISERROR(D2),0,D2))
3 30
I am trying to add a constant to the previous value of a variable based on
certain conditions. Maybe there is a simple way to do this that I am missing
completely. I have given an example below:
df <- data.frame(x = c(1,2,3,4,5), y = c(10,20,30,NA,NA))
> df
x y
1 1 10
2 2 20
3 3 30
4 4 NA
5 5
I am using the tm package to do text miniing:
I have a huge list of stopwords (2000+) that are in a csv file. I read it as
follows:
stopwordlist <- read.csv("stopwords to be Removed 10042011.csv")
myStopwords <- as.character(stopwordlist$stopwords)
When try removing the stopwords using
tr1=tm_
The following for loops does the work but it takes a good 30 minutes to run:
for(i in 1:length(myStopwords))
{
currentWord <- myStopwords[i]
tr1=tm_map(tr1,removeWords,currentWord)
}
Are there any faster alternatives?? Thank you.
Ravi
--
View this message in context:
http://r.789695.n4.n
I am using 'tm' package for text mining. I use the function findFreqTerms to
obtain the frequent words based on their frequency in the term document
matrix.
The following is the example given in the help page of this function:
library("tm")
data("crude")
tdm <- TermDocumentMatrix(crude)
findFreqT
I am using gsub to remove numbers for each element of a list. Code is given
below.
testList <- list("this contains a number 1000","this does not contain")
removeNumbers <- function(X)
{
gsub("\\d","",X)
}
outputList <- lapply(testList,removeNumbers)
However, when I try to find the
I am trying to use rEMM package for the Extensible Markov Models. I tried the
following sequence of code:
emmt=EMM(measure="euclidean",threshold=0.75,lambda=0.001)
emmt=build(emmt,data)
new_threshold=sum(cluster_counts(emmt))*0.002
emmt_ new=prune(emmt,new_threshold)
However, I get the following
I have a sentence like the following:
sentence <- "Part 1 is working, Part 2 is not working and Part 3 is working"
I would like th get the total count of working and not working as Working =
2 and Not Working = 1.
Can someone help with how can this be done in R??? Thank you.
Ravi
--
View thi
I am trying to parse a webpage using the htmlParse command in XML package as
follows:
library(XML)
u = "http://en.wikipedia.org/wiki/World_population";
doc = htmlParse(u)
I get the following error:
Error in htmlParse(u) :
error in creating parser for http://en.wikipedia.org/wiki/World_populat
59 matches
Mail list logo