[R] Java Exception error while reading large data in R from DB using RJDBC.

2012-10-30 Thread aajit75
Dear List,

Java Exception error while reading large data in R from DB using RJDBC.
I am trying to read large data from DB table(Vectorwise), using RJDBC
connection.
I have tested the connection with small size data and was able to fetch DB
tables using same connection(conn as in my code).  

Please suggest where  am i going wrong or alternate option to solve such
issues  while reading large DB table.

drv <- JDBC(paste(db_driver,  sep = ""),
   paste(db_jar_file,  sep = ""),
   identifier.quote="`")

conn <<- dbConnect(drv, paste(db_server,  sep = ""),
paste(db_server_lgn,  sep = ""), 
paste(db_server_pwd,  sep = ""))
s <- sprintf("select * from  cypress_modeldev_account_info")
 temp <- dbGetQuery(conn, s) 
Error in .jcheck() : 
  Java Exception .jcall(rp, "I",
"fetch", stride)




--
View this message in context: 
http://r.789695.n4.nabble.com/Java-Exception-error-while-reading-large-data-in-R-from-DB-using-RJDBC-tp4647844.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Solving binary integer optimization problem

2012-08-10 Thread aajit75
Hi,

I am new to R for solving optimization problems, I have set of communication
channels with limited capacity with two types of costs, fixed and variable
cost. Each channel has expected gain for a single communication.
I want to determine optimal number of communications for each channel
maximizing ROI)return on investment) with overall budget as constraint.6
is the budget allocated.

Channel Fixed_Cost  Variable_Cost   CapacityExpected_Gain
C1  400 2.5 50000.25
C2  1   0   3   0.3
C3  40000.152   0.15
C4  20002   1   0.36
C5  100 3   40000.09

Channel_Select <-data.frame(Channel=c('c1','c2','c3','c4','c5'),
  Fixed_Cost=c(400,5000,4000,2000,100), 
Variable_Cost=c(2.5,0,0.15,2,3),
Capacity=c(5000,3,2,1,4000),
Expected_gain=c(0.25,0.3,0.15,0.36,0.09))


Let  x1,x2,x3,x4,x5 are the decision variables  for c1,c2,c3,c4,c5 channel
and z1,z2,z3,z4,z5 are the indicator binary variables if channel has
allocated communication if any.

max((0.25*x1+0.30*x2+0.15*x3+0.36*x4+0.09*x5)-(2.5*x1+0*x2+0.15*x3+2*x4+3*x5+400*z1+1*z2+4000*z3+2000*z4+100*z5)/(
2.5*x1+0*x2+0.15*x3+2*x4+3*x5+400*z1+1*z2+4000*z3+2000*z4+100*z5))

Constraints:
(2.5*x1+0*x2+0.15*x3+2*x4+3*x5+400*z1+1*z2+4000*z3+2000*z4+100*z5)) <=
6 ##Budget Constraint
x1-5000*z1<=0
x2-3*z2<=0
x3-2*z3<=0
x4-1*z4<=0
x5-4000*z5<=0
x1 >= 200
x2 >= 100
x3>=100
x4>=500
x5>=0

I had tried lp function from lpsolve but not able to set objective.in for
objective function. Any help or hint is welcomed!




--
View this message in context: 
http://r.789695.n4.nabble.com/Solving-binary-integer-optimization-problem-tp4639891.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Putting directory path as a parameter

2011-11-15 Thread aajit75
Hi List,

I am new to R, this may be simple. 

I want to store directory path as parameter which in turn to be used while
reading and writing data from csv files.

How I can use dir defined  in the below mentioned example while reading the
csv file.

Example:
dir <- "C:/Users/Desktop" #location of file

temp_data <- read.csv("dir/bs_dev_segment_file.csv")

If I run this it will show errors:

Error in file(file, "rt") : cannot open the connection
In addition: Warning message:
In file(file, "rt") :
  cannot open file 'dir/bs_dev_segment_file.csv': No such file or directory

Regards,
-Ajit

--
View this message in context: 
http://r.789695.n4.nabble.com/Putting-directory-path-as-a-parameter-tp4043092p4043092.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Similar function for Redun() from Hmisc ?

2011-11-22 Thread aajit75
Hi List,

Working on the large data frame (number of records=35000 and number of
variables=160).
Using redun() for dropping variables before using into model.

V <- redun(~., data = data.frame, r2 = 0.8)

It takes enormously high time for execution, is there anything wrong in the
script?
Suggest any other similar function available for dropping redundant
variables. 

Thanks in advance!
~A


--
View this message in context: 
http://r.789695.n4.nabble.com/Similar-function-for-Redun-from-Hmisc-tp4095455p4095455.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Any function\method to use automatically Final Model after bootstrapping using boot.stepAIC()

2011-11-29 Thread aajit75
Hi List,
Being new to R, I am trying to apply boot.stepAIC() for Model selection by
bootstrapping the stepAIC() procedure. I had gone through the discussion in
various thread on the variable selection methods. Understood the pros and
cons of various method, also going through the regression modelling
strategies in rms.  
I want to read Final model or Formula or list of variables automatically
after boot.stepAIC().

n <- 200
x1 <- runif(n, -3, 3)
x2 <- runif(n, -3, 3)
x3 <- runif(n, -3, 3)
x4 <- runif(n, -3, 3)
x5 <- factor(sample(letters[1:2], n, rep = TRUE))
eta <- 0.1 + 1.6 * x1 - 2.5 * as.numeric(as.character(x5) == levels(x5)[1])
y1 <- rbinom(n, 1, plogis(eta))

data <- data.frame(y1,x1, x2, x3, x4, x5)
glmFit1 <- glm(y1 ~ x1 + x2 + x3 + x4 + x5, family = binomial, data = data)
bglmfit <- boot.stepAIC(glmFit1, data, B = 50)
bglmfit 
In the summary of Bootstrapping the 'stepAIC()' procedure, Following
information is listed:
Initial Model:
y1 ~ x1 + x2 + x3 + x4 + x5
Final Model:
y1 ~ x1 + x5
Is there any function or method for using Final Model by Bootstrapping the
'stepAIC()' procedure, like OrigstepAIC model as shown below.
n <- 200
x1 <- runif(n, -3, 3)
x5 <- factor(sample(letters[1:2], n, rep = TRUE))
eta <- 0.1 + 1.6 * x1 - 2.5 * as.numeric(as.character(x5) == levels(x5)[1])

data1 <- data.frame(x1, x5)
data1$probscore <- predict(bglmfit$OrigStepAIC , data1)
Is there any way to read the variables or formula in the Final Model.

Thanks in advance!
Regards,
~Ajit

--
View this message in context: 
http://r.789695.n4.nabble.com/Any-function-method-to-use-automatically-Final-Model-after-bootstrapping-using-boot-stepAIC-tp4119050p4119050.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Calculating the probability of an event at time "t" from a Cox model fit

2011-12-19 Thread aajit75
Dear R-users,

I would like to determine the probability of event at specific time using
cox model fit. On the development sample data I am able to get the
probability of a event at time point(t). 
I need probability score of a event at specific time, using scoring scoring
dataset which will have only covariates and not the response variables.

Here is the sample code:

n = 1000
beta1 = 2; beta2 = -1; 
lambdaT = .02 # baseline hazard
lambdaC = .4  # hazard of censoring
x1 = rnorm(n,0)
x2 = rnorm(n,0)

# true event time
T = rweibull(n, shape=1, scale=lambdaT*exp(-beta1*x1-beta2*x2)) 
C = rweibull(n, shape=1, scale=lambdaC)   #censoring time
time = pmin(T,C)  #observed time is min of censored and true
event = time==T   # set to 1 if event is observed
dataphr=data.frame(time,event,x1,x2)

library(survival)
fit_coxph <- coxph(Surv(time, event)~ x1 + x2 , method="breslow")

library(peperr)
predictProb.coxph(fit_coxph, Surv(dataphr$time, dataphr$event), dataphr,
0.003)

# Using predictProb.coxph function, probability of event at time (t) is
estimated for cox fit models, I want to estimate this probability on scoring
dataset score_data as below with covariate x1 and x2. 

Is it possible/ is there any way to get these probabilities? since in
predictProb.coxph function it requires response, which is not preseent on
scoring sample.

n = 1
set.seed(1)
x1 = rnorm(n,0)
x2 = rnorm(n,0)
score_data <- data.frame(x1,x2)


Thanks in advance!!
~ Ajit

--
View this message in context: 
http://r.789695.n4.nabble.com/Calculating-the-probability-of-an-event-at-time-t-from-a-Cox-model-fit-tp4213318p4213318.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Creating and assigning variable names in loop

2011-12-21 Thread aajit75
Hello List

I am trying to create and assign variable names in loop, but not able to get
expected variable names. 

Here is the sample code

n = 10
set.seed(1)
x1 = rnorm(n,0)
x2 = rnorm(n,0)
samp_data <- data.frame(x1,x2)

for( i in 1:3) { 
label <- paste("score", i, sep="_") 
assign(label, value =x1+(x2*i) ) 
samp_data <- cbind(samp_data, get(label)) 
}

> head(samp_data)
  x1x2  get(label) get(label)  
get(label)
1 -0.6264538  1.51178117  0.8853274  2.3971085  3.9088897
2  0.1836433  0.38984324  0.5734866  0.9633298  1.3531730
3 -0.8356286 -0.62124058 -1.4568692 -2.0781098 -2.6993504
4  1.5952808 -2.21469989 -0.6194191 -2.8341190 -5.0488189
5  0.3295078  1.12493092  1.4544387  2.5793696  3.7043005
6 -0.8204684 -0.04493361 -0.8654020 -0.9103356 -0.9552692

I am expecting new variables to be created in the samp_data are
score_1   score_2   score_3, instead get(label)   get(label)   get(label)
Where am I going wrong?

Thanks in advance
~Ajit

--
View this message in context: 
http://r.789695.n4.nabble.com/Creating-and-assigning-variable-names-in-loop-tp4221080p4221080.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Subsetting data by eliminating redundant variables

2011-10-19 Thread aajit75
Dear All,

I am new to R, I have one question which might be easy.

I have a large data with more than 250 variable, i am reducing number of
variables by redun function as in the example below,

n <- 100
x1 <- runif(n)
x2 <- runif(n)
x3 <- x1 + x2 + runif(n)/10
x4 <- x1 + x2 + x3 + runif(n)/10
x5 <- factor(sample(c('a','b','c'),n,replace=TRUE))
x6 <- 1*(x5=='a' | x5=='c')
data1 <- cbind(x1,x2,x3,x4,x5,x6) 
data2 <- data.frame(data1)
outredun <- redun(~., data=data2, r2=.8,)
outredun
#outredun1 <- capture.output(redun(~., data=data2, r2=.8,))
#outredun1
#x25 <- outredun1[25]
#mydata12 <- daat1[myvars] #myvars I need to pass to retain variables 

which gives me , say for this example  Rendundant variables:x6 x4 x3 and
Predicted from variables: x1 x2 x5 as output in console.

I want to subset my original data with either by keeping 'Predicted from
variables' or by droping 'Rendundant variables'. I have tried using
capture.output function as mentioned above in the commented code but it
gives me a string like "x1 x2 x5 " which need to modify as "x1", "x2", "x3"
as input to subset data.

As my data has more than 250 variables and evry time data and nuber of
variables are changing. How this can be achived?

Thanks in advance for the help.

Regards,
-Ajit







--
View this message in context: 
http://r.789695.n4.nabble.com/Subsetting-data-by-eliminating-redundant-variables-tp3918199p3918199.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] How to remove multiple outliers

2011-10-20 Thread aajit75
Hi All,

I am working on the dataset in which some of the variables have more than
one observations with  outliers . 

I am using below mentioned sample script 

library(outliers)
x1 <- c(10, 10, 11, 12, 13, 14, 14, 10, 11, 13, 12, 13, 10, 19, 18, 17,
10099, 10099, 10098)
outlier_tf1 = outlier(x1,logical=TRUE)
find_outlier1 = which(outlier_tf1==TRUE, arr.ind=TRUE)
beh_input_ro1 = x1[-find_outlier1]

It removes the outliers which are extrme and not all. In this example it
removes only  10099, 10099 and not 10098.

Thanks for the help in advance.
-Ajit


--
View this message in context: 
http://r.789695.n4.nabble.com/How-to-remove-multiple-outliers-tp3921689p3921689.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] How to remove multiple outliers

2011-10-21 Thread aajit75
Hi Michael,

Thanks for the help.

Yes, I have gone through the document for ?outlier. As it removes one
outlier at a time, being new to R, I was woondering is there any function
available for removing multiple outliers whithout calling say rm.outlier for
n number of time because n is not finite here.

On the second point, I am using below mentioned piece of code, because I am
getting error when rm.outlier with fill = FALSE option is applied on the
same dataset.

outlier_tf1 = outlier(x1,logical=TRUE) 
find_outlier1 = which(outlier_tf1==TRUE, arr.ind=TRUE) 
beh_input_ro1 = x1[-find_outlier1] 

> library(outliers)
> beh_input_ro <- rm.outlier(beh_input_dr, fill = FALSE, median = FALSE,
> opposite = FALSE)
Error in data.frame(X1 = c(28.7812, 24.8923, 31.3987, 25.774, 27.1798,  : 
arguments imply differing number of rows: 2398, 2390, 2399

Regards,
-Ajit

--
View this message in context: 
http://r.789695.n4.nabble.com/How-to-remove-multiple-outliers-tp3921689p3924904.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Data frame manipulation by eliminating rows containing extreme values

2011-10-22 Thread aajit75
Dear All, 

I have got the limits for removing extreme values for each variables using
following function .

f=function(x){quantile(x, c(0.25, 0.75),na.rm = TRUE) - matrix(IQR(x,na.rm =
TRUE) * c(1.5), nrow = 1) %*% c(-1, 1)}

#Example:

n <- 100
x1 <- runif(n)
x2 <- runif(n)
x3 <- x1 + x2 + runif(n)/10
x4 <- x1 + x2 + x3 + runif(n)/10
x5 <- factor(sample(c('a','b','c'),n,replace=TRUE))
x6 <- 1*(x5=='a' | x5=='c')
data1 <- cbind(x1,x2,x3,x4,x5,x6)
data2 <- data.frame(data1)
xyz <- lapply(data1, f)

#Now, I can eliminate those rows(observations) from the data which contains
extreme values for each of the variables one by one as below.

data2 <- subset (data2, x1<=xyz$x1[,1] &  x1>=xyz$x1[,2])
data2 <- subset (data2, x1<=xyz$x2[,1] &  x1>=xyz$x2[,2])

.
.
and so on..

But my data has more number of variables (more than 120),  can any body
suggest efficient way of eliminating rows containg extreme values?

Thanks in advance!

Regards,
-Ajit


--
View this message in context: 
http://r.789695.n4.nabble.com/Data-frame-manipulation-by-eliminating-rows-containing-extreme-values-tp3927941p3927941.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Data frame manipulation by eliminating rows containing extreme values

2011-10-23 Thread aajit75
Hi David,

Thanks for the reply,


f=function(x){quantile(x, c(0.25, 0.75),na.rm = TRUE) - matrix(IQR(x,na.rm =
TRUE) * c(1.5), nrow = 1) %*% c(-1, 1)} 

Here parameter 1.5 is set for example in the above function as argument, it
can be even more may be 3.0 after analyzing actual data. Here expectation is
to find cut-off on both sides(higher and lower values) for each variable as
like in box plot. And then I would like to eliminate observations based on
the cut-off.

For the second point, I am extremly sorry. It was because of the typo
mistake, actually in 
xyz <- lapply(data1, f) here it is data2

n <- 100 
x1 <- runif(n) 
x2 <- runif(n) 
x3 <- x1 + x2 + runif(n)/10 
x4 <- x1 + x2 + x3 + runif(n)/10 
x5 <- factor(sample(c('a','b','c'),n,replace=TRUE)) 
x6 <- 1*(x5=='a' | x5=='c') 
data1 <- cbind(x1,x2,x3,x4,x5,x6) 
data2 <- data.frame(data1) 
xyz <- lapply(data2, f) 
str (xyz)

Now it has list of six only
List of 6
 $ x1: num [1, 1:2] 0.7797 0.0613
 $ x2: num [1, 1:2] 0.9533 0.0194
 $ x3: num [1, 1:2] 1.438 0.532
 $ x4: num [1, 1:2] 2.85 1.03
 $ x5: num [1, 1:2] 4 0
 $ x6: num [1, 1:2] 1.5 -0.5

Third point you mentioned is the problem to resolved, now I am overwriting
data2 applying these cut-offs for each variable. Is there any efficient way
to do this?

 data2 <- subset (data2, x1<=xyz$x1[,1] &  x1>=xyz$x1[,2]) 
 data2 <- subset (data2, x1<=xyz$x2[,1] &  x1>=xyz$x2[,2]) 

On the last point you mentioned, I agree on the removing "extreme values" is
a serious distortion of the data.  But in my data values to some
observations is set to very high number like say . Also this is
not consistent across all variables in the data. So I can set value higher
than 1.5 in the function and get cut-offs for each varibales and remove such
obervations. As rm.outlier removes only one value, I am using above
function.

Thanks for the help in advance.

Regards,
-Ajit




--
View this message in context: 
http://r.789695.n4.nabble.com/Data-frame-manipulation-by-eliminating-rows-containing-extreme-values-tp3927941p3929927.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] How to get Quartiles when data contains both numeric variables and factors

2011-10-31 Thread aajit75
When data contains both factor and numeric variables, how to get quartiles
for all numeric variables?
n <- 100 
x1 <- runif(n) 
x2 <- runif(n) 
x3 <- x1 + x2 + runif(n)/10 
x4 <- x1 + x2 + x3 + runif(n)/10 
x5 <- factor(sample(c('a','b','c'),n,replace=TRUE))
x6 <- factor(1*(x5=='a' | x5=='c')) 
data1 <- cbind(x1,x2,x3,x4,x5,x6) 
data <- data.frame(data1) 

data <- within(data,{x5 <- factor(x5)})
x <- data

qs <- sapply(x, function(x) quantile(x, c(0.01, 0.99))) 

I get an error: Error in quantile.default(x, c(min_pct, max_pct)) : factors
are not allowed

Thanks for the help.


--
View this message in context: 
http://r.789695.n4.nabble.com/How-to-get-Quartiles-when-data-contains-both-numeric-variables-and-factors-tp3955750p3955750.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Creating deciles on data using one variable

2011-11-02 Thread aajit75
I need to deciles data containing more than one variables using any one
variable. I am using script below :

id <-c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20)
tot <-c(1230, 1230, 2345, 3456, 456, 4356, 123, 124, 987, 785, 5646, 345,
2345, 3456, 456, 4356, 123, 124, 987, 785)  
data <- data.frame ( cbind(id , tot))
data$decile<-cut(data$tot,quantile(data$tot,(0:10)/10),include.lowest=TRUE,lable=TRUE)
data$decile

New variable "decile" taking values as below where as I need it should take
values from 1,2..10, Where I am going wrong? 

data$decile
 [1] (987,1.23e+03]  (987,1.23e+03]  (1.23e+03,2.34e+03]
 [4] (2.34e+03,3.46e+03] (301,456]   (3.46e+03,4.36e+03]
 [7] [123,124]   (124,301]   (785,987]  
[10] (456,785]   (4.36e+03,5.65e+03] (301,456]  
[13] (1.23e+03,2.34e+03] (2.34e+03,3.46e+03] (301,456]  
[16] (3.46e+03,4.36e+03] [123,124]   (124,301]  
[19] (785,987]   (456,785]  

-Ajit

--
View this message in context: 
http://r.789695.n4.nabble.com/Creating-deciles-on-data-using-one-variable-tp3973086p3973086.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Decision tree model using rpart ( classification

2011-11-04 Thread aajit75
Hi Experts,

I am new to R, using decision tree model for getting segmentation rules.
A) Using behavioural data (attributes defining customer behaviour, ( example
balances, number of accounts etc.)
1. Clustering:  Cluster behavioural data to suitable number of clusters
2. Decision Tree: Using rpart classification tree for generating rules for
segmentation using cluster number(cluster id) as target variable and
variables from behavioural data as input variables.

B) Using profile data (customers  demographic data )
1. Clustering:  Cluster profile data to suitable number of clusters
2. Decision Tree: Using rpart classification tree for generating rules for
segmentation using cluster number(cluster id) as target variable and
variables from profile data as input variables.

C) Using profile data (customers  demographic data ) and deciles created
based on behaviour
1. Deciles:  Deciles customers to 10 groups based on some behavioural data
2. Decision Tree: Using rpart classification for generating rules for
segmentation using Deciles  as target variable and variables from profile
data as input variables.

In first two cases A and B decision tree model using rpart finish the
execution in a minute or two, But in third case (C) it continues to run for
infinite amount of time( monitored and running even after 14 hours).
 fit <- rpart(decile ~., method="class",data=dtm_ip)
Is there anything wrong with my approach?

Thanks for the help in advance.
-Ajit


--
View this message in context: 
http://r.789695.n4.nabble.com/Decision-tree-model-using-rpart-classification-tp3989162p3989162.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Decision tree model using rpart ( classification

2011-11-04 Thread aajit75
Hi,

Thanks for the responce, code for each case is as:

c_c_factor <- 0.001  
min_obs_split <- 80

A)

fit <- rpart(segment ~., method="class", 
   control=rpart.control(minsplit=min_obs_split, cp=c_c_factor), 
   data=Beh_cluster_out)

B)
fit <- rpart(segment ~., method="class", 
   control=rpart.control(minsplit=min_obs_split, cp=c_c_factor), 
   data=profile_cluster_out)

 C)
fit <- rpart(decile ~., method="class", 
   control=rpart.control(minsplit=min_obs_split, cp=c_c_factor), 
   data=dtm_ip)

In A and B target variable 'segment' is from the clustering data using same
set of input variables , while in C target variable 'decile' is derived from
behavioural variables and input variables are from profile data. Number of
rows in the input table in all three cases are same.

Regards,
-Ajit


--
View this message in context: 
http://r.789695.n4.nabble.com/Decision-tree-model-using-rpart-classification-tp3989162p3989320.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Assign value to new variable based on conditions on other variables

2012-04-10 Thread aajit75
Hi Experts,

This may be simple question, I want to create new variable "seg" and assign
values to it based on some conditions satisfied by each observation.

Here is the example:
##Below are the conditions

##if variable x2 gt 0 and x3 gt 200 then seg should take value 1, 
##if variable x2 gt 100 and x3 gt 300 then seg should take value 2
##if variable x2 gt 200 and x3 gt 400 then seg should take value 3
##if variable x2 gt 300 and x3 gt 500 then seg should take value 4

id <- c(1,2,3,4,5)
x2 <- c(200,100,400,500,600)
x3 <- c(300,400,500,600,700)
dd <- data.frame(id,x2,x3)


dd$seg[dd$x2> 0 && dd$x3> 200] <-1
dd$seg[dd$x2> 100 && dd$x3> 300] <-2
dd$seg[dd$x2> 200 && dd$x3> 400] <-3
dd$seg[dd$x2> 300 && dd$x3> 500] <-4

I tried as above but it is not working for me. What is the correct and
efficient way to do this.

Thanks for the help in advance!!


--
View this message in context: 
http://r.789695.n4.nabble.com/Assign-value-to-new-variable-based-on-conditions-on-other-variables-tp4544753p4544753.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Assign value to new variable based on conditions on other variables

2012-04-10 Thread aajit75
I have got solution using within function as below

dd$Seg <- 1
dd <- within(dd, Seg[x2> 0 & x3> 200] <- 1) 
dd <- within(dd, Seg[x2> 100 & x3> 300] <- 2) 
dd <- within(dd, Seg[x2> 200 & x3> 400] <- 3) 
dd <- within(dd, Seg[x2> 300 & x3> 500] <- 4)

I sthere any better way of doing it!!

--
View this message in context: 
http://r.789695.n4.nabble.com/Assign-value-to-new-variable-based-on-conditions-on-other-variables-tp4544753p4544795.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Java heap space Error while reading table from postgres database using RJDBC

2012-02-09 Thread aajit75

Hi List,
I am reading table from postgres database into R session using RJDBC, table
contains 150 columns  and 20 rows.
Sample code is as below, which works fine with smaller tables.

db_driver <- mydir$db_driver
db_jar_file <- mydir$db_jar_file
db_server <- mydir$db_server
db_server_lgn <- mydir$db_server_lgn
db_server_pwd <- mydir$db_server_pwd

library(RJDBC)

drv <- JDBC(paste(db_driver,  sep = ""),
   paste(db_jar_file,  sep = ""),
   identifier.quote="`")

conn <- dbConnect(drv, paste(db_server,  sep = ""),
paste(db_server_lgn,  sep = ""), 
paste(db_server_pwd,  sep = ""))

cs_input_abt <- dbReadTable(conn, "cs_input_abt")

Following are the different error occurs after executing above script, every
time different error when above script is executed.
1. Error in .jcall(rp, "I", "fetch", stride) : 
  java.lang.OutOfMemoryError: Java heap space
2. Error in .verify.JDBC.result(r, "Unable to retrieve JDBC result set for
",  : 
  Unable to retrieve JDBC result set for SELECT * FROM cs_input_abt (Could
not initialize class org.postgresql.util.PSQLException)
3. Error in .verify.JDBC.result(r, "Unable to retrieve JDBC result set for
",  : 
  Unable to retrieve JDBC result set for SELECT * FROM bs_modelling_abt (GC
overhead limit exceeded)

Where am I going wrong? Is there any option which I had not used in the
RJDBC connection or needed to add?
[[elided Yahoo spam]]
Ajit



--
View this message in context: 
http://r.789695.n4.nabble.com/Java-heap-space-Error-while-reading-table-from-postgres-database-using-RJDBC-tp4372816p4372816.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Passing date as parameter while retrieving data from database using dbGetQuery

2012-02-15 Thread aajit75

Hi All,
This might be simple question, I need to retrive data for modelling from the
databases. Eveytime date values changes so I countnot fix date value in the
code, it is required to pass as parameter.
When I pass the date as parameter, it throws error.
(ERROR: column "start_dt" does not exist  Position: 285)
My script is as below, please guide me where am I going wrong?
All parameters are passed correctly, when start_dt and end_dt are replaced
by  '2010-11-01' and '2011-01-31' respectively in the query code works fine
without any errors.
#
db_driver <- mydir$db_driver
db_jar_file <- mydir$db_jar_file
db_server <- mydir$db_server
db_server_lgn <- mydir$db_server_lgn
db_server_pwd <- mydir$db_server_pwd

library(RJDBC)
.jinit(classpath="myClasses.jar", parameters="-Xmx4096m")

drv <- JDBC(paste(db_driver,  sep = ""),
   paste(db_jar_file,  sep = ""),
   identifier.quote="`")

conn <- dbConnect(drv, paste(db_server,  sep = ""),
  paste(db_server_lgn,  sep = ""), 
  paste(db_server_pwd,  sep = ""))

start_dt <- as.Date('2010-11-01',format="%Y-%m-%d")
end_dt <- as.Date('2011-01-31',format="%Y-%m-%d")   

library(sqldf)
target_population <- dbGetQuery(conn,
"select distinct 
a.primary_customer_code as cust_id,
a.primary_product_code,
a.account_opening_date,
b.l4_product_hierarchy_code,
b.l5_product_hierarchy_code
from account_dim a,
product_dim b 
where a.primary_product_code=b.l5_product_hierarchy_code
and a.account_opening_date between start_dt and end_dt")


As it is not possible to reproduce error with the above code, I am providing
sample example as below with sqldf function using dataframe.

date_tm <- as.Date(c('2010-11-01', '2011-11-01','2010-12-01', '2011-01-01',
'2011-02-01'))
x1 <- c(1,2,3,4,5)
x2 <- c(100,200,300,400,500)

test_data <- data.frame(x1,x2,date_tm)

test_data

start_dt <- as.Date('2011-01-01',format="%Y-%m-%d") #Passing as parameter
end_dt <- as.Date('2011-02-31',format="%Y-%m-%d") #Passing as parameter

library(sqldf) 
new_data  <- 
sqldf("select *
from test_data
where date_tm  = start_dt")
It shows similar error, when date is passed by parameter start_dt
(error in statement: no such column: start_dt)

[[elided Yahoo spam]]

~Ajit

--
View this message in context: 
http://r.789695.n4.nabble.com/Passing-date-as-parameter-while-retrieving-data-from-database-using-dbGetQuery-tp4390216p4390216.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Issues while using “lift.chart” and “adjProbScore” function from ”BCA” library

2012-05-24 Thread aajit75
Dear List,

Couple of issues while using functions from “BCA” library:

1. I am trying to use “lift.chart” function from “BCA” library, but facing
issues while using model where model formula is passed as formula object in
glm.

When model formula is written as text, then it works fine. In my case input
variables and target variables are going to change dynamically, so have to
used formula as formula object as derived.

Below is the sample code, taken from the package document to illustrate the
issues

library(BCA)
data(CCS)
CCS$Sample <- create.samples(CCS, est=0.4, val=0.4)
CCSEst <- CCS[CCS$Sample == "Estimation",]

#Fit glm model with formula written as text

CCS.glm <- glm(MonthGive ~ DonPerYear + LastDonAmt + Region + YearsGive,
family=binomial(logit), data=CCSEst)

CCSVal <- CCS[CCS$Sample == "Validation",]

lift.chart(c("CCS.glm"), data=CCSVal, targLevel="Yes",
trueResp=0.01, type="incremental", sub="Validation")


#Fit glm model with formula passed as formula object

fm <- as.formula("MonthGive ~ DonPerYear + LastDonAmt + Region + YearsGive")

CCS.glm12 <- glm(fm,family=binomial(logit), data=CCSEst)

lift.chart(c("CCS.glm12"), data=CCSVal, targLevel="Yes",
trueResp=0.01, type="incremental", sub="Validation")

Following error occurs,
Error in if (any(yvar1 != yvar1[1])) { : 
  missing value where TRUE/FALSE needed

Is there any way out to use formula object in the model and using
“lift.chart” function

2. Issue using “adjProbScore” function from the “BCA” library.

(adjProbScore(model="CCS.glm", data=CCSVal1, targLevel="Yes",
trueResp=0.01))

Error in parse(text = paste("as.character(", ActiveDataSet(), "$", yvar,  : 
  :1:16: unexpected '$'
1: as.character(  $
  ^
Above error is thrown, am I doing anything wrong? Please correct.
Also, as in the case-1 above, can we use model fitted with formula object in
“adjProbScore” function.

Thanks in advance!
Ajit


--
View this message in context: 
http://r.789695.n4.nabble.com/Issues-while-using-lift-chart-and-adjProbScore-function-from-BCA-library-tp4631158.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.