Re: [R] [FORGED] Regression with factors ?

David Winsemius Wed, 13 Jul 2016 08:03:33 -0700

> On Jul 13, 2016, at 6:48 AM, stn021 <[email protected]> wrote:
> 
> Hello,
> 
> so here a numerical example in R-code. Code is appended below.
> 
> The output should be
> 1) the numerical values of the abilities of the persons
> 2) the multiplyer
> 
> 
> Please note that
> 
> 1) I have used non-linear optimization to solve this problem and got
> the expected result, though not with R but other software.
> 
> 2) I have applied lm() to this problem, even before I posted the
> question. I am well aware of the syntax of formulas. I my last posting
> I wrote the formula "freehand" so I made the previously mentioned
> errors. Sorry about that.
> 
> 
> 
> Unfortunately the formulas with I() as well as multiplying variables
> before running R does not work here. I() does not apply to factors (R
> tells me) and multiplying in advance also works only for continuous
> variables, not for factors, because there is no known numerical value
> to multiply.
> 
> The latter is actually what my question is about, along with the
> question on how to get R to treat two columns as two instances of the
> same factor.
> 
> 
> Just to be sure I used R to check if the data really counts as a
> factor according to R-terminology. It really is a factor, see code
> below.
> 
> 
> 
> This is the code for generating the example-data:
> 
> # --------------------------------------------------------------- #
> pnames    = c( "alice" , "bob" , "charlie" , "don" , "eve" , "freddy"
> , "grace" , "henry" )
> pcount    = length( pnames )
> 
> # abilities = runif( pcount )
> abilities = (1:pcount) / 10
> 
> persons = data.frame( name = pnames , ability = abilities )
> persons
> 
> # random subset of possible combinations and extra df
> combinations = combn( nrow( persons ) , 2 ) ;
> combinations = cbind( combinations,combinations,combinations,combinations )
> combinations = combinations[ , runif(ncol(combinations))<0.5 ]
> ccount = ncol( combinations )
> 
> observed_data = data.frame(
>  idx1 = combinations[1,]
> , idx2 = combinations[2,]
> , p1 = ( persons$name[    combinations[1,] ] )
> , p2 = ( persons$name[    combinations[2,] ] )
> )
> 
> abilities_data = data.frame(
>  a1 = persons$ability[ combinations[1,] ]
> , a2 = persons$ability[ combinations[2,] ]
> )
> 
> # y = result of cooperation of each pair
> multiplyer = runif(1) + 1
> offset     = 1
> cat( "multiplyer = " , multiplyer , "\n" )
> cat( "offset = " , offset , "\n" )
> 
> y0 = multiplyer * ( offset - ( abilities_data$a1 - abilities_data$a2 ) ^ 2 )
> noise = .05 * rnorm( ccount )
> 
> # check variables are really factors :
> str(  observed_data$p1 )
> dput( observed_data$p1 )
> 
> observed_data = data.frame( y = round( y0+noise,3 ) , observed_data )
> observed_data
> 
> # --------------------------------------------------------------- #


Is this what is intended?

> observed_data$p1ab <- persons$ability[ match(observed_data$p1, persons$name) ]
> observed_data$p2ab <- persons$ability[ match(observed_data$p2, persons$name) ]
> head(observed_data)
      y idx1 idx2    p1      p2 p1ab p2ab
1 1.149    1    6 alice  freddy  0.1  0.6
2 1.006    1    7 alice   grace  0.1  0.7
3 1.529    2    3   bob charlie  0.2  0.3
4 1.404    2    5   bob     eve  0.2  0.5
5 1.205    2    6   bob  freddy  0.2  0.6
6 1.187    2    7   bob   grace  0.2  0.7


> lm( y ~ I( (p1ab -p2ab)^2 ), data=observed_data)

Call:
lm(formula = y ~ I((p1ab - p2ab)^2), data = observed_data)

Coefficients:
       (Intercept)  I((p1ab - p2ab)^2)  
             1.506              -1.435  

>  separate_term <- lm( y ~ I( (p1ab -p2ab)^2 ), data=observed_data)
> summary(separate_term)

Call:
lm(formula = y ~ I((p1ab - p2ab)^2), data = observed_data)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.116249 -0.030996  0.002633  0.032765  0.136282 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)         1.50589    0.01067  141.08   <2e-16 ***
I((p1ab - p2ab)^2) -1.43527    0.05863  -24.48   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.05304 on 44 degrees of freedom
Multiple R-squared:  0.9316,    Adjusted R-squared:   0.93 
F-statistic: 599.2 on 1 and 44 DF,  p-value: < 2.2e-16

You could also have compared 2 models differing only with rest to the includion 
of an interaction term that was the squared difference in abilities:

> full <- lm( y ~ p1ab + p2ab + I( (p1ab -p2ab)^2 ), data=observed_data)
> reduced <- lm( y ~ p1ab + p2ab , data=observed_data)
> anova(full,reduced)
Analysis of Variance Table

Model 1: y ~ p1ab + p2ab + I((p1ab - p2ab)^2)
Model 2: y ~ p1ab + p2ab
  Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
1     42 0.11823                                  
2     43 0.17315 -1  -0.05492 19.509 6.892e-05 ***


-- 
David

> 
> 
> 2016-07-11 19:16 GMT+02:00 Jeff Newmiller <[email protected]>:
>> Your clarification is promising.  A reproducible example is always 
>> preferred, though never a guarantee. I expect to be somewhat preoccupied 
>> this week so responses may be rather delayed, but the less setup we have to 
>> the more likely that someone on the list will tackle it.
>> 
>> Re an answer: If you can make the example simple enough that you can tell us 
>> what the right numerical result will be, we will have a better chance of 
>> understanding what you are after.  E.g. if you start with a solution and use 
>> it to create sample input data with then you don't need to actually solve it 
>> to illustrate what you are after. [1]
>> 
>> Note that I am not aware of any package dedicated to this type of problem, 
>> so unless someone else responds otherwise then you will likely have to use 
>> bootstrapping or your own statistical analysis (Bayesian?) of the result.
>> 
>> [1] 
>> http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example
>> --
>> Sent from my phone. Please excuse my brevity.
>> 
>> On July 11, 2016 7:28:41 AM PDT, stn021 <[email protected]> wrote:
>>> Hello,
>>> 
>>> thank you for the replies. Sorry about the html-email, I forgot.
>>> Should be OK with this email.
>>> 
>>> 
>>> Don't be fooled be the apparent simplicity of the problem. I have
>>> tried to reduce it to only a single relatively simple question.
>>> 
>>> The idea here is to model cooperation of two persons. The model is
>>> about one specific aspect of that cooperation, namely that two persons
>>> with similar abilities may be able to produce better results that two
>>> very different persons.
>>> 
>>> That is only one part of the model with other parts modeling for
>>> example the fact that of course two persons with a higher degree of
>>> ability will produce better results per se.
>>> 
>>> 
>>> It is not classic regression with factors. That can be easily done by
>>> something like lm( y ~ (p1-p2)^2 ).
>>> 
>>> This expands to lm( y ~ p1^2 - 2*p1*p2 + p2^2 ). This contains a
>>> multiplicagtions and for lm() this implies interactions between the
>>> factor-levels and produces one parameter for each combination of
>>> factor-levels that occurs in the data. That is not what the question
>>> is about.
>>> 
>>> Also p1 and p2 are different levels of the same factor, while for lm()
>>> it would be two different factors with different levels.
>>> 
>>> 
>>> As for the sensical part: this has a real world application therefore
>>> it makes sense.
>>> 
>>> Also it is not so difficult to solve with non-linear optimization. I
>>> was hoping to be able to use R for that purpose because then the
>>> results could easily be checked with statistical tests.
>>> 
>>> So my question is not "how to solve" but "how to solve with R".
>>> 
>>> 
>>> As for the excess degrees of freedom, in real observations there would
>>> of course be added noise due to either random variations or factors
>>> not included in the model. So to generate a more reality-conforming
>>> example I could add some random normal-distributed noise to the
>>> dependent variable y. I previously left that part out because to me it
>>> did not seem relevant.
>>> 
>>> 
>>> Would you like me to make a complete example dataset with more records
>>> and noise ?
>>> 
>>> 
>>> The answer I look for would be the numerical values of the
>>> factor-levels and numerical values for the multiplier (f) and the
>>> offset (o), with p1 and p2 given as names (here: persons) and y given
>>> as some level of achievement they reach by cooperating.
>>> 
>>> y = f * ( o - ( p1 - p2 )^2 )
>>> 
>>> Is that what you meant by "answer" ?
>>> 
>>> 
>>> THX
>>> stefan
>>> 
>>> 
>>> 
>>> 
>>> 2016-07-10 2:27 GMT+02:00 Jeff Newmiller <[email protected]>:
>>>> 
>>>> I have seen less sensical questions.
>>>> 
>>>> It would be nice if the example were a bit more complete (as in it
>>> should have excess degrees of freedom and an answer) and less like a
>>> homework problem (which are off topic here). It would of course also be
>>> helpful if the OP were to conform to the Posting Guide, particularly in
>>> respect to using plain text email.
>>>> 
>>>> It looks like the kind of nonlinear optimization problem that
>>> evolutionary algorithms are often applied to. It doesn't look (to me)
>>> like a typical problem that factors get applied to in formulas though,
>>> because multiple instances of the same factor variable are present.
>>>> --
>>>> Sent from my phone. Please excuse my brevity.
>>>> 
>>>> On July 9, 2016 4:59:30 PM PDT, Rolf Turner <[email protected]>
>>> wrote:
>>>>> On 09/07/16 20:52, stn021 wrote:
>>>>>> Hello,
>>>>>> 
>>>>>> I would like to analyse a model like this:
>>>>>> 
>>>>>> y = 1 *  ( 1 - ( x1 - x2 )  ^ 2   )
>>>>>> 
>>>>>> x1 and x2 are not continuous variables but factors, so the
>>>>> observation
>>>>>> contain the level.
>>>>>> Its numerical value is unknown and is to be estimated with the
>>> model.
>>>>>> 
>>>>>> 
>>>>>> The observations look like this:
>>>>>> 
>>>>>> y        x1     x2
>>>>>> 0.96  Alice  Bob
>>>>>> 0.84  Alice  Charlie
>>>>>> 0.96  Bob   Charlie
>>>>>> 0.64  Dave Alice
>>>>>> etc.
>>>>>> 
>>>>>> Each person has a numerical value. Here for example Alice = 0.2
>>> and
>>>>> Bob =
>>>>>> 0.4
>>>>>> 
>>>>>> Then y = 0.96 = 1* ( 1- ( 0.2-0.4 ) ^ 2 ) , see first observation.
>>>>>> 
>>>>>> How can this be done in R ?
>>>>> 
>>>>> 
>>>>> This question makes about as little sense as it is possible to
>>> imagine.
>>>>> 
>>>>> cheers,
>>>>> 
>>>>> Rolf Turner
>>>> 
>> 
> 
> ______________________________________________
> [email protected] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius
Alameda, CA, USA

______________________________________________
[email protected] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] [FORGED] Regression with factors ?

Reply via email to