Re: [R] How long to wait for process?

Marc Schwartz Thu, 27 Jul 2017 07:33:02 -0700

Hi John,

Not a problem, just wanted to be sure that there was not additional confounding 
due to these issues.


You may be aware that a subsetting operation to remove records in a data frame 
does not by default remove the unwanted levels from the factor that was 
filtered:

iris.new <- subset(iris, Species == "setosa")

> str(iris.new)
'data.frame':   50 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 
1 ...

> levels(iris.new$Species)
[1] "setosa"     "versicolor" "virginica" 

> table(iris.new$Species)

    setosa versicolor  virginica 
        50          0          0 


You can see that Species retains all 3 original levels, even though only one is 
actually present in the records in the new data frame.

Thus, your output below may very well be post the filtering of 'know_fin' to 2 
levels.

Regards,

Marc


> On Jul 27, 2017, at 9:14 AM, john polo <jp...@mail.usf.edu> wrote:
> 
> Marc,
> 
> Sorry for the lack of info on my part. Yes, I did use 'family = binomial' and 
> I did drop the 3rd level before running the model. I think the str(<subset>) 
> that I wrote into my original email might not have been my final step before 
> using glm. Thank you for reminding of the potential problem.
> 
> I think Michael Friendly's idea is probably the solution I need to consider. 
> I am simplifying my factors a little bit and revising which I will keep.
> 
> 
> best,
> John
> 
> On 7/27/2017 8:54 AM, Marc Schwartz wrote:
>> Hi,
>> 
>> Late to the thread here, but I noted that your dependent variable 'know_fin' 
>> has 3 levels in the str() output below.
>> 
>> Since you did not provide a full c&p of your glm() call, we can only presume 
>> that you did specify 'family = binomial' in the call.
>> 
>> Is the dataset 'knowf3' the result of a subsetting operation, such that 
>> there are only two of the three levels of 'know_fin' retained in the records 
>> used in the glm() call, or are there actually 3 levels in the dataset used 
>> in the glm() call?
>> 
>> If the latter, that will of course be problematic and from a quick check 
>> here, glm(..., family = binomial) does not issue a warning or error in the 
>> case where the dependent variable has >2 levels.
>> 
>> Regards,
>> 
>> Marc Schwartz
>> 
>> 
>>> On Jul 27, 2017, at 8:26 AM, john polo<jp...@mail.usf.edu>  wrote:
>>> 
>>> Michael,
>>> 
>>> Thank you for the suggestion. I will take your advice and look more 
>>> critically at the covariates.
>>> 
>>> John
>>> 
>>> On 7/27/2017 8:08 AM, Michael Friendly wrote:
>>>> Rather than go to a penalized GLM, you might be better off investigating 
>>>> the sources of quasi-perfect separation and simplifying the model to avoid 
>>>> or reduce it.  In your data set you have several factors with large number 
>>>> of levels, making the data sparse for all their combinations.
>>>> 
>>>> Like multicolinearity, near perfect separation is a data problem, and is 
>>>> often better solved by careful thought about the model, rather than 
>>>> wrapping the data in a computationally intensive band aid.
>>>> 
>>>> -Michael
>>>> 
>>>> On 7/26/2017 10:14 AM, john polo wrote:
>>>>> UseRs,
>>>>> 
>>>>> I have a dataframe with 2547 rows and several hundred columns in R 3.1.3. 
>>>>> I am trying to run a small logistic regression with a subset of the data.
>>>>> 
>>>>> know_fin ~ 
>>>>> comp_grp2+age+gender+education+employment+income+ideol+home_lot+home+county
>>>>> 
>>>>>     > str(knowf3)
>>>>>     'data.frame':   2033 obs. of  18 variables:
>>>>>     $ userid    : Factor w/ 2542 levels "FNCNM1639","FNCNM1642",..: 1857 
>>>>> 157 965 1967 164 315 849 1017 699 189 ...
>>>>>     $ round_id   : Factor w/ 1 level "Round 11": 1 1 1 1 1 1 1 1 1 1 ...
>>>>>     $ age       : int  67 66 44 27 32 67 36 76 70 66 ...
>>>>>     $ county: Factor w/ 80 levels "Adair","Alfalfa",..: 75 75 75 75 75 75 
>>>>> 64 64 64 64 ...
>>>>>     $ gender    : Factor w/ 2 levels "0","1": 1 2 1 1 2 1 2 1 2 2 ...
>>>>>     $ education : Factor w/ 8 levels "1","2","3","4",..: 6 7 6 8 2 4 2 4 
>>>>> 2 6 ...
>>>>>     $ employment: Factor w/ 9 levels "1","2","3","4",..: 8 4 4 4 3 8 5 8 
>>>>> 4 4 ...
>>>>>     $ income    : num  550000 80000 90000 19000 42000 30000 18000 50000 
>>>>> 800000 10000 ...
>>>>>     $ home: num  0 0 0 0 0 0 0 0 0 0 ...
>>>>>     $ ideol     : Factor w/ 7 levels "1","2","3","4",..: 2 7 4 3 2 4 2 3 
>>>>> 2 6 ...
>>>>>     $ home_lot  : Factor w/ 3 levels "1","2","3": 2 2 2 2 2 2 3 3 1 2 ...
>>>>>     $ hispanic  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
>>>>>     $ comp_grp2 : Factor w/ 16 levels "Cr_Gr","Cr_Ot",..: 13 13 13 13 13 
>>>>> 13 10 10 10 10 ...
>>>>>     $ know_fin : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
>>>>> 
>>>>> 
>>>>> With the regular glm() function, I get a warning about "perfect or 
>>>>> quasi-perfect separation"[1]. I looked for a method to deal with this and 
>>>>> a penalized GLM is an accepted method[2]. This is implemented in 
>>>>> logistf(). I used the default settings for the function.
>>>>> 
>>>>> Just before I run the model, memory.size() for my session is ~4500 (MB). 
>>>>> memory.limit() is ~25500. When I start the model, R immediately becomes 
>>>>> non-responsive. This is in a Windows environment and in Task Manager, the 
>>>>> instance of R is, and has been, using ~13% of CPU aand ~4997 MB of RAM. 
>>>>> It's been ~24 hours now in that state and I don't have any idea of how 
>>>>> long this should take. If I run the same model in the same setting with 
>>>>> the base glm(), the model runs in about 60 seconds. Is there a way to 
>>>>> know if the process is going to produce something useful after all this 
>>>>> time or if it's hanging on some kind of problem?
>>>>> 
>>>>> 
>>>>>   
>>>>> [1]:https://stats.stackexchange.com/questions/11109/how-to-deal-with-perfect-separation-in-logistic-regression#68917
>>>>>      
>>>>> [2]:https://academic.oup.com/biomet/article-abstract/80/1/27/228364/Bias-reduction-of-maximum-likelihood-estimates
>>>>>   
>>>>> 
> 
> 
> -- 
> Men occasionally stumble
> over the truth, but most of them
> pick themselves up and hurry off
> as if nothing had happened.
> -- Winston Churchill
> 

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] How long to wait for process?

Reply via email to