Hi,

Late to the thread here, but I noted that your dependent variable 'know_fin' 
has 3 levels in the str() output below.

Since you did not provide a full c&p of your glm() call, we can only presume 
that you did specify 'family = binomial' in the call. 

Is the dataset 'knowf3' the result of a subsetting operation, such that there 
are only two of the three levels of 'know_fin' retained in the records used in 
the glm() call, or are there actually 3 levels in the dataset used in the glm() 
call? 

If the latter, that will of course be problematic and from a quick check here, 
glm(..., family = binomial) does not issue a warning or error in the case where 
the dependent variable has >2 levels.

Regards,

Marc Schwartz


> On Jul 27, 2017, at 8:26 AM, john polo <jp...@mail.usf.edu> wrote:
> 
> Michael,
> 
> Thank you for the suggestion. I will take your advice and look more 
> critically at the covariates.
> 
> John
> 
> On 7/27/2017 8:08 AM, Michael Friendly wrote:
>> Rather than go to a penalized GLM, you might be better off investigating the 
>> sources of quasi-perfect separation and simplifying the model to avoid or 
>> reduce it.  In your data set you have several factors with large number of 
>> levels, making the data sparse for all their combinations.
>> 
>> Like multicolinearity, near perfect separation is a data problem, and is 
>> often better solved by careful thought about the model, rather than wrapping 
>> the data in a computationally intensive band aid.
>> 
>> -Michael
>> 
>> On 7/26/2017 10:14 AM, john polo wrote:
>>> UseRs,
>>> 
>>> I have a dataframe with 2547 rows and several hundred columns in R 3.1.3. I 
>>> am trying to run a small logistic regression with a subset of the data.
>>> 
>>> know_fin ~ 
>>> comp_grp2+age+gender+education+employment+income+ideol+home_lot+home+county
>>> 
>>>     > str(knowf3)
>>>     'data.frame':   2033 obs. of  18 variables:
>>>     $ userid    : Factor w/ 2542 levels "FNCNM1639","FNCNM1642",..: 1857 
>>> 157 965 1967 164 315 849 1017 699 189 ...
>>>     $ round_id   : Factor w/ 1 level "Round 11": 1 1 1 1 1 1 1 1 1 1 ...
>>>     $ age       : int  67 66 44 27 32 67 36 76 70 66 ...
>>>     $ county: Factor w/ 80 levels "Adair","Alfalfa",..: 75 75 75 75 75 75 
>>> 64 64 64 64 ...
>>>     $ gender    : Factor w/ 2 levels "0","1": 1 2 1 1 2 1 2 1 2 2 ...
>>>     $ education : Factor w/ 8 levels "1","2","3","4",..: 6 7 6 8 2 4 2 4 2 
>>> 6 ...
>>>     $ employment: Factor w/ 9 levels "1","2","3","4",..: 8 4 4 4 3 8 5 8 4 
>>> 4 ...
>>>     $ income    : num  550000 80000 90000 19000 42000 30000 18000 50000 
>>> 800000 10000 ...
>>>     $ home: num  0 0 0 0 0 0 0 0 0 0 ...
>>>     $ ideol     : Factor w/ 7 levels "1","2","3","4",..: 2 7 4 3 2 4 2 3 2 
>>> 6 ...
>>>     $ home_lot  : Factor w/ 3 levels "1","2","3": 2 2 2 2 2 2 3 3 1 2 ...
>>>     $ hispanic  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
>>>     $ comp_grp2 : Factor w/ 16 levels "Cr_Gr","Cr_Ot",..: 13 13 13 13 13 13 
>>> 10 10 10 10 ...
>>>     $ know_fin : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
>>> 
>>> 
>>> With the regular glm() function, I get a warning about "perfect or 
>>> quasi-perfect separation"[1]. I looked for a method to deal with this and a 
>>> penalized GLM is an accepted method[2]. This is implemented in logistf(). I 
>>> used the default settings for the function.
>>> 
>>> Just before I run the model, memory.size() for my session is ~4500 (MB). 
>>> memory.limit() is ~25500. When I start the model, R immediately becomes 
>>> non-responsive. This is in a Windows environment and in Task Manager, the 
>>> instance of R is, and has been, using ~13% of CPU aand ~4997 MB of RAM. 
>>> It's been ~24 hours now in that state and I don't have any idea of how long 
>>> this should take. If I run the same model in the same setting with the base 
>>> glm(), the model runs in about 60 seconds. Is there a way to know if the 
>>> process is going to produce something useful after all this time or if it's 
>>> hanging on some kind of problem?
>>> 
>>> 
>>>   [1]: 
>>> https://stats.stackexchange.com/questions/11109/how-to-deal-with-perfect-separation-in-logistic-regression#68917
>>>  
>>>   [2]: 
>>> https://academic.oup.com/biomet/article-abstract/80/1/27/228364/Bias-reduction-of-maximum-likelihood-estimates
>>>  
>>> 
>>> 
>> 

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to