Dear R users,



I am dealing with a data
set of aprox. 5 millions rows with data inconsistencies.



The data.frame is an
observation per claim with approximately 2 M unique ID's



Furthermore, one
individual could have one or more claims.



I have found that an
individual could have all his/her information in some but not all
claims as 

example 1



Id: 1
gender birthdate2
F               1994-01-28
<NA>
F               1994-01-28
F               1994-01-28
F               1994-01-28
F               1994-01-28

or it could have or his/her information but it
appears there was a data entry mistake as example 2 in the last row
of the gender column.



id:
2
gender birthdate2
F               2008-07-02
F               2008-07-02
F               2008-07-02
F               2008-07-02
F               2008-07-02
M               2008-07-02






Those are two example of
mixed situation that I have found.



I will like to fill the
missing information (example 1) or correct the information (example
2) by id.



I do not want to impute
here, that will come later for those real missing.



Which would be your
recommendation in working with this type of data management problem?



Thanks in advance,



Jose
                                          
        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to