Here is another way. Have not tested for large scale efficiency, but if you convert dta to a data.table that might improve things.
library(dplyr) dta <- read.csv( text= "key_column,begin_date,end_date 123456,2013-01-01,2014-01-01 123456,2013-07-01,2014-07-01 789102,2012-03-01,2014-03-01 789102,2015-02-01,2016-02-01 789102,2015-02-06,2016-02-06 789102,2015-02-28,2015-03-31 789102,2015-04-30,2015-05-31 ", as.is=TRUE) ( dta %>% mutate( begin_date = as.Date( begin_date ), end_date = as.Date( end_date ) ) %>% arrange( key_column, begin_date ) ) -> dta mkgp <- function( begin_date, cend ) { ix <- c( TRUE, cend[ -length( begin_date ) ] < begin_date[ -1 ] ) cumsum( ix ) } result <- ( dta %>% group_by( key_column ) %>% mutate( cend = as.Date( cummax( as.numeric( end_date ) ) , origin="1970-01-01" ) , gp = mkgp( begin_date, cend ) ) %>% ungroup %>% group_by( key_column, gp ) %>% summarise( begin_date = begin_date[ 1 ] , end_date = cend[ length( cend ) ] ) %>% ungroup %>% select( -gp ) %>% as.data.frame ) --------------------------------------------------------------------------- Jeff Newmiller The ..... ..... Go Live... DCN:<jdnew...@dcn.davis.ca.us> Basics: ##.#. ##.#. Live Go... Live: OO#.. Dead: OO#.. Playing Research Engineer (Solar/Batteries O.O#. #.O#. with /Software/Embedded Controllers) .OO#. .OO#. rocks...1k --------------------------------------------------------------------------- Sent from my phone. Please excuse my brevity. On February 25, 2015 1:18:58 PM PST, Matt Gross <gro...@gmail.com> wrote: >Hi, > >I am trying to process a large dataset in R. The dataset contains the >following three columns: > >key_column - a unique key identifier >begin_date - the start date of the active period >end_date - the end date of the active period > > >Example data is here: > >key_column,begin_date,end_date >123456,2013-01-01,2014-01-01 >123456,2013-07-01,2014-07-01 >789102,2012-03-01,2014-03-01 >789102,2015-02-01,2016-02-01 >789102,2015-02-06,2016-02-06 > >I want to build a condensed table of key_column and begin_date's and >end_date's. As you can see in the example data above, some begin and >end >date periods overlap with begin_date and end_date pairs for the same >key_column. In situations where overlap exists I want to have one >record >for the key_column with the min(begin_date) and the max(end_date). > >Can anyone help me build the commands to process this data in R? > >Thanks, >Matt ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.