Dear Mustafa, I'd recommend the packages readxls to import the data, tidyr to transform the data into long format and dplyr to select the data.
1. read the data into R with read_excel() 2. transform sheet 1 into a long format with gather(). The result is one row for each patient / drug combination 3. select the relevant drugs in sheet 2 with filter() 4. join long sheet 1 and filtered sheet2 with inner_join() 5. summarise() the drug codes and names after group_by(patient_id) 6. count() the number of drug codes. Best regards, ir. Thierry Onkelinx Instituut voor natuur- en bosonderzoek / Research Institute for Nature and Forest team Biometrie & Kwaliteitszorg / team Biometrics & Quality Assurance Kliniekstraat 25 1070 Anderlecht Belgium To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of. ~ Sir Ronald Aylmer Fisher The plural of anecdote is not data. ~ Roger Brinner The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data. ~ John Tukey 2017-04-24 7:07 GMT+02:00 abo dalash <abo_d...@hotmail.com>: > Hi there > > I have data set with 500,000 patients (rows) and the first column is > Patient I'D Number, columns from 2 to 20 are Drug1,Drug 2,...,Drug 20 so > each row starts with the patient ID and the remaining of cells in the row > are codes for names of the treatments taken by the patient. Number of > treatments differ between patients. For example, there are patients with 3 > treatments only and patients with 20 drugs. The unique number of treatments > in the entire data set is about 6700 drugs. However, I'm interested in > studying only 128 drugs, these drugs are listed in a second sheet as code > numbers associated with their meanings (names of drugs representing the > code). I'm interested in identifying the most frequently used DRUG > COMBINATIONS between only the 128 drugs among the 6700 drugs. The structure > of the Excell file to be used in analysis is like this: > > -Sheet 1( the entire data set):- > 1 Patient ID Drug1 Drug2 .... Drug 20. > 2 1125 45 46 55 > 3 1126 60 55 45 > . > . > 500,000 > > -Sheet 2 (list of codes meanings for only the drugs of interest): > > 1 Drug code meaning > 2 45 Simvastatin > 3 55 Aspirin > 4 60 Paracetamol > . > 128 > > The desired output I'm looking for : > > Drug codes Meaning Frequency > 45+55 Simvastatin 2 > +Aspirin > 60+55 Aspirin+ 1 > Paracetamol > 60+45 Simvastatin+ 1 > Paracetamol > > Please note the the final output does not include any combination > containing drug 46 as this is not in the list of drugs preferred to be > studied which are mentioned in sheet 2. > > Could you please help me which R codes and packages should be used to run > this analyisis? > > Regards > Mustafa > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/ > posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.