Re: [R] Complicated analysis for huge databases

David Winsemius Sat, 18 Nov 2017 12:07:47 -0800

> On Nov 18, 2017, at 1:52 AM, Allaisone 1 <allaiso...@hotmail.com> wrote:
> 
> Although the loop seems to be formulated correctly I wonder why
> it gives me these errors :
> 
> -object 'i' not found
> - unexpected '}' in "}"


You probably did not copy the entire code offered. But we cannot know since you 
did not "show your code", not=r did you post complete error messages. Both of 
these practices are strongly recommended by the Posting Guide. Please read it 
(again?).

-- 
David.
> 
> 
> the desired output is expected to be very large as for each dataframe in the 
> list of dataframes I expect to see maf value for each of the 600 columns! and 
> this is only for
> 
> for one dataframe in the list .. I have around 150-200 dataframes.. not sure 
> how R will store these results.. but first I need the analysis to be done 
> correctly. The final output has to be something like this :-
> 
> 
>> mafsforeachcolumns(I,II,...600)foreachcombination
> 
>      MealsCombinations    Cust.ID      I              II            III       
>       IV       ...... 600
> 1          33-55                          1             0.124      0.10      
> 0.65       0.467
>                                                  3
>                                                  5
> 
> 2      44-66                                7           0.134     0.43       
> 0.64       0.479
>                                                  4
>                                                  9
> 
> .
> 
> .
> 
> ~180 dataframes
> 
> 
> ________________________________
> From: Boris Steipe <boris.ste...@utoronto.ca>
> Sent: 18 November 2017 00:35:16
> To: Allaisone 1; R-help
> Subject: Re: [R] Complicated analysis for huge databases
> 
> Something like the following?
> 
> AllMAFs <- list()
> 
> for (i in length(SeparatedGroupsofmealsCombs) {
>  AllMAFs[[i]] <- apply( SeparatedGroupsofmealsCombs[[i]], 2, function(x)maf( 
> tabulate( x+1) ))
> }
> 
> 
> (untested, of course)
> Also the solution is a bit generic since I don't know what the output of 
> maf() looks like in your case, and I don't understand why you use tabulate 
> because I would have assumed that's what maf() does - but that's not for me 
> to worry about :-)
> 
> 
> 
> B.
> 
> 
> 
>> On Nov 17, 2017, at 7:15 PM, Allaisone 1 <allaiso...@hotmail.com> wrote:
>> 
>> 
>> Thanks Boris , this was very helpful but I'm struggling with the last part.
>> 
>> 1) I combined the first 2 columns :-
>> 
>> 
>> library(tidyr)
>> SingleMealsCode <-unite(MyData, MealsCombinations, c(MealA, MealB), 
>> remove=FALSE)
>> SingleMealsCode <- SingleMealsCode[,-2]
>> 
>>  2) I separated this dataframe into different dataframes based on 
>> "MealsCombination"
>>   column so R will recognize each meal combination separately :
>> 
>> SeparatedGroupsofmealsCombs <- 
>> split(SingleMealCode,SingleMealCode$MealsCombinations)
>> 
>> after investigating the structure of "SeparatedGroupsofmealsCombs" , I can 
>> see
>> a list of different databases, each of which represents a different Meal 
>> combinations which is great.
>> 
>> No, I'm struggling with the last part, how can I run the maf code for all 
>> dataframes?
>> 
>> when I run this code as before :-
>> 
>> maf <- apply(SeparatedGroupsofmealsCombs, 2, function(x)maf(tabulate(x+1)))
>> 
>> an error message says : dim(X) must have a positive length . I'm not sure 
>> which length
>> I need to specify.. any suggestions to correct this syntax ?
>> 
>> Regards
>> Allaisone
>> From: Boris Steipe <boris.ste...@utoronto.ca>
>> Sent: 17 November 2017 21:12:06
>> To: Allaisone 1
>> Cc: R-help
>> Subject: Re: [R] Complicated analysis for huge databases
>> 
>> Combine columns 1 and 2 into a column with a single ID like "33.55", "44.66" 
>> and use split() on these IDs to break up your dataset. Iterate over the list 
>> of data frames split() returns.
>> 
>> 
>> B.
>> 
>>> On Nov 17, 2017, at 12:59 PM, Allaisone 1 <allaiso...@hotmail.com> wrote:
>>> 
>>> 
>>> Hi all ..,
>>> 
>>> 
>>> I have a large dataset of around 600,000 rows and 600 columns. The first 
>>> col is codes for Meal A, the second columns is codes for Meal B. The third 
>>> column is customers IDs where each customer had a combination of meals. 
>>> Each column of the rest columns contains values 0,1,or 2. The dataset is 
>>> organised in a way so that the first group of customers had similar meals 
>>> combinations, this is followed by another group of customers with similar 
>>> meals combinations but different from the first group and so on. The 
>>> dataset looks like this :-
>>> 
>>> 
>>>> MyData
>>> 
>>>      Meal A     Meal B     Cust.ID      I            II        III     IV   
>>> ...... 600
>>> 
>>> 1    33                 55             1             0           1        2 
>>>       0
>>> 
>>> 2    33                 55              3             1          0        2 
>>>        2
>>> 
>>> 3    33                 55              5             2          1        1 
>>>         2
>>> 
>>> 4    44                 66               7            0          2         
>>> 2        2
>>> 
>>> 5   44                  66               4            1          1          
>>> 0       1
>>> 
>>> 6   44                  66                9            2          0         
>>>  1       2
>>> 
>>> .
>>> 
>>> .
>>> 
>>> 600,000
>>> 
>>> 
>>> 
>>> I wanted to find maf() for each column(from 4 to 600) after calculating the 
>>> frequency of the 3 values (0,1,2) but this should be done group by group 
>>> (i.e. group(33-55) : rows 1:3 then group(44-66) :rows 4:6 and so on).
>>> 
>>> 
>>> I can do the analysis  for the entire column but not group by group like 
>>> this :
>>> 
>>> 
>>> MAF <- apply(MyData[,4:600], 2, function(x)maf(tabulate(x+1)))
>>> 
>>> How can I modify this code to tell R to do the analysis group by group for 
>>> each column so I get maf value for 33-55 group of clolumn I, then maf value 
>>> for group 44-66 in the same column I,then the rest of groups in this column 
>>> and do the same for the remaining columns.
>>> 
>>> In fact, I'm interested in doing this analysis for only 300 columns but all 
>>> of the 600 columns.
>>> I have another sheet contains names of columns of interest like this :
>>> 
>>>> ColOfinterest
>>> 
>>> Col
>>> I
>>> IV
>>> V
>>> .
>>> .
>>> 300
>>> 
>>> Any one would help with the best combination of syntax to perform this 
>>> complex analysis?
>>> 
>>> Regards
>>> Allaisone
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>>       [[alternative HTML version deleted]]
>>> 
>>> ______________________________________________
>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
> 
> 
>       [[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius
Alameda, CA, USA

'Any technology distinguishable from magic is insufficiently advanced.'   
-Gehm's Corollary to Clarke's Third Law

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Complicated analysis for huge databases

Reply via email to