*I made this question on StackOverflow, but I think I will get better
results posting it here. We should use that platform more, so Julia is more
exposed to R/Python/Matlab users needing something like it.*
I have some data (from a R course assignment, but that doesn't matter) that
I want to use split-apply-combine strategy, but I'm having some problems.
The data is on a DataFrame, called outcome, and each line represents a
Hospital. Each column has an information about that hospital, like name,
location, rates, etc.
*My objective is to obtain the Hospital with the lowest "Mortality by Heart
Attack Rate" of each State.*
I was playing around with some strategies, and got a problem using the
byfunction:
best_heart_rate(df) = sort(df, cols = :Mortality)[end,:]
best_hospitals = by(hospitals, :State, best_heart_rate)
The idea was to split the hospitals DataFrame by State, sort each of the
SubDataFrames by Mortality Rate, get the lowest one, and combine the lines
in a new DataFrame
But when I used this strategy, I got:
ERROR: no method nrow(SubDataFrame{Array{Int64,1}})
in sort at /home/paulo/.julia/v0.3/DataFrames/src/dataframe/sort.jl:311
in sort at /home/paulo/.julia/v0.3/DataFrames/src/dataframe/sort.jl:296
in f at none:1
in based_on at
/home/paulo/.julia/v0.3/DataFrames/src/groupeddataframe/grouping.jl:144
in by at
/home/paulo/.julia/v0.3/DataFrames/src/groupeddataframe/grouping.jl:202
I suppose the nrow function is not implemented for SubDataFrames for a good
reason, so I gave up from this strategy. Then I used a nastier code:
best_heart_rate(df) = (df[sortperm(df[:,:Mortality] , rev=true), :])[1,:]
best_hospitals = by(hospitals, :State, best_heart_rate)
Seems to work. But now there is a NA problem: how can I remove the rows
from the SubDataFrames that have NA on the Mortality column? Is there a
better strategy to accomplish my objective?