Hello Ross,

> On 30 May 2017, at 14:51, Ross Gayler <r.gay...@gmail.com> wrote:
> 
> I am after R package recommendations.
> 
> I have a data frame with ~5 million rows and ~50 columns. (I could do what
> I want with a sample of the rows, but ideally i would use all the rows.)

Very nice question for a mental gymnastics. First thing that comes in my mind 
is SPEED/PERFORMANCE when you mention ~5 million and _recursive_ words as you 
did. That’s why, the functions that have heavy work load are written in 
C/FORTRAN (i.e. partykit[1]). Most of the time, You are only limited to 
function arguments in R if you don’t want speed lost. For instance, lmtree 
function in partykit package decides the tree structure by its own partitioning 
model. So, you don’t have any authority (based on you are not a good 
programmer) to change the basic decision algorithm but only function arguments. 
Hence, the ready-to-go R functions only let you change function arguments but 
not the basic internal decision algorithm written in C/FORTRAN in backstage.

> 
> (1) I want to recursively partition the rows of the data frame in a way
> that I manually specify. That is, I want to generate a tree structure such
> that each node of the tree represents a subset of the rows of the data
> frame and the child nodes of any parent node represent a partition of the
> rows represented by the parent node. This is the sort of thing that tree
> induction algorithms like CART and ID3 do, but I want to manually specify
> the tree structure rather than have some algorithm decide it for me.

At this point, “manually specify” requires you write your own decision 
mechanism (seems as your only chance). Scaring part here is a matrix with 
~5milx50 cells and _recursive_ part. These are the first two of the things that 
R is not very good. Let’s create a very simple partitioning algorithm to see 
what we will encounter in the future.

part <- function(x, stop_rule = 1000) {
  set.seed(1)
  n <- round(runif(1, 1, nrow(x)))
  n1 <- 1:n; n2 <- (n + 1):nrow(x)
  part1 <- if (length(n1) < stop_rule)
    x[n1,] else part(x[n1,], stop_rule)
  part2 <- if (length(n2) < stop_rule)
    x[n2,] else part(x[n2,], stop_rule)
  return(list(p1 = part1, p2 = part2))
}

The _part_ function is a very simple partitioning algorithm by recursive 
function calling. It randomly parts a matrix till number of rows is less than 
1000. Only 10 lines of code. Very nice huh? 
Now let’s create a data to play.

x <- matrix(runif(5e6), nrow = 5e6, ncol = 50)

Wow. please check the size of x. 1.9 GB! (Anyway, I have 8 core + 16 GB huge 27 
inch desktop, I can handle it, I hope…) Now let’s apply the part function over 
x.

p <- part(x)

Please, note the execution time and watch the system memory! I used more than 
16GB RAM and elapsed time is 46 sec for a very simple partitioning. And check 
the size of _p_ (1.9 GB). This is not good, mate. _p_ is a list of parts 
containing other subnodes. Terminal node contains partitioned rows of x matrix. 
also note that I didn’t use any information hidden in x matrix in calculations. 
only random length. So, in real life, if you use pure R to part things, you 
will end up with very long running functions (I mean days).

I think I can increase the performance a bit. Let’s try!

node <- 0
partf <- function(x, stop_rule = 1000) {
  set.seed(1)
  n <- round(runif(1, 1, nrow(x)))
  n1 <- 1:n; n2 <- (n + 1):nrow(x)
  part1 <- if (length(n1) < stop_rule) {
    node <<- node + 1
    rep(node, length(n1))
  } else partf(x[n1,], stop_rule)
  part2 <- if (length(n2) < stop_rule) {
    node <<- node + 1
    rep(node, length(n2))
  } else partf(x[n2,], stop_rule)
  return(c(part1, part2))
}

_partf_ function is the same as _part_ except it returns only an index of 
partitions. So, I can get rid of 1.9 GB _p_ variable.  Run the code below:

pf <- partf(x)

> 
> (2) I want the means for specifying the tree structure to be as simple as
> possible, because the users will be trying out different tree structures.

Now, excecution was completed in 32 sec and size of _pf_ is only 38 Mb. I can 
use _pf_ to index x matrix and calculate required summary statistics that you 
mention in (2). But monitor your memory pressure at this stage. _partf_function 
use still huge amount of RAM to complete a simple calculation. Accept these 
examples as a start point and decide the future of the project in your mind. 

> 
> (3) Each node (internal or terminal) of the tree represents a row subset of
> the root data frame. I want to be able to specify a function to be applied
> to each node that takes the node data frame as input and calculates a set
> of summary statistics. I will probably write this node summary function as
> a dplyr pipeline. I will want to be able to associate the summaries with
> the nodes so that I keep track of the summaries in terms of the tree
> structure.
> 
> (4) I want to be able to print and plot the tree of summaries in a way that
> shows the summaries in the context of the tree structure. Inevitably, there
> will be fiddling with the formatting of the prints and plots, so I expect i
> will need user definable print/plot formatting functions that are applied
> to each node of the tree.


Item 3 and 4 will be up to your R programming skills. Because a ready-to-go 
package won’t help you much as you want to construct your own partitioning rule.

> 
> What I am looking for is an R package that provides the best starting point
> for me to implement this. I am not a particularly good programmer, so
> getting a package that minimises what I have to write is important to me.

If you still insist on, I suggest you start to learn C/C++ and Rcpp and 
friends. So, you can write your own partitioning algorithm, your own decision 
mechanism. And your functions will work super duper fast and will use 
incredibly less RAM. Also Rcpp has parallel support. So, you can run your code 
faster on multicore machines. I hope my explanations make things clear in your 
mind.

> 
> So far, the most likely packages appear to be:
> 
>   - partykit <http://partykit.r-forge.r-project.org/partykit/>
>   - data.tree <https://github.com/gluc/data.tree>
> 
> I would appreciate any recommendations for R packages that would serve as a
> good base; any comments on the relative merits of the packages for my
> purposes; and any pointers to example code of people doing similar things.
> 
> Thanks
> 
> Ross

1- https://r-forge.r-project.org/scm/viewvc.php/pkg/partykit/src/?root=partykit 
<https://r-forge.r-project.org/scm/viewvc.php/pkg/partykit/src/?root=partykit>



        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to