Do you have cycles in the objects you're trying to save? (like A->B->A) I'm 
not sure JLD handles cycles. In which case breaking the cycle with a custom 
serializer will also solve the problem. (More ambitiously, one could also 
solve the general cycle problem.)

Best,
--Tim

On Sunday, January 24, 2016 12:31:33 PM Pedro Silva wrote:
> I did see that other post, but I really thought that this could be a
> different problem. The save function is running for the past 20 hours
> without terminating. I am inexperienced with serializers but I will see
> what I can make from the code you posted. Thank you very much.
> 
> On Sunday, January 24, 2016 at 10:29:25 AM UTC-8, Tim Holy wrote:
> > Similar question here, asked just a couple of days ago (please do search
> > the
> > archives first):
> > https://groups.google.com/d/msg/julia-users/VInJ4M-yNUY/Z6N8wCCfAwAJ
> > 
> > Someone should just add a serializer to the relevant random
> > forest/decision
> > tree packages. These aren't hard to write, and there's an example in the
> > linked docs.
> > 
> > For reference, here's a more complicated example: in my own lab's code, we
> > use
> > "tile trees" to represent sums over little pieces of images. They combine
> > QuadTrees/OctTrees (depending on spatial dimensionality) with
> > spatio-temporal
> > factorizations. The main point being that these might seem like fairly
> > complicated data structures, yet the serializer and deserializer can each
> > be
> > written in ~10 lines of code, and gave me an orders-of-magnitude
> > performance
> > improvement when saving/loading.
> > 
> > For reference, I've pasted the code below: it's not self-contained, but it
> > should give you the idea.
> > 
> > Best,
> > --Tim
> > 
> > # This contains info needed to reconstruct the BoxTree, but does not store
> > the
> > # BoxTree itself
> > type TileTreeSerializer{TT<:Tile}
> > 
> >     tiles::Vector{TT}
> >     ids::Vector{Int}
> >     ntiles::Int
> >     dims::Dims
> >     Ts::Type
> >     Tel::Type
> >     K::Int
> >     W::Tuple
> > 
> > end
> > TileTrees.tiletype{TT}(::Type{TileTreeSerializer{TT}}) = TT
> > TileTrees.tiletype{TT}(::TileTreeSerializer{TT}) = TT
> > 
> > function JLD.readas(serdata::TileTreeSerializer)
> > 
> >     bt = boxtree(serdata.Ts, serdata.Tel, serdata.K, serdata.W,
> > 
> > dimspans(serdata.dims[1:end-1]))
> > 
> >     TT = tiletype(serdata)
> >     tiles = Array(TT, serdata.ntiles)
> >     for i = 1:length(serdata.tiles)
> >     
> >         id = serdata.ids[i]
> >         tile = serdata.tiles[i]
> >         tiles[id] = tile
> >         roi = boxroi(tile.spans, id)
> >         push!(bt, roi)
> >     
> >     end
> >     ttree = TileTree(tiles, bt, serdata.dims)
> > 
> > end
> > 
> > function JLD.writeas(ttree::TileTree)
> > 
> >     tiles = Array(tiletype(ttree), 0)
> >     ids = Int[]
> >     for (id, tile) in ttree
> >     
> >         push!(tiles, tile)
> >         push!(ids, id)
> >     
> >     end
> >     BT = boxtreetype(ttree)
> >     ST = splittype(BT)
> >     TileTreeSerializer{tiletype(ttree)}(
> >     
> >         tiles,
> >         ids,
> >         length(ttree.tiles),
> >         ttree.dims,
> >         ST,
> >         eltype(BT),
> >         splitk(BT),
> >         (splitwidth(BT)...))
> > 
> > end
> > 
> > On Sunday, January 24, 2016 02:15:50 AM Pedro Silva wrote:
> > > I've been training a lot of random forests in a really big dataset and
> > 
> > while
> > 
> > > saving my transformations of the data in JLD files has been a breeze
> > 
> > saving
> > 
> > > the Models and their respective details is not going smoothly. I'm
> > > experimenting with different sizes of trees and different number of
> > > parameters per tree, so I have 10 forests total and since they take
> > 
> > about 1
> > 
> > > hour to train each I'd like to save them every 7 iterations in case I
> > 
> > have
> > 
> > > to shut down a machine. My code for the process is the following:
> > > 
> > > using HDF5, JLD, DataFrames, Distributions, DecisionTree, MLBase,
> > 
> > StatsBase
> > 
> > > ...
> > > 
> > > num_of_trees = collect(10:10:100);
> > > num_of_features = collect(20:5:50);
> > > Models =
> > 
> > Array{DecisionTree.Ensemble}(length(num_of_trees),length(num_of_features))
> > ;
> > 
> > > Predictions =
> > > Array{Array{Float64,1}}(length(num_of_trees),length(num_of_features));
> > > RMSEs = Array{Float64}(length(num_of_trees),length(num_of_features));
> > 
> > train
> > 
> > > = rand(Bernoulli(0.8), size(Y)) .== 1;
> > > 
> > > for i in 1:length(num_of_trees)
> > > 
> > >         for j in 1:length(num_of_features)
> > >         
> > >                 Models[i,j] =
> > 
> > build_forest(Y[train],DataSTD[train,:],num_of_features[j],num_of_trees[i])
> > ;
> > 
> > > Predictions[i,j] = apply_forest(Models[i,j], DataSTD[!train,:]);
> > 
> > RMSEs[i,j]
> > 
> > > = root_mean_squared_error(Y[!train], Predictions[i,j]); println("\n",
> > > Models[i,j])
> > > 
> > >                 println("Features: ",num_of_features[j])
> > >                 println("RMSE: ",RMSEs[i,j])
> >                 
> >                 display(confusion_matrix_regression(Y[!train],Predictions[
> >                 i,j],10))
> > >         
> > >         end
> > >         save("Models_run1.jld", "Models", Models, "Features",
> > 
> > num_of_features,
> > 
> > > "Predictions", Predictions, "RMSEs", RMSEs, "Bernoulli", train); end
> > > 
> > > Finishing the internal for loop takes around 7 hours, which is not a
> > > surprise, but the save function runs for hours as well. The file keeps
> > > slowly increasing in size, so I think something is happening but I'm not
> > > sure what. I'm still unable to get to a second iteration of my outer
> > 
> > loop
> > 
> > > after 3 hours of the intern loop has finished. I plan to leave it
> > 
> > running
> > 
> > > over night to see whether it fails or finishes. Any idea on why this is
> > > happening?

Reply via email to