On Wed, Dec 30, 2009 at 8:39 PM, Peter Green <[email protected]> wrote: > I'm a Haskell neophyte, so may be missing something obvious in the problem > outlined below. I'm fairly proficient in Python + have some limited > experience in OCaml and F#, so know just enough to be be dangerous, but not > nearly enough to really know what I'm doing here. > > OK, I have text files containing 'compressed lists' Compressed lists look > like this: > > 8+12,11,7+13,10 > 1+2+3,1+9,3+6,4 > . > . > > Sublists are comma-delimited, and sublist elements are separated by '+' > character. > > I parse these to look like so: > > [[[8,12],[11],[7,13],[10]], > [[1,2,3],[1,9],[3,6],[4]], > . > . > ] > > I need to explode these and produce a lex-sorted list of exploded lists: > > [[1,1,3,4],[1,1,6,4],[1,9,3,4],[1,9,6,4],[2,1,3,4],[2,1,6,4],[2,9,3,4],[2,9,6,4],[3,1,3,4],[3,1,6,4],[3,9,3,4],[3,9,6,4], > [8,11,7,10],[8,11,13,10],[12,11,7,10],[12,11,13,10]] > > I then output this data as comma-delimited lists: > > 1,1,3,4 > 1,1,6,4 > . > . > 12,11,13,10 > > I can do all this in my (no doubt fairly naive) program shown below. In real > life, one of my test data files contains compressed lists which all contain > 7 sublists. This data (correctly) explodes to a list containing ~3,700,000 > exploded lists. All good as far as correctly transforming input to output > goes. However, sorting the data uses a lot of memory: > > Partial output from ./explode +RTS -sstderr: > > 540 MB total memory in use (4 MB lost due to fragmentation) > > If I do not do any sorting on the exploded lists, i.e. modify the main > function to be > main = interact (unlines . toCSV . explode . fromCSV . lines) > > I then see this partial output from ./explode +RTS --stderr: > > 2 MB total memory in use (0 MB lost due to fragmentation) > > I can guess that there might be be less laziness and more instantiation when > sorting is introduced, but my questions are: > (1) Am I doing anything terribly stupid/naive here? > (2) If so, what can I do to improve space efficiency? > > TIA! > > > import Data.List (sort) > import Data.List.Split (splitOn) > > -- Cartesian Product over a List of Lists > -- cp [[1,2],[3],[4,5,6]] --> > [[1,3,4],[1,3,5],[1,3,6],[2,3,4],[2,3,5],[2,3,6]] > cp :: [[a]] -> [[a]] > cp [] = [[]] > cp (xs:xss) = [y:ys | y <- xs, ys <- cp xss] > > -- fromCSV ["8+12,11,7+13,10", "1+2+3,1+9,3+6,4"] --> > -- [[[8,12],[11],[7,13],[10]],[[1,2,3],[1,9],[3,6],[4]]] > fromCSV :: [String] -> [[[Int]]] > fromCSV = map parseOneLine > where parseOneLine = map parseGroup . splitOn "," > where parseGroup = map read . splitOn "+" > > -- explode [[[1,2],[3],[4,5,6]], [[1, 2], [14,15], [16]]] --> > [[1,3,4],[1,3,5], > -- [1,3,6],[2,3,4],[2,3,5],[2,3,6],[1,14,16],[1,15,16],[2,14,16],[2,15,16]] > explode :: [[[a]]] -> [[a]] > explode = concatMap cp > > -- toCSV [[8,11,7,10,12],[8,11,7,10,12],[8,11,7,10,12]] --> > -- ["8,11,7,10,12","8,11,7,10,12","8,11,7,10,12"] > toCSV :: (Show a) => [[a]] -> [String] > toCSV = map $ tail . init . show > --toCSV = map (intercalate "," . map show) > > main = interact (unlines . toCSV . sort . explode . fromCSV . lines) > > _______________________________________________ > Haskell-Cafe mailing list > [email protected] > http://www.haskell.org/mailman/listinfo/haskell-cafe >
I'm not sure about your problem specifically but the external-sort package on Hackage may be of interest. Alex _______________________________________________ Haskell-Cafe mailing list [email protected] http://www.haskell.org/mailman/listinfo/haskell-cafe
