Thanks for the advice. After taking most of it it is faster. But it is still many times slower than it ought to be! This algorithm should be much faster than simply sorting the list, and yet it is more than twice as slow!
One note, you said: """" > Increment length. > >> modifySTRef >> lengthRef >> (+1) This will create a big thunk for the length, you should use oldLength <- readSTRef lengthRef writeSTRef lengthRef $! oldLength + 1 (I'm not sure if a strict modifySTRef exists somewhere...)"""" Actually, replacing modifySTRef with that code is just a hair slower... Not sure why. I'm attaching the super optimized version. Along with the profiler output. I just can't understand what is slow here :( Timothy ---------- Původní zpráva ---------- Od: Felipe Almeida Lessa <felipe.le...@gmail.com> Datum: 3. 9. 2012 Předmět: Re: [Haskell-cafe] hstats median algorithm "On Mon, Sep 3, 2012 at 11:18 AM, Felipe Almeida Lessa <felipe.le...@gmail.com> wrote: > Ditto for oldLen here. Also, you can simplify this lambda a lot: > > import Control.Applicative ((<$>)) > > \(oldLen, oldVal) -> > let newLen = oldLen + 1 > newVal = (number:) <$> oldVal > in newLen `seq` newVal `seq` (newLen, newVal) Or, using BangPatterns, \(oldLen, oldVal) -> let !newLen = oldLen + 1 !newVal = (number:) <$> oldVal in (newLen, newVal) Cheers, -- Felipe."
Mon Sep 3 20:54 2012 Time and Allocation Profiling Report (Final) medians +RTS -p -K400m -RTS total time = 7.40 secs (7397 ticks @ 1000 us, 1 processor) total alloc = 2,805,885,192 bytes (excludes profiling overheads) COST CENTRE MODULE %time %alloc whichBucket Main 46.8 49.5 buckets.\ Main 35.0 30.1 buckets Main 7.7 4.6 buckets.\.bucket Main 5.8 5.7 someList.\ Main 2.7 5.7 someList Main 1.7 4.0 individual inherited COST CENTRE MODULE no. entries %time %alloc %time %alloc MAIN MAIN 55 0 0.0 0.0 100.0 100.0 CAF Main 109 0 0.0 0.0 100.0 100.0 someListMin Main 123 1 0.0 0.0 0.0 0.0 someListMax Main 122 1 0.0 0.0 0.0 0.0 someList Main 121 1 1.7 4.0 4.4 9.7 someList.\ Main 124 1001 2.7 5.7 2.7 5.7 someListNumBuckets Main 119 1 0.0 0.0 0.0 0.0 someListGuessedMiddle Main 118 1 0.0 0.0 0.0 0.0 someListMedian Main 111 1 0.0 0.0 95.6 90.3 median Main 112 1 0.3 0.4 95.6 90.3 median.stubLen Main 134 1 0.0 0.0 0.0 0.0 median.length Main 133 1 0.0 0.0 0.0 0.0 median.halfLength Main 132 1 0.0 0.0 0.0 0.0 median.(...) Main 116 1 0.0 0.0 95.3 89.9 buckets Main 117 1 7.7 4.6 95.3 89.9 buckets.\ Main 129 100 0.0 0.0 0.0 0.0 buckets.bottom Main 128 1 0.0 0.0 0.0 0.0 buckets.\ Main 125 2003001 35.0 30.1 87.6 85.3 buckets.\.bucket Main 126 2003001 5.8 5.7 52.6 55.2 whichBucket Main 127 2003001 46.8 49.5 46.8 49.5 buckets.\ Main 120 100 0.0 0.0 0.0 0.0 median.myBuckets Main 115 1 0.0 0.0 0.0 0.0 median.(...) Main 114 1 0.0 0.0 0.0 0.0 median.(...).\ Main 130 100 0.0 0.0 0.0 0.0 median.(...).\.nextI Main 131 51 0.0 0.0 0.0 0.0 median.medianBucketVals Main 113 1 0.0 0.0 0.0 0.0 main Main 110 1 0.0 0.0 0.0 0.0 CAF GHC.Conc.Signal 95 0 0.0 0.0 0.0 0.0 CAF GHC.IO.Handle.FD 91 0 0.0 0.0 0.0 0.0 CAF GHC.IO.Encoding 86 0 0.0 0.0 0.0 0.0 CAF GHC.IO.Encoding.Iconv 72 0 0.0 0.0 0.0 0.0
>{-# LANGUAGE BangPatterns #-} So I've been playing with the median problem today. Not sure why, but it stuck in my head. >import Data.List >import Control.Monad.ST >import Data.STRef >import Control.Monad >import Control.Applicative ((<$>)) >import qualified Data.Vector.Mutable as V >import Debug.Trace I've been using the hashing algorithm that I described last night, but it's quite slow. I must be missing something obvious. It really shouldn't be slow at all! So I have a median function: >median First I assume you know the range of your data. > (min,max) Then I ask you to figure out how many buckets you want to create, and which of those buckets you actually want to fill. You should only fill the buckets near the middle, based on an educated guess of the distribution of your data, otherwise, you will end up wasting memory. I you guessed the middle correctly the function will return more quickly than if you didn't. > numBuckets guessedMiddle Then you are to pass it your list. > list = > let First I seperate out the values into a list of buckets, each one holding values which are near to eachother. I can figure out the length of the list at the same time, since I have to go through the whole thing anyways. > (myBuckets,length) = > buckets > (min,max) > numBuckets > guessedMiddle > list > halfLength = length `div` 2 The buckets are set up, so that the first bucket has the lowest values, and the last bucket has the highest values. Each bucket, is a tuple, of it's length and it's contents. We fold across the list of buckets, accumulating the "index so far", until we find the bucket in which the median must reside. > Right ((_,medianBucketVals),stubLen,medianBucketIx) = foo :: Either (Int,Int) ((a,Maybe [b]),a,Int) foo = > foldr > (\thisBucket@(thisBucketLen,_) > eitheriOrMedianBucket -> > case eitheriOrMedianBucket of > Left (i,bucketIx) -> > let > nextI = i + thisBucketLen > in > if nextI > halfLength > then Right (thisBucket, thisBucketLen-(halfLength - i),(bucketIx-1)) > else (Left (nextI, (bucketIx + 1))) > _ -> eitheriOrMedianBucket) > (Left (0,0)) > myBuckets > in We then sort the bucket in which the median must reside, and we then find the median the normal way. This should be faster, since we only had to sort one bucket, rather than the entire list. It is not, so there must be something horrible going on. > case medianBucketVals of > Just medianBucket' -> > sort medianBucket' `genericIndex` stubLen > Nothing -> > median > (min,max) > numBuckets > (medianBucketIx,medianBucketIx) > list Here is my actual function for seperating the values out into buckets. >buckets > (min,max) > numBuckets > (guessedMiddleStart,guessedMiddleEnd) > list = > runST $ do > lengthRef <- newSTRef 0 First we create a list of empty buckets. Since it would be a waste of memory to actually fill the buckets near the edges of our distribution(where we are not likely to find our median), our buckets contain Maybe lists, and the buckets which are outside of our guessed bucket range will be filled with Nothing. > buckets' <- V.new numBuckets > mapM > (\n-> > V.write > buckets' > n > (0, > if n >= guessedMiddleStart && n <= guessedMiddleEnd > then Just [] > else Nothing)) > [0..numBuckets-1] Then we go through the buckets, figuring out which bucket to put a given value into. We calculate the length at the same time. > let > bottom = > (fromIntegral (max - min)) > / > (fromIntegral (numBuckets-1)) > forM_ list $ \number -> do > let Figure out which bucket to put this into. > bucket = > whichBucket > (min,max) > bottom > numBuckets > number Increment length. > modifySTRef > lengthRef > (+1) However tests show that's not true. Put the value into the appropriate bucket. > (oldLen,oldVal) <- V.read > buckets' > bucket > let !newLen = oldLen + 1 > !newVal = (number:) <$> oldVal > V.write > buckets' > bucket > (newLen, newVal) > filledBuckets <- > mapM > (\index->V.read buckets' index) > [0..numBuckets-1] > length <- readSTRef lengthRef > return (filledBuckets,length) >whichBucket (min,max) bottom numBuckets number = > floor((fromIntegral (number - min)) > / > bottom) So I created a little test scenario. >someListNumBuckets = 100 >someListGuessedMiddle = (45,55) >someListLength = genericLength someList And found that this: >realMedian = sort someList `genericIndex` (someListLength `div` 2) Is actually faster ^_^ :O *Main> realMedian 500 (7.18 secs, 2031543472 bytes) Than this: >someListMedian = > median > (someListMin,someListMax) > someListNumBuckets > someListGuessedMiddle > someList *Main> someListMedian Just 500 (37.77 secs, 15376209200 bytes) someListMin :: Integer >someListMin = 0 someListMax :: Integer >someListMax = 1000 someList :: [Integer] >someList = > concatMap > (\n->intersperse n [someListMin..someListMax]) > [someListMax,someListMax-1..someListMin] >main :: IO () >main = print someListMedian
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe