midfield: > hi folks -- > > a haskell newbie here, searching for comments and wisdom on my code. > > i had a project to try to implement "external sort" in haskell as a > learning exercise. (external sort is sorting a list that is too large > to fit in main memory, by sorting in chunks, spooling to disk, and > then merging. more properly there probably should be multiple stages, > but for simplicity i'm doing a one-stage external sort.) > > the trick is the number of files can quickly grow very large, so it is > best to use one large file and seek inside it around. however as one > can imagine the order-of-IO-operations becomes a bit tricky, if you're > seeking file handles around underneath Data.ByteString.Lazy's nose. > but late this night after not thinking about it for a while i had a > brainstorm: rewrite hGetContents to keep the handle position in the > right place! it's all about judicious use of unsafeInterleaveIO..... > > it seems to be rather fast, strangely faster than the usual "sort" at > times. it also seems to have nice memory characteristics, though not > perfect. it's hard to test because the normal "sort" function takes > too much RAM on large lists, making my computer swap like mad.
I have to agree with Mr. Apfelmus here. This is lovely code. It is exactly what the ByteString team hoped people would be able to write ByteStrings: "Zen of Haskell" code, where you win by working at a high level, rather than a low level. Thanks! I've inserted some small comments though the source: > >module ExternalSort where > > Sort a list of Ords "offline." We're doing this to be able to sort > things without taking up too much memory (for example sorting lists > too large to fit in RAM.) Laziness is imperative, as is the > order-of-operations. > > >import Control.Monad > >import Data.List > >import qualified Data.Binary as Bin > >import qualified Data.ByteString.Lazy as B > >import qualified Data.ByteString as P (hGetNonBlocking, null) > >import Data.ByteString.Base (LazyByteString(LPS)) > >import Foreign.Storable (sizeOf) > >import System.IO (openFile, hClose, hSeek, hTell, hIsEOF, hWaitForInput, > > Handle, IOMode(ReadMode, WriteMode), > > SeekMode(AbsoluteSeek)) > >import System.IO.Unsafe (unsafeInterleaveIO) > > > >import qualified Data.Edison.Seq.ListSeq as LS > >import qualified Data.Edison.Coll.SplayHeap as Splay > > Conceptually, we sort a list in blocks, spool blocks to disk, then > merge back. However for IO performance it is better to read off > chunks of elements off the sorted blocks from disk instead of > elements-at-a-time. > > It would be better if these were in KBytes instead of # of elements. > > >blocksize :: Int > >blocksize = 10000 > > Turn a list into a list of chunks. > > >slice :: Int -> [a] -> [[a]] > >slice _ [] = [] > >slice size l = (take size l) : (slice size $ drop size l) That's unnecessary parenthesis, and I'd probably use splitAt here: myslice :: Int -> [a] -> [[a]] myslice _ [] = [] myslice n xs = a : myslice n b where (a,b) = splitAt n xs And just to check: *M> :m + Test.QuickCheck *M Test.QuickCheck> quickCheck (\n (xs :: [Int]) -> n > 0 ==> slice n xs == myslice n xs) OK, passed 100 tests. > > Turn a list into a list of blocks, each of which is sorted. > > >blockify :: (Ord a) => Int -> [a] -> [[a]] > >blockify bsize l = map sort $ slice bsize l Possibly you could drop the 'l' parameter: blockify n = map sort . slice n > > Serialize a block, returning the (absolute) position of the start. > > >dumpBlock :: (Ord a, Bin.Binary a) => Handle -> [a] -> IO Integer > >dumpBlock h b = do > > start <- hTell h > > B.hPut h $ Bin.encode b > > return start > > The actual sorting function. We blockify the list, turning it into a > list of sorted blocks, and spool to disk, keeping track of offsets. > We then read back the blocks (lazily!), and merge them. > > >externalSort [] = do return [] > >externalSort l = do > > h <- openFile "ExternalSort.bin" WriteMode > > idx <- mapM (\x -> dumpBlock h x) (blockify blocksize l) idx <- mapM (dumpBlock h) (blockify blocksize l) > > hClose h > > h <- openFile "ExternalSort.bin" ReadMode > > blocks <- mapM (\x -> do {bs <- hGetContentsWithCursor h x; > > return $ Bin.decode bs}) idx Possibly forM idx $ \x -> decode `fmap` hGetContentsWithCursor h x > > return (kMerge $ blocks) > > Merging chunks. K-way merge (and in fact external sort in general) is > detailed in Knuth, where he recommends tournament trees. The easiest > thing is to probably use one of Okasaki's heaps. I'll use splay > heaps, because I don't know any better. > > It would be better if I changed Ord for blocks to only check the first > element. > > >kMerge :: (Ord a) => [[a]] -> [a] > >kMerge [] = [] > >kMerge l = > > let h = Splay.fromSeq l in > > kM (Splay.minElem h) (Splay.deleteMin h) > > where > > kM :: (Ord a) => [a] -> Splay.Heap [a] -> [a] > > kM l h > > | h == Splay.empty = l > > | otherwise = > > let next = Splay.minElem h > > (f, b) = span (\x -> x <= head next) l > > in > > f ++ (kM next (if null b then Splay.deleteMin h > > else (Splay.insert b $ Splay.deleteMin h))) > > > >kMergeSort :: (Ord a) => [a] -> [a] > >kMergeSort l = kMerge $ blockify blocksize l > > This is a version of hGetContents which resets its handle position > between reads, so is safe to use with interleaved handle seeking. > > >hGetContentsWithCursor :: Handle -> Integer -> IO B.ByteString > >hGetContentsWithCursor = hGetContentsWithCursorN defaultChunkSize > > > >hGetContentsWithCursorN :: Int -> Handle -> Integer -> IO B.ByteString > >hGetContentsWithCursorN k h start = (lazyRead start) >>= return . LPS > > where > > lazyRead start = unsafeInterleaveIO $ loop start > > > > loop start = do > > hSeek h AbsoluteSeek start > > ps <- P.hGetNonBlocking h k > > --TODO: I think this should distinguish EOF from no data available > > -- the otherlying POSIX call makes this distincion, returning > > either > > -- 0 or EAGAIN > > if P.null ps > > then do eof <- hIsEOF h > > if eof then return [] > > else hWaitForInput h (-1) > > >> (loop start) > > else do > > pos <- hTell h > > pss <- lazyRead pos > > return (ps : pss) Very nice! > > > >defaultChunkSize :: Int > >defaultChunkSize = 32 * k - overhead > > where k = 1024 > > overhead = 2 * sizeOf (undefined :: Int) We'll export this value in bytestring 1.0. I like this code. Would you consider cabalising it, and uploading it to hackage.haskell.org, so we don't lose it? Perhaps just call it hsort or something? Cheers, Don _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe