midfield:
> hi folks --
> 
> a haskell newbie here, searching for comments and wisdom on my code.
> 
> i had a project to try to implement "external sort" in haskell as a
> learning exercise.  (external sort is sorting a list that is too large
> to fit in main memory, by sorting in chunks, spooling to disk, and
> then merging.  more properly there probably should be multiple stages,
> but for simplicity i'm doing a one-stage external sort.)
> 
> the trick is the number of files can quickly grow very large, so it is
> best to use one large file and seek inside it around.  however as one
> can imagine the order-of-IO-operations becomes a bit tricky, if you're
> seeking file handles around underneath Data.ByteString.Lazy's nose.
> but late this night after not thinking about it for a while i had a
> brainstorm: rewrite hGetContents to keep the handle position in the
> right place!  it's all about judicious use of unsafeInterleaveIO.....
> 
> it seems to be rather fast, strangely faster than the usual "sort" at
> times.  it also seems to have nice memory characteristics, though not
> perfect.  it's hard to test because the normal "sort" function takes
> too much RAM on large lists, making my computer swap like mad.

I have to agree with Mr. Apfelmus here. This is lovely code. It is exactly
what the ByteString team hoped people would be able to write
ByteStrings: "Zen of Haskell" code, where you win by working at a high
level, rather than a low level. 

Thanks!

I've inserted some small comments though the  source:

> >module ExternalSort where
> 
> Sort a list of Ords "offline."  We're doing this to be able to sort
> things without taking up too much memory (for example sorting lists
> too large to fit in RAM.)  Laziness is imperative, as is the
> order-of-operations.
> 
> >import Control.Monad
> >import Data.List
> >import qualified Data.Binary as Bin
> >import qualified Data.ByteString.Lazy as B
> >import qualified Data.ByteString as P (hGetNonBlocking, null)
> >import Data.ByteString.Base (LazyByteString(LPS))
> >import Foreign.Storable (sizeOf)
> >import System.IO (openFile, hClose, hSeek, hTell, hIsEOF, hWaitForInput,
> >                  Handle, IOMode(ReadMode, WriteMode),
> >                  SeekMode(AbsoluteSeek))
> >import System.IO.Unsafe (unsafeInterleaveIO)
> >
> >import qualified Data.Edison.Seq.ListSeq as LS
> >import qualified Data.Edison.Coll.SplayHeap as Splay
> 
> Conceptually, we sort a list in blocks, spool blocks to disk, then
> merge back.  However for IO performance it is better to read off
> chunks of elements off the sorted blocks from disk instead of
> elements-at-a-time.
> 
> It would be better if these were in KBytes instead of # of elements.
> 
> >blocksize :: Int
> >blocksize = 10000
> 
> Turn a list into a list of chunks.
> 
> >slice :: Int -> [a] -> [[a]]
> >slice _ [] = []
> >slice size l = (take size l) : (slice size $ drop size l)

That's unnecessary parenthesis, and I'd probably use splitAt here:

    myslice :: Int -> [a] -> [[a]]
    myslice _ [] = []
    myslice n xs = a : myslice n b  where (a,b) = splitAt n xs

And just to check:

    *M> :m + Test.QuickCheck
    *M Test.QuickCheck> quickCheck (\n (xs :: [Int]) -> n > 0 ==> slice n xs == 
myslice n xs)
    OK, passed 100 tests.

> 
> Turn a list into a list of blocks, each of which is sorted.
> 
> >blockify :: (Ord a) => Int -> [a] -> [[a]]
> >blockify bsize l = map sort $ slice bsize l

Possibly you could drop the 'l' parameter:

    blockify n = map sort . slice n

> 
> Serialize a block, returning the (absolute) position of the start.
> 
> >dumpBlock :: (Ord a, Bin.Binary a) => Handle -> [a] -> IO Integer
> >dumpBlock h b = do
> >  start <- hTell h
> >  B.hPut h $ Bin.encode b
> >  return start
> 
> The actual sorting function.  We blockify the list, turning it into a
> list of sorted blocks, and spool to disk, keeping track of offsets.
> We then read back the blocks (lazily!), and merge them.
> 
> >externalSort [] = do return []
> >externalSort l = do
> >  h <- openFile "ExternalSort.bin" WriteMode
> >  idx <- mapM (\x -> dumpBlock h x) (blockify blocksize l)

    idx <- mapM (dumpBlock h) (blockify blocksize l)

> >  hClose h
> >  h <- openFile "ExternalSort.bin" ReadMode
> >  blocks <- mapM (\x -> do {bs <- hGetContentsWithCursor h x;
> >                            return $ Bin.decode bs}) idx

    Possibly

        forM idx $ \x -> decode `fmap` hGetContentsWithCursor h x


> >  return (kMerge $ blocks)
> 
> Merging chunks.  K-way merge (and in fact external sort in general) is
> detailed in Knuth, where he recommends tournament trees.  The easiest
> thing is to probably use one of Okasaki's heaps.  I'll use splay
> heaps, because I don't know any better.
> 
> It would be better if I changed Ord for blocks to only check the first
> element.
> 
> >kMerge :: (Ord a) => [[a]] -> [a]
> >kMerge [] = []
> >kMerge l =
> >    let h = Splay.fromSeq l in
> >    kM (Splay.minElem h) (Splay.deleteMin h)
> >    where
> >    kM :: (Ord a) => [a] -> Splay.Heap [a] -> [a]
> >    kM l h
> >        | h == Splay.empty = l
> >        | otherwise =
> >            let next = Splay.minElem h
> >                (f, b) = span (\x -> x <= head next) l
> >            in
> >            f ++ (kM next (if null b then Splay.deleteMin h
> >                           else (Splay.insert b $ Splay.deleteMin h)))
> >
> >kMergeSort :: (Ord a) => [a] -> [a]
> >kMergeSort l = kMerge $ blockify blocksize l
> 
> This is a version of hGetContents which resets its handle position
> between reads, so is safe to use with interleaved handle seeking.
> 
> >hGetContentsWithCursor :: Handle -> Integer -> IO B.ByteString
> >hGetContentsWithCursor = hGetContentsWithCursorN defaultChunkSize
> >
> >hGetContentsWithCursorN :: Int -> Handle -> Integer -> IO B.ByteString
> >hGetContentsWithCursorN k h start = (lazyRead start) >>= return . LPS
> >  where
> >    lazyRead start = unsafeInterleaveIO $ loop start
> >
> >    loop start = do
> >        hSeek h AbsoluteSeek start
> >        ps <- P.hGetNonBlocking h k
> >        --TODO: I think this should distinguish EOF from no data available
> >        -- the otherlying POSIX call makes this distincion, returning 
> >        either
> >        -- 0 or EAGAIN
> >        if P.null ps
> >          then do eof <- hIsEOF h
> >                  if eof then return []
> >                         else hWaitForInput h (-1)
> >                           >> (loop start)
> >           else do
> >              pos <- hTell h
> >              pss <- lazyRead pos
> >              return (ps : pss)

Very nice!

> >
> >defaultChunkSize :: Int
> >defaultChunkSize = 32 * k - overhead
> >   where k = 1024
> >         overhead = 2 * sizeOf (undefined :: Int)

We'll export this value in bytestring 1.0.


I like this code. Would you consider cabalising it, and uploading it to
hackage.haskell.org, so we don't lose it? Perhaps just call it hsort or
something?

Cheers,
  Don
_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Reply via email to