Monday, January 06, 2014

Optimising Haskell for a tight inner loop

Summary: I walk through optimising a Haskell string splitter to get a nice tight inner loop. We look at the Haskell code, the generated Core, C-- and assembly. We get down to 6 assembly instructions per input character.

Let's start with some simple code:

break (`elem` " \r\n$") src

This code scans a string looking for a space, newline or $ and returns the string before and the string after. Our goal is to make this code faster - by the end we'll get down to 6 assembly instructions per input character. Before making things go faster we should write test cases (so we don't break anything), profile (so we are optimising the right thing) and write benchmarks (to check our changes make things go faster). To write this post, I did all those steps, but the post is only going to look at the generated Core, C-- and assembly - and be guided by guesses about what should go faster. The complete code is available online, along with the Core/C--/assembly for each step as produced by GHC 7.6.3.

Version 1

To turn our example into a complete program, we write:

module InnerLoop(innerLoop) where

innerLoop :: FilePath -> IO (String, String)
innerLoop file = do
    src <- readFile file
    return $ break test src

test x = x `elem` " \r\n$"

We can save this code as InnerLoop.hs and compile it with:

ghc -c -O2 InnerLoop.hs -ddump-simpl -ddump-cmm -ddump-asm > log.txt

The full output of log.txt is available here. It contains the GHC Core (which looks a bit like Haskell), then the C-- (which looks a bit like C) and finally the assembly code (which looks exactly like assembly). When optimising we usually look at the Core, then at the C--, then at the assembly - stopping whenever our profiling says we are done. Let's take a look at the inner loop in Core (with some light editing):

innerLoop_3 = GHC.CString.unpackCString# " \r\n\$"

test_1 = \ (x :: GHC.Types.Char) ->
    GHC.List.elem @ GHC.Types.Char GHC.Classes.$fEqChar x innerLoop_3

innerLoop_2 =
    case GHC.List.$wbreak @ GHC.Types.Char test_1 x of _
        (# a, b #) -> (a, b)

The best way to read the Core is by looking for what you can understand, and ignoring the rest - it contains a lot of boring detail. We can see that a lot of things are fully qualified, e.g. GHC.List.elem. Some things have also been a bit mangled, e.g. $wbreak, which is roughly break. The interesting thing here is that break is being passed test_1. Looking at test_1 (which will be called on each character), we can see we are passing $fEqChar - a pair containing a function of how to perform equality on characters - to the elem function. For each character we are going to end up looping through a 4 element list (innerLoop_3) and each comparison will be going through a higher order function. Clearly we need to improve our test function.

Version 2

We can unroll the elem in test to give:

test x = x == ' ' || x == '\r' || x == '\n' || x == '$'

Compiling again and looking at the Core we see:

test_2 =
  \ (x :: GHC.Types.Char) ->
    case x of _ { GHC.Types.C# c ->
    case c of _ {
      __DEFAULT -> GHC.Types.False;
      '\n' -> GHC.Types.True;
      '\r' -> GHC.Types.True;
      ' ' -> GHC.Types.True;
      '$' -> GHC.Types.True

Now for each character we extract the raw character (pattern matching against C#) then test it against the possibilities. GHC has optimised our repeated ==/|| into a nice case expression. It looks quite nice. Now the bottleneck is the break function.

Version 3

The break function is working on a String, which is stored as a linked list of characters. To get better performance we can move to ByteString, writing:

innerLoop :: FilePath -> IO (ByteString, ByteString)
innerLoop file = do
    src <- BS.readFile file
    return $ BS.break test src

For many people this is the reasonable-performance version they should stick with. However, let's look at the Core once more:

go = \ (a :: Addr#) (i :: Int#) (w :: State# RealWorld) ->
    case i >=# len of _ {
      GHC.Types.False ->
        case readWord8OffAddr# @ GHC.Prim.RealWorld a 0 w
        of _ { (# w, c #) ->
        case chr# (word2Int# c) of _ {
          __DEFAULT -> go (plusAddr# a 1) (i +# 1) w;
          '\n' -> (# w, GHC.Types.I# i #);
          '\r' -> (# w, GHC.Types.I# i #);
          ' ' -> (# w, GHC.Types.I# i #);
          '$' -> (# w, GHC.Types.I# i #)
      GHC.Types.True -> (# w, l_a1J9 #)

The first thing that should strike you is the large number of # symbols. In Core, a # means you are doing strict primitive operations on unboxed values, so if the optimiser has managed to get down to # that is good. You'll also notice values of type State# RealWorld which I've renamed w - these are an encoding of the IO monad, but have zero runtime cost, and can be ignored. Looking at the rest of the code, we have a loop with a pointer to the current character (a :: Addr#) and an index of how far through the buffer we are (i :: Int#). At each character we first test if the index exceeds the length, and if it doesn't, read a character and match it against the options. If it doesn't match we continue by adding 1 to the address and 1 to the index. Of course, having to loop over two values is a bit unfortunate.

Version 4

A ByteString needs an explicit length so it knows when it has come to the end of the buffer, so needs to keep comparing against explicit lengths (and for efficiency reasons, also maintaining those lengths). Looking to C for inspiration, typically strings are terminated by a \0 character, which allows looping without comparing against a length (assuming the source file does not contain \0). We can define our own null-terminated ByteString type with a break operation:

newtype ByteString0 = BS0 ByteString

readFile0 :: FilePath -> IO ByteString0
readFile0 x = do
    src <- BS.readFile x
    return $ BS0 $ src `BS.snoc` '\0'

We define a newtype wrapper around ByteString so we gain some type safety. We also define a readFile0 that reads a file as a ByteString0, by explicitly calling snoc with \0. We can now define our own break0 function (this is the only big chunk of Haskell in this article):

break0 :: (Char -> Bool) -> ByteString0 -> (ByteString, ByteString0)
break0 f (BS0 bs) = (BS.unsafeTake i bs, BS0 $ BS.unsafeDrop i bs)
        i = Internal.inlinePerformIO $ BS.unsafeUseAsCString bs $ \ptr -> do
            let start = castPtr ptr :: Ptr Word8
            let end = go start
            return $! end `minusPtr` start

        go s | c == '\0' || f c = s
             | otherwise = go $ inc s
            where c = chr s

chr :: Ptr Word8 -> Char
chr x = Internal.w2c $ Internal.inlinePerformIO $ peek x

inc :: Ptr Word8 -> Ptr Word8
inc x = x `plusPtr` 1

We define break0 by finding the position at which the condition stops being true (i) and calling unsafeTake/unsafeDrop to slice out the relevant pieces. Because we know the second part is still null terminated we can rewrap in ByteString0. To find the index, we mostly use code copied from the bytestring library and modified. We convert the ByteString to a Ptr CChar using unsafeUseAsCString which just lets us look at the internals of the ByteString. We then loop over the pointer with go until we get to the first character that passes f and find how far we travelled. The function go looks at the current character using chr, and if it's \0 (the end) or the function f passes, returns the address at this point. Otherwise it increments the pointer. We use chr to peek at the pointer directly, and inlinePerformIO to do so purely and fast - since we know these buffers are never modified, the inlinePerformIO is morally defensible (we could have put chr in IO but that breaks a future optimisation we'll need to do).

Compiling to Core we see:

go = \ (x :: GHC.Prim.Addr#) ->
    case readWord8OffAddr# @ RealWorld x 0 realWorld#
    of _ { (# _, c #) ->
    case GHC.Prim.chr# (GHC.Prim.word2Int# c) of _ {
      __DEFAULT -> go (GHC.Prim.plusAddr# x 1);
      '\NUL' -> GHC.Ptr.Ptr @ GHC.Word.Word8 x;
      '\n' -> GHC.Ptr.Ptr @ GHC.Word.Word8 x;
      '\r' -> GHC.Ptr.Ptr @ GHC.Word.Word8 x;
      ' ' -> GHC.Ptr.Ptr @ GHC.Word.Word8 x;
      '$' -> GHC.Ptr.Ptr @ GHC.Word.Word8 x

Now we have a Core inner loop to be proud of. We loop round with a single pointer, peek at a byte, and compare it to our options. Time to look onwards to the C--, where I've included just the inner loop:

        Hp = Hp + 8;
        if (Hp > HpLim) goto c1Tx;
        _s1RN::I32 = %MO_UU_Conv_W8_W32(I8[I32[Sp + 0]]);
        _s1T5::I32 = _s1RN::I32;
        _s1T6::I32 = _s1T5::I32;
        if (_s1T6::I32 < 13) goto c1TG;
        if (_s1T6::I32 < 32) goto c1TH;
        if (_s1T6::I32 < 36) goto c1TI;
        if (_s1T6::I32 != 36) goto c1TJ;
        _s1T4::I32 = I32[Sp + 0] + 1;
        I32[Sp + 0] = _s1T4::I32;
        Hp = Hp - 8;
        jump InnerLoop.$wgo_info; // []

Reading the code, we first mess around with Hp, then pull a value out of the array and into _s1RN, then do some comparisons, and if they don't match jump to c1TJ, mess around with Hp again and jump back to start again.

There are three obvious problems with the code: 1) we mess around with Hp; 2) we are doing too many tests to get to the default case; 3) there is a jump in the middle of the loop.

Version 5

Let's start with the Hp variable. Hp is the heap pointer, which says how much heap GHC is using - if the heap gets above a certain limit, it triggers a garbage collection. The Hp = Hp + 8 reserves 8 bytes of heap for this function, Hp > HpLim checks if we need to garbage collect, and Hp = Hp - 8 at the bottom of the loop gives back that heap space. Why do we allocate 8 bytes, only to give it back at the end? The reason is that in the return path after the loop we do allocation. It's a long standing performance issue that GHC doesn't push the heap test down to the exit path, but we can fix it ourselves. Looking at the Core, we saw:

case GHC.Prim.chr# (GHC.Prim.word2Int# c) of _ {
  __DEFAULT -> go (GHC.Prim.plusAddr# x 1);
  '\NUL' -> GHC.Ptr.Ptr @ GHC.Word.Word8 x;

The expression GHC.Ptr.Ptr @ GHC.Word8 x is allocating a constructor around the pointer to return. Looking at the Ptr type we discover:

data Ptr a = Ptr Addr#

So Ptr is simply a constructor wrapping our address. To avoid the Ptr in the inner loop, we can switch to returning Addr# from go:

i = Internal.inlinePerformIO $ BS.unsafeUseAsCString bs $ \ptr -> do
    let start = castPtr ptr :: Ptr Word8
    let end = go start
    return $! Ptr end `minusPtr` start

go s@(Ptr a) | c == '\0' || f c = a
             | otherwise = go $ inc s
    where c = chr s

We also add back the Ptr around end do call minusPtr. Looking at the Core we now see a very simple return path:

case GHC.Prim.chr# (GHC.Prim.word2Int# ipv1_a1D0) of _ {
  __DEFAULT -> InnerLoop.$wgo (GHC.Prim.plusAddr# ww_s1PR 1);
  '\NUL' -> ww_s1PR;

And dropping down to C-- we see:

     _s1Ry::I32 = %MO_UU_Conv_W8_W32(I8[I32[Sp + 0]]);
     _s1SP::I32 = _s1Ry::I32;
     _s1SQ::I32 = _s1SP::I32;
     if (_s1SQ::I32 < 13) goto c1Tn;
     if (_s1SQ::I32 < 32) goto c1To;
     if (_s1SQ::I32 < 36) goto c1Tp;
     if (_s1SQ::I32 != 36) goto c1Tq;
     R1 = I32[Sp + 0];
     Sp = Sp + 4;
     jump (I32[Sp + 0]); // [R1]
     _s1SO::I32 = I32[Sp + 0] + 1;
     I32[Sp + 0] = _s1SO::I32;
     jump InnerLoop.$wgo_info; // []

Not a single mention of Hp. We still have a lot more tests than we'd like though.

Version 6

The current code to check for our 5 terminating characters compares each character one by one. This entire example is based on lexing Ninja source files, so we know that most characters will be alphanumeric. Using this information, we can instead test if the character is less than or equal to $, if it is we can test for the different possibilities, otherwise continue on the fast path. We can write:

test x = x <= '$' && (x == ' ' || x == '\r' || x == '\n' || x == '$')

Now looking at the Core we see:

go = \ (ww_s1Qt :: GHC.Prim.Addr#) ->
    case GHC.Prim.readWord8OffAddr#
           @ GHC.Prim.RealWorld ww_s1Qt 0 GHC.Prim.realWorld#
    of _ { (# _, ipv1_a1Dr #) ->
    case GHC.Prim.chr# (GHC.Prim.word2Int# ipv1_a1Dr) of wild_XH {
      __DEFAULT ->
        case GHC.Prim.leChar# wild_XH '$' of _ {
          GHC.Types.False -> go (GHC.Prim.plusAddr# ww_s1Qt 1);
          GHC.Types.True ->
            case wild_XH of _ {
              __DEFAULT -> go (GHC.Prim.plusAddr# ww_s1Qt 1);
              '\n' -> ww_s1Qt;
              '\r' -> ww_s1Qt;
              ' ' -> ww_s1Qt;
              '$' -> ww_s1Qt
      '\NUL' -> ww_s1Qt

The code looks reasonable, but the final \NUL indicates that the code first checks if the character is \NUL (or \0) and only then does our fast < $ test.

Version 7

To perform our < $ test before checking for \0 we need to modify go. We require that the argument predicate must return False on \0 (otherwise we'll run off the end of the string) and can then write:

go s@(Ptr a) | f c = a
             | otherwise = go $ inc s
    where c = chr s

test x = x <= '$' &&
    (x == ' ' || x == '\r' || x == '\n' || x == '$' || x == '\0')

The Core reads:

InnerLoop.$wgo =
  \ (ww_s1Qq :: GHC.Prim.Addr#) ->
    case GHC.Prim.readWord8OffAddr#
           @ GHC.Prim.RealWorld ww_s1Qq 0 GHC.Prim.realWorld#
    of _ { (# _, ipv1_a1Dr #) ->
    let {
      c1_a1uU [Dmd=Just L] :: GHC.Prim.Char#
      [LclId, Str=DmdType]
      c1_a1uU = GHC.Prim.chr# (GHC.Prim.word2Int# ipv1_a1Dr) } in
    case GHC.Prim.leChar# c1_a1uU '$' of _ {
      GHC.Types.False -> InnerLoop.$wgo (GHC.Prim.plusAddr# ww_s1Qq 1);
      GHC.Types.True ->
        case c1_a1uU of _ {
          __DEFAULT -> InnerLoop.$wgo (GHC.Prim.plusAddr# ww_s1Qq 1);
          '\NUL' -> ww_s1Qq;
          '\n' -> ww_s1Qq;
          '\r' -> ww_s1Qq;
          ' ' -> ww_s1Qq;
          '$' -> ww_s1Qq

The C-- reads:

     _s1Se::I32 = %MO_UU_Conv_W8_W32(I8[I32[Sp + 0]]);
     _s1Sh::I32 = _s1Se::I32;
     _s1Sg::I32 = _s1Sh::I32;
     _c1TZ::I32 = _s1Sg::I32 <= 36;
     if (_c1TZ::I32 >= 1) goto c1Ui;
     _s1Ty::I32 = I32[Sp + 0] + 1;
     I32[Sp + 0] = _s1Ty::I32;
     jump InnerLoop.$wgo_info; // []

And the assembly reads:

    movl 0(%ebp),%eax
    movzbl (%eax),%eax
    cmpl $36,%eax
    jbe _c1Ui
    incl 0(%ebp)
    jmp InnerLoop.$wgo_info

We have ended up with a fairly small 6 instruction loop.

Version 8

We've now exhausted my Haskell bag of tricks, and have to stop. But the assembly code could still be improved. In each loop we read the contents of the memory at %ebp into %eax, and increment the contents of the memory at %ebp at the end - we're manipulating the value on the top of the stack (which is pointed to by %ebp). We could instead cache that value in %ebx, and write:

    movzbl (%ebx),%eax
    cmpl $36,%eax
    jbe _c1Ui
    incl %ebx
    jmp _c1Uf

One less instruction, two less memory accesses. I tried the LLVM backend (using LLVM 2.8 at -O3), but it generated significantly worse assembly code. I don't know how to optimise any further without dropping down to the C FFI, but I'm sure one day GHC/LLVM will automatically produce the shorter assembly code.

Update: It appears on 64bit x86 GHC already produces the minimal assembly.

Saturday, January 04, 2014

Running Makefiles with Shake

Summary: Shake can interpret some Makefiles, just like Make.

Since version 0.10 the Shake build system also installs a shake executable that can interpret a Makefile, just like you would use with make. To try it out, just type shake where you would normally type make (i.e. a directory containing a Makefile) and Shake will attempt to perform the build. I say attempt, as there are plenty of caveats, and it is unlikely that randomly chosen Makefile will work.

The current state

  • shake can interpret some Makefile examples. For example, the first four examples from this Makefile tutorial can be executed by shake. There are plenty of missing features, too many to list. It is likely that some tutorials or simple projects will work, but most "real" projects will not.
  • shake is sufficiently compatible with make (for the subset of features it supports) that I actually run the GNU Make testsuite using shake (see I currently pass 16 tests from 8 categories, out of 83 tests in 39 categories. One of the tests I pass is even failed by the MinGW version of make.
  • The command like arguments to shake are mostly a superset of those to make. In fact, all Shake programs that use shakeWithArgs support most of these options - including things like --jobs, --keep-going, --directory and --always-make.

The original plan

My original plan had 3 steps (spoiler alert: I never even finished step 1):

Step 1: I wanted to interpret many common Makefile patterns, so some simpler projects could use either make or shake interchangeably. I never expected it to work for complex projects like GHC or the Linux Kernel, but did hope to work for the average one-author open source project.

Step 2: I wanted to add need as a command that pulled in additional dependencies. One of the weaknesses of make (fixed by Shake) is that you have to specify all dependencies up front, before you have generated other dependencies. I wanted to extend the Makefile syntax to support:

foo.tar : foo.txt
    cat foo.txt | need
    tar -cf foo.tar `cat foo.txt`

Here cat foo.txt | need would add dependencies on all the files listed by foo.txt and build them if necessary, before continuing.

Step 3: I wanted to write a converter that took an extended Makefile and produced a Shakefile.hs Haskell program that used the standard Shake library, giving users a seamless migration path.

What now?

This feature has no active users, and I don't immediately see who would chose to use it. There are advantages of using shake over make, primarily profiling and progress prediction, but they aren't killer advantages. Make is a large and complex program, and while you could replicate it in Haskell on top of Shake, you probably don't want to. As an example, if a rule for a file does not update the file, Shake will not rerun that rule in subsequent builds but Make will. Arguably the Make behaviour is a bug, but like so many misfeatures, someone will have relied on it somewhere. Replicating Make bug-for-bug would be a lot of effort, end up very ugly, and if you don't have complete compatibility, big projects are unlikely to be able to switch.

Despite my pessimism around this feature, I don't intend to remove it. If it gains users I would be pleasantly surprised. If there is some small feature someone thinks it should support (ideally to gain additional users!) I'm happy to implement it. If people want to send pull requests, or even take over development of this part, I'd be very happy. Shake has gained a number of useful features while implementing Make compatibility, so even if the code is never used in anger, it was still of benefit.

A quick tour of the code

The code for the Make compatibility is all in the repo under Development/Make. It's relatively short and might serve as a nice example for intermediate level Haskell programmers. There are 5 files:

  • Type defines the type Makefile, which is the parsed representation of a Makefile (58 lines).
  • Parse takes a file and parses it, producing a Makefile value (71 lines).
  • Env provides a representation of the Make environment, describes which variables are assigned to which values, and evaluates expressions in the environment (40 lines).
  • Rules defines a new Shake rule type, mostly like the normal Shake file rule, but having slightly different semantics - if a rule fails to generate a file, that isn't an error (59 lines).
  • All puts all the pieces together - parsing a Makefile, evaluating the expressions within it while building up an environment, and then translating to Shake rules (135 lines).

The code is fairly well segregated and serves as a reasonable template for interpreting alternative input build specifications under Shake. The Development/Ninja source code follows much the same pattern.

Friday, January 03, 2014

Announcing Shake 0.11

Summary: Shake 0.11 is out, upgrading should be pretty easy.

Shake is a Haskell library for writing build tools, and I've just released version 0.11. There are a few incompatible changes, and a few new features.

Incompatible changes:

  • Everyone should move to the cmd and command functions for running external programs. I've deprecated all of the other functions (including sys, system' and systemCwd).
  • newCache is now keyed on anything, and takes an action. To obtain the same behaviour as before, make sure you need the file argument in the function passed to newCache.
  • If you are writing your own types of Rule (fairly rare) you need to import Development.Shake.Rule.
  • The storedValue function of the Rule class now gets given the current ShakeOptions.

The move to cmd and command will require the most effort to upgrade, but none of these functions have been removed, so you can upgrade and then convert over after. The other changes are likely to hit very few people, and can be adapted to very easily. If you have any difficulties, email the mailing list.

Major new features in 0.11:

  • There is tracking support, to check you have not missed any dependencies. There are support functions for tracking (e.g. trackUse, trackChange) and on Windows there is support for using tracker.exe to call these functions.
  • The Shake database is now more compact, about 40% smaller for some examples.
  • If you have a few rules that overlap (e.g. one for foo.txt and one for *.txt) you can use the alternatives function to combine them.
  • The Ninja parser is faster (4x faster than Ninja itself to parse the Chromium Ninja files on one machine).
  • The continuous integration tests include a test that Shake can build the Ninja binary from Ninja build files faster than Ninja can (it is typically about 10% faster to build from scratch).

The biggest features since 0.10:

  • Support for Unicode filenames.
  • Addition of newThrottle to support throttled resources, where you can only perform n operations per minute.
  • Progress prediction.
  • Shake installs an executable shake which can build either Ninja projects (almost all) or Makefile projects (barely any).

A complete list of changes since the beginning is available here.