Add a section on expressions in the Core language.

[matthijs/master-project/report.git] / Chapters / Prototype.tex
diff --git a/Chapters/Prototype.tex b/Chapters/Prototype.tex

index 1db9c6d73aff1d766dabffeed186d63bd43cebcc..6ef6a9785a2f9f38ed673722cb14522071ffda89 100644 (file)
--- a/Chapters/Prototype.tex
+++ b/Chapters/Prototype.tex
@@ -1,8 +1,459 @@
-\chapter{Prototype}
-        Choice of Haskell
-        Core - description of the language (appendix?)
-       Stages (-> Core, Normalization, -> VHDL)
+\chapter[chap:prototype]{Prototype}
+  An important step in this research is the creation of a prototype compiler.
+  Having this prototype allows us to apply the ideas from the previous chapter
+  to actual hardware descriptions and evaluate their usefulness. Having a
+  prototype also helps to find new techniques and test possible
+  interpretations.
+
+  Obviously the prototype was not created after all research
+  ideas were formed, but its implementation has been interleaved with the
+  research itself. Also, the prototype described here is the final version, it
+  has gone through a number of design iterations which we will not completely
+  describe here.
+
+  \section{Choice of language}
+    When implementing this prototype, the first question to ask is: What
+    (functional) language will we use to describe our hardware? (Note that
+    this does not concern the \emph{implementation language} of the compiler,
+    just the language \emph{translated by} the compiler).
+
+    On the highest level, we have two choices:
+
+    \startitemize
+      \item Create a new functional language from scratch. This has the
+      advantage of having a language that contains exactly those elements that
+      are convenient for describing hardware and can contain special
+      constructs that might.
+      \item Use an existing language and create a new backend for it. This has
+      the advantage that existing tools can be reused, which will speed up
+      development.
+    \stopitemize
+
+    Considering that we required a prototype which should be working quickly,
+    and that implementing parsers, semantic checkers and especially
+    typcheckers isn't exactly the core of this research (but it is lots and
+    lots of work!), using an existing language is the obvious choice. This
+    also has the advantage that a large set of language features is available
+    to experiment with and it is easy to find which features apply well and
+    which don't. A possible second prototype could use a custom language with
+    just the useful features (and possibly extra features that are specific to
+    the domain of hardware description as well).
+
+    The second choice is to pick one of the many existing languages. As
+    mentioned before, this language is Haskell.  This choice has not been the
+    result of a thorough comparison of languages, for the simple reason that
+    the requirements on the language were completely unclear at the start of
+    this language. The fact that Haskell is a language with a broad spectrum
+    of features, that it is commonly used in research projects and that the
+    primary compiler, GHC, provides a high level API to its internals, made
+    Haskell an obvious choice.
+
+    TODO: Was Haskell really a good choice? Perhaps say this somewhere else?
+
+  \section{Prototype design}
+    As stated above, we will use the Glasgow Haskell Compiler (\small{GHC}) to
+    implement our prototype compiler. To understand the design of the
+    compiler, we will first dive into the \small{GHC} compiler a bit. It's
+    compilation consists of the following steps (slightly simplified):
+
+    \startuseMPgraphic{ghc-pipeline}
+      % Create objects
+      save inp, front, desugar, simpl, back, out;
+      newEmptyBox.inp(0,0);
+      newBox.front(btex Parser etex);
+      newBox.desugar(btex Desugarer etex);
+      newBox.simpl(btex Simplifier etex);
+      newBox.back(btex Backend etex);
+      newEmptyBox.out(0,0);
+
+      % Space the boxes evenly
+      inp.c - front.c = front.c - desugar.c = desugar.c - simpl.c 
+        = simpl.c - back.c = back.c - out.c = (0, 1.5cm);
+      out.c = origin;
+
+      % Draw lines between the boxes. We make these lines "deferred" and give
+      % them a name, so we can use ObjLabel to draw a label beside them.
+      ncline.inp(inp)(front) "name(haskell)";
+      ncline.front(front)(desugar) "name(ast)";
+      ncline.desugar(desugar)(simpl) "name(core)";
+      ncline.simpl(simpl)(back) "name(simplcore)";
+      ncline.back(back)(out) "name(native)";
+      ObjLabel.inp(btex Haskell source etex) "labpathname(haskell)", "labdir(rt)";
+      ObjLabel.front(btex Haskell AST etex) "labpathname(ast)", "labdir(rt)";
+      ObjLabel.desugar(btex Core etex) "labpathname(core)", "labdir(rt)";
+      ObjLabel.simpl(btex Simplified core etex) "labpathname(simplcore)", "labdir(rt)";
+      ObjLabel.back(btex Native code etex) "labpathname(native)", "labdir(rt)";
+
+      % Draw the objects (and deferred labels)
+      drawObj (inp, front, desugar, simpl, back, out);
+    \stopuseMPgraphic
+    \placefigure[right]{GHC compiler pipeline}{\useMPgraphic{ghc-pipeline}}
+
+    \startdesc{Frontend}
+      This step takes the Haskell source files and parses them into an
+      abstract syntax tree (\small{AST}). This \small{AST} can express the
+      complete Haskell language and is thus a very complex one (in contrast
+      with the Core \small{AST}, later on). All identifiers in this
+      \small{AST} are resolved by the renamer and all types are checked by the
+      typechecker.
+    \stopdesc
+    \startdesc{Desugaring}
+      This steps takes the full \small{AST} and translates it to the
+      \emph{Core} language. Core is a very small functional language with lazy
+      semantics, that can still express everything Haskell can express. Its
+      simpleness makes Core very suitable for further simplification and
+      translation. Core is the language we will be working on as well.
+    \stopdesc
+    \startdesc{Simplification}
+      Through a number of simplification steps (such as inlining, common
+      subexpression elimination, etc.) the Core program is simplified to make
+      it faster or easier to process further.
+    \stopdesc
+    \startdesc{Backend}
+      This step takes the simplified Core program and generates an actual
+      runnable program for it. This is a big and complicated step we will not
+      discuss it any further, since it is not required for our prototype.
+    \stopdesc
+
+    In this process, there a number of places where we can start our work.
+    Assuming that we don't want to deal with (or modify) parsing, typechecking
+    and other frontend business and that native code isn't really a useful
+    format anymore, we are left with the choice between the full Haskell
+    \small{AST}, or the smaller (simplified) core representation.
+
+    The advantage of taking the full \small{AST} is that the exact structure
+    of the source program is preserved. We can see exactly what the hardware
+    descriiption looks like and which syntax constructs were used. However,
+    the full \small{AST} is a very complicated datastructure. If we are to
+    handle everything it offers, we will quickly get a big compiler.
+
+    Using the core representation gives us a much more compact datastructure
+    (a core expression only uses 9 constructors). Note that this does not mean
+    that the core representation itself is smaller, on the contrary. Since the
+    core language has less constructs, a lot of things will take a larger
+    expression to express.
+
+    However, the fact that the core language is so much smaller, means it is a
+    lot easier to analyze and translate it into something else. For the same
+    reason, \small{GHC} runs its simplifications and optimizations on the core
+    representation as well.
+
+    However, we will use the normal core representation, not the simplified
+    core. Reasons for this are detailed below.
+    
+    The final prototype roughly consists of three steps:
+    
+    \startuseMPgraphic{ghc-pipeline}
+      % Create objects
+      save inp, front, norm, vhdl, out;
+      newEmptyBox.inp(0,0);
+      newBox.front(btex \small{GHC} frontend + desugarer etex);
+      newBox.norm(btex Normalization etex);
+      newBox.vhdl(btex \small{VHDL} generation etex);
+      newEmptyBox.out(0,0);
+
+      % Space the boxes evenly
+      inp.c - front.c = front.c - norm.c = norm.c - vhdl.c 
+        = vhdl.c - out.c = (0, 1.5cm);
+      out.c = origin;
+
+      % Draw lines between the boxes. We make these lines "deferred" and give
+      % them a name, so we can use ObjLabel to draw a label beside them.
+      ncline.inp(inp)(front) "name(haskell)";
+      ncline.front(front)(norm) "name(core)";
+      ncline.norm(norm)(vhdl) "name(normal)";
+      ncline.vhdl(vhdl)(out) "name(vhdl)";
+      ObjLabel.inp(btex Haskell source etex) "labpathname(haskell)", "labdir(rt)";
+      ObjLabel.front(btex Core etex) "labpathname(core)", "labdir(rt)";
+      ObjLabel.norm(btex Normalized core etex) "labpathname(normal)", "labdir(rt)";
+      ObjLabel.vhdl(btex \small{VHDL} description etex) "labpathname(vhdl)", "labdir(rt)";
+
+      % Draw the objects (and deferred labels)
+      drawObj (inp, front, norm, vhdl, out);
+    \stopuseMPgraphic
+    \placefigure[right]{GHC compiler pipeline}{\useMPgraphic{ghc-pipeline}}
+
+    \startdesc{Frontend}
+      This is exactly the frontend and desugarer from the \small{GHC}
+      pipeline, that translates Haskell sources to a core representation.
+    \stopdesc
+    \startdesc{Normalization}
+      This is a step that transforms the core representation into a normal
+      form. This normal form is still expressed in the core language, but has
+      to adhere to an extra set of constraints. This normal form is less
+      expressive than the full core language (e.g., it can have limited higher
+      order expressions, has a specific structure, etc.), but is also very
+      close to directly describing hardware.
+    \stopdesc
+    \startdesc{\small{VHDL} generation}
+      The last step takes the normal formed core representation and generates
+      \small{VHDL} for it. Since the normal form has a specific, hardware-like
+      structure, this final step is very straightforward.
+    \stopdesc
+    
+    The most interesting step in this process is the normalization step. That
+    is where more complicated functional constructs, which have no direct
+    hardware interpretation, are removed and translated into hardware
+    constructs. This step is described in a lot of detail at
+    \in{chapter}[chap:normalization].
+    
+  \section{The Core language}
+    Most of the prototype deals with handling the program in the Core
+    language. In this section we will show what this language looks like and
+    how it works.
+
+    The Core language is a functional language that describes
+    \emph{expressions}. Every identifier used in Core is called a
+    \emph{binder}, since it is bound to a value somewhere. On the highest
+    level, a Core program is a collection of functions, each of which bind a
+    binder (the function name) to an expression (the function value, which has
+    a function type).
+
+    The Core language itself does not prescribe any program structure, only
+    expression structure. In the \small{GHC} compiler, the Haskell module
+    structure is used for the resulting Core code as well. Since this is not
+    so relevant for understanding the Core language or the Normalization
+    process, we'll only look at the Core expression language here.
+
+    Each Core expression consists of one of these possible expressions.
+
+    \startdesc{Variable reference}
+\startlambda
+a
+\stoplambda
+      This is a simple reference to a binder. It's written down as the
+      name of the binder that is being referred to, which should of course be
+      bound in a containing scope (including top level scope, so a reference
+      to a top level function is also a variable reference). Additionally,
+      constructors from algebraic datatypes also become variable references.
+
+      The value of this expression is the value bound to the given binder.
+
+      Each binder also carries around its type, but this is usually not shown
+      in the Core expressions. Occasionally, the type of an entire expression
+      or function is shown for clarity, but this is only informational. In
+      practice, the type of an expression is easily determined from the
+      structure of the expression and the types of the binders and occasional
+      cast expressions. This minimize the amount of bookkeeping needed to keep
+      the typing consistent.
+    \stopdesc
+    \startdesc{Literal}
+\startlambda
+10
+\stoplambda
+      This is a simple literal. Only primitive types are supported, like
+      chars, strings, ints and doubles. The types of these literals are the
+      \quote{primitive} versions, like \lam{Char\#} and \lam{Word\#}, not the
+      normal Haskell versions (but there are builtin conversion functions).
+    \stopdesc
+    \startdesc{Application}
+\startlambda
+func arg
+\stoplambda
+      This is simple function application. Each application consists of two
+      parts: The function part and the argument part. Applications are used
+      for normal function \quote{calls}, but also for applying type
+      abstractions and data constructors.
+      
+      The value of an application is the value of the function part, with the
+      first argument binder bound to the argument part.
+    \stopdesc
+    \startdesc{Lambda abstraction}
+\startlambda
+λbndr.body
+\stoplambda
+      This is the basic lambda abstraction, as it occurs in labmda calculus.
+      It consists of a binder part and a body part.  A lambda abstraction
+      creates a function, that can be applied to an argument. 
+     
+      Note that the body of a lambda abstraction extends all the way to the
+      end of the expression, or the closing bracket surrounding the lambda. In
+      other words, the lambda abstraction \quote{operator} has the lowest
+      priority of all.
+
+      The value of an application is the value of the body part, with the
+      binder bound to the value the entire lambda abstraction is applied to.
+    \stopdesc
+    \startdesc{Non-recursive let expression}
+\startlambda
+let bndr = value in body
+\stoplambda
+      A let expression allows you to bind a binder to some value, while
+      evaluating to some other value (where that binder is in scope). This
+      allows for sharing of subexpressions (you can use a binder twice) and
+      explicit \quote{naming} of arbitrary expressions. Note that the binder
+      is not in scope in the value bound to it, so it's not possible to make
+      recursive definitions with the normal form of the let expression (see
+      the recursive form below).
+
+      Even though this let expression is an extension on the basic lambda
+      calculus, it is easily translated to a lambda abstraction. The let
+      expression above would then become:
+
+\startlambda
+(λbndr.body) value
+\stoplambda
+
+      This notion might be useful for verifying certain properties on
+      transformations, since a lot of verification work has been done on
+      lambda calculus already.
+
+      The value of a let expression is the value of the body part, with the
+      binder bound to the value. 
+    \stopdesc
+    \startdesc{Recursive let expression}
+\startlambda
+let 
+  bndr1 = value1
+  \vdots
+  bndrn = valuen
+in 
+  body
+\stoplambda
+
+      This is the recursive version of the let expression. In \small{GHC}'s
+      Core implementation, non-recursive and recursive lets are not so
+      distinct as we present them here, but this provides a clearer overview.
+      
+      The main difference with the normal let expression is that each of the
+      binders is in scope in each of the values, in addition to the body. This
+      allows for self-recursive definitions or mutually recursive definitions.
+
+      It should also be possible to express a recursive let using normal
+      lambda calculus, if we use the \emph{least fixed-point operator},
+      \lam{Y}.
+    \stopdesc
+    \startdesc{Case expression}
+\startlambda
+  case scrut of bndr
+    DEFAULT -> defaultbody
+    C0 bndr0,0 ... bndr0,m -> body0
+    \vdots
+    Cn bndrn,0 ... bndrn,m -> bodyn
+\stoplambda
+
+TODO: Define WHNF
+
+    A case expression is the only way in Core to choose between values. A case
+    expression evaluates its scrutinee, which should have an algebraic
+    datatype, into weak head normal form (\small{WHNF}) and (optionally) binds
+    it to \lam{bndr}. It then chooses a body depending on the constructor of
+    its scrutinee. If none of the constructors match, the \lam{DEFAULT}
+    alternative is chosen. 
+    
+    Since we can only match the top level constructor, there can be no overlap
+    in the alternatives and thus order of alternatives is not relevant (though
+    the \lam{DEFAULT} alternative must appear first for implementation
+    efficiency).
+    
+    Any arguments to the constructor in the scrutinee are bound to each of the
+    binders after the constructor and are in scope only in the corresponding
+    body.
+
+    To support strictness, the scrutinee is always evaluated into WHNF, even
+    when there is only a \lam{DEFAULT} alternative. This allows a strict
+    function argument to be written like:
+
+\startlambda
+function (case argument of arg
+  DEFAULT -> arg)
+\stoplambda
+
+    This seems to be the only use for the extra binder to which the scrutinee
+    is bound. When not using strictness annotations (which is rather pointless
+    in hardware descriptions), \small{GHC} seems to never generate any code
+    making use of this binder. The current prototype does not handle it
+    either, which probably means that code using it would break.
+
+    Note that these case statements are less powerful than the full Haskell
+    case statements. In particular, they do not support complex patterns like
+    in Haskell. Only the constructor of an expression can be matched, complex
+    patterns are implemented using multiple nested case expressions.
+
+    Case statements are also used for unpacking of algebraic datatypes, even
+    when there is only a single constructor. For examples, to add the elements
+    of a tuple, the following Core is generated:
+
+\startlambda
+sum = λtuple.case tuple of
+  (,) a b -> a + b
+\stoplambda
+  
+    Here, there is only a single alternative (but no \lam{DEFAULT}
+    alternative, since the single alternative is already exhaustive). When
+    it's body is evaluated, the arguments to the tuple constructor \lam{(,)}
+    (\eg, the elements of the tuple) are bound to \lam{a} and \lam{b}.
+  \stopdesc
+  \startdesc{Cast expression}
+\startlambda
+body :: targettype
+\stoplambda
+    A cast expression allows you to change the type of an expression to an
+    equivalent type. Note that this is not meant to do any actual work, like
+    conversion of data from one format to another, or force a complete type
+    change. Instead, it is meant to change between different representations
+    of the same type, \eg switch between types that are provably equal (but
+    look different).
+    
+    In our hardware descriptions, we typically see casts to change between a
+    Haskell newtype and its contained type, since those are effectively
+    different representations of the same type.
+
+    More complex are types that are proven to be equal by the typechecker,
+    but look different at first glance. To ensure that, once the typechecker
+    has proven equality, this information sticks around, explicit casts are
+    added. In our notation we only write the target type, but in reality a
+    cast expressions carries around a \emph{coercion}, which can be seen as a
+    proof of equality. TODO: Example
+
+    The value of a cast is the value of its body, unchanged. The type of this
+    value is equal to the target type, not the type of its body.
+
+    Note that this syntax is also used sometimes to indicate that a particular
+    expression has a particular type, even when no cast expression is
+    involved. This is then purely informational, since the only elements that
+    are explicitely typed in the Core language are the binder references and
+    cast expressions, the types of all other elements are determined at
+    runtime.
+  \stopdesc
+  \startdesc{Note}
+
+    The Core language in \small{GHC} allows adding \emph{notes}, which serve
+    as hints to the inliner or add custom (string) annotations to a core
+    expression. These shouldn't be generated normally, so these are not
+    handled in any way in the prototype.
+  \stopdesc
+  \startdesc{Type}
+\startlambda
+@type
+\stoplambda
+    It is possibly to use a Core type as a Core expression. This is done to
+    allow for type abstractions and applications to be handled as normal
+    lambda abstractions and applications above. This means that a type
+    expression in Core can only ever occur in the argument position of an
+    application, and only if the type of the function that is applied to
+    expects a type as the first argument. This happens for all polymorphic
+    functions, for example, the \lam{fst} function:
+
+\startlambda
+fst :: \forall a. \forall b. (a, b) -> a
+fst = λtup.case tup of (,) a b -> a
+
+fstint :: (Int, Int) -> Int
+fstint = λa.λb.fst @Int @Int a b
+\stoplambda
+    
+    The type of \lam{fst} has two universally quantified type variables. When
+    \lam{fst} is applied in \lam{fstint}, it is first applied to two types.
+    (which are substitued for \lam{a} and \lam{b} in the type of \lam{fst}, so
+    the type of \lam{fst} actual type of arguments and result can be found:
+    \lam{fst @Int @Int :: (Int, Int) -> Int}).
+  \stopdesc
+
+  TODO: Core type system
+
          Implementation issues
+        Simplified core?
  
          Haskell language coverage / constraints
                  Recursion