X-Git-Url: https://git.stderr.nl/gitweb?p=matthijs%2Fmaster-project%2Freport.git;a=blobdiff_plain;f=Chapters%2FPrototype.tex;h=6ef6a9785a2f9f38ed673722cb14522071ffda89;hp=64fbbe761364ac5b491bc8fbf6aae0eda62584f9;hb=3fc7078e3369d9e916b3663f04c51bb587434d14;hpb=d13b2459ebafe1cdeef354cc4e4f28ca9d7c3a83 diff --git a/Chapters/Prototype.tex b/Chapters/Prototype.tex index 64fbbe7..6ef6a97 100644 --- a/Chapters/Prototype.tex +++ b/Chapters/Prototype.tex @@ -1,4 +1,4 @@ -\chapter{Prototype} +\chapter[chap:prototype]{Prototype} An important step in this research is the creation of a prototype compiler. Having this prototype allows us to apply the ideas from the previous chapter to actual hardware descriptions and evaluate their usefulness. Having a @@ -56,6 +56,39 @@ compiler, we will first dive into the \small{GHC} compiler a bit. It's compilation consists of the following steps (slightly simplified): + \startuseMPgraphic{ghc-pipeline} + % Create objects + save inp, front, desugar, simpl, back, out; + newEmptyBox.inp(0,0); + newBox.front(btex Parser etex); + newBox.desugar(btex Desugarer etex); + newBox.simpl(btex Simplifier etex); + newBox.back(btex Backend etex); + newEmptyBox.out(0,0); + + % Space the boxes evenly + inp.c - front.c = front.c - desugar.c = desugar.c - simpl.c + = simpl.c - back.c = back.c - out.c = (0, 1.5cm); + out.c = origin; + + % Draw lines between the boxes. We make these lines "deferred" and give + % them a name, so we can use ObjLabel to draw a label beside them. + ncline.inp(inp)(front) "name(haskell)"; + ncline.front(front)(desugar) "name(ast)"; + ncline.desugar(desugar)(simpl) "name(core)"; + ncline.simpl(simpl)(back) "name(simplcore)"; + ncline.back(back)(out) "name(native)"; + ObjLabel.inp(btex Haskell source etex) "labpathname(haskell)", "labdir(rt)"; + ObjLabel.front(btex Haskell AST etex) "labpathname(ast)", "labdir(rt)"; + ObjLabel.desugar(btex Core etex) "labpathname(core)", "labdir(rt)"; + ObjLabel.simpl(btex Simplified core etex) "labpathname(simplcore)", "labdir(rt)"; + ObjLabel.back(btex Native code etex) "labpathname(native)", "labdir(rt)"; + + % Draw the objects (and deferred labels) + drawObj (inp, front, desugar, simpl, back, out); + \stopuseMPgraphic + \placefigure[right]{GHC compiler pipeline}{\useMPgraphic{ghc-pipeline}} + \startdesc{Frontend} This step takes the Haskell source files and parses them into an abstract syntax tree (\small{AST}). This \small{AST} can express the @@ -82,9 +115,345 @@ discuss it any further, since it is not required for our prototype. \stopdesc - Core - description of the language (appendix?) - Stages (-> Core, Normalization, -> VHDL) + In this process, there a number of places where we can start our work. + Assuming that we don't want to deal with (or modify) parsing, typechecking + and other frontend business and that native code isn't really a useful + format anymore, we are left with the choice between the full Haskell + \small{AST}, or the smaller (simplified) core representation. + + The advantage of taking the full \small{AST} is that the exact structure + of the source program is preserved. We can see exactly what the hardware + descriiption looks like and which syntax constructs were used. However, + the full \small{AST} is a very complicated datastructure. If we are to + handle everything it offers, we will quickly get a big compiler. + + Using the core representation gives us a much more compact datastructure + (a core expression only uses 9 constructors). Note that this does not mean + that the core representation itself is smaller, on the contrary. Since the + core language has less constructs, a lot of things will take a larger + expression to express. + + However, the fact that the core language is so much smaller, means it is a + lot easier to analyze and translate it into something else. For the same + reason, \small{GHC} runs its simplifications and optimizations on the core + representation as well. + + However, we will use the normal core representation, not the simplified + core. Reasons for this are detailed below. + + The final prototype roughly consists of three steps: + + \startuseMPgraphic{ghc-pipeline} + % Create objects + save inp, front, norm, vhdl, out; + newEmptyBox.inp(0,0); + newBox.front(btex \small{GHC} frontend + desugarer etex); + newBox.norm(btex Normalization etex); + newBox.vhdl(btex \small{VHDL} generation etex); + newEmptyBox.out(0,0); + + % Space the boxes evenly + inp.c - front.c = front.c - norm.c = norm.c - vhdl.c + = vhdl.c - out.c = (0, 1.5cm); + out.c = origin; + + % Draw lines between the boxes. We make these lines "deferred" and give + % them a name, so we can use ObjLabel to draw a label beside them. + ncline.inp(inp)(front) "name(haskell)"; + ncline.front(front)(norm) "name(core)"; + ncline.norm(norm)(vhdl) "name(normal)"; + ncline.vhdl(vhdl)(out) "name(vhdl)"; + ObjLabel.inp(btex Haskell source etex) "labpathname(haskell)", "labdir(rt)"; + ObjLabel.front(btex Core etex) "labpathname(core)", "labdir(rt)"; + ObjLabel.norm(btex Normalized core etex) "labpathname(normal)", "labdir(rt)"; + ObjLabel.vhdl(btex \small{VHDL} description etex) "labpathname(vhdl)", "labdir(rt)"; + + % Draw the objects (and deferred labels) + drawObj (inp, front, norm, vhdl, out); + \stopuseMPgraphic + \placefigure[right]{GHC compiler pipeline}{\useMPgraphic{ghc-pipeline}} + + \startdesc{Frontend} + This is exactly the frontend and desugarer from the \small{GHC} + pipeline, that translates Haskell sources to a core representation. + \stopdesc + \startdesc{Normalization} + This is a step that transforms the core representation into a normal + form. This normal form is still expressed in the core language, but has + to adhere to an extra set of constraints. This normal form is less + expressive than the full core language (e.g., it can have limited higher + order expressions, has a specific structure, etc.), but is also very + close to directly describing hardware. + \stopdesc + \startdesc{\small{VHDL} generation} + The last step takes the normal formed core representation and generates + \small{VHDL} for it. Since the normal form has a specific, hardware-like + structure, this final step is very straightforward. + \stopdesc + + The most interesting step in this process is the normalization step. That + is where more complicated functional constructs, which have no direct + hardware interpretation, are removed and translated into hardware + constructs. This step is described in a lot of detail at + \in{chapter}[chap:normalization]. + + \section{The Core language} + Most of the prototype deals with handling the program in the Core + language. In this section we will show what this language looks like and + how it works. + + The Core language is a functional language that describes + \emph{expressions}. Every identifier used in Core is called a + \emph{binder}, since it is bound to a value somewhere. On the highest + level, a Core program is a collection of functions, each of which bind a + binder (the function name) to an expression (the function value, which has + a function type). + + The Core language itself does not prescribe any program structure, only + expression structure. In the \small{GHC} compiler, the Haskell module + structure is used for the resulting Core code as well. Since this is not + so relevant for understanding the Core language or the Normalization + process, we'll only look at the Core expression language here. + + Each Core expression consists of one of these possible expressions. + + \startdesc{Variable reference} +\startlambda +a +\stoplambda + This is a simple reference to a binder. It's written down as the + name of the binder that is being referred to, which should of course be + bound in a containing scope (including top level scope, so a reference + to a top level function is also a variable reference). Additionally, + constructors from algebraic datatypes also become variable references. + + The value of this expression is the value bound to the given binder. + + Each binder also carries around its type, but this is usually not shown + in the Core expressions. Occasionally, the type of an entire expression + or function is shown for clarity, but this is only informational. In + practice, the type of an expression is easily determined from the + structure of the expression and the types of the binders and occasional + cast expressions. This minimize the amount of bookkeeping needed to keep + the typing consistent. + \stopdesc + \startdesc{Literal} +\startlambda +10 +\stoplambda + This is a simple literal. Only primitive types are supported, like + chars, strings, ints and doubles. The types of these literals are the + \quote{primitive} versions, like \lam{Char\#} and \lam{Word\#}, not the + normal Haskell versions (but there are builtin conversion functions). + \stopdesc + \startdesc{Application} +\startlambda +func arg +\stoplambda + This is simple function application. Each application consists of two + parts: The function part and the argument part. Applications are used + for normal function \quote{calls}, but also for applying type + abstractions and data constructors. + + The value of an application is the value of the function part, with the + first argument binder bound to the argument part. + \stopdesc + \startdesc{Lambda abstraction} +\startlambda +λbndr.body +\stoplambda + This is the basic lambda abstraction, as it occurs in labmda calculus. + It consists of a binder part and a body part. A lambda abstraction + creates a function, that can be applied to an argument. + + Note that the body of a lambda abstraction extends all the way to the + end of the expression, or the closing bracket surrounding the lambda. In + other words, the lambda abstraction \quote{operator} has the lowest + priority of all. + + The value of an application is the value of the body part, with the + binder bound to the value the entire lambda abstraction is applied to. + \stopdesc + \startdesc{Non-recursive let expression} +\startlambda +let bndr = value in body +\stoplambda + A let expression allows you to bind a binder to some value, while + evaluating to some other value (where that binder is in scope). This + allows for sharing of subexpressions (you can use a binder twice) and + explicit \quote{naming} of arbitrary expressions. Note that the binder + is not in scope in the value bound to it, so it's not possible to make + recursive definitions with the normal form of the let expression (see + the recursive form below). + + Even though this let expression is an extension on the basic lambda + calculus, it is easily translated to a lambda abstraction. The let + expression above would then become: + +\startlambda +(λbndr.body) value +\stoplambda + + This notion might be useful for verifying certain properties on + transformations, since a lot of verification work has been done on + lambda calculus already. + + The value of a let expression is the value of the body part, with the + binder bound to the value. + \stopdesc + \startdesc{Recursive let expression} +\startlambda +let + bndr1 = value1 + \vdots + bndrn = valuen +in + body +\stoplambda + + This is the recursive version of the let expression. In \small{GHC}'s + Core implementation, non-recursive and recursive lets are not so + distinct as we present them here, but this provides a clearer overview. + + The main difference with the normal let expression is that each of the + binders is in scope in each of the values, in addition to the body. This + allows for self-recursive definitions or mutually recursive definitions. + + It should also be possible to express a recursive let using normal + lambda calculus, if we use the \emph{least fixed-point operator}, + \lam{Y}. + \stopdesc + \startdesc{Case expression} +\startlambda + case scrut of bndr + DEFAULT -> defaultbody + C0 bndr0,0 ... bndr0,m -> body0 + \vdots + Cn bndrn,0 ... bndrn,m -> bodyn +\stoplambda + +TODO: Define WHNF + + A case expression is the only way in Core to choose between values. A case + expression evaluates its scrutinee, which should have an algebraic + datatype, into weak head normal form (\small{WHNF}) and (optionally) binds + it to \lam{bndr}. It then chooses a body depending on the constructor of + its scrutinee. If none of the constructors match, the \lam{DEFAULT} + alternative is chosen. + + Since we can only match the top level constructor, there can be no overlap + in the alternatives and thus order of alternatives is not relevant (though + the \lam{DEFAULT} alternative must appear first for implementation + efficiency). + + Any arguments to the constructor in the scrutinee are bound to each of the + binders after the constructor and are in scope only in the corresponding + body. + + To support strictness, the scrutinee is always evaluated into WHNF, even + when there is only a \lam{DEFAULT} alternative. This allows a strict + function argument to be written like: + +\startlambda +function (case argument of arg + DEFAULT -> arg) +\stoplambda + + This seems to be the only use for the extra binder to which the scrutinee + is bound. When not using strictness annotations (which is rather pointless + in hardware descriptions), \small{GHC} seems to never generate any code + making use of this binder. The current prototype does not handle it + either, which probably means that code using it would break. + + Note that these case statements are less powerful than the full Haskell + case statements. In particular, they do not support complex patterns like + in Haskell. Only the constructor of an expression can be matched, complex + patterns are implemented using multiple nested case expressions. + + Case statements are also used for unpacking of algebraic datatypes, even + when there is only a single constructor. For examples, to add the elements + of a tuple, the following Core is generated: + +\startlambda +sum = λtuple.case tuple of + (,) a b -> a + b +\stoplambda + + Here, there is only a single alternative (but no \lam{DEFAULT} + alternative, since the single alternative is already exhaustive). When + it's body is evaluated, the arguments to the tuple constructor \lam{(,)} + (\eg, the elements of the tuple) are bound to \lam{a} and \lam{b}. + \stopdesc + \startdesc{Cast expression} +\startlambda +body :: targettype +\stoplambda + A cast expression allows you to change the type of an expression to an + equivalent type. Note that this is not meant to do any actual work, like + conversion of data from one format to another, or force a complete type + change. Instead, it is meant to change between different representations + of the same type, \eg switch between types that are provably equal (but + look different). + + In our hardware descriptions, we typically see casts to change between a + Haskell newtype and its contained type, since those are effectively + different representations of the same type. + + More complex are types that are proven to be equal by the typechecker, + but look different at first glance. To ensure that, once the typechecker + has proven equality, this information sticks around, explicit casts are + added. In our notation we only write the target type, but in reality a + cast expressions carries around a \emph{coercion}, which can be seen as a + proof of equality. TODO: Example + + The value of a cast is the value of its body, unchanged. The type of this + value is equal to the target type, not the type of its body. + + Note that this syntax is also used sometimes to indicate that a particular + expression has a particular type, even when no cast expression is + involved. This is then purely informational, since the only elements that + are explicitely typed in the Core language are the binder references and + cast expressions, the types of all other elements are determined at + runtime. + \stopdesc + \startdesc{Note} + + The Core language in \small{GHC} allows adding \emph{notes}, which serve + as hints to the inliner or add custom (string) annotations to a core + expression. These shouldn't be generated normally, so these are not + handled in any way in the prototype. + \stopdesc + \startdesc{Type} +\startlambda +@type +\stoplambda + It is possibly to use a Core type as a Core expression. This is done to + allow for type abstractions and applications to be handled as normal + lambda abstractions and applications above. This means that a type + expression in Core can only ever occur in the argument position of an + application, and only if the type of the function that is applied to + expects a type as the first argument. This happens for all polymorphic + functions, for example, the \lam{fst} function: + +\startlambda +fst :: \forall a. \forall b. (a, b) -> a +fst = λtup.case tup of (,) a b -> a + +fstint :: (Int, Int) -> Int +fstint = λa.λb.fst @Int @Int a b +\stoplambda + + The type of \lam{fst} has two universally quantified type variables. When + \lam{fst} is applied in \lam{fstint}, it is first applied to two types. + (which are substitued for \lam{a} and \lam{b} in the type of \lam{fst}, so + the type of \lam{fst} actual type of arguments and result can be found: + \lam{fst @Int @Int :: (Int, Int) -> Int}). + \stopdesc + + TODO: Core type system + Implementation issues + Simplified core? Haskell language coverage / constraints Recursion