Explicitely use "letrec" for recursive lets.

[matthijs/master-project/report.git] / Chapters / HardwareDescription.tex
diff --git a/Chapters/HardwareDescription.tex b/Chapters/HardwareDescription.tex

index c9005967fd128fadfed7f712227d3939010d37f7..508c11bef4a4dd3f5028ae08cf7befc99a7c4d56 100644 (file)
--- a/Chapters/HardwareDescription.tex
+++ b/Chapters/HardwareDescription.tex
@@ -18,7 +18,7 @@
  
    \section{Function application}
    The basic syntactic element of a functional program are functions and
-  function application. These have a single obvious VHDL translation: Each
+  function application. These have a single obvious \small{VHDL} translation: Each
    function becomes a hardware component, where each argument is an input port
    and the result value is the output port.
  
@@ -27,17 +27,54 @@
    to the corresponding input port. The output port of the function is also
    mapped to a signal, which is used as the result of the application.
  
-  An example of a simple program using only function application would be:
+  \in{Example}[ex:And3] shows a simple program using only function
+  application and the corresponding architecture.
  
-  \starthaskell
-  -- | A simple function that returns the and of three bits
-  and3 :: Bit -> Bit -> Bit -> Bit
-  and3 a b c = and (and a b) c
-  \stophaskell
+\startbuffer[And3]
+-- | A simple function that returns 
+--   the and of three bits
+and3 :: Bit -> Bit -> Bit -> Bit
+and3 a b c = and (and a b) c
+\stopbuffer
  
-  This results in the following hardware:
-  
-  TODO: Pretty picture
+  \startuseMPgraphic{And3}
+    save a, b, c, anda, andb, out;
+
+    % I/O ports
+    newCircle.a(btex $a$ etex) "framed(false)";
+    newCircle.b(btex $b$ etex) "framed(false)";
+    newCircle.c(btex $c$ etex) "framed(false)";
+    newCircle.out(btex $out$ etex) "framed(false)";
+
+    % Components
+    newCircle.anda(btex $and$ etex);
+    newCircle.andb(btex $and$ etex);
+
+    a.c    = origin;
+    b.c    = a.c + (0cm, 1cm);
+    c.c    = b.c + (0cm, 1cm);
+    anda.c = midpoint(a.c, b.c) + (2cm, 0cm);
+    andb.c = midpoint(b.c, c.c) + (4cm, 0cm);
+
+    out.c   = andb.c + (2cm, 0cm);
+
+    % Draw objects and lines
+    drawObj(a, b, c, anda, andb, out);
+
+    ncarc(a)(anda) "arcangle(-10)";
+    ncarc(b)(anda);
+    ncarc(anda)(andb);
+    ncarc(c)(andb);
+    ncline(andb)(out);
+  \stopuseMPgraphic
+
+  \placeexample[here][ex:And3]{Simple three port and.}
+    \startcombination[2*1]
+      {\typebufferhs{And3}}{Haskell description using function applications.}
+      {\boxedgraphic{And3}}{The architecture described by the Haskell description.}
+    \stopcombination
+
+  TODO: Define top level function and subfunctions/circuits.
  
    \subsection{Partial application}
    It should be obvious that we cannot generate hardware signals for all
@@ -47,19 +84,50 @@
    represented as a signal or i/o port to a component.
  
    From this, we can see that the above translation rules do not apply to a
-  partial application. Let's look at an example:
+  partial application. \in{Example}[ex:Quadruple] shows an example use of
+  partial application and the corresponding architecture.
  
-  \starthaskell
-  -- | Multiply the input word by four.
-  quadruple :: Word -> Word
-  quadruple n = mul (mul n)
-    where
-      mul = (*) 2
-  \stophaskell
+\startbuffer[Quadruple]
+-- | Multiply the input word by four.
+quadruple :: Word -> Word
+quadruple n = mul (mul n)
+  where
+    mul = (*) 2
+\stopbuffer
+
+  \startuseMPgraphic{Quadruple}
+    save in, two, mula, mulb, out;
+
+    % I/O ports
+    newCircle.in(btex $n$ etex) "framed(false)";
+    newCircle.two(btex $2$ etex) "framed(false)";
+    newCircle.out(btex $out$ etex) "framed(false)";
+
+    % Components
+    newCircle.mula(btex $\times$ etex);
+    newCircle.mulb(btex $\times$ etex);
  
-  It should be clear that the above code describes the following hardware:
+    two.c    = origin;
+    in.c     = two.c + (0cm, 1cm);
+    mula.c  = in.c + (2cm, 0cm);
+    mulb.c  = mula.c + (2cm, 0cm);
+    out.c   = mulb.c + (2cm, 0cm);
  
-  TODO: Pretty picture
+    % Draw objects and lines
+    drawObj(in, two, mula, mulb, out);
+
+    nccurve(two)(mula) "angleA(0)", "angleB(45)";
+    nccurve(two)(mulb) "angleA(0)", "angleB(45)";
+    ncline(in)(mula);
+    ncline(mula)(mulb);
+    ncline(mulb)(out);
+  \stopuseMPgraphic
+
+  \placeexample[here][ex:Quadruple]{Simple three port and.}
+    \startcombination[2*1]
+      {\typebufferhs{Quadruple}}{Haskell description using function applications.}
+      {\boxedgraphic{Quadruple}}{The architecture described by the Haskell description.}
+    \stopcombination
  
    Here, the definition of mul is a partial function application: It applies
    \hs{2 :: Word} to the function \hs{(*) :: Word -> Word -> Word} resulting in
@@ -76,7 +144,395 @@
    function boundaries), but eventually, the partial application will become
    completely applied.
  
-  \section{Recursion}
+  \section{Costless specialization}
+    Each (complete) function application in our description generates a
+    component instantiation, or a specific piece of hardware in the final
+    design. It is interesting to note that each application of a function
+    generates a \emph{separate} piece of hardware. In the final design, none
+    of the hardware is shared between applications, even when the applied
+    function is the same.
+
+    This is distinctly different from normal program compilation: Two separate
+    calls to the same function share the same machine code. Having more
+    machine code has implications for speed (due to less efficient caching)
+    and memory usage. For normal compilation, it is therefore important to
+    keep the amount of functions limited and maximize the code sharing.
+
+    When generating hardware, this is hardly an issue. Having more \quote{code
+    sharing} does reduce the amount of \small{VHDL} output (Since different
+    component instantiations still share the same component), but after
+    synthesis, the amount of hardware generated is not affected.
+
+    In particular, if we would duplicate all functions so that there is a
+    duplicate for every application in the program (\eg, each function is then
+    only applied exactly once), there would be no increase in hardware size
+    whatsoever.
+   
+   TODO: Perhaps these next two sections are a bit too
+   implementation-oriented?
+
+    \subsection{Specialization}
+      Because of this, a common optimization technique called
+      \emph{specialization} is as good as free for hardware generation.  Given
+      some function that has a \emph{domain} of $D$ (\eg, the set of all
+      possible arguments that could be applied), we create a specialized
+      function with exactly the same behaviour, but with an domain of $D'
+      \subset D$. This subset can be derived in all sort of ways, but commonly
+      this is done by limiting a polymorphic argument to a single type (\eg,
+      removing polymorphism) or by limiting an argument to just a single value
+      (\eg, cross-function constant propagation, effectively removing the
+      argument). 
+      
+      Since we limit the argument domain of the specialized function, its
+      definition can often be optimized further (since now more types or even
+      values of arguments are already know). By replacing any application of
+      the function that falls within the reduced domain by an application of
+      the specialized version, the code gets faster (but the code also gets
+      bigger, since we now have two versions instead of one!). If we apply
+      this technique often enough, we can often replace all applications of a
+      function by specialized versions, allowing the original function to be
+      removed (in some cases, this can even give a net reduction of the code
+      compared to the non-specialized version).
+
+      Specialization is useful for our hardware descriptions for functions
+      that contain arguments that cannot be translated to hardware directly
+      (polymorphic or higher order arguments, for example). If we can create
+      specialized functions that remove the argument, or make it translatable,
+      we can use specialization to make the original, untranslatable, function
+      obsolete.
+
+  \section{Higher order values}
+    What holds for partial application, can be easily generalized to any
+    higher order expression. This includes partial applications, plain
+    variables (e.g., a binder referring to a top level function), lambda
+    expressions and more complex expressions with a function type (a case
+    expression returning lambda's, for example).
+
+    Each of these values cannot be directly represented in hardware (just like
+    partial applications). Also, to make them representable, they need to be
+    applied: function variables and partial applications will then eventually
+    become complete applications, applied lambda expressions disappear by
+    applying β-reduction, etc.
+
+    So any higher order value will be \quote{pushed down} towards its
+    application just like partial applications. Whenever a function boundary
+    needs to be crossed, the called function can be specialized.
+  
+    TODO: This is section should be improved
+
+  \section{Polymorphism}
+    In Haskell, values can be polymorphic: They can have multiple types. For
+    example, the function \hs{fst :: (a, b) -> a} is an example of a
+    polymorphic function: It works for tuples with any element types. Haskell
+    typeclasses allow a function to work on a specific set of types, but the
+    general idea is the same.
+
+%    A type class is a collection of types for which some operations are
+%    defined. It is thus possible for a value to be polymorphic while having
+%    any number of \emph{class constraints}: The value is not defined for
+%    every type, but only for types in the type class. An example of this is
+%    the \hs{even :: (Integral a) => a -> Bool} function, which can map any
+%    value of a type that is member of the \hs{Integral} type class 
+
+    When generating hardware, polymorphism can't be easily translated. How
+    many wire will you lay down for a value that could have any type? When
+    type classes are involved, what hardware components will you lay down for
+    a class method (whose behaviour depends on the type of its arguments)?
+
+    Fortunately, we can again use the principle of specialization: Since every
+    function application generates separate pieces of hardware, we can know
+    the types of all arguments exactly. Provided that we don't use existential
+    typing, all of the polymorphic types in a function must depend on the
+    types of the arguments (In other words, the only way to introduce a type
+    variable is in a lambda abstraction). Our top level function must not have
+    a polymorphic type (otherwise we wouldn't know the hardware interface to
+    our top level function).
+
+    If a function is monomorphic, all values inside it are monomorphic as
+    well, so any function that is applied within the function can only be
+    applied to monomorphic values. The applied functions can then be
+    specialized to work just for these specific types, removing the
+    polymorphism from the applied functions as well.
+
+    By induction, this means that all functions that are (indirectly) called
+    by our top level function (meaning all functions that are translated in
+    the final hardware) become monomorphic.
+
+  \section{State}
+    A very important concept in hardware designs is \emph{state}. In a
+    stateless (or, \emph{combinatoric}) design, every output is a directly and solely dependent on the
+    inputs. In a stateful design, the outputs can depend on the history of
+    inputs, or the \emph{state}. State is usually stored in \emph{registers},
+    which retain their value during a clockcycle, and are typically updated at
+    the start of every clockcycle. Since the updating of the state is tightly
+    coupled (synchronized) to the clock signal, these state updates are often
+    called \emph{synchronous}.
+  
+    To make our hardware description language useful to describe more that
+    simple combinatoric designs, we'll need to be able to describe state in
+    some way.
+
+    \subsection{Approaches to state}
+      In Haskell, functions are always pure (except when using unsafe
+      functions like \hs{unsafePerformIO}, which should be prevented whenever
+      possible). This means that the output of a function solely depends on
+      its inputs. If you evaluate a given function with given inputs, it will
+      always provide the same output.
+
+      TODO: Define pure
+
+      This is a perfect match for a combinatoric circuit, where the output
+      also soley depend on the inputs. However, when state is involved, this
+      no longer holds. Since we're in charge of our own language, we could
+      remove this purity constraint and allow a function to return different
+      values depending on the cycle in which it is evaluated (or rather, the
+      current state). However, this means that all kinds of interesting
+      properties of our functional language get lost, and all kinds of
+      transformations and optimizations might no longer be meaning preserving.
+
+      Provided that we want to keep the function pure, the current state has
+      to be present in the function's arguments in some way. There seem to be
+      two obvious ways to do this: Adding the current state as an argument, or
+      including the full history of each argument.
+
+      \subsubsection{Stream arguments and results}
+        Including the entire history of each input (\eg, the value of that
+        input for each previous clockcycle) is an obvious way to make outputs
+        depend on all previous input. This is easily done by making every
+        input a list instead of a single value, containing all previous values
+        as well as the current value.
+
+        An obvious downside of this solution is that on each cycle, all the
+        previous cycles must be resimulated to obtain the current state. To do
+        this, it might be needed to have a recursive helper function as well,
+        wich might be hard to properly analyze by the compiler.
+
+        A slight variation on this approach is one taken by some of the other
+        functional \small{HDL}s in the field (TODO: References to Lava,
+        ForSyDe, ...): Make functions operate on complete streams. This means
+        that a function is no longer called on every cycle, but just once. It
+        takes stream as inputs instead of values, where each stream contains
+        all the values for every clockcycle since system start. This is easily
+        modeled using an (infinite) list, with one element for each clock
+        cycle. Since the funciton is only evaluated once, its output is also a
+        stream. Note that, since we are working with infinite lists and still
+        want to be able to simulate the system cycle-by-cycle, this relies
+        heavily on the lazy semantics of Haskell.
+
+        Since our inputs and outputs are streams, all other (intermediate)
+        values must be streams. All of our primitive operators (\eg, addition,
+        substraction, bitwise operations, etc.) must operate on streams as
+        well (note that changing a single-element operation to a stream
+        operation can done with \hs{map}, \hs{zipwith}, etc.).
+
+        Note that the concept of \emph{state} is no more than having some way
+        to communicate a value from one cycle to the next. By introducing a
+        \hs{delay} function, we can do exactly that: Delay (each value in) a
+        stream so that we can "look into" the past. This \hs{delay} function
+        simply outputs a stream where each value is the same as the input
+        value, but shifted one cycle. This causes a \quote{gap} at the
+        beginning of the stream: What is the value of the delay output in the
+        first cycle? For this, the \hs{delay} function has a second input
+        (which is a value, not a stream!).
+
+        \in{Example}[ex:DelayAcc] shows a simple accumulator expressed in this
+        style.
+
+\startbuffer[DelayAcc]
+acc :: Stream Word -> Stream Word
+acc in = out
+  where
+    out = (delay out 0) + in
+\stopbuffer
+
+\startuseMPgraphic{DelayAcc}
+  save in, out, add, reg;
+
+  % I/O ports
+  newCircle.in(btex $in$ etex) "framed(false)";
+  newCircle.out(btex $out$ etex) "framed(false)";
+
+  % Components
+  newReg.reg("") "dx(4mm)", "dy(6mm)", "reflect(true)";
+  newCircle.add(btex + etex);
+  
+  in.c    = origin;
+  add.c   = in.c + (2cm, 0cm);
+  out.c   = add.c + (2cm, 0cm);
+  reg.c   = add.c + (0cm, 2cm);
+
+  % Draw objects and lines
+  drawObj(in, out, add, reg);
+
+  nccurve(add)(reg) "angleA(0)", "angleB(180)", "posB(d)";
+  nccurve(reg)(add) "angleA(180)", "angleB(-45)", "posA(out)";
+  ncline(in)(add);
+  ncline(add)(out);
+\stopuseMPgraphic
+
+
+        \placeexample[here][ex:DelayAcc]{Simple accumulator architecture.}
+          \startcombination[2*1]
+            {\typebufferhs{DelayAcc}}{Haskell description using streams.}
+            {\boxedgraphic{DelayAcc}}{The architecture described by the Haskell description.}
+          \stopcombination
+
+
+        This notation can be confusing (especially due to the loop in the
+        definition of out), but is essentially easy to interpret. There is a
+        single call to delay, resulting in a circuit with a single register,
+        whose input is connected to \hs{outl (which is the output of the
+        adder)}, and it's output is the \hs{delay out 0} (which is connected
+        to one of the adder inputs).
+
+        This notation has a number of downsides, amongst which are limited
+        readability and ambiguity in the interpretation. TODO: Reference
+        Christiaan.
+        
+      \subsubsection{Explicit state arguments and results}
+        A more explicit way to model state, is to simply add an extra argument
+        containing the current state value. This allows an output to depend on
+        both the inputs as well as the current state while keeping the
+        function pure (letting the result depend only on the arguments), since
+        the current state is now an argument.
+
+        In Haskell, this would look like \in{example}[ex:ExplicitAcc].
+
+\startbuffer[ExplicitAcc]
+-- input -> current state -> (new state, output)
+acc :: Word -> Word -> (Word, Word)
+acc in (State s) = (State s', out)
+  where
+    out = s + in
+    s'  = out
+\stopbuffer
+
+        \placeexample[here][ex:ExplicitAcc]{Simple accumulator architecture.}
+          \startcombination[2*1]
+            {\typebufferhs{ExplicitAcc}}{Haskell description using explicit state arguments.}
+            % Picture is identical to the one we had just now.
+            {\boxedgraphic{DelayAcc}}{The architecture described by the Haskell description.}
+          \stopcombination
+
+        This approach makes a function's state very explicit, which state
+        variables are used by a function can be completely determined from its
+        type signature (as opposed to the stream approach, where a function
+        looks the same from the outside, regardless of what state variables it
+        uses (or wether it's stateful at all).
+
+        This approach is the one chosen for Cλash and will be examined more
+        closely below.
+
+    \subsection{Explicit state specification}
+      We've seen the concept of explicit state in a simple example below, but
+      what are the implications of this approach?
+
+      \subsubsection{Substates}
+        Since a function's state is reflected directly in its type signature,
+        if a function calls other stateful functions (\eg, has subcircuits) it
+        has to somehow know the current state for these called functions. The
+        only way to do this, is to put these \emph{substates} inside the
+        caller's state. This means that a function's state is the sum of the
+        states of all functions it calls, and its own state.
+
+        This also means that the type of a function (at least the "state"
+        part) is dependent on its implementation and the functions it calls.
+        This is the major downside of this approach: The separation between
+        interface and implementation is limited. However, since Cλash is not
+        very suitable for separate compilation (see
+        \in{section}[sec:prototype:separate]) this is not a big problem in
+        practice. Additionally, when using a type synonym for the state type
+        of each function, we can still provide explicit type signatures
+        while keeping the state specification for a function near its
+        definition only.
+    
+      \subsubsection{...}
+        We need some way to know which arguments should become input ports and
+        which argument(s?) should become the current state (\eg, be bound to
+        the register outputs). This does not hold holds not just for the top
+        level function, but also for any subfunctions. Or could we perhaps
+        deduce the statefulness of subfunctions by analyzing the flow of data
+        in the calling functions?
+
+        To explore this matter, we make an interesting observation: We get
+        completely correct behaviour when we put all state registers in the
+        top level entity (or even outside of it). All of the state arguments
+        and results on subfunctions are treated as normal input and output
+        ports. Effectively, a stateful function results in a stateless
+        hardware component that has one of its input ports connected to the
+        output of a register and one of its output ports connected to the
+        input of the same register.
+
+        TODO: Example?
+
+        Of course, even though the hardware described like this has the
+        correct behaviour, unless the layout tool does smart optimizations,
+        there will be a lot of extra wire in the design (since registers will
+        not be close to the component that uses them). Also, when working with
+        the generated \small{VHDL} code, there will be a lot of extra ports
+        just to pass one state values, which can get quite confusing.
+
+        To fix this, we can simply \quote{push} the registers down into the
+        subcircuits. When we see a register that is connected directly to a
+        subcircuit, we remove the corresponding input and output port and put
+        the register inside the subcircuit instead. This is slightly less
+        trivial when looking at the Haskell code instead of the resulting
+        circuit, but the idea is still the same.
+
+        TODO: Example?
+
+        However, when applying this technique, we might push registers down
+        too far. When you intend to store a result of a stateless subfunction
+        in the caller's state and pass the current value of that state
+        variable to that same function, the register might get pushed down too
+        far. It is impossible to distinguish this case from similar code where
+        the called function is in fact stateful. From this we can conclude
+        that we have to either:
+
+        \startitemize
+        \item accept that the generated hardware might not be exactly what we
+        intended, in some specific cases. In most cases, the hardware will be
+        what we intended.
+        \item explicitely annotate state arguments and results in the input
+        description.
+        \stopitemize
+
+        The first option causes (non-obvious) exceptions in the language
+        intepretation. Also, automatically determining where registers should
+        end up is easier to implement correctly with explicit annotations, so
+        for these reasons we will look at how this annotations could work.
+
+
+      TODO: Note about conditions on state variables and checking them.
+
+    \subsection{Explicit state annotation}
+      To make our stateful descriptions unambigious and easier to translate,
+      we need some way for the developer to describe which arguments and
+      results are intended to become stateful.
+    
+      Roughly, we have two ways to achieve this:
+      \startitemize[KR]
+        \item Use some kind of annotation method or syntactic construction in
+        the language to indicate exactly which argument and (part of the)
+        result is stateful. This means that the annotation lives
+        \quote{outside} of the function, it is completely invisible when
+        looking at the function body.
+        \item Use some kind of annotation on the type level, \eg give stateful
+        arguments and (part of) results a different type. This has the
+        potential to make this annotation visible inside the function as well,
+        such that when looking at a value inside the function body you can
+        tell if it's stateful by looking at its type. This could possibly make
+        the translation process a lot easier, since less analysis of the
+        program flow might be required.
+      \stopitemize
+
+      From these approaches, the type level \quote{annotations} have been
+      implemented in Cλash. \in{Section}[sec:prototype:statetype] expands on
+      the possible ways this could have been implemented.
+
+  TODO: Say something about dependent types and fixed size vectors
+
+  \section[sec:recursion]{Recursion}
    An import concept in functional languages is recursion. In it's most basic
    form, recursion is a function that is defined in terms of itself. This
    usually requires multiple evaluations of this function, with changing
@@ -177,3 +633,6 @@
  
    Due to these complications, we leave other forms of recursion as
    future work as well.
+  
+  \section{Supported types}
+    TODO