Add the prototype to the research goals.

[matthijs/master-project/report.git] / Chapters / HardwareDescription.tex
diff --git a/Chapters/HardwareDescription.tex b/Chapters/HardwareDescription.tex

index 90007ee02eee550c6a73d7a354229f6fe3209122..ddce85dedbbc147263b481f42eff8408a924caf7 100644 (file)
--- a/Chapters/HardwareDescription.tex
+++ b/Chapters/HardwareDescription.tex
@@ -27,17 +27,329 @@
    to the corresponding input port. The output port of the function is also
    mapped to a signal, which is used as the result of the application.
  
-  An example of a simple program using only function application would be:
+  \in{Example}[ex:And3] shows a simple program using only function
+  application and the corresponding architecture.
+
+\startbuffer[And3]
+-- | A simple function that returns 
+--   the and of three bits
+and3 :: Bit -> Bit -> Bit -> Bit
+and3 a b c = and (and a b) c
+\stopbuffer
  
-  \starthaskell
-  -- | A simple function that returns the and of three bits
-  and3 :: Bit -> Bit -> Bit -> Bit
-  and3 a b c = and (and a b) c
-  \stophaskell
+  \startuseMPgraphic{And3}
+    save a, b, c, anda, andb, out;
+
+    % I/O ports
+    newCircle.a(btex $a$ etex) "framed(false)";
+    newCircle.b(btex $b$ etex) "framed(false)";
+    newCircle.c(btex $c$ etex) "framed(false)";
+    newCircle.out(btex $out$ etex) "framed(false)";
+
+    % Components
+    newCircle.anda(btex $and$ etex);
+    newCircle.andb(btex $and$ etex);
+
+    a.c    = origin;
+    b.c    = a.c + (0cm, 1cm);
+    c.c    = b.c + (0cm, 1cm);
+    anda.c = midpoint(a.c, b.c) + (2cm, 0cm);
+    andb.c = midpoint(b.c, c.c) + (4cm, 0cm);
+
+    out.c   = andb.c + (2cm, 0cm);
+
+    % Draw objects and lines
+    drawObj(a, b, c, anda, andb, out);
+
+    ncarc(a)(anda) "arcangle(-10)";
+    ncarc(b)(anda);
+    ncarc(anda)(andb);
+    ncarc(c)(andb);
+    ncline(andb)(out);
+  \stopuseMPgraphic
+
+  \placeexample[here][ex:And3]{Simple three port and.}
+    \startcombination[2*1]
+      {\typebufferhs{And3}}{Haskell description using function applications.}
+      {\boxedgraphic{And3}}{The architecture described by the Haskell description.}
+    \stopcombination
+
+  TODO: Define top level function and subfunctions/circuits.
+
+  \section{Choice}
+    Since describing components and connections allows us to describe a lot of
+    hardware designs already, there is an obvious thing missing: Choice. We
+    need some way to be able to choose between values based on another value.
+    In Haskell, choice is achieved by case expressions, if expressions or
+    pattern matching (all of which can be reduced to case expressions).
+
+    An obvious way to add choice to our language without having to recognize
+    any of Haskell's syntax, would be to add a primivite \quote{\hs{if}}
+    function. This function would take three arguments: The condition, the
+    value to return when the condition is true and the value to return when
+    the condition is false.
+
+    This \hs{if} function would then essentially describe a multiplexer and
+    allows us to describe any architecture that uses multiplexers. (TODO: Are
+    there other mechanisms of choice in hardware?)
+
+    However, to be able to describe our hardware in a more convenient way, we
+    also need to translate Haskell's choice mechanisms. The easiest of these
+    are of course case expressions (and if expressions, which can be very
+    directly translated to case expressions). A case expression can in turn
+    simply be translated to a conditional assignment, where the conditions use
+    equality comparisons against the constructors in the case expressions.
+
+    In \in{example}[ex:CaseInv] a simple case statement is shown,
+    scrutinizing a boolean value. If we would translate a Boolean to a bit
+    value, we could of course remove the comparator and directly feed 'in'
+    into the multiplex (or even use an inverter instead of a multiplexer).
+    However, we will try to make a general translation, which works for all
+    possible case statements. Optimizations such as these are left for the
+    \VHDL synthesizer, which handles them very well.
+
+    \startbuffer[CaseInv]
+    inv :: Bool -> Bool
+    inv in = case in of
+      True -> False
+      False -> True
+    \stopbuffer
+
+    \startuseMPgraphic{CaseInv}
+      save in, truecmp, falseout, trueout, out, cmp, mux;
+
+      % I/O ports
+      newCircle.in(btex $in$ etex) "framed(false)";
+      newCircle.out(btex $out$ etex) "framed(false)";
+      % Constants
+      newBox.truecmp(btex $True$ etex) "framed(false)";
+      newBox.trueout(btex $True$ etex) "framed(false)";
+      newBox.falseout(btex $False$ etex) "framed(false)";
+
+      % Components
+      newCircle.cmp(btex $==$ etex);
+      newMux.mux;
+
+      in.c       = origin;
+      cmp.c      = in.c + (3cm, 0cm);
+      truecmp.c  = cmp.c + (-1cm, 1cm);
+      mux.sel    = cmp.e + (1cm, -1cm);
+      falseout.c = mux.inpa - (2cm, 0cm);
+      trueout.c  = mux.inpb - (2cm, 0cm);
+      out.c      = mux.out + (2cm, 0cm);
+
+      % Draw objects and lines
+      drawObj(in, out, truecmp, trueout, falseout, cmp, mux);
+
+      ncline(in)(cmp);
+      ncarc(truecmp)(cmp);
+      nccurve(cmp.e)(mux.sel) "angleA(0)", "angleB(-90)";
+      ncline(falseout)(mux) "posB(inpa)";
+      ncline(trueout)(mux) "posB(inpb)";
+      ncline(mux)(out) "posA(out)";
+    \stopuseMPgraphic
+
+    \placeexample[here][ex:CaseInv]{Simple inverter.}
+      \startcombination[2*1]
+        {\typebufferhs{CaseInv}}{Haskell description using a Case expression.}
+        {\boxedgraphic{CaseInv}}{The architecture described by the Haskell description.}
+      \stopcombination
+
+    Slightly more complex (but very powerful) forms of choice are pattern
+    matches. A function can be defined in multiple clauses, where each clause
+    specifies a pattern. When the arguments match the pattern, the
+    corresponding clause will be used.
+
+    \startbuffer[PatternInv]
+    inv :: Bool -> Bool
+    inv True = False
+    inv False = True
+    \stopbuffer
+
+    \placeexample[here][ex:PatternInv]{Simple inverter using pattern matching.}
+        {\typebufferhs{CaseInv}}
+
+    The architecture described in \in{example}[ex:PatternInv] is of course the
+    same one as the one in \in{example}[ex:CaseInv]. The general interpration
+    of pattern matching is also similar to that of case expressions: Generate
+    hardware for each of the clause (like each of the clauses of a case
+    expression) and connect them to the function output through (a number of
+    nested) multiplexers. These multiplexers are driven by comparators and
+    other logic, that check each pattern in turn.
+
+  \section{Types}
+    Apart from giving structure to our hardware, we'll also need to provide
+    some way to translate the values used to hardware equivalents. In
+    particular, this means having to come up with a hardware equivalent for
+    every \emph{type} used in our program.
+
+    Since most functional languages have a lot of standard types that are
+    hard to translate (integers without a fixed size, lists without a static
+    length, etc.), we will start out by defining a number of \quote{builtin}
+    types ourselves. These types are builtin in the sense that our compiler
+    will have a fixed VHDL type for these. User defined types, on the other
+    hand, will have their hardware type derived directly from their Haskell
+    declaration automatically, according to the rules we sketch here.
+
+    \subsection{Builtin types}
+      \startdesc{Bit}
+        This is the most basic type available. It is mapped directly onto
+        the \type{std_logic} \small{VHDL} type. Mapping this to the
+        \type{bit} type might make more sense (since the Haskell version
+        only has two values), but using \type{std_logic} is more standard
+        (and allowed for some experimentation with don't care values)
+      \stopdesc
+      \startdesc{Bool}
+        This is a builtin Haskell type, but is translated exactly like the
+        Bit type. Supporting the Bool type is particularly useful to support
+        \hs{if ... then ... else ...} expressions, which always have a
+        \hs{Bool} value for the condition.
  
-  This results in the following hardware:
-  
-  TODO: Pretty picture
+        A \hs{Bool} is translated to a \type{std_logic}, just like \hs{Bit}.
+      \stopdesc
+      \startdesc{SizedWord, SizedInt}
+        These are types to represent integers. \hs{SizedWord} is unsigned,
+        while \hs{SizedInt} is signed. These types are parameterized by a
+        length type, so you can define an unsigned word of 32 bits wide as
+        follows:
+        
+        \starthaskell
+          type Word32 = SizedWord D32
+        \stophaskell
+
+        Here, \hs{D32} is the \emph{type level representation} of the number
+        32. TODO: Introduce dependent types somewhere.
+
+        These types are translated to the \small{VHDL} \type{unsigned} and
+        \type{signed} respectively.
+      \stopdesc
+      \startdesc{RangedWord}
+        This is another type to describe integers, but unlike the previous
+        two it has no specific width, but an upper bound. This means that
+        its range is not limited to powers of two, but can be any number.
+        A \hs{RangedWord} only has an upper bound, its lower bound is
+        implicitly zero. There is a lot of added implementation complexity
+        when adding a lower bound and having just an upper bound was enough
+        for the primary purpose of this type: Typesafely indexing vectors.
+
+        This type is translated to the \type{unsigned} \small{VHDL} type.
+      \stopdesc
+      \startdesc{TFVec}
+        This is a vector type, with a fixed length. It has two type
+        parameters: Its length and the type of the elements contained in it.
+
+        A fixed size vector is translated to a \small{VHDL} array type.
+       
+        The \hs{TF} in its name is a reference to the implementation used in
+        the prototype, which uses \emph{type families}.
+      \stopdesc
+
+      TODO: Reference Christiaan
+    \subsection{User-defined types}
+      There are three ways to define new types in Haskell: Algebraic
+      datatypes with the \hs{data} keyword, type synonyms with the \hs{type}
+      keyword and type renamings with the \hs{newtype} keyword. This
+      expclitely excludes more advanced type creation from \GHC extensions
+      such as type families, existential typing, \small{GADT}s, etc.
+
+      The first of these actually introduces a new type, for which we provide
+      the \VHDL translation below. The latter two only define new names for
+      existing types (where synonyms are completely interchangeable and
+      renamings need explicit conversion). Therefore, these don't need any
+      particular \VHDL translation, a synonym or renamed type will just use
+      the same representation as the equivalent type.
+
+      For algebraic types, we can make the following distinction:
+
+      \startdesc{Product types}
+        A product type is an algebraic datatype with a single constructor with
+        two or more fields. This is essentially a way to pack a few values
+        together in a record-like structure. In fact, the builtin tuple types
+        are just product types (and are thus supported in exactly the same
+        way).
+
+        The "product" in its name refers to the collection of values belonging
+        to this type. The collection for a product type is the cartesic
+        product of the collections for the types of its fields.
+
+        These types are translated to \VHDL record types, with one field for
+        every field in the constructor. This translation applies to all single
+        constructor algebraic datatypes, including those with no fields (unit
+        types) and just one field (which are technically not a product).
+      \stopdesc
+      \startdesc{Enumerated types}
+        An enumerated type is an algebraic datatype with multiple constructors, but
+        none of them have fields. This is essentially a way to get an
+        enum-like type containing alternatives.
+
+        Note that Haskell's \hs{Bool} type is also defined as an
+        enumeration type, but we have a fixed translation for that.
+        
+        These types are translated to \VHDL enumerations, with one value for
+        each constructor. This allows references to these constructors to be
+        translated to the corresponding enumeration value.
+      \stopdesc
+      \startdesc{Sum types}
+        A sum type is an algebraic datatype with multiple constructors, where
+        the constructors have one or more fields. Technically, a type with
+        more than one field per constructor is a sum of products type, but
+        for our purposes this distinction does not really make a difference,
+        so we'll leave it out.
+
+        Sum types are currently not supported by the prototype, since its
+        translation is not so obvious as before. The can easily be emulated,
+        as we will see from an example:
+
+        \starthaskell
+        data Sum = A Bit Word | B Word 
+        \stophaskell
+
+        An obvious way to translate this would be to create an enumeration to
+        distinguish the constructors and then create a big record that
+        contains all the fields of all the constructors. This is the same
+        translation that would result from the following enumeration and
+        product type (using a tuple for clarity):
+
+        \starthaskell
+        data SumC = A | B
+        type Sum = (SumC, Bit, Word, Word)
+        \stophaskell
+
+        Here, the \hs{SumC} type effectively signals which of the fields of
+        the \hs{Sum} type are valid, all the other ones have no useful value.
+
+        An obvious problem with this naive approach is the space usage: The
+        example above generates a fairly big \VHDL type. However, we can be
+        sure that the two \hs{Word}s in the \hs{Sum} type will never be valid
+        at the same time, so this is a waste of space.
+
+        Obviously, we could do some duplication detection here to reuse a
+        particular field for another constructor, but this would only
+        partially solve the problem. If I would, for example, have an array of
+        8 bits and a 8 bit unsiged word, these are different types and could
+        not be shared. However, in the final hardware, both of these types
+        would simply be 8 bit connections, so we have a 100\% size increase by
+        not sharing these!
+      \stopdesc
+
+      Another interesting case is that of recursive types. In Haskell, an
+      algebraic datatype can be recursive: Any of its field types can be (or
+      contain) the type being defined. The most well-known recursive type is
+      probably the list type, which is defined is:
+      
+      \starthaskell
+      data List a = Empty | Cons a (List a)
+      \stophaskell
+
+      Where \hs{Empty} is usually written as \hs{[]} and \hs{Cons} as \hs{:}.
+      This immediately shows the problem with recursive types: What hardware
+      type to allocate here? There is no way to statically determine how long
+      the list will be
+      
+      In general, we can say we can never properly translated recursive types:
+      All recursive types have a potentially infinite value (even though in
+      practice they will have a bounded value, there is no way for the
+      compiler to determine an upper bound on its size.
  
    \subsection{Partial application}
    It should be obvious that we cannot generate hardware signals for all
@@ -47,19 +359,50 @@
    represented as a signal or i/o port to a component.
  
    From this, we can see that the above translation rules do not apply to a
-  partial application. Let's look at an example:
+  partial application. \in{Example}[ex:Quadruple] shows an example use of
+  partial application and the corresponding architecture.
  
-  \starthaskell
-  -- | Multiply the input word by four.
-  quadruple :: Word -> Word
-  quadruple n = mul (mul n)
-    where
-      mul = (*) 2
-  \stophaskell
+\startbuffer[Quadruple]
+-- | Multiply the input word by four.
+quadruple :: Word -> Word
+quadruple n = mul (mul n)
+  where
+    mul = (*) 2
+\stopbuffer
+
+  \startuseMPgraphic{Quadruple}
+    save in, two, mula, mulb, out;
+
+    % I/O ports
+    newCircle.in(btex $n$ etex) "framed(false)";
+    newCircle.two(btex $2$ etex) "framed(false)";
+    newCircle.out(btex $out$ etex) "framed(false)";
+
+    % Components
+    newCircle.mula(btex $\times$ etex);
+    newCircle.mulb(btex $\times$ etex);
  
-  It should be clear that the above code describes the following hardware:
+    two.c    = origin;
+    in.c     = two.c + (0cm, 1cm);
+    mula.c  = in.c + (2cm, 0cm);
+    mulb.c  = mula.c + (2cm, 0cm);
+    out.c   = mulb.c + (2cm, 0cm);
  
-  TODO: Pretty picture
+    % Draw objects and lines
+    drawObj(in, two, mula, mulb, out);
+
+    nccurve(two)(mula) "angleA(0)", "angleB(45)";
+    nccurve(two)(mulb) "angleA(0)", "angleB(45)";
+    ncline(in)(mula);
+    ncline(mula)(mulb);
+    ncline(mulb)(out);
+  \stopuseMPgraphic
+
+  \placeexample[here][ex:Quadruple]{Simple three port and.}
+    \startcombination[2*1]
+      {\typebufferhs{Quadruple}}{Haskell description using function applications.}
+      {\boxedgraphic{Quadruple}}{The architecture described by the Haskell description.}
+    \stopcombination
  
    Here, the definition of mul is a partial function application: It applies
    \hs{2 :: Word} to the function \hs{(*) :: Word -> Word -> Word} resulting in
@@ -76,6 +419,120 @@
    function boundaries), but eventually, the partial application will become
    completely applied.
  
+  \section{Costless specialization}
+    Each (complete) function application in our description generates a
+    component instantiation, or a specific piece of hardware in the final
+    design. It is interesting to note that each application of a function
+    generates a \emph{separate} piece of hardware. In the final design, none
+    of the hardware is shared between applications, even when the applied
+    function is the same.
+
+    This is distinctly different from normal program compilation: Two separate
+    calls to the same function share the same machine code. Having more
+    machine code has implications for speed (due to less efficient caching)
+    and memory usage. For normal compilation, it is therefore important to
+    keep the amount of functions limited and maximize the code sharing.
+
+    When generating hardware, this is hardly an issue. Having more \quote{code
+    sharing} does reduce the amount of \small{VHDL} output (Since different
+    component instantiations still share the same component), but after
+    synthesis, the amount of hardware generated is not affected.
+
+    In particular, if we would duplicate all functions so that there is a
+    duplicate for every application in the program (\eg, each function is then
+    only applied exactly once), there would be no increase in hardware size
+    whatsoever.
+   
+   TODO: Perhaps these next two sections are a bit too
+   implementation-oriented?
+
+    \subsection{Specialization}
+      Because of this, a common optimization technique called
+      \emph{specialization} is as good as free for hardware generation.  Given
+      some function that has a \emph{domain} of $D$ (\eg, the set of all
+      possible arguments that could be applied), we create a specialized
+      function with exactly the same behaviour, but with an domain of $D'
+      \subset D$. This subset can be derived in all sort of ways, but commonly
+      this is done by limiting a polymorphic argument to a single type (\eg,
+      removing polymorphism) or by limiting an argument to just a single value
+      (\eg, cross-function constant propagation, effectively removing the
+      argument). 
+      
+      Since we limit the argument domain of the specialized function, its
+      definition can often be optimized further (since now more types or even
+      values of arguments are already know). By replacing any application of
+      the function that falls within the reduced domain by an application of
+      the specialized version, the code gets faster (but the code also gets
+      bigger, since we now have two versions instead of one!). If we apply
+      this technique often enough, we can often replace all applications of a
+      function by specialized versions, allowing the original function to be
+      removed (in some cases, this can even give a net reduction of the code
+      compared to the non-specialized version).
+
+      Specialization is useful for our hardware descriptions for functions
+      that contain arguments that cannot be translated to hardware directly
+      (polymorphic or higher order arguments, for example). If we can create
+      specialized functions that remove the argument, or make it translatable,
+      we can use specialization to make the original, untranslatable, function
+      obsolete.
+
+  \section{Higher order values}
+    What holds for partial application, can be easily generalized to any
+    higher order expression. This includes partial applications, plain
+    variables (e.g., a binder referring to a top level function), lambda
+    expressions and more complex expressions with a function type (a case
+    expression returning lambda's, for example).
+
+    Each of these values cannot be directly represented in hardware (just like
+    partial applications). Also, to make them representable, they need to be
+    applied: function variables and partial applications will then eventually
+    become complete applications, applied lambda expressions disappear by
+    applying β-reduction, etc.
+
+    So any higher order value will be \quote{pushed down} towards its
+    application just like partial applications. Whenever a function boundary
+    needs to be crossed, the called function can be specialized.
+  
+    TODO: This is section should be improved
+
+  \section{Polymorphism}
+    In Haskell, values can be polymorphic: They can have multiple types. For
+    example, the function \hs{fst :: (a, b) -> a} is an example of a
+    polymorphic function: It works for tuples with any element types. Haskell
+    typeclasses allow a function to work on a specific set of types, but the
+    general idea is the same.
+
+%    A type class is a collection of types for which some operations are
+%    defined. It is thus possible for a value to be polymorphic while having
+%    any number of \emph{class constraints}: The value is not defined for
+%    every type, but only for types in the type class. An example of this is
+%    the \hs{even :: (Integral a) => a -> Bool} function, which can map any
+%    value of a type that is member of the \hs{Integral} type class 
+
+    When generating hardware, polymorphism can't be easily translated. How
+    many wire will you lay down for a value that could have any type? When
+    type classes are involved, what hardware components will you lay down for
+    a class method (whose behaviour depends on the type of its arguments)?
+
+    Fortunately, we can again use the principle of specialization: Since every
+    function application generates separate pieces of hardware, we can know
+    the types of all arguments exactly. Provided that we don't use existential
+    typing, all of the polymorphic types in a function must depend on the
+    types of the arguments (In other words, the only way to introduce a type
+    variable is in a lambda abstraction). Our top level function must not have
+    a polymorphic type (otherwise we wouldn't know the hardware interface to
+    our top level function).
+
+    If a function is monomorphic, all values inside it are monomorphic as
+    well, so any function that is applied within the function can only be
+    applied to monomorphic values. The applied functions can then be
+    specialized to work just for these specific types, removing the
+    polymorphism from the applied functions as well.
+
+    By induction, this means that all functions that are (indirectly) called
+    by our top level function (meaning all functions that are translated in
+    the final hardware) become monomorphic.
+
    \section{State}
      A very important concept in hardware designs is \emph{state}. In a
      stateless (or, \emph{combinatoric}) design, every output is a directly and solely dependent on the
@@ -217,7 +674,8 @@ acc in = out
          In Haskell, this would look like \in{example}[ex:ExplicitAcc].
  
  \startbuffer[ExplicitAcc]
-acc :: Word -> (State Word) -> (State Word, Word)
+-- input -> current state -> (new state, output)
+acc :: Word -> Word -> (Word, Word)
  acc in (State s) = (State s', out)
    where
      out = s + in
@@ -237,282 +695,117 @@ acc in (State s) = (State s', out)
          looks the same from the outside, regardless of what state variables it
          uses (or wether it's stateful at all).
  
-        A direct consequence of this, is that if a function calls other
-        stateful functions (\eg, has subcircuits), it has to somehow know the
-        current state for these called functions. The only way to do this, is
-        to put these \emph{substates} inside the caller's state. This means
-        that a function's state is the sum of the states of all functions it
-        calls, and its own state.
-
          This approach is the one chosen for Cλash and will be examined more
          closely below.
  
      \subsection{Explicit state specification}
-      Note about semantic correctness of top level state.
-
-      Note about automatic ``down-pushing'' of state.
-
-      Note about explicit state specification as the best solution.
-
-      Note about substates
-
-      Note about conditions on state variables and checking them.
-
-    \subsection{Explicit state implementation}
-      Recording state variables at the type level.
-
-      Ideal: Type synonyms, since there is no additional code overhead for
-      packing and unpacking. Downside: there is no explicit conversion in Core
-      either, so type synonyms tend to get lost in expressions (they can be
-      preserved in binders, but this makes implementation harder, since that
-      statefulness of a value must be manually tracked).
-
-      Less ideal: Newtype. Requires explicit packing and unpacking of function
-      arguments. If you don't unpack substates, there is no overhead for
-      (un)packing substates. This will result in many nested State constructors
-      in a nested state type. \eg: 
-
-  \starttyping
-  State (State Bit, State (State Word, Bit), Word)
-  \stoptyping
-
-      Alternative: Provide different newtypes for input and output state. This
-      makes the code even more explicit, and typechecking can find even more
-      errors. However, this requires defining two type synomyms for each
-      stateful function instead of just one. \eg:
-  \starttyping
-  type AccumStateIn = StateIn Bit
-  type AccumStateOut = StateOut Bit
-  \stoptyping
-      This also increases the possibility of having different input and output
-      states. Checking for identical input and output state types is also
-      harder, since each element in the state must be unpacked and compared
-      separately.
-
-      Alternative: Provide a type for the entire result type of a stateful
-      function, not just the state part. \eg:
-
-  \starttyping
-  newtype Result state result = Result (state, result)
-  \stoptyping
-      
-      This makes it easy to say "Any stateful function must return a
-      \type{Result} type, without having to sort out result from state. However,
-      this either requires a second type for input state (similar to
-      \type{StateIn} / \type{StateOut} above), or requires the compiler to
-      select the right argument for input state by looking at types (which works
-      for complex states, but when that state has the same type as an argument,
-      things get ambiguous) or by selecting a fixed (\eg, the last) argument,
-      which might be limiting.
-
-      \subsubsection{Example}
-      As an example of the used approach, a simple averaging circuit, that lets
-      the accumulation of the inputs be done by a subcomponent.
-
-      \starttyping
-        newtype State s = State s
-
-        type AccumState = State Bit
-        accum :: Word -> AccumState -> (AccumState, Word)
-        accum i (State s) = (State (s + i), s + i)
-
-        type AvgState = (AccumState, Word)
-        avg :: Word -> AvgState -> (AvgState, Word)
-        avg i (State s) = (State s', o)
-          where
-            (accums, count) = s
-            -- Pass our input through the accumulator, which outputs a sum
-            (accums', sum) = accum i accums
-            -- Increment the count (which will be our new state)
-            count' = count + 1
-            -- Compute the average
-            o = sum / count'
-            s' = (accums', count')
-      \stoptyping
-
-      And the normalized, core-like versions:
-
-      \starttyping
-        accum i spacked = res
-          where
-            s = case spacked of (State s) -> s
-            s' = s + i
-            spacked' = State s'
-            o = s + i
-            res = (spacked', o)
-
-        avg i spacked = res
-          where
-            s = case spacked of (State s) -> s
-            accums = case s of (accums, \_) -> accums
-            count = case s of (\_, count) -> count
-            accumres = accum i accums
-            accums' = case accumres of (accums', \_) -> accums'
-            sum = case accumres of (\_, sum) -> sum
-            count' = count + 1
-            o = sum / count'
-            s' = (accums', count')
-            spacked' = State s'
-            res = (spacked', o)
-      \stoptyping
-
-
-
-      As noted above, any component of a function's state that is a substate,
-      \eg passed on as the state of another function, should have no influence
-      on the hardware generated for the calling function. Any state-specific
-      \small{VHDL} for this component can be generated entirely within the called
-      function. So,we can completely leave out substates from any function.
-      
-      From this observation, we might think to remove the substates from a
-      function's states alltogether, and leave only the state components which
-      are actual states of the current function. While doing this would not
-      remove any information needed to generate \small{VHDL} from the function, it would
-      cause the function definition to become invalid (since we won't have any
-      substate to pass to the functions anymore). We could solve the syntactic
-      problems by passing \type{undefined} for state variables, but that would
-      still break the code on the semantic level (\ie, the function would no
-      longer be semantically equivalent to the original input).
-
-      To keep the function definition correct until the very end of the process,
-      we will not deal with (sub)states until we get to the \small{VHDL} generation.
-      Here, we are translating from Core to \small{VHDL}, and we can simply not generate
-      \small{VHDL} for substates, effectively removing the substate components
-      alltogether.
-
-      There are a few important points when ignore substates.
-
-      First, we have to have some definition of "substate". Since any state
-      argument or return value that represents state must be of the \type{State}
-      type, we can simply look at its type. However, we must be careful to
-      ignore only {\em substates}, and not a function's own state.
-
-      In the example above, this means we should remove \type{accums'} from
-      \type{s'}, but not throw away \type{s'} entirely. We should, however,
-      remove \type{s'} from the output port of the function, since the state
-      will be handled by a \small{VHDL} procedure within the function.
-
-      When looking at substates, these can appear in two places: As part of an
-      argument and as part of a return value. As noted above, these substates
-      can only be used in very specific ways.
-
-      \desc{State variables can appear as an argument.} When generating \small{VHDL}, we
-      completely ignore the argument and generate no input port for it.
-
-      \desc{State variables can be extracted from other state variables.} When
-      extracting a state variable from another state variable, this always means
-      we're extracting a substate, which we can ignore. So, we simply generate no
-      \small{VHDL} for any extraction operation that has a state variable as a result.
-
-      \desc{State variables can be passed to functions.} When passing a
-      state variable to a function, this always means we're passing a substate
-      to a subcomponent. The entire argument can simply be ingored in the
-      resulting port map.
-
-      \desc{State variables can be returned from functions.} When returning a
-      state variable from a function (probably as a part of an algebraic
-      datatype), this always mean we're returning a substate from a
-      subcomponent. The entire state variable should be ignored in the resulting
-      port map. The type binder of the binder that the function call is bound
-      to should not include the state type either.
-
-      \startdesc{State variables can be inserted into other variables.} When inserting
-      a state variable into another variable (usually by constructing that new
-      variable using its constructor), we can identify two cases: 
-
-      \startitemize
-        \item The state is inserted into another state variable. In this case,
-        the inserted state is a substate, and can be safely left out of the
-        constructed variable.
-        \item The state is inserted into a non-state variable. This happens when
-        building up the return value of a function, where you put state and
-        retsult variables together in an algebraic type (usually a tuple). In
-        this case, we should leave the state variable out as well, since we
-        don't want it to be included as an output port.
-      \stopitemize
-
-      So, in both cases, we can simply leave out the state variable from the
-      resulting value. In the latter case, however, we should generate a state
-      proc instead, which assigns the state variable to the input state variable
-      at each clock tick.
-      \stopdesc
-      
-      \desc{State variables can appear as (part of) a function result.} When
-      generating \small{VHDL}, we can completely ignore any part of a function result
-      that has a state type. If the entire result is a state type, this will
-      mean the entity will not have an output port. Otherwise, the state
-      elements will be removed from the type of the output port.
-
-
-      Now, we know how to handle each use of a state variable separately. If we
-      look at the whole, we can conclude the following:
-
-      \startitemize
-      \item A state unpack operation should not generate any \small{VHDL}. The binder
-      to which the unpacked state is bound should still be declared, this signal
-      will become the register and will hold the current state.
-      \item A state pack operation should not generate any \small{VHDL}. The binder th
-      which the packed state is bound should not be declared. The binder that is
-      packed is the signal that will hold the new state.
-      \item Any values of a State type should not be translated to \small{VHDL}. In
-      particular, State elements should be removed from tuples (and other
-      datatypes) and arguments with a state type should not generate ports.
-      \item To make the state actually work, a simple \small{VHDL} proc should be
-      generated. This proc updates the state at every clockcycle, by assigning
-      the new state to the current state. This will be recognized by synthesis
-      tools as a register specification.
-      \stopitemize
-
-
-      When applying these rules to the example program (in normal form), we will
-      get the following result. All the parts that don't generate any value are
-      crossed out, leaving some very boring assignments here and there.
-      
+      We've seen the concept of explicit state in a simple example below, but
+      what are the implications of this approach?
+
+      \subsubsection{Substates}
+        Since a function's state is reflected directly in its type signature,
+        if a function calls other stateful functions (\eg, has subcircuits) it
+        has to somehow know the current state for these called functions. The
+        only way to do this, is to put these \emph{substates} inside the
+        caller's state. This means that a function's state is the sum of the
+        states of all functions it calls, and its own state.
+
+        This also means that the type of a function (at least the "state"
+        part) is dependent on its implementation and the functions it calls.
+        This is the major downside of this approach: The separation between
+        interface and implementation is limited. However, since Cλash is not
+        very suitable for separate compilation (see
+        \in{section}[sec:prototype:separate]) this is not a big problem in
+        practice. Additionally, when using a type synonym for the state type
+        of each function, we can still provide explicit type signatures
+        while keeping the state specification for a function near its
+        definition only.
      
-  \starthaskell
-    avg i --spacked-- = res
-      where
-        s = --case spacked of (State s) -> s--
-        --accums = case s of (accums, \_) -> accums--
-        count = case s of (--\_,-- count) -> count
-        accumres = accum i --accums--
-        --accums' = case accumres of (accums', \_) -> accums'--
-        sum = case accumres of (--\_,-- sum) -> sum
-        count' = count + 1
-        o = sum / count'
-        s' = (--accums',-- count')
-        --spacked' = State s'--
-        res = (--spacked',-- o)
-  \stophaskell
-          
-      When we would really leave out the crossed out parts, we get a slightly
-      weird program: There is a variable \type{s} which has no value, and there
-      is a variable \type{s'} that is never used. Together, these two will form
-      the state proc of the function. \type{s} contains the "current" state,
-      \type{s'} is assigned the "next" state. So, at the end of each clock
-      cycle, \type{s'} should be assigned to \type{s}.
-
-      Note that the definition of \type{s'} is not removed, even though one
-      might think it as having a state type. Since the state type has a single
-      argument constructor \type{State}, some type that should be the resulting
-      state should always be explicitly packed with the State constructor,
-      allowing us to remove the packed version, but still generate \small{VHDL} for the
-      unpacked version (of course with any substates removed).
-      
-      As you can see, the definition of \type{s'} is still present, since it
-      does not have a state type (The State constructor. The \type{accums'} substate has been removed,
-      leaving us just with the state of \type{avg} itself.
-    \subsection{Initial state}
-      How to specify the initial state? Cannot be done inside a hardware
-      function, since the initial state is its own state argument for the first
-      call (unless you add an explicit, synchronous reset port).
-
-      External init state is natural for simulation.
+      \subsubsection{...}
+        We need some way to know which arguments should become input ports and
+        which argument(s?) should become the current state (\eg, be bound to
+        the register outputs). This does not hold holds not just for the top
+        level function, but also for any subfunctions. Or could we perhaps
+        deduce the statefulness of subfunctions by analyzing the flow of data
+        in the calling functions?
+
+        To explore this matter, we make an interesting observation: We get
+        completely correct behaviour when we put all state registers in the
+        top level entity (or even outside of it). All of the state arguments
+        and results on subfunctions are treated as normal input and output
+        ports. Effectively, a stateful function results in a stateless
+        hardware component that has one of its input ports connected to the
+        output of a register and one of its output ports connected to the
+        input of the same register.
+
+        TODO: Example?
+
+        Of course, even though the hardware described like this has the
+        correct behaviour, unless the layout tool does smart optimizations,
+        there will be a lot of extra wire in the design (since registers will
+        not be close to the component that uses them). Also, when working with
+        the generated \small{VHDL} code, there will be a lot of extra ports
+        just to pass one state values, which can get quite confusing.
+
+        To fix this, we can simply \quote{push} the registers down into the
+        subcircuits. When we see a register that is connected directly to a
+        subcircuit, we remove the corresponding input and output port and put
+        the register inside the subcircuit instead. This is slightly less
+        trivial when looking at the Haskell code instead of the resulting
+        circuit, but the idea is still the same.
+
+        TODO: Example?
+
+        However, when applying this technique, we might push registers down
+        too far. When you intend to store a result of a stateless subfunction
+        in the caller's state and pass the current value of that state
+        variable to that same function, the register might get pushed down too
+        far. It is impossible to distinguish this case from similar code where
+        the called function is in fact stateful. From this we can conclude
+        that we have to either:
+
+        \startitemize
+        \item accept that the generated hardware might not be exactly what we
+        intended, in some specific cases. In most cases, the hardware will be
+        what we intended.
+        \item explicitely annotate state arguments and results in the input
+        description.
+        \stopitemize
+
+        The first option causes (non-obvious) exceptions in the language
+        intepretation. Also, automatically determining where registers should
+        end up is easier to implement correctly with explicit annotations, so
+        for these reasons we will look at how this annotations could work.
+
+
+      TODO: Note about conditions on state variables and checking them.
+
+    \subsection{Explicit state annotation}
+      To make our stateful descriptions unambigious and easier to translate,
+      we need some way for the developer to describe which arguments and
+      results are intended to become stateful.
+    
+      Roughly, we have two ways to achieve this:
+      \startitemize[KR]
+        \item Use some kind of annotation method or syntactic construction in
+        the language to indicate exactly which argument and (part of the)
+        result is stateful. This means that the annotation lives
+        \quote{outside} of the function, it is completely invisible when
+        looking at the function body.
+        \item Use some kind of annotation on the type level, \eg give stateful
+        arguments and (part of) results a different type. This has the
+        potential to make this annotation visible inside the function as well,
+        such that when looking at a value inside the function body you can
+        tell if it's stateful by looking at its type. This could possibly make
+        the translation process a lot easier, since less analysis of the
+        program flow might be required.
+      \stopitemize
  
-      External init state works for hardware generation as well.
+      From these approaches, the type level \quote{annotations} have been
+      implemented in Cλash. \in{Section}[sec:prototype:statetype] expands on
+      the possible ways this could have been implemented.
  
-      Implementation issues: state splitting, linking input to output state,
-      checking usage constraints on state variables.
+  TODO: Say something about dependent types and fixed size vectors
  
    \section[sec:recursion]{Recursion}
    An import concept in functional languages is recursion. In it's most basic