Add future work chapter.

[matthijs/projects/internship.git] / Report / Main / Problems / Challenges.tex
diff --git a/Report/Main/Problems/Challenges.tex b/Report/Main/Problems/Challenges.tex

index 927e6e9d74e055318a3b35a9196825c03715f674..06bd72c81ea6b46321b91c6efc8f8b41ed856b16 100644 (file)
--- a/Report/Main/Problems/Challenges.tex
+++ b/Report/Main/Problems/Challenges.tex
@@ -26,15 +26,16 @@ set a baseline for the compiler: A point to start improving the transformations
  from. Apart from that, this initial specification is also useful in reviewing
  existing MontiumC code and verifying existing transformations.
  
  from. Apart from that, this initial specification is also useful in reviewing
  existing MontiumC code and verifying existing transformations.
  
-Existing MontiumC code is supported by the compiler (since it has been tweaked
-until it worked). However, the existing code might not fully work within the
-produced specification. This can happen in particular when existing code makes
-use of corner cases in the compiler that have been missed in (or left out of)
-the specification, because they will not work reliably in all cases.
+Existing MontiumC code is of course supported by the compiler (since it has been
+tweaked until it worked). However, the existing code might not fully work within
+the produced specification. This can happen in particular when existing code
+makes use of corner cases in the compiler that have been missed in (or left out
+of) the specification, because they will not work reliably in all cases.
  
  The best way to detect these cases is making the compiler check its input using
  an the specification. This way, any code operating outside of the specification
  
  The best way to detect these cases is making the compiler check its input using
  an the specification. This way, any code operating outside of the specification
-can be detected automatically.
+can be detected automatically. Writing such checks has not happened yet, mainly
+because the impact of the new hardware on MontiumC is not quite clear yet.
  
  Existing transformations, on the other hand, might miss a few corner cases. When
  writing the specification, a particular feature might appear supported by the
  
  Existing transformations, on the other hand, might miss a few corner cases. When
  writing the specification, a particular feature might appear supported by the
@@ -44,7 +45,8 @@ cases, which either need to be fixed or put into the specification.
  The best way to detect these cases is writing a lot of structured testing code,
  which uses (combinations of) the features that are supported according to the
  specification. This way, corner cases exposed by the testing code can be
  The best way to detect these cases is writing a lot of structured testing code,
  which uses (combinations of) the features that are supported according to the
  specification. This way, corner cases exposed by the testing code can be
-detected automatically.
+detected automatically. A framework for this testing has been set up and
+partially filled with small tests.
  
  Building this initial specification did pose a number of challenges. Since
  simply trying all possible C features to see if they are accepted by the
  
  Building this initial specification did pose a number of challenges. Since
  simply trying all possible C features to see if they are accepted by the
@@ -75,9 +77,9 @@ Not only that, it is simply not possible to map all valid C programs to the
  Montium hardware. "Regular" architectures don't suffer from this problem (as
  much), since most instruction sets are not fundamentally different from the
  features supported by C (or other imperative languages). This means that
  Montium hardware. "Regular" architectures don't suffer from this problem (as
  much), since most instruction sets are not fundamentally different from the
  features supported by C (or other imperative languages). This means that
-anything is mappable, but with a simple compiler will not result in the most
-efficient code. In the Montium case, a lot of things simply cannot be mapped on
-the hardware at all.
+anything is mappable, but with a simple compiler this will not result in the
+most efficient code. In the Montium case, a lot of things simply cannot be
+mapped on the hardware at all.
  
  Considering that our ideal is not reachable (by far), every feature in our
  wanted MontiumC should be evaluated thoroughly for feasibility, both in hardware
  
  Considering that our ideal is not reachable (by far), every feature in our
  wanted MontiumC should be evaluated thoroughly for feasibility, both in hardware
@@ -87,8 +89,8 @@ being succesfully implemented. This is conforming to the incremental development
  of MontiumC that was envisioned at the outset of its development.
  
  \subsection{Familiarizing with LLVM}
  of MontiumC that was envisioned at the outset of its development.
  
  \subsection{Familiarizing with LLVM}
-Since during my internship I have been working mainly with LLVM (code), one of
-the first challenges was to get myself familiar with LLVM. There are two main
+Since the frontend heavily relies on the LLVM project for doing it's work, one
+of the first challenges was to get myself familiar with LLVM. There are two main
  aspects to this: Getting to know my way around the LLVM codebase and getting to
  know the LLVM community.
  
  aspects to this: Getting to know my way around the LLVM codebase and getting to
  know the LLVM community.
  
@@ -107,13 +109,13 @@ the Montium architecture. Eventually, this code didn't turn out to be
  immediately useful (it's still not included currently), but it proved very
  insightful as to how the LLVM framework is built and what its possibilities are.
  
  immediately useful (it's still not included currently), but it proved very
  insightful as to how the LLVM framework is built and what its possibilities are.
  
-Additionally, during my working with the code in this stage I also produced a
-number of patches for LLVM, containing bugfixes, some cleanup and documentation
-improvements. Since the best way to integrate with any open source project seems
-to be contributing code, I was giving commit access to the LLVM tree not long
-thereafter. This access has proved very useful during the rest of the
-internship, since it was now a a lot easier to make (simple) changes to the LLVM
-framework to better suit the needs of Recore.
+Additionally, during my working with the code in this internship I also produced
+a number of patches for LLVM, containing bugfixes, some cleanup and
+documentation improvements. Since the best way to integrate with any open source
+project seems to be contributing code, I was giving commit access to the LLVM
+tree not long thereafter. This access has proved very useful during the rest of
+the internship, since it was now a a lot easier to make (simple) changes to the
+LLVM framework to better suit the needs of Recore.
  
  A major challenge during my internship was to find the balance between doing
  work specifically for Recore, and doing work that is useful for LLVM in general.
  
  A major challenge during my internship was to find the balance between doing
  work specifically for Recore, and doing work that is useful for LLVM in general.
@@ -134,7 +136,8 @@ project.
  
  \subsection{Coding}
  This section says something about the challenges encountered while actually
  
  \subsection{Coding}
  This section says something about the challenges encountered while actually
-writing code, if I can think of enough things to say here.
+writing code, if I can think of enough things to say here. So far, haven't
+thought of anything particularly interesting yet.
  
  \subsection{Working together}
  Since the compiler plays a fairly central role in the development process at
  
  \subsection{Working together}
  Since the compiler plays a fairly central role in the development process at
@@ -145,7 +148,7 @@ often providing bug reports and other feedback, which ensures regular contact.
  
  On the other end, I have been in contact with the developer of the backend very
  intensively, since most changes made to either component needed changes in the
  
  On the other end, I have been in contact with the developer of the backend very
  intensively, since most changes made to either component needed changes in the
-other one as well. So compiler changes also require hardware support, so working
+other one as well. Compiler changes also require hardware support, so working
  with the hardware developers was not uncommon either. In practice, most of this
  communication went through the backend developer, except for the design
  discussion concerning the new Montium hardware design (also see section
  with the hardware developers was not uncommon either. In practice, most of this
  communication went through the backend developer, except for the design
  discussion concerning the new Montium hardware design (also see section
@@ -161,9 +164,9 @@ in an efficient way.
  \label{StayingGeneric}
  The toughest challenge by far was to fit the features and changes required by
  Recore into code that is maintained by the LLVM project. The main problem here
  \label{StayingGeneric}
  The toughest challenge by far was to fit the features and changes required by
  Recore into code that is maintained by the LLVM project. The main problem here
-is clash of goals: The LLVM project aims for code that performs well on their
+is clash of goals: The LLVM project aims for code that performs well for their
  supported architectures, while Recore aims for code that performs well (or more
  supported architectures, while Recore aims for code that performs well (or more
-often, can compile at all) on the Montium architecture.
+often, can compile MontiumC at all) for the Montium architecture.
  
  In general, these problems mainly occur because MontiumC and in particular
  MontiumIR poses a number of constraints that are not present in other
  
  In general, these problems mainly occur because MontiumC and in particular
  MontiumIR poses a number of constraints that are not present in other
@@ -176,49 +179,56 @@ that the LLVM code can handle architectures with or without the constraint.
  In a lot of cases, however, the constraint is so Montium specific that changing
  LLVM to support it is not feasible. In a few cases, this meant that the backend
  was changed to support more features of the LLVM IR. In other cases, a
  In a lot of cases, however, the constraint is so Montium specific that changing
  LLVM to support it is not feasible. In a few cases, this meant that the backend
  was changed to support more features of the LLVM IR. In other cases, a
-Recore-specific pass was added to solve these problems. In a few more cases, the
-problems are still unresolved, effectively resulting in additional constraints
-on the MontiumC language.
+Recore-specific transformations was added to solve these problems. In a few more
+cases, the problems are still unresolved, effectively resulting in additional
+constraints on the MontiumC language.
  
  
-\subsection{Pipelined scheduling}
+\subsection{New hardware design}
  \label{Pipelining}
  I've also been involved for a bit with the instruction scheduling algorithm
  \label{Pipelining}
  I've also been involved for a bit with the instruction scheduling algorithm
-required for the new (pipelined) hardware design. Even though this is completely
-outside of the area of my assignment, the initial prototype of that scheduler
-was created using LLVM by someone else, so I have been assisting him with that.
-Initially mostly helping out with hints on LLVM coding, but later also
-with thinking about the scheduler and hardware design.
+required for the new (pipelined) hardware design and the hardware design itself.
+Even though this is completely outside of the area of my assignment, the initial
+prototype of that scheduler was created by someone else using LLVM. Because of
+my experience with LLVM, I have been assisting him with that.  Initially mostly
+helping out with hints on LLVM coding, but later also with thinking about the
+scheduler and hardware design.
  
  I will not go into much detail about the new hardware and its scheduler here,
  but I will highlight the most important challenges and tradeoffs.
  
  
  I will not go into much detail about the new hardware and its scheduler here,
  but I will highlight the most important challenges and tradeoffs.
  
+\subsubsection{Tradeoffs}
  In general, the most important tradeoff seems to be between flexibility and
  In general, the most important tradeoff seems to be between flexibility and
-everything else (code size, performance, complexity). This flexibility is
-expressed in which instructions are possible, which connections are present, how
-big register files are, etc.
+everything else (code size, performance, complexity, hardware area). This
+flexibility is expressed in which instructions are possible, which connections
+are present, how big register files are, etc.
  
  An important reason to be flexible is for programmability. If the hardware is
  regular, making a compiler that produces optimal code gets a lot easier.
  On the other hand, the compiler also limits flexibility. If the hardware
  has flexibility that the compiler will never use, it's better to save
  
  An important reason to be flexible is for programmability. If the hardware is
  regular, making a compiler that produces optimal code gets a lot easier.
  On the other hand, the compiler also limits flexibility. If the hardware
  has flexibility that the compiler will never use, it's better to save
-area and complexity by making the hardware less flexible.
+area and complexity by making the hardware less flexible. Exactly for this
+reason, it is important to develop hardware and supporting software in parallel,
+instead of using the hardware first, software later approach used with the
+initial Montium (TODO: Is this true?). This allows for a much better balanced
+and usable design, without any unused extras.
  
  
+\subsubsection{Inner loop pipelining}
  When trying to improve runtime performance, the main focus is on
  optimizing loops, and inner loops (loops that contain no other loops) in
  When trying to improve runtime performance, the main focus is on
  optimizing loops, and inner loops (loops that contain no other loops) in
-particular. Since the inner loop is executed the most, it is the most
-efficient to optimize the inner loop. Also, the inner loop is also the
-piece of code that can most optimally use the parellel processing power
-of the Montium, because it can be software pipelined. 
-
-This means that the compiler will emit code that performs operations
-that belong into different iterations of the original loop in the same
-cycle. Since data dependencies within a loop body usually severely limit
-the amount of operations that can be done in parallel, pipelining allows
-the second (and more) iteration to start well before the first iteration
-is done. This is done by dividing the loop body in a number of stages,
-that would normally be executed sequentially. These stages are then
-executed in parallel, but for different iterations (ie, run stage 2 of
-iteration i, while running stage 1 of iteration i+1).
+particular. Since the inner loop is executed the most often, it is the most
+efficient to optimize the inner loop. Also, the inner loop is also the piece of
+code that can most optimally use the parellel processing power of the Montium,
+because it can be software pipelined. 
+
+Software pipelining means that the compiler will emit code that performs
+operations that belong in different iterations of the original loop during the
+same cycle. Since data dependencies within a loop body usually severely limit
+the amount of operations that can be done in parallel, pipelining allows the
+second (and further) iteration to start well before the first iteration is done.
+This is done by dividing the loop body in a number of stages, that would
+normally be executed sequentially. These stages are then executed in parallel,
+but for different iterations (ie, run stage 2 of iteration i, while running
+stage 1 of iteration i+1).
  
  This approach allows a loop to be ran a lot faster than executing a
  single iteration at a time. However, since the instructions for the
  
  This approach allows a loop to be ran a lot faster than executing a
  single iteration at a time. However, since the instructions for the
@@ -232,12 +242,41 @@ same time, similar for the epilogue). Also, every instruction in the
  prologue and epilogue are a strict subset of the instructions in the
  kernel. By adding some hardware support for exactly this structure, the
  code size increase for the prologue and epilogue can be effectively
  prologue and epilogue are a strict subset of the instructions in the
  kernel. By adding some hardware support for exactly this structure, the
  code size increase for the prologue and epilogue can be effectively
-reduced to a fixed number of instructions.
-
-Performance - inner loops
-Code compression.
-
-I'll also
-spend a few words on the observed merits of hardware/software codesign.
-
-
+reduced to a fixed number of instructions (which take the number of stages as a
+parameter and uses preloaded instructions with explicit stage annotation).
+
+The tradeoff here is that this hardware is only usable specifically for these
+inner loops, any other code will leave this extra hardware unused. However,
+since pipelined inner loops appear to be very common, this should not be a
+problem at all.
+
+\subsubsection{Code compression}
+Another very important tradeoff concerns codesize. In the old hardware, a lot of
+flexibility in the original code was achieved by using inline functions (since
+the hardware has only very limited support for code reuse through function
+calls). This resulted in a lot of code duplication, which was compensated for by
+using two level configuration registers to be able to reuse (partial)
+instructions nonetheless (which will still need a lot of sequencer instructions,
+but those will be a lot smaller than the full instruction).
+
+On the new hardware, however, function calls are more powerful, which should
+lead to a lot less code duplication. For this reason, putting every instruction
+in configuration registers might actually take more space instead of less. It
+should be noted that, the configuration registers of the old Montium are
+effectively a compiler controlled cache that is mandatory and static
+(instructions must be in the cache and the cache cannot be modified at runtime).
+By lifting these limitations, we get a cache that is a lot more flexible.
+However, since it is compiler controlled (as opposed to letting the hardware
+decide what ends up in the cache), the performance of the code remains very
+predictable (unlike traditional caches). Additionally, this has the advantage
+that the size of the cache is no longer an upper bound on the program size, but
+only the instruction memory is (which can be a lot bigger).
+
+The tradeoff here is that the sequencer instructions will get a lot bigger,
+since they need to contain a full instruction word (which would be preloaded
+into the CRs in the old design) which can be up to a few hundred bits long.
+Since not all sequencer instructions need to be this long (executing out of the
+cache should be a lot shorter), different instruction lengths should be
+supported. Also, some smart compression techniques should also be applied to
+those long instruction words, though that can only be done once there is some
+more application code to gather statistics about.