From: Matthijs Kooijman <matthijs@stdin.nl>
Date: Mon, 28 Jul 2008 08:05:11 +0000 (+0200)
Subject: Add something about tradeoffs and pipelining in the new hardware.
X-Git-Tag: Report-final~26
X-Git-Url: https://git.stderr.nl/gitweb?p=matthijs%2Fprojects%2Finternship.git;a=commitdiff_plain;h=d416e9f0ac0aafb798efa56f797d80cb3fb3f5a5

Add something about tradeoffs and pipelining in the new hardware.
---

diff --git a/Report/Main/Problems/Challenges.tex b/Report/Main/Problems/Challenges.tex
index 531e580..927e6e9 100644
--- a/Report/Main/Problems/Challenges.tex
+++ b/Report/Main/Problems/Challenges.tex
@@ -199,7 +199,42 @@ big register files are, etc.
 
 An important reason to be flexible is for programmability. If the hardware is
 regular, making a compiler that produces optimal code gets a lot easier.
-
+On the other hand, the compiler also limits flexibility. If the hardware
+has flexibility that the compiler will never use, it's better to save
+area and complexity by making the hardware less flexible.
+
+When trying to improve runtime performance, the main focus is on
+optimizing loops, and inner loops (loops that contain no other loops) in
+particular. Since the inner loop is executed the most, it is the most
+efficient to optimize the inner loop. Also, the inner loop is also the
+piece of code that can most optimally use the parellel processing power
+of the Montium, because it can be software pipelined. 
+
+This means that the compiler will emit code that performs operations
+that belong into different iterations of the original loop in the same
+cycle. Since data dependencies within a loop body usually severely limit
+the amount of operations that can be done in parallel, pipelining allows
+the second (and more) iteration to start well before the first iteration
+is done. This is done by dividing the loop body in a number of stages,
+that would normally be executed sequentially. These stages are then
+executed in parallel, but for different iterations (ie, run stage 2 of
+iteration i, while running stage 1 of iteration i+1).
+
+This approach allows a loop to be ran a lot faster than executing a
+single iteration at a time. However, since the instructions for the
+first and last few iterations (the prologue and epilogue) are distinctly
+different from the loop "kernel", the number of instructions needed for
+a pipelined loop can easily increase a lot.
+
+However, all pipelined loops share a very distinct structure (first
+stage 1, then stage 1+2, then stage 1+2+3, etc, then all stages at the
+same time, similar for the epilogue). Also, every instruction in the
+prologue and epilogue are a strict subset of the instructions in the
+kernel. By adding some hardware support for exactly this structure, the
+code size increase for the prologue and epilogue can be effectively
+reduced to a fixed number of instructions.
+
+Performance - inner loops
 Code compression.
 
 I'll also