From: Matthijs Kooijman Date: Mon, 28 Jul 2008 08:05:11 +0000 (+0200) Subject: Add something about tradeoffs and pipelining in the new hardware. X-Git-Tag: Report-final~26 X-Git-Url: https://git.stderr.nl/gitweb?p=matthijs%2Fprojects%2Finternship.git;a=commitdiff_plain;h=d416e9f0ac0aafb798efa56f797d80cb3fb3f5a5 Add something about tradeoffs and pipelining in the new hardware. --- diff --git a/Report/Main/Problems/Challenges.tex b/Report/Main/Problems/Challenges.tex index 531e580..927e6e9 100644 --- a/Report/Main/Problems/Challenges.tex +++ b/Report/Main/Problems/Challenges.tex @@ -199,7 +199,42 @@ big register files are, etc. An important reason to be flexible is for programmability. If the hardware is regular, making a compiler that produces optimal code gets a lot easier. - +On the other hand, the compiler also limits flexibility. If the hardware +has flexibility that the compiler will never use, it's better to save +area and complexity by making the hardware less flexible. + +When trying to improve runtime performance, the main focus is on +optimizing loops, and inner loops (loops that contain no other loops) in +particular. Since the inner loop is executed the most, it is the most +efficient to optimize the inner loop. Also, the inner loop is also the +piece of code that can most optimally use the parellel processing power +of the Montium, because it can be software pipelined. + +This means that the compiler will emit code that performs operations +that belong into different iterations of the original loop in the same +cycle. Since data dependencies within a loop body usually severely limit +the amount of operations that can be done in parallel, pipelining allows +the second (and more) iteration to start well before the first iteration +is done. This is done by dividing the loop body in a number of stages, +that would normally be executed sequentially. These stages are then +executed in parallel, but for different iterations (ie, run stage 2 of +iteration i, while running stage 1 of iteration i+1). + +This approach allows a loop to be ran a lot faster than executing a +single iteration at a time. However, since the instructions for the +first and last few iterations (the prologue and epilogue) are distinctly +different from the loop "kernel", the number of instructions needed for +a pipelined loop can easily increase a lot. + +However, all pipelined loops share a very distinct structure (first +stage 1, then stage 1+2, then stage 1+2+3, etc, then all stages at the +same time, similar for the epilogue). Also, every instruction in the +prologue and epilogue are a strict subset of the instructions in the +kernel. By adding some hardware support for exactly this structure, the +code size increase for the prologue and epilogue can be effectively +reduced to a fixed number of instructions. + +Performance - inner loops Code compression. I'll also