X-Git-Url: https://git.stderr.nl/gitweb?a=blobdiff_plain;f=Report%2FMain%2FProblems%2FChallenges.tex;h=b6b9f5a7ba29fbfc770631ad4ec356b9c370f410;hb=3f443477bdf1f422b2ec86cd03b7577bac06b1f8;hp=531e580284f9136cd75de713f0a0c8794bd28cb3;hpb=52b588775e8f16a4a29c71762497ae2eef946e01;p=matthijs%2Fprojects%2Finternship.git diff --git a/Report/Main/Problems/Challenges.tex b/Report/Main/Problems/Challenges.tex index 531e580..b6b9f5a 100644 --- a/Report/Main/Problems/Challenges.tex +++ b/Report/Main/Problems/Challenges.tex @@ -180,7 +180,7 @@ Recore-specific pass was added to solve these problems. In a few more cases, the problems are still unresolved, effectively resulting in additional constraints on the MontiumC language. -\subsection{Pipelined scheduling} +\subsection{New hardware design} \label{Pipelining} I've also been involved for a bit with the instruction scheduling algorithm required for the new (pipelined) hardware design. Even though this is completely @@ -192,6 +192,7 @@ with thinking about the scheduler and hardware design. I will not go into much detail about the new hardware and its scheduler here, but I will highlight the most important challenges and tradeoffs. +\subsubsection{Tradeoffs} In general, the most important tradeoff seems to be between flexibility and everything else (code size, performance, complexity). This flexibility is expressed in which instructions are possible, which connections are present, how @@ -199,10 +200,75 @@ big register files are, etc. An important reason to be flexible is for programmability. If the hardware is regular, making a compiler that produces optimal code gets a lot easier. - -Code compression. - -I'll also -spend a few words on the observed merits of hardware/software codesign. - - +On the other hand, the compiler also limits flexibility. If the hardware +has flexibility that the compiler will never use, it's better to save +area and complexity by making the hardware less flexible. Exactly for this +reason, it is important to develop hardware and supporting software in parallel, +instead of using the hardware first, software later approach used with the +initial Montium. This allows for a much better balanced and usable design, +without any unused extras. + +\subsubsection{Inner loop pipelining} +When trying to improve runtime performance, the main focus is on +optimizing loops, and inner loops (loops that contain no other loops) in +particular. Since the inner loop is executed the most, it is the most +efficient to optimize the inner loop. Also, the inner loop is also the +piece of code that can most optimally use the parellel processing power +of the Montium, because it can be software pipelined. + +This means that the compiler will emit code that performs operations +that belong into different iterations of the original loop in the same +cycle. Since data dependencies within a loop body usually severely limit +the amount of operations that can be done in parallel, pipelining allows +the second (and more) iteration to start well before the first iteration +is done. This is done by dividing the loop body in a number of stages, +that would normally be executed sequentially. These stages are then +executed in parallel, but for different iterations (ie, run stage 2 of +iteration i, while running stage 1 of iteration i+1). + +This approach allows a loop to be ran a lot faster than executing a +single iteration at a time. However, since the instructions for the +first and last few iterations (the prologue and epilogue) are distinctly +different from the loop "kernel", the number of instructions needed for +a pipelined loop can easily increase a lot. + +However, all pipelined loops share a very distinct structure (first +stage 1, then stage 1+2, then stage 1+2+3, etc, then all stages at the +same time, similar for the epilogue). Also, every instruction in the +prologue and epilogue are a strict subset of the instructions in the +kernel. By adding some hardware support for exactly this structure, the +code size increase for the prologue and epilogue can be effectively +reduced to a fixed number of instructions. + +The tradeoff here is that this hardware is only usable specifically for these +inner loops, any other code will leave this extra hardware unused. However, +since pipelined inner loops appear to be very common, this should not be a +problem at all. + +\subsubsection{Code compression} +Another very important tradeoff concerns codesize. In the old hardware, a lot of +flexibility in the original code was achieved by using inline functions (since +the hardware has only very limited support for function calls and thus code +reuse). This resulted in a lot of code duplication, which was compensated for by +using two level configuration registers (which will still need a lot of +sequencer instructions, but those will be a lot smaller than the full +instruction). + +On the new hardware, however, function calls are more powerful, which should +lead to a lot less code duplication. For this reason, putting every instruction +in configuration registers might actually take more space instead of less. It +should be noted that, the configuration registers are actually just a compiler +controlled cache that is mandatory and static (instructions must be in the cache +and the cache cannot be modified at runtime). By lifting these both limitations, +we get a cache that is a lot more flexible. Additionally, this has the advantage +that the size of the cache is no longer an upper bound on the program size, but +only the instruction memory is (which can be a lot bigger). + +The tradeoff here is that the sequencer instructions will get a lot bigger, +since they need to contain a full instruction word (which would be preloaded +into the CRs in the old design) which can be up to a few hundred bits long. +Since not all sequencer instructions need to be this long (executing out of the +cache should be a lot shorter), different instruction lengths should be +supported. Also, some smart compression techniques should also be applied to +those long instruction words, though that can only be done once there is some +more application code to gather statistics on.