X-Git-Url: https://git.stderr.nl/gitweb?a=blobdiff_plain;f=Report%2FMain%2FProblems%2FChallenges.tex;h=927e6e9d74e055318a3b35a9196825c03715f674;hb=d416e9f0ac0aafb798efa56f797d80cb3fb3f5a5;hp=13a8dd84a696842ef859e02f7e9667b5e5b6447d;hpb=a91774c5d265106f961c06ea192b2f585852fd85;p=matthijs%2Fprojects%2Finternship.git diff --git a/Report/Main/Problems/Challenges.tex b/Report/Main/Problems/Challenges.tex index 13a8dd8..927e6e9 100644 --- a/Report/Main/Problems/Challenges.tex +++ b/Report/Main/Problems/Challenges.tex @@ -190,5 +190,54 @@ Initially mostly helping out with hints on LLVM coding, but later also with thinking about the scheduler and hardware design. I will not go into much detail about the new hardware and its scheduler here, -but I will highlight the most important challenges and tradeoffs. I'll also +but I will highlight the most important challenges and tradeoffs. + +In general, the most important tradeoff seems to be between flexibility and +everything else (code size, performance, complexity). This flexibility is +expressed in which instructions are possible, which connections are present, how +big register files are, etc. + +An important reason to be flexible is for programmability. If the hardware is +regular, making a compiler that produces optimal code gets a lot easier. +On the other hand, the compiler also limits flexibility. If the hardware +has flexibility that the compiler will never use, it's better to save +area and complexity by making the hardware less flexible. + +When trying to improve runtime performance, the main focus is on +optimizing loops, and inner loops (loops that contain no other loops) in +particular. Since the inner loop is executed the most, it is the most +efficient to optimize the inner loop. Also, the inner loop is also the +piece of code that can most optimally use the parellel processing power +of the Montium, because it can be software pipelined. + +This means that the compiler will emit code that performs operations +that belong into different iterations of the original loop in the same +cycle. Since data dependencies within a loop body usually severely limit +the amount of operations that can be done in parallel, pipelining allows +the second (and more) iteration to start well before the first iteration +is done. This is done by dividing the loop body in a number of stages, +that would normally be executed sequentially. These stages are then +executed in parallel, but for different iterations (ie, run stage 2 of +iteration i, while running stage 1 of iteration i+1). + +This approach allows a loop to be ran a lot faster than executing a +single iteration at a time. However, since the instructions for the +first and last few iterations (the prologue and epilogue) are distinctly +different from the loop "kernel", the number of instructions needed for +a pipelined loop can easily increase a lot. + +However, all pipelined loops share a very distinct structure (first +stage 1, then stage 1+2, then stage 1+2+3, etc, then all stages at the +same time, similar for the epilogue). Also, every instruction in the +prologue and epilogue are a strict subset of the instructions in the +kernel. By adding some hardware support for exactly this structure, the +code size increase for the prologue and epilogue can be effectively +reduced to a fixed number of instructions. + +Performance - inner loops +Code compression. + +I'll also spend a few words on the observed merits of hardware/software codesign. + +