Report/Main/Problems/Challenges.tex

   1 \section{Challenges and Solutions}
   2 This section will describe the challenges faced during each of the tasks and the
   3 solutions found for both the task itself and the challenges.
   4
   5 \subsection{What is MontiumC?}
   6 A critical question popped up during at the beginning of my internship: What is
   7 MontiumC? Previously, there was no real specification of MontiumC. There was
   8 documentation about the functions that could be used, some examples and a lot of
   9 personal knowledge in the heads of the Recore employees, but ultimately MontiumC
  10 was "whatever the compiler eats".
  11
  12 To be able to create a proper set of transformations, the constraints on the
  13 input and output of that transformation process should be properly specified.
  14 This entails two parts: Specifying the MontiumC language, and specifying the
  15 Montium IR constraints, which is is the input to the backend.
  16
  17 Specifying Montium IR was relatively easy, since it is defined directly by the
  18 backend. The MontiumC specification is slightly more complex. There are two
  19 different angles to it: What does the compiler support, and what do we want the
  20 compiler to support.
  21
  22 \subsubsection{What is supported?}
  23 One angle for looking at MontiumC is seeing what the compiler can currently
  24 compile and turn that into a formal specification. This approach works fine to
  25 set a baseline for the compiler: A point to start improving the transformations
  26 from. Apart from that, this initial specification is also useful in reviewing
  27 existing MontiumC code and verifying existing transformations.
  28
  29 Existing MontiumC code is supported by the compiler (since it has been tweaked
  30 until it worked). However, the existing code might not fully work within the
  31 produced specification. This can happen in particular when existing code makes
  32 use of corner cases in the compiler that have been missed in (or left out of)
  33 the specification, because they will not work reliably in all cases.
  34
  35 The best way to detect these cases is making the compiler check its input using
  36 an the specification. This way, any code operating outside of the specification
  37 can be detected automatically.
  38
  39 Existing transformations, on the other hand, might miss a few corner cases. When
  40 writing the specification, a particular feature might appear supported by the
  41 compiler. On closer inspection, the transformation passes might miss some corner
  42 cases, which either need to be fixed or put into the specification.
  43
  44 The best way to detect these cases is writing a lot of structured testing code,
  45 which uses (combinations of) the features that are supported according to the
  46 specification. This way, corner cases exposed by the testing code can be
  47 detected automatically.
  48
  49 Building this initial specification did pose a number of challenges. Since
  50 simply trying all possible C features to see if they are accepted by the
  51 MontiumC compiler and thus valid MontiumC is a lengthy process and only useful
  52 in a limited way. A more constructive way would be to examine the compiler
  53 components to see what transformations are applied and from that derive the
  54 specification for valid MontiumC. However, most of these components are not very
  55 transparent. The Clang frontend supports a lot of C constructs and is not a
  56 trivial piece of code. It also has support for Objective C and partially C++,
  57 which makes it harder to see which code paths are actually used for compiling
  58 MontiumC. This issue is only partially solved, meaning that there might still be
  59 incorrect or missing cases in the specification.
  60
  61 Another problem is the complexity of the C language. Although individual
  62 features can be described and supported with relative ease, combining features
  63 can easily lead to complex constructs which are hard to transform into supported
  64 Montium IR. This became more of an issue after adding features to the base
  65 specification of MontiumC, though.
  66
  67 \subsubsection{What is wanted?}
  68 A completely different angle of looking at this, is from the requirements point
  69 of view. What do we want MontiumC to support? This angle is even harder than the
  70 previous one, since there are a lot of levels of requirements. Ideally, MontiumC
  71 would not exist and our compiler would support the C language fully. However,
  72 this would require a very complicated compiler (both frontend and backend).
  73
  74 Not only that, it is simply not possible to map all valid C programs to the
  75 Montium hardware. "Regular" architectures don't suffer from this problem (as
  76 much), since most instruction sets are not fundamentally different from the
  77 features supported by C (or other imperative languages). This means that
  78 anything is mappable, but with a simple compiler will not result in the most
  79 efficient code. In the Montium case, a lot of things simply cannot be mapped on
  80 the hardware at all.
  81
  82 Considering that our ideal is not reachable (by far), every feature in our
  83 wanted MontiumC should be evaluated thoroughly for feasibility, both in hardware
  84 and in the compiler. In practice, this meant that new language features would be
  85 informally expressed and discussed, and only added to the specification after
  86 being succesfully implemented. This is conforming to the incremental development
  87 of MontiumC that was envisioned at the outset of its development.
  88
  89 \subsection{Familiarizing with LLVM}
  90 Since during my internship I have been working mainly with LLVM (code), one of
  91 the first challenges was to get myself familiar with LLVM. There are two main
  92 aspects to this: Getting to know my way around the LLVM codebase and getting to
  93 know the LLVM community.
  94
  95 Since LLVM has a pretty large amount of documentation, I spent most of my first
  96 weeks with reading tutorials and documents. Since there was already a (very
  97 preliminary) version of the clang-based frontend, I also had some code to play
  98 with.
  99
 100 During this period, it was not completely clear what the frontend should
 101 be doing and what transformations should be written. To prevent circular
 102 dependencies in my tasks (I can't write any code before I know what needs to be
 103 written, but I can't really tell what's needed until I know how the code works,
 104 which I can't effectively learn without actively working with it, etc.) I
 105 started out with adapting the loop unrolling pass in LLVM to be better suited to
 106 the Montium architecture. Eventually, this code didn't turn out to be
 107 immediately useful (it's still not included currently), but it proved very
 108 insightful as to how the LLVM framework is built and what its possibilities are.
 109
 110 Additionally, during my working with the code in this stage I also produced a
 111 number of patches for LLVM, containing bugfixes, some cleanup and documentation
 112 improvements. Since the best way to integrate with any open source project seems
 113 to be contributing code, I was giving commit access to the LLVM tree not long
 114 thereafter. This access has proved very useful during the rest of the
 115 internship, since it was now a a lot easier to make (simple) changes to the LLVM
 116 framework to better suit the needs of Recore.
 117
 118 A major challenge during my internship was to find the balance between doing
 119 work specifically for Recore, and doing work that is useful for LLVM in general.
 120 Any tasks that are required by the Montium frontend and the LLVM changes they
 121 directly need are obvious. However, usually when making changes to the main LLVM
 122 code, just changing enough for Recore is not engough for LLVM. Since the LLVM
 123 code must work on any program, not just MontiumC programs, extra changes are
 124 required (see also parapgrah \ref{StayingGeneric}. This is also an issue of
 125 building up credit within the LLVM community: The more you contribute to LLVM,
 126 the more influence you have when things need changing.
 127
 128 Lastly, this is also a matter of efficiency: If I have been working
 129 with a particular piece of code intensively, it is more efficient for me to fix
 130 a bug in that code than most others, even though the particular bug does not
 131 interfere with the MontiumC frontend. In the end, I think I managed to find a
 132 decent balance, though it might have been tipped slighly in favour of the LLVM
 133 project.
 134
 135 \subsection{Coding}
 136 This section says something about the challenges encountered while actually
 137 writing code, if I can think of enough things to say here.
 138
 139 \subsection{Working together}
 140 Since the compiler plays a fairly central role in the development process at
 141 Recore, I have been cooperating with a number of different people, in different
 142 areas. On one end, the compiler is directly used by the DSP engineers, so a lot
 143 of the requirements and wishes for the compiler come from them. Also, they are
 144 often providing bug reports and other feedback, which ensures regular contact.
 145
 146 On the other end, I have been in contact with the developer of the backend very
 147 intensively, since most changes made to either component needed changes in the
 148 other one as well. So compiler changes also require hardware support, so working
 149 with the hardware developers was not uncommon either. In practice, most of this
 150 communication went through the backend developer, except for the design
 151 discussion concerning the new Montium hardware design (also see section
 152 \ref{Pipelining} below).
 153
 154 In addition, discussions regarding design issues at various levels often happen
 155 out in the open, which easily invites people with an opinion about something to
 156 chime in, without having to gather people around for every discussion that
 157 happens. This allows for very quick feedback on ideas from people in all areas
 158 in an efficient way.
 159
 160 \subsection{Staying generic}
 161 \label{StayingGeneric}
 162 The toughest challenge by far was to fit the features and changes required by
 163 Recore into code that is maintained by the LLVM project. The main problem here
 164 is clash of goals: The LLVM project aims for code that performs well on their
 165 supported architectures, while Recore aims for code that performs well (or more
 166 often, can compile at all) on the Montium architecture.
 167
 168 In general, these problems mainly occur because MontiumC and in particular
 169 MontiumIR poses a number of constraints that are not present in other
 170 architectures. This means that some transformations present in LLVM will violate
 171 these constraints (which would result in better code on most other
 172 architectures) resulting in miscompiled or uncompilable code.
 173
 174 In some cases, these extra constraints can be formulated in a generic way, such
 175 that the LLVM code can handle architectures with or without the constraint.
 176 In a lot of cases, however, the constraint is so Montium specific that changing
 177 LLVM to support it is not feasible. In a few cases, this meant that the backend
 178 was changed to support more features of the LLVM IR. In other cases, a
 179 Recore-specific pass was added to solve these problems. In a few more cases, the
 180 problems are still unresolved, effectively resulting in additional constraints
 181 on the MontiumC language.
 182
 183 \subsection{New hardware design}
 184 \label{Pipelining}
 185 I've also been involved for a bit with the instruction scheduling algorithm
 186 required for the new (pipelined) hardware design. Even though this is completely
 187 outside of the area of my assignment, the initial prototype of that scheduler
 188 was created using LLVM by someone else, so I have been assisting him with that.
 189 Initially mostly helping out with hints on LLVM coding, but later also
 190 with thinking about the scheduler and hardware design.
 191
 192 I will not go into much detail about the new hardware and its scheduler here,
 193 but I will highlight the most important challenges and tradeoffs.
 194
 195 \subsubsection{Tradeoffs}
 196 In general, the most important tradeoff seems to be between flexibility and
 197 everything else (code size, performance, complexity). This flexibility is
 198 expressed in which instructions are possible, which connections are present, how
 199 big register files are, etc.
 200
 201 An important reason to be flexible is for programmability. If the hardware is
 202 regular, making a compiler that produces optimal code gets a lot easier.
 203 On the other hand, the compiler also limits flexibility. If the hardware
 204 has flexibility that the compiler will never use, it's better to save
 205 area and complexity by making the hardware less flexible. Exactly for this
 206 reason, it is important to develop hardware and supporting software in parallel,
 207 instead of using the hardware first, software later approach used with the
 208 initial Montium. This allows for a much better balanced and usable design,
 209 without any unused extras.
 210
 211 \subsubsection{Inner loop pipelining}
 212 When trying to improve runtime performance, the main focus is on
 213 optimizing loops, and inner loops (loops that contain no other loops) in
 214 particular. Since the inner loop is executed the most, it is the most
 215 efficient to optimize the inner loop. Also, the inner loop is also the
 216 piece of code that can most optimally use the parellel processing power
 217 of the Montium, because it can be software pipelined.
 218
 219 This means that the compiler will emit code that performs operations
 220 that belong into different iterations of the original loop in the same
 221 cycle. Since data dependencies within a loop body usually severely limit
 222 the amount of operations that can be done in parallel, pipelining allows
 223 the second (and more) iteration to start well before the first iteration
 224 is done. This is done by dividing the loop body in a number of stages,
 225 that would normally be executed sequentially. These stages are then
 226 executed in parallel, but for different iterations (ie, run stage 2 of
 227 iteration i, while running stage 1 of iteration i+1).
 228
 229 This approach allows a loop to be ran a lot faster than executing a
 230 single iteration at a time. However, since the instructions for the
 231 first and last few iterations (the prologue and epilogue) are distinctly
 232 different from the loop "kernel", the number of instructions needed for
 233 a pipelined loop can easily increase a lot.
 234
 235 However, all pipelined loops share a very distinct structure (first
 236 stage 1, then stage 1+2, then stage 1+2+3, etc, then all stages at the
 237 same time, similar for the epilogue). Also, every instruction in the
 238 prologue and epilogue are a strict subset of the instructions in the
 239 kernel. By adding some hardware support for exactly this structure, the
 240 code size increase for the prologue and epilogue can be effectively
 241 reduced to a fixed number of instructions.
 242
 243 The tradeoff here is that this hardware is only usable specifically for these
 244 inner loops, any other code will leave this extra hardware unused. However,
 245 since pipelined inner loops appear to be very common, this should not be a
 246 problem at all.
 247
 248 \subsubsection{Code compression}
 249 Another very important tradeoff concerns codesize. In the old hardware, a lot of
 250 flexibility in the original code was achieved by using inline functions (since
 251 the hardware has only very limited support for function calls and thus code
 252 reuse). This resulted in a lot of code duplication, which was compensated for by
 253 using two level configuration registers (which will still need a lot of
 254 sequencer instructions, but those will be a lot smaller than the full
 255 instruction).
 256
 257 On the new hardware, however, function calls are more powerful, which should
 258 lead to a lot less code duplication. For this reason, putting every instruction
 259 in configuration registers might actually take more space instead of less. It
 260 should be noted that, the configuration registers are actually just a compiler
 261 controlled cache that is mandatory and static (instructions must be in the cache
 262 and the cache cannot be modified at runtime). By lifting these both limitations,
 263 we get a cache that is a lot more flexible. Additionally, this has the advantage
 264 that the size of the cache is no longer an upper bound on the program size, but
 265 only the instruction memory is (which can be a lot bigger).
 266
 267 The tradeoff here is that the sequencer instructions will get a lot bigger,
 268 since they need to contain a full instruction word (which would be preloaded
 269 into the CRs in the old design) which can be up to a few hundred bits long.
 270 Since not all sequencer instructions need to be this long (executing out of the
 271 cache should be a lot shorter), different instruction lengths should be
 272 supported. Also, some smart compression techniques should also be applied to
 273 those long instruction words, though that can only be done once there is some
 274 more application code to gather statistics on.