+
+\subsection{Familiarizing with LLVM}
+Since the frontend heavily relies on the LLVM project for doing it's work, one
+of the first challenges was to get myself familiar with LLVM. There are two main
+aspects to this: Getting to know my way around the LLVM codebase and getting to
+know the LLVM community.
+
+Since LLVM has a pretty large amount of documentation, I spent most of my first
+weeks with reading tutorials and documents. Since there was already a (very
+preliminary) version of the clang-based frontend, I also had some code to play
+with.
+
+During this period, it was not completely clear what the frontend should
+be doing and what transformations should be written. To prevent circular
+dependencies in my tasks (I can't write any code before I know what needs to be
+written, but I can't really tell what's needed until I know how the code works,
+which I can't effectively learn without actively working with it, etc.) I
+started out with adapting the loop unrolling pass in LLVM to be better suited to
+the Montium architecture. Eventually, this code didn't turn out to be
+immediately useful (it's still not included currently), but it proved very
+insightful as to how the LLVM framework is built and what its possibilities are.
+
+Additionally, during my working with the code in this internship I also produced
+a number of patches for LLVM, containing bugfixes, some cleanup and
+documentation improvements. Since the best way to integrate with any open source
+project seems to be contributing code, I was giving commit access to the LLVM
+tree not long thereafter. This access has proved very useful during the rest of
+the internship, since it was now a a lot easier to make (simple) changes to the
+LLVM framework to better suit the needs of Recore.
+
+A major challenge during my internship was to find the balance between doing
+work specifically for Recore, and doing work that is useful for LLVM in general.
+Any tasks that are required by the Montium frontend and the LLVM changes they
+directly need are obvious. However, usually when making changes to the main LLVM
+code, just changing enough for Recore is not engough for LLVM. Since the LLVM
+code must work on any program, not just MontiumC programs, extra changes are
+required (see also parapgrah \ref{StayingGeneric}. This is also an issue of
+building up credit within the LLVM community: The more you contribute to LLVM,
+the more influence you have when things need changing.
+
+Lastly, this is also a matter of efficiency: If I have been working
+with a particular piece of code intensively, it is more efficient for me to fix
+a bug in that code than most others, even though the particular bug does not
+interfere with the MontiumC frontend. In the end, I think I managed to find a
+decent balance, though it might have been tipped slighly in favour of the LLVM
+project.
+
+\subsection{Coding}
+This section says something about the challenges encountered while actually
+writing code, if I can think of enough things to say here. So far, haven't
+thought of anything particularly interesting yet.
+
+\subsection{Working together}
+Since the compiler plays a fairly central role in the development process at
+Recore, I have been cooperating with a number of different people, in different
+areas. On one end, the compiler is directly used by the DSP engineers, so a lot
+of the requirements and wishes for the compiler come from them. Also, they are
+often providing bug reports and other feedback, which ensures regular contact.
+
+On the other end, I have been in contact with the developer of the backend very
+intensively, since most changes made to either component needed changes in the
+other one as well. Compiler changes also require hardware support, so working
+with the hardware developers was not uncommon either. In practice, most of this
+communication went through the backend developer, except for the design
+discussion concerning the new Montium hardware design (also see section
+\ref{Pipelining} below).
+
+In addition, discussions regarding design issues at various levels often happen
+out in the open, which easily invites people with an opinion about something to
+chime in, without having to gather people around for every discussion that
+happens. This allows for very quick feedback on ideas from people in all areas
+in an efficient way.
+
+\subsection{Staying generic}
+\label{StayingGeneric}
+The toughest challenge by far was to fit the features and changes required by
+Recore into code that is maintained by the LLVM project. The main problem here
+is clash of goals: The LLVM project aims for code that performs well for their
+supported architectures, while Recore aims for code that performs well (or more
+often, can compile MontiumC at all) for the Montium architecture.
+
+In general, these problems mainly occur because MontiumC and in particular
+MontiumIR poses a number of constraints that are not present in other
+architectures. This means that some transformations present in LLVM will violate
+these constraints (which would result in better code on most other
+architectures) resulting in miscompiled or uncompilable code.
+
+In some cases, these extra constraints can be formulated in a generic way, such
+that the LLVM code can handle architectures with or without the constraint.
+In a lot of cases, however, the constraint is so Montium specific that changing
+LLVM to support it is not feasible. In a few cases, this meant that the backend
+was changed to support more features of the LLVM IR. In other cases, a
+Recore-specific transformations was added to solve these problems. In a few more
+cases, the problems are still unresolved, effectively resulting in additional
+constraints on the MontiumC language.
+
+\subsection{New hardware design}
+\label{Pipelining}
+I've also been involved for a bit with the instruction scheduling algorithm
+required for the new (pipelined) hardware design and the hardware design itself.
+Even though this is completely outside of the area of my assignment, the initial
+prototype of that scheduler was created by someone else using LLVM. Because of
+my experience with LLVM, I have been assisting him with that. Initially mostly
+helping out with hints on LLVM coding, but later also with thinking about the
+scheduler and hardware design.
+
+I will not go into much detail about the new hardware and its scheduler here,
+but I will highlight the most important challenges and tradeoffs.
+
+\subsubsection{Tradeoffs}
+In general, the most important tradeoff seems to be between flexibility and
+everything else (code size, performance, complexity, hardware area). This
+flexibility is expressed in which instructions are possible, which connections
+are present, how big register files are, etc.
+
+An important reason to be flexible is for programmability. If the hardware is
+regular, making a compiler that produces optimal code gets a lot easier.
+On the other hand, the compiler also limits flexibility. If the hardware
+has flexibility that the compiler will never use, it's better to save
+area and complexity by making the hardware less flexible. Exactly for this
+reason, it is important to develop hardware and supporting software in parallel,
+instead of using the hardware first, software later approach used with the
+initial Montium (TODO: Is this true?). This allows for a much better balanced
+and usable design, without any unused extras.
+
+\subsubsection{Inner loop pipelining}
+When trying to improve runtime performance, the main focus is on
+optimizing loops, and inner loops (loops that contain no other loops) in
+particular. Since the inner loop is executed the most often, it is the most
+efficient to optimize the inner loop. Also, the inner loop is also the piece of
+code that can most optimally use the parellel processing power of the Montium,
+because it can be software pipelined.
+
+Software pipelining means that the compiler will emit code that performs
+operations that belong in different iterations of the original loop during the
+same cycle. Since data dependencies within a loop body usually severely limit
+the amount of operations that can be done in parallel, pipelining allows the
+second (and further) iteration to start well before the first iteration is done.
+This is done by dividing the loop body in a number of stages, that would
+normally be executed sequentially. These stages are then executed in parallel,
+but for different iterations (ie, run stage 2 of iteration i, while running
+stage 1 of iteration i+1).
+
+This approach allows a loop to be ran a lot faster than executing a
+single iteration at a time. However, since the instructions for the
+first and last few iterations (the prologue and epilogue) are distinctly
+different from the loop "kernel", the number of instructions needed for
+a pipelined loop can easily increase a lot.
+
+However, all pipelined loops share a very distinct structure (first
+stage 1, then stage 1+2, then stage 1+2+3, etc, then all stages at the
+same time, similar for the epilogue). Also, every instruction in the
+prologue and epilogue are a strict subset of the instructions in the
+kernel. By adding some hardware support for exactly this structure, the
+code size increase for the prologue and epilogue can be effectively
+reduced to a fixed number of instructions (which take the number of stages as a
+parameter and uses preloaded instructions with explicit stage annotation).
+
+The tradeoff here is that this hardware is only usable specifically for these
+inner loops, any other code will leave this extra hardware unused. However,
+since pipelined inner loops appear to be very common, this should not be a
+problem at all.
+
+\subsubsection{Code compression}
+Another very important tradeoff concerns codesize. In the old hardware, a lot of
+flexibility in the original code was achieved by using inline functions (since
+the hardware has only very limited support for code reuse through function
+calls). This resulted in a lot of code duplication, which was compensated for by
+using two level configuration registers to be able to reuse (partial)
+instructions nonetheless (which will still need a lot of sequencer instructions,
+but those will be a lot smaller than the full instruction).
+
+On the new hardware, however, function calls are more powerful, which should
+lead to a lot less code duplication. For this reason, putting every instruction
+in configuration registers might actually take more space instead of less. It
+should be noted that, the configuration registers of the old Montium are
+effectively a compiler controlled cache that is mandatory and static
+(instructions must be in the cache and the cache cannot be modified at runtime).
+By lifting these limitations, we get a cache that is a lot more flexible.
+However, since it is compiler controlled (as opposed to letting the hardware
+decide what ends up in the cache), the performance of the code remains very
+predictable (unlike traditional caches). Additionally, this has the advantage
+that the size of the cache is no longer an upper bound on the program size, but
+only the instruction memory is (which can be a lot bigger).
+
+The tradeoff here is that the sequencer instructions will get a lot bigger,
+since they need to contain a full instruction word (which would be preloaded
+into the CRs in the old design) which can be up to a few hundred bits long.
+Since not all sequencer instructions need to be this long (executing out of the
+cache should be a lot shorter), different instruction lengths should be
+supported. Also, some smart compression techniques should also be applied to
+those long instruction words, though that can only be done once there is some
+more application code to gather statistics about.