+
+To get a feeling for what MontiumC looks like, consider the fragment in
+figure \ref{ExampleLow}. This is a piece of code that reads values from
+one memory, multiplies them by two, and writes them back to another
+memory. As you can see, this is an awful lot of code. This has two main
+reasons. First, looping and memory indexing is very explicit and takes a
+lot of instructions. Second, the code is effectively software pipelined
+to make it run more efficiently.
+
+In figure \ref{ExampleHigh} the same code is displayed, but this time
+using higer level C features (for loops, array indexing). This is the
+level of code we are trying to achieve, but we're not there yet. It
+should be noted that this is still not "normal" C, since we use the
+"imul" function instead of the normal * operator. However, since the
+Montium defines a lot of different operations, most of which have a
+number of options (saturation, truncation, post shifting, etc.) these
+cannot all be mapped onto normal C operators. By using a specific
+function call for each, we can still distinguish between all the
+different operations and add extra arguments where needed.
+
+\subsubsection{What do we have now?}
+The result of this work is a usable, but conservative, specification. It
+defines the subset of features that should certainly be supported. In practice,
+some other features will also work, but not reliably. Therefore, these are left
+out of the specification.
+
+It is not unlikely that the specification is still incorrect in a few places (or
+rather, that the code does not implement the specification properly). Since
+so far there has not been any automated checking of programs against the
+specification, these errors have not been uncovered. Once the new hardware is
+more clearly defined and the MontiumC specification is updated for it, this
+checking should be added so the specification and compiler can be better
+matched.
+
+\begin{figure}
+ \caption{Low level MontiumC example}
+\label{ExampleLow}
+\begin{verbatim}
+mem input;
+mem output;
+word factor;
+
+void run(void) {
+ factor = from_int(2);
+ input = alloc_mem(P0M0);
+ output = alloc_mem(P0M1);
+ set_base(input, 0);
+ set_offset(input, 0);
+ set_base(output, -1);
+ set_offset(output, -1);
+
+ next_cycle();
+ word in = read_mem(input);
+ word out = p0o0(imul(ra1(in), rc1(factor)))
+ add_offset(input, 1);
+ add_offset(output, 1);
+ init_loop(LC1, 8);
+ do {
+ write_mem(output, out);
+ in = read_mem(input);
+ out = p0m0(imul(ra1(in), rc1(factor)))
+ add_offset(input, 1);
+ add_offset(output, 1);
+ } while(loop_next(LC1));
+
+ write_mem(output, out);
+\end{verbatim}
+\end{figure}
+
+\begin{figure}
+ \caption{High level MontiumC example}
+ \label{ExampleHigh}
+\begin{verbatim}
+P0M0 int input[10];
+P0M1 int output[10];
+
+void run(void) {
+ for (int i=0; i<10; ++i)
+ output[i] = mul(input[i], 2);
+}
+\end{verbatim}
+\end{figure}
+
+\subsection{Familiarizing with LLVM}
+Since the frontend heavily relies on the LLVM project for doing it's work, one
+of the first challenges was to get myself familiar with LLVM. There were two main
+aspects to this: Getting to know my way around the LLVM codebase and getting to
+know the LLVM community.
+
+LLVM has a pretty large amount of documentation, I spent most of my first
+weeks with reading tutorials and documents. Since there was already a (very
+preliminary) version of the Clang-based frontend, I also had some code to play
+with.
+
+During this period, it was not completely clear what the frontend should
+be doing and what transformations should be written. To prevent circular
+dependencies in my tasks (I can't write any code before I know what needs to be
+written, but I can't really tell what's needed until I know how the code works,
+which I can't effectively learn without actively working with it, etc.) I
+started out with adapting the loop unrolling pass in LLVM to be better suited to
+the Montium architecture. Eventually, this code didn't turn out to be
+immediately useful because deciding when to unroll a loop and when not to turned
+out rather hard (it's still not included currently). Working with this pass did
+prove very insightful, however, as to how the LLVM framework is built and what its
+possibilities are.
+
+Additionally, during my working with the code during this internship I also produced
+a number of patches for LLVM, containing bugfixes, some cleanup and
+documentation improvements. Since the best way to integrate with any open source
+project seems to be contributing code, I was given commit access to the LLVM
+tree not long thereafter. This access has proved very useful during the rest of
+the internship, since it was now a a lot easier to make (simple) changes to the
+LLVM framework to better suit the needs of Recore.
+
+A major challenge during my internship was to find the balance between doing
+work specifically for Recore, and doing work that is useful for LLVM in general.
+Any changes that are directly required by the Montium frontend and the LLVM changes they
+need are obvious. However, usually when making changes to the main LLVM
+tree, just changing enough for Recore is not engough for LLVM. Since the LLVM
+code must work on any program, not just MontiumC programs, extra changes are
+required (see also parapgrah \ref{StayingGeneric}). Additionally, this is an issue of
+building up credit within the LLVM community: The more you contribute to LLVM,
+the more influence you have when things need changing.
+
+Lastly, this is also a matter of efficiency: If I have been working
+with a particular piece of code intensively, it is more efficient for me to fix
+a bug in that code than most others, even though the particular bug does not
+interfere with the MontiumC frontend. In the end, I think I managed to find a
+decent balance, though it might have been tipped slighly in favour of the LLVM
+project.
+
+\subsection{Working together}
+Since the compiler plays a fairly central role in the development process at
+Recore, I have been cooperating with a number of different people, in different
+areas. On one end, the compiler is directly used by the DSP engineers, so a lot
+of the requirements and wishes for the compiler come from them. Also, they are
+often providing bug reports and other feedback, which ensures regular contact.
+
+On the other end, I have been in contact with the developer of the backend very
+intensively, since most changes made to either component needed changes in the
+other one as well. Compiler changes also require hardware support, so working
+with the hardware developers was not uncommon either. In practice, most
+communication with the hardware developers went through the backend
+developer, except for the design discussion concerning the new Montium
+hardware design (also see section \ref{Pipelining}).
+
+In addition, discussions regarding design issues at various levels often happen
+out in the open, which invites people with an opinion about something to
+chime in, without the overhead of planning a seperate meeting for each
+topic. This allows for very quick feedback on ideas from people in all areas
+in an efficient way.
+
+\subsection{Staying generic}
+\label{StayingGeneric}
+The toughest challenge by far was to fit the features and changes required by
+Recore into code that is maintained by the LLVM project. The main problem here
+is a clash of goals: The LLVM project aims for code that performs well for their
+supported architectures, while Recore aims for code that performs well (or more
+often, can compile at all) for the Montium architecture.
+
+In general, these problems mainly occur because MontiumC and in particular
+MontiumIR poses a number of constraints that are not present in other
+architectures. This means that some transformations present in LLVM will violate
+these constraints (which would result in better code on most other
+architectures) resulting in miscompiled or uncompilable code.
+
+In some cases, these extra constraints can be formulated in a generic way, such
+that the LLVM code can handle architectures with or without the constraint.
+Examples include providing Montium specific parameters through LLVM's
+``\verb TargetData '' class and modifications to the loop unroller to support
+custom policies.
+
+In a lot of cases, however, the constraint is so Montium specific that changing
+LLVM to support it is not feasible.
+
+In a few of these cases, this meant that the backend was changed to support more
+features of the LLVM IR. An example of this is finding datapath constants (which
+are the result of a function in MontiumC, so hard to track by LLVM).
+
+In other cases, Recore-specific transformations were added to solve these
+problems. Examples of these are a transformation that removes all global
+variables and passes them as arguments and return values instead, a
+transformation that forcibly inlines all functions marked as ``\verb inline ''
+and a transformation that limits variable lifetimes by introducing extra phi
+nodes.
+
+In a few more cases, the problems are still unresolved, effectively resulting in
+additional constraints on the MontiumC language. Examples of these are
+preventing instructions from being moved out of if/else blocks (which is
+perfectly fine from an LLVM IR point of view, but does not take into account the
+extra meaning that an if statement has in MontiumIR) and removal of unused bits
+from a constant (which could introduce more different constants than the Montium
+has registers for them).
+
+
+\subsection{New hardware design}
+\label{Pipelining}
+I've also been involved for a bit with the instruction scheduling algorithm
+required for the new (pipelined) hardware design and the hardware design itself.
+Even though this is completely outside of the area of my assignment, the initial
+prototype of that scheduler was created by someone else using LLVM. Because of
+my experience with LLVM, I have been assisting him with it. Initially I
+helped him by giving hints on LLVM coding, but later also with thinking about the
+scheduler and hardware design.
+
+I will not go into much detail about the new hardware and its scheduler here,
+but I will highlight the most important challenges and tradeoffs.
+
+\subsubsection{Tradeoffs}
+In general, the most important tradeoff seems to be between flexibility and
+everything else (code size, performance, complexity, hardware area). This
+flexibility is defined in terms of possible instructions, present
+connections, register files sizes, etc.
+
+An important reason to be flexible is for programmability. If the hardware is
+regular, making a compiler that produces optimal code gets a lot easier.
+On the other hand, the compiler also limits flexibility. If the hardware
+has flexibility that the compiler will never use, it's better to save
+area and complexity by making the hardware less flexible. Exactly for this
+reason, it is important to develop hardware and supporting software in parallel,
+instead of using the hardware first, software later approach used with the
+initial Montium. This allows for a much better balanced and usable design,
+without any unused extras.
+
+\subsubsection{Inner loop pipelining}
+When trying to improve runtime performance, the main focus is on
+optimizing loops, and inner loops (loops that contain no other loops) in
+particular. Since inner loops are executed the most often (compared to
+other loops and code), it is the most
+efficient to optimize inner loops. Also, inner loops can most optimally
+use the parellel processing power of the Montium, because they can be
+software pipelined.
+
+Software pipelining means that the compiler will emit code that performs
+operations that belong in different iterations of the original loop during the
+same cycle. Since data dependencies within a loop body usually severely limit
+the amount of operations that can be done in parallel, pipelining allows the
+second (and further) iteration to start well before the first iteration is done.
+This is done by dividing the loop body in a number of stages, that would
+normally be executed sequentially. These stages are then executed in parallel,
+but for different iterations (ie, run stage 2 of iteration i, while running
+stage 1 of iteration i+1).
+
+This approach allows a loop to be ran a lot faster than executing a
+single iteration at a time. However, since the instructions for the
+first and last few iterations (the prologue and epilogue) are distinctly
+different from the loop "kernel", the number of instructions needed for
+a pipelined loop can easily increase a lot.
+
+However, all pipelined loops share a very distinct structure: first
+stage 1, then stage 1+2, then stage 1+2+3, etc, then all stages at the
+same time, similar for the epilogue. Also, every instruction in the
+prologue and epilogue are a strict subset of the instructions in the
+kernel. By adding some hardware support for exactly this structure, the
+code size increase for the prologue and epilogue can be effectively
+reduced to a fixed number of instructions (which take the number of stages as a
+parameter and use preloaded instructions with explicit stage annotation).
+
+The tradeoff here is that this hardware is only usable specifically for these
+inner loops, any other code will leave this extra hardware unused. However,
+since pipelined inner loops appear to be very common, this should not be a
+problem at all.
+
+\subsubsection{Code compression}
+Another very important tradeoff concerns codesize. In the old hardware, a lot of
+flexibility in the original code was achieved by using inline functions (since
+the hardware has only very limited support for code reuse through function
+calls). This resulted in a lot of code duplication, which was compensated for by
+using two level configuration registers to be able to reuse (partial)
+instructions nonetheless (which will still need a lot of sequencer instructions,
+but those will be a lot smaller than the full instruction).
+
+On the new hardware, however, function calls are more powerful, which should
+lead to a lot less code duplication. For this reason, putting every instruction
+in configuration registers might actually take more space instead of less. It
+should be noted that the configuration registers of the old Montium are
+effectively a compiler controlled cache that is mandatory and static
+(instructions must be in the cache and the cache cannot be modified at runtime).
+By lifting these limitations, we get a cache that is a lot more flexible.
+However, since it is compiler controlled (as opposed to letting the hardware
+decide what ends up in the cache), the performance of the code remains very
+predictable (unlike traditional caches). Additionally, this has the advantage
+that the size of the cache is no longer an upper bound on the program size, but
+only the instruction memory is (which can be a lot bigger).
+
+The tradeoff here is that the sequencer instructions will get a lot bigger,
+since they need to contain a full instruction word (which would be preloaded
+into the CRs in the old design) which can be up to a few hundred bits long.
+Since not all sequencer instructions need to be this long (executing out of the
+cache should be a lot shorter), different instruction lengths should be
+supported. Also, some smart compression techniques should also be applied to
+those long instruction words, though that can only be done once there is some
+more application code to gather statistics about.