X-Git-Url: https://git.stderr.nl/gitweb?a=blobdiff_plain;f=Report%2FMain%2FContext%2FMontium.tex;h=973d21e317cb00636e67d80075d811e8181d8ffa;hb=refs%2Ftags%2FReport-final;hp=18272ca7553da01ebe4d3f1f80800a28efffcac1;hpb=99713a971023a195e42cf9e63a6b30e3e87d9880;p=matthijs%2Fprojects%2Finternship.git diff --git a/Report/Main/Context/Montium.tex b/Report/Main/Context/Montium.tex index 18272ca..973d21e 100644 --- a/Report/Main/Context/Montium.tex +++ b/Report/Main/Context/Montium.tex @@ -1,4 +1,115 @@ \section{Montium Tile Processor} -This section describes the Montium Tile Processor (Montium) in moderate detail. -It is not meant to be a full spec, but it provides the context necessary for -understanding the next sections and getting a feel for the challenges involved. +\label{Montium} +The Montium Tile Processor (Montium) is the main product of Recore Systems. It +is a reconfigurable processor that is designed for inclusion in a tiled, +heterogenous multi- or manycore system-on-chip (SoC), connected to other tiles +and the outside world through a network-on-chip (NoC). + +The Montium has a number of fundamental differences with ``regular'' processors +and DSP engines, that make it both interesting and challenging to program for +both application programmers and compiler designers. + +\begin{figure} + %\epsfig{file=Img/MontiumOverview.eps, width=.5\textwidth} + \caption{Overview of the Montium design} +\end{figure} + +\subsection{Overall design} +The Montium is built from a few parts. The central part is the interconnect, +which ties memories, arithmetic and logic units (ALU) and the communication +and configuration unit (CCU) together. The memories store data locally, the +ALUs process data and the CCU moves data and configuration on and off the +Montium. Furthermore, there is a sequencer, which is the closest thing to a +normal processor in the Montium: It accepts and executes instructions one by +one, is capable of performing (conditional) jumps and can perform some other +limited control flow. + +\subsubsection{Sequencer} +The Sequencer executes its instructions one by one and controls all other +elements through the configuration registers (CR). To keep the size of sequencer +instructions limited, while not limiting the flexibility of the other elements, +two levels of configuration registers are introduced. These registers are wide +and contain multiple sets of input signals to the various multiplexers, function +units, etc. + +The sequencer instructions in turn contain indices into these configuration +registers. This way, every sequencer instruction can select a configuration for +the entire Montium for the cycle during which the instruction is executed. This +also means that the Montium is reconfigured on every cycle, for maximum +flexibility and performance. + +Using a two-level configuration register scheme ensures that when a (part of) a +particular configuration is reused in more than one sequencer instruction, it +does not have to be duplicated entirely. Only the index pointing to the right +configuration register (which is a lot smaller) is duplicated in multiple +sequencer instructions. This does of course limit the amount of different +configurations that a single program can use and thus limit the size of a +Montium program. + +\subsubsection{Memories} +The Montium contains ten memories (two for each ALU). Each of these memories has +its own Address Generation Unit (AGU), which can generate different memory +address patterns. This means that the instructions or CRs never contain direct memory +addresses, only modifications to the current address. Each memory simply reads +from its current address and offers the value read to the interconnect, which +can then further distribute it to wherever it is needed. Writing works in the +same way, though a memory can only be read from or written to in the same cycle. + +\subsubsection{Arithmetic and logic units} +The main processing elements of the Montium are its 5 arithmetic and +logic units (ALU). Each of them has four (16 bit) inputs, each with a +number of input registers. Each ALU contains a number of function units, +a multiplier, a few adders and some miscellaneous logic. Each of the +elements in the ALU can be controlled separately and data can be routed +in different ways by configuration of multiplexers inside the ALU. The +ALU has two output ports, without registers. Additionally, there is a +connection from each ALU to its neighbour. + +The ALU has no internal registers, so data travels through the entire ALU +in a single cycle, to arrive at the outputs before the end of the cycle. This +means that the ALU can perform a lot of computation in a single clock cycle. For +example, using four of the five ALUs, an FFT butterfly operation (two complex +multiplications and four complex additions) can be exected in a +single clock cycle. The downside of this approach is that the data will have a +long path to travel, which limits the clock speed of the design. + +\subsubsection{Communication and Configuration Unit} +The communication and configuration unit (CCU) controls communication +with the external world, usually a network-on-chip. During normal operations, the +CCU can take values from the +interconnect and stream them out onto the NoC, or vice versa. Additionally, the +CCU can be used from outside the Montium to start and stop execution and +move configuration registers, sequencer instructions and memory contents into +and out of the Montium. + +\subsubsection{Interconnect} +The central part of the Montium is the interconnect, which is a crossbar +of lines, of which most are connected. There are a total of 10 global +busses in the interconnect, to which every input and output port of the +various components can be connected. This way, every output of the +memories, ALUs and CCU can be routed to every input, provided that +there are enough global busses. Additionally, each pair of memories +belonging to a specific ALU can be routed directly to the inputs and +outputs of that ALU, without requiring a global bus. + +\subsection{Design changes} +Currently, the Montium design is experiencing a major overhaul. During work with +the original design, a number of flaws or suboptimal constructs have been found. +In particular, the ALUs are capable of performing a large number of operations +in a single cycle, but since they operate sequentially, this severly limits +clock speeds. In the new design, the number of ALUs is reduced, but each ALU is +subdivided in multiple parallel operating function units. Also, the Montium has +only very limited support for control flow, making it hard to program it for +data dependent control and synchronization, which asks for improvements. + +This approach requires computations to be properly pipelined to efficiently +use all those function units in parallel, but since data only travels through +only a single function unit in each cycle, this allows for much higher clock +speeds than the old design. + +During my internship I have mainly been working with the old Montium +design, and unless otherwise stated, that is what is meant when +referring to the "Montium". Some of the work has been done with the new +design in mind, but I have been actually working with the new design +only during the final weeks of my internship. See section +\ref{Pipelining} for more details.