Report/Main/Context/Montium.tex

   1 \section{Montium Tile Processor}
   2 \label{Montium}
   3 The Montium Tile Processor (Montium) is the main product of Recore Systems. It
   4 is a reconfigurable processor that is designed for inclusion in a tiled,
   5 heterogenous multi- or manycore system-on-chip (SoC), connected to other tiles
   6 and the outside world through a network-on-chip (NoC).
   7
   8 The Montium has a number of fundamental differences with ``regular'' processors
   9 and DSP engines, that make it both interesting and challenging to program for
  10 both application programmers and compiler designers.
  11
  12 \begin{figure}
  13   %\epsfig{file=Img/MontiumOverview.eps, width=.5\textwidth}
  14   \caption{Overview of the Montium design}
  15 \end{figure}
  16
  17 \subsection{Overall design}
  18 The Montium is built from a few parts. The central part is the interconnect,
  19 which ties memories, arithmetic and logic units (ALU) and the communication
  20 and configuration unit (CCU) together. The memories store data locally, the
  21 ALUs process data and the CCU moves data and configuration on and off the
  22 Montium. Furthermore, there is a sequencer, which is the closest thing to a
  23 normal processor in the Montium: It accepts and executes instructions one by
  24 one, is capable of performing (conditional) jumps and can perform some other
  25 limited control flow.
  26
  27 \subsubsection{Sequencer}
  28 The Sequencer executes its instructions one by one and controls all other
  29 elements through the configuration registers (CR). To keep the size of sequencer
  30 instructions limited, while not limiting the flexibility of the other elements,
  31 two levels of configuration registers are introduced. These registers are wide
  32 and contain multiple sets of input signals to the various multiplexers, function
  33 units, etc.
  34
  35 The sequencer instructions in turn contain indices into these configuration
  36 registers. This way, every sequencer instruction can select a configuration for
  37 the entire Montium for the cycle during which the instruction is executed. This
  38 also means that the Montium is reconfigured on every cycle, for maximum
  39 flexibility and performance.
  40
  41 Using a two-level configuration register scheme ensures that when a (part of) a
  42 particular configuration is reused in more than one sequencer instruction, it
  43 does not have to be duplicated entirely. Only the index pointing to the right
  44 configuration register (which is a lot smaller) is duplicated in multiple
  45 sequencer instructions. This does of course limit the amount of different
  46 configurations that a single program can use and thus limit the size of a
  47 Montium program.
  48
  49 \subsubsection{Memories}
  50 The Montium contains ten memories (two for each ALU). Each of these memories has
  51 its own Address Generation Unit (AGU), which can generate different memory
  52 address patterns. This means that the instructions or CRs never contain direct memory
  53 addresses, only modifications to the current address. Each memory simply reads
  54 from its current address and offers the value read to the interconnect, which
  55 can then further distribute it to wherever it is needed. Writing works in the
  56 same way, though a memory can only be read from or written to in the same cycle.
  57
  58 \subsubsection{Arithmetic and logic units}
  59 The main processing elements of the Montium are its 5 arithmetic and
  60 logic units (ALU). Each of them has four (16 bit) inputs, each with a
  61 number of input registers. Each ALU contains a number of function units,
  62 a multiplier, a few adders and some miscellaneous logic. Each of the
  63 elements in the ALU can be controlled separately and data can be routed
  64 in different ways by configuration of multiplexers inside the ALU. The
  65 ALU has two output ports, without registers. Additionally, there is a
  66 connection from each ALU to its neighbour.
  67
  68 The ALU has no internal registers, so data travels through the entire ALU
  69 in a single cycle, to arrive at the outputs before the end of the cycle. This
  70 means that the ALU can perform a lot of computation in a single clock cycle. For
  71 example, using four of the five ALUs, an FFT butterfly operation (two complex
  72 multiplications and four complex additions) can be exected in a
  73 single clock cycle. The downside of this approach is that the data will have a
  74 long path to travel, which limits the clock speed of the design.
  75
  76 \subsubsection{Communication and Configuration Unit}
  77 The communication and configuration unit (CCU) controls communication
  78 with the external world, usually a network-on-chip. During normal operations, the
  79 CCU can take values from the
  80 interconnect and stream them out onto the NoC, or vice versa. Additionally, the
  81 CCU can be used from outside the Montium to start and stop execution and
  82 move configuration registers, sequencer instructions and memory contents into
  83 and out of the Montium.
  84
  85 \subsubsection{Interconnect}
  86 The central part of the Montium is the interconnect, which is a crossbar
  87 of lines, of which most are connected. There are a total of 10 global
  88 busses in the interconnect, to which every input and output port of the
  89 various components can be connected.  This way, every output of the
  90 memories, ALUs and CCU can be routed to every input, provided that
  91 there are enough global busses. Additionally, each pair of memories
  92 belonging to a specific ALU can be routed directly to the inputs and
  93 outputs of that ALU, without requiring a global bus.
  94
  95 \subsection{Design changes}
  96 Currently, the Montium design is experiencing a major overhaul. During work with
  97 the original design, a number of flaws or suboptimal constructs have been found.
  98 In particular, the ALUs are capable of performing a large number of operations
  99 in a single cycle, but since they operate sequentially, this severly limits
 100 clock speeds. In the new design, the number of ALUs is reduced, but each ALU is
 101 subdivided in multiple parallel operating function units. Also, the Montium has
 102 only very limited support for control flow, making it hard to program it for
 103 data dependent control and synchronization, which asks for improvements.
 104
 105 This approach requires computations to be properly pipelined to efficiently
 106 use all those function units in parallel, but since data only travels through
 107 only a single function unit in each cycle, this allows for much higher clock
 108 speeds than the old design.
 109
 110 During my internship I have mainly been working with the old Montium
 111 design, and unless otherwise stated, that is what is meant when
 112 referring to the "Montium".  Some of the work has been done with the new
 113 design in mind, but I have been actually working with the new design
 114 only during the final weeks of my internship.  See section
 115 \ref{Pipelining} for more details.