Mitch Burnett · 4842ebf7
--- a/6-fw-sw-design/6.1.md
+++ b/6-fw-sw-design/6.1.md
+[<< Home](/home#6-firmware-and-software-design-panel-charge-5)
+
+[<< Section 5.4](/5-dbe/5.4)
+
+## 6.1 Beamformer F-Engine Firmware
+An overview of the beamformer digital back end and its operating modes was given
+in [Section 5.1](../5-dbe/5.1). The capability to provide different modes with
+either coarse or narrow-band spectral data products is realized by a two-stage
+channelizer architecture. First stage digital processing will be done in the
+[RFSoC](../5-dbe/5.2). A high-level block diagram is shown in the following
+figure which depicts the signal path through the RFSoC for one antenna element.
+The input signal is received over the RFoF link, and output is routed to the
+second-stage processor through the 100 GbE network and switch.
+
+<div align="center">
+  <img src="../img/dbe/f-engine-blk-diagram.png" width=800"\>
+
+  Figure 1: RFSoC single antenna signal flow diagram
+</div>
+
+
+The first stage digital processing (F-engine) includes sampling antenna voltages,
+frequency channelization, and data "packetizing" for network transport. 
+The following describes functionality and implementation of the IP used in the F-engine
+following the ADC.  The RFSoCs sampling capabilities, configuration, and
+operation in the context of ALPACA were addressed in [Section 5.2](../5-dbe/5.2).
+
+### 6.1.1 Oversampled Polyphase Filter Bank
+A channelizer is a filter bank used to decompose an input signal into bins by
+frequency. In high-performance real-time systems, computationally efficient
+channelization is achieved by using a polyphase filter bank (PFB) as opposed to
+a conventional fast Fourier transform (FFT) because of its ability to reduce
+spectral leakage and signal attenuation near frequency bin edges (called scalloping
+loss).
+
+Single-stage PFB implementations follow a conventional design approach where the
+frequency response of the prototype low-pass filter (LPF) has low sidelobes,
+narrow transition bands, and the attenuation specification at the crossover
+point between adjacent channels is -3 dB. This results in a uniform power spread
+for spectra across the full bandwidth of the instrument. The PFB which
+accomplishes this is called a critically sampled (or maximally decimated) PFB
+because the channelizer output sample rate per channel, in samples per second,
+is equal to the effective channel spacing in Hertz [^harris].
+
+In two-stage channelizer architectures, when this same approach is followed but
+output products are then subsequently processed by a second-stage "zoom" PFB,
+this results in two significant processing artifacts observed in the fine
+channelized spectrum in regions corresponding to coarse adjacent channel
+crossovers.  These undesirable artifacts are scalloping between adjacent
+adjacent fine channels, and spectral aliasing between fine channels. An example
+of this behavior is shown in the following figure:
+
+<div align="center">
+  <img src="../img/dbe/second_stage_alias.png" width=600"\>
+</div>
+
+Figure 2:  System degrading processing artifacts are present when a critically
+sampled PFB is followed by a second-stage channelizer.  Note the aliased
+frequency tone (red curve) and scalloping of the white noise floor which should
+be flat (black curve).
+
+Despite the design of the LPF in the first-stage being correct for a
+channelizer design, the scalloping shown is the expected result because the
+filter frequency response in the transition band is sampled at a finer 
+frequency resolution as a result of the second channelizer. The spectral images
+that occur from signals present at the coarse channel boundary are a more severe
+artifact and occur because the filter was not designed to attenuate aliases at
+the same level as a conventional anti-aliasing filter.
+
+To avoid these spectral corruptions when processing in fine "zoom" spectrometer mode,
+the channelizer in the ALPACA F-engine is not the conventional critically
+sampled PFB, but an oversampled PFB (OSPFB). Here, the decimation rate of the
+first-stage channelizer is decreased and the channel passband shape is designed
+to allow for a slight overlap between adjacent channels in their crossover
+region. Following the output of the second-stage critically sampled PFB, the
+fine channels in the overlapped region are discarded eliminating all unwanted
+processing artifacts. With proper prototype filter design only a few channels of
+overlap are required. The OSPFB does increase the channelizer output
+sampling rate (compared to the critically sampled case), and this needs to be accounted
+for as part of the allocated I/O budget.
+
+The following figure shows a software simulation result comparing the output of
+a second-stage PFB for fine spectrometer mode when the first-stage PFB is either
+critically sampled or oversampled. A signal of interest is placed between
+adjacent channels within the passband. When the first-stage PFB is critically
+sampled we again see the scalloping and aliased image of the signal of interest.
+The OSPFB successfully removes these unwanted artifacts producing a uniform
+power spectrum.
+
+<div align="center">
+  <img src="../img/dbe/os_pfb_mat.png" width=600"\>
+
+  Figure 3: Improved second stage spectrum with an OSPFB first stage.
+</div>
+
+
+The architecture for the implementation of an OSPFB can be derived by starting
+with that of a critically sampled PFB. As shown in the following figure, a PFB
+channelizer producing $`M`$ frequency bin outputs can be considered an
+$`M`$-port device where samples are delivered to the $`M`$ branches of a
+polyphase LPF with filter outputs subsequently processed by an $`M`$-point FFT.
+
+<div align="center">
+  <img src="../img/dbe/cspfb-blk-diagram.png" width=600"\>
+
+Figure 4: Critically sampled PFB block diagram.
+</div>
+
+In the critically sampled case, $`M`$ samples are delivered to the core per
+computation of the $`M`$ branch filter outputs and $`M`$-point FFT. The OSPFB
+modifies the decimation by any rate $`D`$ to be less than the critical rate $`M`$ ($`D
+< M`$), increasing the sampling rate at each output port by the ratio $`M/D`$.
+In practice this is done by shifting in $`D`$ samples to the core per
+computation of branch filter and FFT outputs.
+
+The shifts by $`D`$ samples as opposed to $`M`$ introduce a frequency dependent
+phase offset not accounted for by the $`M`$-point FFT kernel. The compensation
+of this phase offset is done with the addition of a barrel sample rotator
+serving to re-align the $`M`$-path filter outputs with their respective
+transform input. The following figure shows the modified block diagram for the
+OPSFB implementation with the addition of the phase compensation buffer.
+
+<div align="center">
+  <img src="../img/dbe/ospfb-concept-blk-diagram.png" width=600"\>
+
+Figure 5: Oversampled PFB block diagram.
+</div>
+
+
+The ALPACA F-engine OSPFB is a custom developed IP which takes into account the
+trade-offs in the number of parallel antenna signals and available FPGA
+resources resulting in a flexible and efficient implementation. 
+Design and implementation for a single antenna input of this custom ALPACA 
+hardware OSPFB IP for the RFSoC has been completed.
+The following figure shows a complete post-synthesis hardware simulation (bit and cycle
+accurate) for the first-stage ALPACA specified OSPFB (2048 channels, oversample
+ratio 4/3, 8 polyphase taps) followed by a second stage software 32-point
+critically sampled PFB. The core is functional and working as expected.
+
+<div align="center">
+  <img src="../img/dbe/ospfb-hw-sim-output.png" width=600"\>
+
+Figure 6: Fine spectrum plot of the ALPACA OSPFB output with a single tone input. Note the lack of scalloping or aliasing.
+</div>
+
+### 6.1.2 Packetizer
+The document linked below specifies the detailed ethernet jumbo packet format for
+data transfer from the RFSoC F-engine digitizer and frequency channelizer, to
+the GPU XB-engine digital beamformer.  The data transfer is handled by a 60-port
+100 GbE ethernet switch, which performs a large "corner turn" operation to
+reorder data from being sequenced by antenna index to sequencing by frequency
+channel index.  Each F-engine RFSoC handles 12 PAF antennas across all frequency
+channels.  After the corner turn, these jumbo packets are re-routed so that each
+GPU process 25 (out of 1300) frequency channels for all 138 (+6 spares) antenna
+signal streams.
+
+Another important aspect of the packetizer format design shown in the linked
+document below is the way frequency channels from each F-engine (each with a
+unique FID index number as shown in the table) are distributed across the 50 GPU
+XB-engines (each with a unique XID index).  The processing load for some
+XB-engine processing modes, such as HI observations using a "zoom" fine
+resolution spectrometer, is so high that the digital back end cannot process the
+full 305.1 MHz bandwidth.  Usually the observer in these modes has no need for
+the full bandwidth, so we do reduced width subband processing.  However, if
+channels are assigned to GPUs (XIDs) sequentially, filling up one XID with
+channels before moving on to the next, the system would fail in increased
+computational demand modes even with reduced bandwidth.  The packet format
+handles this by "dealing out like playing cards" one channel per XID until all
+50 have one, then starting over for the next 50 channels, and so on. When
+processing bandwidth is reduced, the processing load is then still evenly
+distributed across all XIDs, rather than concentrated on a few.  This keeps the
+workload uniform across XIDs when processing demands will not support full
+bandwidth operation.
+
+[Ethernet Packet Specifications](../uploads/7666d16ef1f7fb6c19a746e2dbf23508/Packet_Format_2.0.pdf)
+
+### 6.1.3 UDP Framer and 100 GbE
+The UDP framer was developed by the Electronic Systems Design Group of
+Rutherford Appleton Laboratories. This core converts AXI4-Stream data frames
+from the F-engine packetizer into IEEE 802.3 Ethernet and IPv4 packets. The core
+is very flexible, with a receive path, AXI4-Lite memory map control
+interface, and optional PING and other IPv4 protocol functions. ALPCA will only
+be using the UDP core to transmit packets and its ARP capabilities for
+destination IP address look up. The outputs of the UDP core are then sent to our
+custom wrapper IP for the integrated 100G CMAC PHY of the RFSoC. This core
+implements CAUI-4 100G using RS-FEC (Reed-Solomon forward error correction) for
+use on a 100GBASE-SR4 link.
+
+The output data rate per each of the 12 RFSoC will be 81.8 Gbps. After being
+distributed to the 25 HPCs (50 GPUs) the rate drops to 39.3 Gbps per HPC over
+two 100 Gigabit NIC cards per each.
+
+[Section 6.2 >>](./6.2)
+
+### Footnotes
+[^harris]: F. J. Harris, Multirate Signal Processing for Communication Systems.
+Upper Saddle River, NJ, USA: Prentice Hall PTR, 2004. 
\ No newline at end of file