1. Introduction
The high-energy physics experiments generate vast volumes of data. It is not possible to archive all that data for later offline processing. Two approaches are used to solve that issue. In the triggered data acquisition systems, the incoming data are buffered until the “level one (L1) trigger” decides whether it is useful. The data must be buffered until this decision is taken and delivered. The L1 trigger decision must be elaborated quickly to keep the necessary capacity of data buffers reasonable, which usually requires high-performance FPGA chips. When elaborating the local trigger decision and waiting for the L1 trigger decision, the data may be prepared for sending to the data acquisition system (DAQ).
However, using the triggered approach is not always possible. The papers [
1,
2] predicted the increasing role of triggerless DAQs eight years ago. Some experiments already use or plan to use this approach: LHCb [
3], AMBER [
4], and CBM [
5]. In that approach, the raw data from the detector are submitted only to elementary processing (e.g., zero suppression) and then delivered to the system responsible for identifying the interesting events—the event selector. The event selector may be built based on ATCA architecture [
6] or as a tightly connected computer network [
7,
8]. The second solution may be better scalable and upgradable, using the standard servers as the main component. In that solution, it is still necessary to deliver the data from the front–end electronics (FEE) via the optical links to the computer being the entry nodes of the event building network. That may be accomplished with specialized FPGA-based data concentrator boards connected to the PCIe bus in the server [
9,
10] and equipped with multiple optical transceivers.
The PCIe blocks in modern FPGA chips use a wide datapath to fully utilize the bandwidth offered by the PCIe interface at reasonable clock frequency. For example, the PCIe Gen3 block working with 8 lanes requires delivering 256 bits of data at a frequency of 250 MHz. For 16 lanes, the data should be delivered as 512-bit words at 250 MHz [
11].
The overall volume of data should be limited. Therefore, the data about the physics events recorded by detectors (so-called “hit messages”) are transmitted as possibly short data words. For example, in the CBM experiment [
12], the STS detector readout chip SMX2 delivers the hit data as 24-bit words [
13], which, after preprocessing and adding the source ID, will be extended to 32-bit words. Similarly, the detector data link (DDL) protocol, used in the ALICE experiment at CERN LHC [
14], is oriented toward transmitting 32-bit words.
Concatenating the shorter words received from links into the wider word used by PCIe seems simple. In the case of eight 32-bit links concentrated to a 256-bit PCIe word, a constant group of bits could be allocated for each link. However, such a solution would be inefficient. The links do not transmit the hit data all the time. The data stream may contain additional words with time markers, control, status, and diagnostic information. If there are no data to be sent now, the “empty” words are transmitted. Generally, the data may be divided into “DAQ data”, which should be transmitted to DAQ, and “non-DAQ data”, which should be discarded. If the PCIe word consists of bit groups permanently assigned to particular links, such discarded data would create “holes” in the data transmitted via PCIe and then stored in the DMA memory buffers in the event selector computer. That results in wasting the PCIe bandwidth and memory and may reduce the performance of data processing.
The “non-DAQ” input data words should be removed before packing the data into the wide output words. However, in that case, the output word must be assembled from the received “DAQ” words in the process of data concentration. It is desired that the “DAQ” words are packed in the same time order as they are received from the links.
Before introducing the method proposed in [
15] and further improved in this article, it is worth quickly reviewing alternative solutions. Their number is limited because, despite the high importance of efficient data concentration, it is very sparsely described in most papers about DAQ solutions.
Two alternative approaches are presented in [
15]. The first one uses the scanning of
N input links at a frequency
N times higher than the data rate. All data may be checked and written to the appropriate location in the output word, or discarded before the new dataset arrives. That approach has been successfully used for the initial concentration of data in the CBM STS detector [
12]. However, it cannot be used for high data rates and a large number of concentrated inputs.
The second method independently assembles the complete output records from the received “DAQ” data in each input channel. The complete records are then transmitted to the output with round-robin priority. In this method, data from low-rate channels may be significantly delayed (until their output record is filled) compared to the data from high-rate channels. The practical usage of this method is not explicitly described in the literature. But, as shown in [
15], it may be found in publicly available firmware sources for the FELIX board used by the ATLAS experiment at CERN [
16].
An interesting solution of concentrating eight 32-bit data streams into the 256-bit output has been developed for the LHCb Silicon Pixel Detector [
17]. It uses a three-level binary tree of blocks called “2-to-1 encoders”, creating the final “8-to-1 encoder” (see
Figure 1). The data are concentrated first into four 64-bit words, then into two 128-bit words, and finally to a single 256-bit word. Each “2-to-1 encoder” has a relatively complex internal structure, as shown in
Figure 2.
A relatively complex logic was possible because the concentration worked at a low clock frequency of 30 MHz.
The third method has been introduced in [
15] and assumes concentration by properly routing data to the desired position in the output record. In that method, neither multiplication nor the prolonged buffering of data is required. Such routing of data may be performed with an interconnection network. It is a very mature technology [
18], widely used for data concentration in telecommunications and data processing systems [
19,
20]. However, the data concentration in those applications aims to transmit the data via reduced channels [
21] and differs from the data concentration in DAQ described earlier. Therefore, the problem studied in [
15] and this paper is not a proposal of a new interconnection network but adopting the well-known network architecture for that specific task and proving its correctness.
Data Concentration for DAQ with Interconnection Network
Example of data concentration with interconnection network is shown in
Figure 3.
The data from the links are fed into the concentrating network, which removes the non-DAQ words and delivers the DAQ words to the consecutive locations in the output record. The output record may be partially filled at the end of the concentration cycle. Therefore, the location of the first DAQ word must be selectable. Suppose that, after filling the output record, there are still DAQ words at the link outputs to be received. In that case, they must be stored in an additional auxiliary record, from which they are copied at the beginning of the next concentration cycle. The concentrating network may be built as a multistage interconnection network [
22], consisting of switches with two inputs and two outputs (see
Figure 4), transmitting the data transparently (“bar” mode) or swapping them (“cross” mode).
To keep track of the current occupancy of the output and auxiliary records and to configure the concentrating network according to the availability of the DAQ words at inputs, an additional “controller” block is needed, which is not shown in
Figure 3.
It counts the inputs delivering the DAQ words (“active inputs”) and assigns them the consecutive free positions in the output record. After the last position is assigned, the next active inputs are assigned the positions in the auxiliary record.
The development of a data concentrator with eight inputs was presented in [
15]. Initially, the 8 × 8 Beneš network was used as the concentrating network, as it can perform any data permutation. However, finding the proper configuration of switches for the particular permutation is a complex problem. Therefore, the brute force approach was used, and all needed configurations were stored in the table. During the development, it was found that the full Beneš network is not needed for data concentration. Therefore, finally, the concentrator based on an 8 × 8 “reduced Beneš network” was proposed as the optimal solution (see
Figure 5).
However, concentrating data from more input streams is needed in certain applications. An example may be a concentration of 32-bit words to a 512-bit output record (e.g., for 8-lane PCIe Gen 4 or 16-lane PCIe Gen 3) or a concentration of 16-bit input words to a 256-bit output record.
Therefore, a concentrating network with a higher number of inputs and outputs is needed.
2. Attempt to Extend the Previous Concentrator
The basis for the attempt to directly extend the previous concentrator was the assumption that the “reduced Beneš network” (described in [
15]) of any size could concentrate data. However, without formal proof, this hypothesis could only be checked by explicitly testing all possible settings of switches.
For a 16 × 16 network, the number of switches is 32, so the number of possible configurations is ca. 4 billion. The tools developed in [
15] have been appropriately extended, and all possible configurations have been checked. Indeed, it appeared that such a network can perform the concentration task. However, the problem is the considerable size of the table storing the necessary configurations of the switches, which must be used in the concentrator controller. There are 32 switches to be configured, and the configuration depends on the “DAQ word” flags of 16 inputs and the 4-bit number of occupied words in the output record. Therefore, storing the configuration in the table requires the memory with
32-bit words. Implementing it in the AMD/Xilinx FPGA would consume 1024 BRAMs or 128 UltraRAMs, which is unacceptable.
A solution could be transforming that table into a combinational function. An attempt was made to use the generated configuration table as the truth table and perform the logic synthesis and optimization. The synthesis was attempted with two commercially available tools: AMD/Xilinx Vivado [
23], Intel/Altera Quartus [
24], and the open source Yosys [
25] environment. Unfortunately, none of those attempts resulted in combinational logic of reasonable complexity. Vivado 2022.2 implemented it as a combinational logic, consuming 524,189 LUTs in an Ultrascale + FPGA. Quartus 21.1 implemented it as a memory, consuming exactly
= 33,554,432 memory bits. Therefore, this approach was abandoned and replaced with the one based on the analysis of the concentration network.
3. Analysis of the Concentrating Network
The concentrating network described in [
15] as the “reduced Beneš network” is, in fact, the “baseline network” [
22] with a bit-reversed order of outputs. Furthermore, it will be described as a “baseline network with reversed outputs”, or BNRO in short.
The topology of this network for different numbers of layers (and what is related—different numbers of inputs and outputs) is shown in
Figure 6,
Figure 7 and
Figure 8.
The structure of the baseline network is regular. The network with
layers may be created from two baseline networks with
N layers by adding a layer with switches where the output 0 is connected to the “upper network” and output 1 is connected to the “lower network” (see
Figure 9).
The rule for determining the output switch also defines the simple and unambiguous routing algorithm: To route the data through the N-layer BNRO network from input k to output m, we:
Start from the input k of the network,
Compare the bit 0 in the input number with the bit 0 of the target output number. The switch should be set to the “bar” mode if the bits are equal. If they differ, the switch should be set to the “cross” mode. After setting the switch in the suitable mode, find the right switch and its input in the first layer.
Repeat the next steps in the loop until the output of the network is reached:
- –
In layer l, compare the l-th bit in the target output and the network input numbers. The switch should be set to the “bar” mode if the bits are equal. If they differ, the switch should be set to the “cross” mode. After setting the switch in the suitable mode, find the right switch and its input in the next layer.
- –
Increase l.
The above algorithm enables routing any input of the BNRO network to any of its outputs. However, it does not warrant that the BNRO network can perform any possible data permutation. The routing is impossible if the data delivered to two inputs of a particular switch should be routed to the same output of that switch, which results in a collision.
The results obtained in [
15] and the experiments with 16 × 16 (4-layer) network in
Section 2 suggest that such a collision should not happen in the BNRO network when used for data concentration. However, formal proof is needed.
4. Proof of the BNRO Suitability for Data Concentration
The suitability of the BNRO network to data concentration may be proven in various ways. Below, two of them are presented.
4.1. Proof Based on the Mathematical Induction
We can take the network with one layer (
) and check that it can perform the concentration task. That is trivial for two inputs and outputs—the network can propagate the data without swapping them (corresponding to writing the output record from location 0) or swapping them (corresponding to writing the output record from location 1). The corresponding data may be simply ignored if one or both inputs deliver a non-DAQ word. So, the BNRO with
layer can concentrate data (see
Figure 10).
Then, we must prove that, if a BNRO network with
N layers can concentrate data, the network with
layers, created as shown in
Figure 9, can also do it.
Let us assume that the first DAQ word should be routed to the output . If the network receives the data with DAQ words, that set of words may be split into two subsets with sizes differing by no more than 1. If t is even, we route the first DAQ word to the upper subnetwork. If t is odd, we route that word to the lower subnetwork. The next DAQ words (if available) must be routed alternately to both subnetworks. Such a routing is always possible.
Suppose that the currently handled DAQ word is connected to the switch with another input receiving the non-DAQ word. In that case, we can select the output and the upper or lower subnetwork as required.
Suppose that the currently handled DAQ word is connected to the switch where another input also receives the DAQ word. In that case, we may route it to the required network, and the next DAQ word will be automatically routed to the opposite network, fulfilling the requirement of the alternate routing.
As each subnetwork can perform the data concentration, both the even and odd DAQ words may be routed to the appropriate subnetworks’ consecutive (in modulo sense) outputs.
If
t is even, we configure the upper subnetwork so that even DAQ words are routed to its outputs starting from the
output and the lower subnetwork so that the odd DAQ words are routed to its outputs starting from the
output. In the whole network, those data will appear at consecutive (in modulo
sense) outputs starting from output
t. An example of such concentration for 4-layer BNRO is shown in
Figure 11.
If
t is odd, we configure the lower subnetwork so that even DAQ words are routed to its outputs starting from
output and the upper subnetwork so that the odd DAQ words are routed to its outputs starting from
output. Those data will appear at consecutive (in modulo
sense) outputs starting from output
t in the whole network. An example of such concentration for the four-layer BNRO is shown in
Figure 12.
Based on the mathematical induction principle, the above proves that the BNRO with any number of layers can concentrate data.
4.2. Proof Based on Analysis of Collision Possibility
Analyzing the recursive building of the BNRO network and analyzing the topologies of BNRO for small numbers of layers (see
Figure 6,
Figure 7 and
Figure 8), we may find the rules describing the numbers of network inputs that may be connected to the particular input
i of the switch number
r located in layer
l, and numbers of network outputs that may be connected to the particular output
j of that switch. These rules are presented in
Figure 13 and
Figure 14.
With the above rules, we may analyze the conditions resulting in impossible data routing due to a collision in a certain switch. The collision happens when data from both inputs of the particular switch should be routed to the same output, as shown in
Figure 15.
The network inputs accessible from the individual input have the
consecutive numbers. The group of numbers associated with inputs 0 and 1 are neighboring, so all inputs accessible from the switch have
consecutive numbers (see
Figure 16a). Therefore, the difference between the numbers of two associated inputs must be not higher than
.
The outputs accessible from the switch have numbers spread equally and differ by
. Therefore, the difference between the numbers of two accessible outputs must be not less than
(see
Figure 16b).
Suppose that the collision in the switch should happen. In that case, the data from two network inputs accessible from the switch (with numbers differing by no more than ) should be routed to different network outputs accessible from the same switch output. However, the numbers of those network outputs must differ by no less than . The concentration of data, however, does not insert new data words into the data stream. It may only remove certain not needed data words from that stream. Therefore, during the concentration, for each pair of data words processed in the same concentration cycle, the difference in the numbers between the destination network outputs must be not higher than the difference between the numbers of their network inputs.
This means that the collision never happens when the analyzed network concentrates the data, and the BNRO network can perform the concentration task.
5. Implementation
The concentrator based on the BNRO network was implemented in the VHDL language. Its block diagram is presented in
Figure 17.
Its sources are publicly available under the dual GPL/BSD license in the Gitlab repository [
26].
The sources are highly parameterized. The user may define the number of layers in the concentrating network and, therefore, the number of inputs and outputs. It is also possible to define the type of data that are concentrated.
The repository contains the testbench, allowing the user to verify the correct operation of the concentrator. As the payload, the integer values are used. The DAQ words with consecutive integer values are delivered to the consecutive inputs. The user may set the probability of the data being accepted by an input. If the data are not accepted, the input is skipped, and the data are tried to be delivered to the next input. The data leaving the concentrator are written to the file. If the concentrator works correctly, the output file contains the consecutive integers. A dedicated Python script checks the content of the generated file.
Additionally, hardware testbenches for the KCU105 [
27] and KCU116 [
28] boards were created for the 16-input concentrator. Similarly, like in [
15], the setup contains FIFOs for storing the input data to be fed into the concentrator and its output data. The block diagram of the test setup is shown in
Figure 18.
The testbench is controlled with a Python script using the standard uio_pci_generic driver. Both the script and the testbench have been extended to support a higher number of concentrated inputs.
7. Discussion
The data concentration concept presented in [
15] significantly improved the concentration performance compared to the previously used methods like high-frequency polling or width conversion in the input channels. This eliminated the necessity to use very high clock frequencies and disturbances in the time ordering of the concentrated data.
The disadvantage of that solution was its poor scalability. It was based on an 8 × 8 interconnection network. Therefore, natively, it supported up to eight inputs and the output word with a length eight times bigger than the input words.
Specific extensions have been proposed in [
15], increasing the number of inputs to 9 or 12 but at the cost of introducing additional clocks with slightly higher frequency and higher design complexity. Attempts to use a bigger 16 × 16 network was blocked by problems with an excessively large lookup table storing the configuration of switches in the interconnection network.
The new solution described in this paper extends this concept with scalability. It proves that the N-layer (with inputs and outputs) baseline network with reversed outputs (BNRO) can correctly concentrate the data. It significantly simplifies the design of such a network, offering an efficient and synthesizable implementation of the BNRO controller. The new method enabled the easy implementation of a concentrator with 4 layers (16 inputs), confirmed in hardware, and with 5 layers (32 inputs), confirmed in simulations.
Currently, the maximum 512-bit size of the output word seems reasonable. Even the planned solutions for PCIe Gen5 assume that width [
29]. With the 16-bit minimum width of concentrated data (it must contain the payload and the source ID), it gives a number of concentrated inputs equal to 32.
In the concentrating network, nothing prevents adding the 6th layer and increasing the number of inputs to 64 if the width of the output word rises to 1024 in the future.
However, the number of inputs may affect the critical path in the network controller. In the current design, where to route the data from the particular input is elaborated in a single clock cycle. That operation includes counting the active inputs (those delivering the DAQ words). Then, the target output is calculated based on the active input number and the current occupancy of the output record (see
Section 3). This results in the roughly linear dependency of the critical path in the controller on the number of inputs. However, the input stage of the concentrator may also be pipelined at the cost of additional latency. The input data may propagate through a required number of pipeline registers. This enables counting the active inputs and calculating the target outputs in a few cycles. Finding the optimal solution should be the object of future research.