3.1 Parallel Architecture
HEVC is more suitable to parallel processing by further reducing data dependency compared
with H.264/AVC. The proposed architecture improves the performance by employing parallelism,
a multi-stage pipeline with parallel filters. For parallel filter processing, a special
memory allocation algorithm is invented for efficient on-chip SRAM access.
Fig. 3 shows the overall block diagram of the proposed architecture. The proposed architecture
consists of the following components: BS calculator, parallel dE calculator, β and
tc calculator, parallel filter and SRAM. SRAM is divided into 16 blocks. In SRAM,
the current pixels of a 32x32 block and their left and above neighboring pixels are
stored. Each SRAM block stores 8 4x4 pixel blocks. The data is read from the SRAM
blocks for filtering computation and stored back to the SRAM blocks after filter computation.
The read/write SRAM block positions are determined by the proposed memory allocation
algorithm.
Fig. 3. Block Diagram of the Proposed Architecture
The proposed architecture works as a 4-stage pipeline as shown in Fig. 3. The circled number in Fig. 4 denotes the pipeline stage. In stage 1, the data is read from SRAM and the BS calculation
is performed. In stage 2, β and tc calculation, and 8 parallel dE calculations are
performed. In stage 3, 4x4 block filtering is performed in parallel. In stage 4, filtered
results are stored back into SRAM blocks.
3.2 Memory Allocation Algorithm
The data access in the on-chip SRAM without conflicts between the parallel filters
is very important to maintain high performance. For the proposed parallel filter architecture,
we show an efficient memory allocation algorithm to avoid such conflicts. By using
this algorithm, each filter is guaranteed to read and write from 2 out of the 16 SRAM
blocks without conflicts between them in parallel.
Fig. 4. 32ⅹ32 Pixel Block (Y, Cb, Cr)
The filtering boundary of a deblocking filter can be either vertical or horizontal.
As shown in Fig. 3, the proposed architecture has 16 SRAM blocks. Fig. 4 shows the 32x32 block waiting for the Deblocking Filter operation in parallel. Each
named block is a 4x4 block. Our goal is to allocate these 4x4 blocks into the 16 SRAM
blocks such that no data conflicts occur between the filters during the parallel filter
computation.
As explained before, each SRAM block stores 8 4x4 pixel blocks. In Fig. 4, each block is named by 2 letters such as AA or BC. The block with the same first
letter belongs to the same vertical filtering. For example, In Fig. 4, blocks AE’s, AA, AB’s, AC’s and AD’s belong to the group for same vertical filtering,
and all of them have same first letter ‘A’. Fig. 13 is an example final data allocation. The blocks with initial ‘A’ are stored in the
first row of the SRAM blocks evenly spread across them without any overlaps. Therefore,
the vertical filtering of this data group can be performed by 8 filters accessing
different SRAM blocks without conflicts at the same time.
Our proposed memory allocation algorithm shows a deterministic way to decide the SRAM
location by using two letters on each 4x4 block. The second letter of the block name
is used to denote the same horizontal filtering group.
The filtering order of HEVC is (1) the vertical filtering shown in Fig. 5 done first, and then (2) the horizontal filtering shown in Fig. 6 next. For example, the blocks AA, AB, AC, AD and AE are used for same vertical filtering,
and BA, BB, BC, BD and BE are used for another vertical filtering as shown in Fig. 5.
Fig. 5. 32ⅹ32 Pixel Block - Vertical Filtering
Fig. 6. 32ⅹ32 Pixel Block - Horizontal Filtering
For horizontal filtering, AA, BA, CA, DA and EA belong to the same filtering group,
and AB, BB, CB, DB, and EB belong to another filtering group. Same filtering group
must be stored into different SRAM blocks to avoid the data access conflicts during
parallel filter computation. By applying these rules on the 32x32 pixel block using
8 alphabet letters as shown in Fig. 4, we can obtain the goal allocation as shown in Fig. 13 eventually. All the pixel blocks are stored into SRAM blocks such that no access
conflicts occur during the filter computation.
Fig. 7. Memory Allocation Step 1
Fig. 7 ~ 12 illustrate a deterministic algorithm to reach a memory allocation result shown in
Fig. 13 without any access conflicts. Firstly, as shown in Fig. 7, allocate B-, C-, D- blocks with 4 space gaps between the rows. Note that all the
blocks are allocated into SRAM blocks (S_0~S_15) such that each SRAM has no conflicting
letters for first and second positions. Therefore there is no conflict in the SRAM
access. After this allocation, there are A- and E- blocks waiting for the allocation
as shown in Fig. 8.
Fig. 8. Memory Allocation Step 2
Fig. 9. Memory Allocation Step 3
Secondly, as shown in Fig. 9, for the SRAM column with -A assigned like SRAM_1, put AE in the first row. For the
column with -E, put EA in the fifth row. As shown in Fig. 10, fill in the remaining letter on the column with only one empty slot.
Finally, as shown in Fig. 11, fill in the remaining two rows with AA and EE for one column, and with AE and EA
for the other three columns. After all the luminance blocks are allocated, Cb and
Cr blocks are allocated similarly. As shown in Fig. 12, allocate the GG blocks in the SRAMs, and then allocate FH and HF blocks. The remaining
columns are allocated such that the first and second letters are not overlapped as
shown in Fig. 13.
Fig. 10. Memory Allocation Step 4
Fig. 11. Memory Allocation Step 5
As long as the letters of each position are not overlapped in a column, there can
be many solutions. In this paper, we showed a deterministic algorithm to find a working
solution quickly.
Fig. 12. Memory Allocation Step 6
By using the proposed memory allocation algorithm with an 8 filter architecture, the
execution time to perform filter oper- ations on 96 boundaries for a 32x32 pixel block
is 12 clock cycles. To add the 6 cycles of starting and ending delays of pipelines,
the total execution time is 18 clock cycles.
Fig. 13. Memory Allocation Step 7
The memory allocation algorithm can be used for other numbers of filter parallelism
in a similar fashion. If the number of SRAM is increased, the number of parallel filters
can be increased proportionally. For example, 16 parallel filters can be used for
32 SRAM blocks.