# 

# Bulk Moving Mechanism on LRU for DRM/TTM

Huang Rui (Ray Huang) <ray.huang@amd.com>

# AGENDA

| Background                                     |
|------------------------------------------------|
| LINUX® GPU Kernel                              |
| GPUVM - GART memory setup (1-level)            |
| GPUVM - 4-level paging                         |
| Per-VM Buffer Object                           |
| Eviction (Buffer Migration)                    |
| VM Key List Definitions                        |
| New List Operation for Bulk Moving             |
| LRU Policy for Buffer Migration in Bulk Moving |
| Bulk Moving Approach in TTM                    |
| Bulk Moving use case in AMDGPU                 |
| Performance Improvement Data                   |
| Contribution                                   |
| Q & A                                          |

#### Background

- Who am I?
  - My name is Huang Rui (Ray), I am from AMD Linux<sup>®</sup> graphic driver team and focus on kernel driver and libdrm for new ASIC bring-up and new feature development several years.
  - Patch work profile:
    - https://git.kernel.org/pub/scm/linux/kernel/git/rui/linux.git
- Why we proposed the solution of bulk moving?
  - Investigating performance with the F1 2017 game benchmark showed that the application caused a large number of buffers being created
  - The validation and LRU management of these buffers in the TTM and driver infrastructure was found to be non-optimal for this scenario.
  - This led to a redesign of the buffer migration process in the code.
  - This talk demonstrates the practical techniques to efficiently profile and analyze the scenario and identify the design changes needed to address it.

#### LINUX<sup>®</sup> GPU Kernel

- Linux<sup>®</sup> GPU Kernel for AMD
  - Device Init
  - Interrupt Handle
  - GMC (GFX memory controller)
  - PSP (Platform Security Processor)
  - Display
  - GFX
  - SDMA
  - MM block (UVD/VCE/VCN)





### GPUVM - GART memory setup (1-level)

- GART memory is GPU visible system memory
  - Allocate GART table BO in the video memory for mapping to system memory.
  - GPU will read the data from the page table entries in the GART table BO to convert to the physical address (DMA bus address).



#### **GPUVM - 4-level paging**

- VMID 0
  - System context domain that only used by kernel mode.
  - GART table is created by 1-level paging (flat page table)
- VMID 1 ~ 15
  - Other context domain that used by user mode.
  - The page table is setup when the thread is created and is 4-level.



#### **Per-VM Buffer Object**

- History:
  - Each BO (too many) in the BO list that needs be validated during CPU bound games
  - Solution: decrease the work of BO list parser relevant.
- New mechanism (Per-VM) that is to ensure the BO always valid for command submission.
  - Add flag for UMD (Vulkan)
  - Share reservation object with VM root BO
  - Allow eviction and swap out when sharing same reservation
  - Ensure the Per-VM BO always valid

# **Eviction (Buffer Migration)**

Buffer migration approach:





#### VM Key List Definitions

233 234

235

236 237

238

239 240

241

242

243 244

245

246

247 248

249 250

251

252

```
/* BOs who needs a validation */
struct list_head
                       evicted;
/* PT BOs which relocated and their parent need an update */
struct list head
                       relocated;
/* per VM BOs moved, but not vet updated in the PT */
struct list head
                       moved;
/* All BOs of this VM not currently in the state machine */
struct list head
                       idle;
/* regular invalidated BOs, but not yet updated in the PT */
struct list head
                 invalidated;
spinlock t invalidated lock;
/* BO mappings freed, but not yet updated in the PT */
struct list head
                       freed;
```

#### New List Operation For Bulk Moving

```
210
     /**
217
      * list bulk move tail - move a subsection of a list to its tail
218
      * @head: the head that will follow our entry
219
      * @first: first entry to move
220
221
      * @last: last entry to move, can be the same as first
222
      ж
      * Move all entries between @first and including @last before @head.
223
224
      * All three entries must belong to the same linked list.
225
      */
226
     static inline void list bulk move tail(struct list head *head,
227
                                             struct list head *first,
                                             struct list head *last)
228
229
    {
230
             first->prev->next = last->next;
231
             last->next->prev = first->prev;
232
233
             head->prev->next = first;
234
             first->prev = head->prev;
235
236
             last->next = head;
237
             head->prev = last;
238
    }
239
```

### LRU Policy for Buffer Migration in Bulk Moving

 Least Recently Used (LRU) algorithm is used for TTM on eviction (buffer migration)

| 200 |                                                                                    |
|-----|------------------------------------------------------------------------------------|
| 236 | <pre>static void ttm_bo_bulk_move_set_pos(struct ttm_lru_bulk_move_pos *pos,</pre> |
| 237 | <pre>struct ttm_buffer_object *bo)</pre>                                           |
| 238 | {                                                                                  |
| 239 | if (!pos->first)                                                                   |
| 240 | pos->first = bo;                                                                   |
| 241 | pos->last = bo;                                                                    |
| 242 | }                                                                                  |
| 243 |                                                                                    |
| 244 | <pre>void ttm_bo_move_to_lru_tail(struct ttm_buffer_object *bo,</pre>              |
| 245 | <pre>struct ttm_lru_bulk_move *bulk)</pre>                                         |
| 246 | {                                                                                  |
| 247 | dma_resv_assert_held(bo->base.resv);                                               |
| 248 |                                                                                    |
| 249 | <pre>ttm_bo_del_from_lru(bo);</pre>                                                |
| 250 | <pre>ttm_bo_add_to_lru(bo);</pre>                                                  |
| 251 |                                                                                    |
| 252 | if (bulk && !(bo->mem.placement & TTM_PL_FLAG_NO_EVICT)) {                         |
| 253 | switch (bo->mem.mem_type) {                                                        |
| 254 | case TTM_PL_TT:                                                                    |
| 255 | <pre>ttm_bo_bulk_move_set_pos(&amp;bulk-&gt;tt[bo-&gt;priority], bo);</pre>        |
| 256 | break;                                                                             |
| 257 |                                                                                    |
| 258 | case TTM_PL_VRAM:                                                                  |
| 259 | <pre>ttm_bo_bulk_move_set_pos(&amp;bulk-&gt;vram[bo-&gt;priority], bo);</pre>      |
| 260 | break;                                                                             |
| 261 | }                                                                                  |
| 262 | if (bo->ttm && !(bo->ttm->page_flags &                                             |
| 263 | (TTM_PAGE_FLAG_SG   TTM_PAGE_FLAG_SWAPPED)))                                       |
| 264 | <pre>ttm_bo_bulk_move_set_pos(&amp;bulk-&gt;swap[bo-&gt;priority], bo);</pre>      |
| 265 | }                                                                                  |
| 266 | }                                                                                  |
| 267 | EXPORT_SYMBOL(ttm_bo_move_to_lru_tail);                                            |



# Bulk Moving Approach in TTM

| 26 |
|----|
| 26 |
| 27 |
|    |
| 27 |
| 27 |
| 27 |
| 27 |
| 27 |
| 27 |
| 27 |
| 27 |
|    |
| 2  |
| 2  |
| 2  |
| 2  |
| 2  |
| 2  |
| 2  |
| 2  |
| 2  |
|    |
| 2  |
| 2  |
| 2  |
| 2  |
| 2  |
| 2  |
| 2  |
| 2  |
| 2  |
| 2  |
| 2  |
| 2  |
| 3  |
|    |
| 3  |
| 3  |
| 3  |
| 3  |
| 3  |
| 3  |
| 3  |
| 3  |
| 3  |
| 3  |
| 3  |
| 3  |
| 3  |
| 3  |
| 3  |
| 3  |
| 3  |
| 3: |
| 3  |

| <pre>void ttm_bo_bulk_move_lru_tail(struct ttm_lru_bulk_move *bulk)</pre>                                                                                       |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------|
| {<br>unsigned i;                                                                                                                                                |
| <pre>for (i = 0; i &lt; TTM_MAX_B0_PRIORITY; ++i) {     struct ttm_lru_bulk_move_pos *pos = &amp;bulk-&gt;tt[i];     struct ttm_mem_type_manager *man;</pre>    |
| <pre>if (!pos-&gt;first)</pre>                                                                                                                                  |
| <pre>dma_resv_assert_held(pos-&gt;first-&gt;base.resv); dma_resv_assert_held(pos-&gt;last-&gt;base.resv);</pre>                                                 |
| <pre>man = &amp;pos-&gt;first-&gt;bdev-&gt;man[TTM_PL_TT]; list_bulk_move_tail(&amp;man-&gt;lru[i], &amp;pos-&gt;first-&gt;lru,</pre>                           |
| }                                                                                                                                                               |
| <pre>for (i = 0; i &lt; TTM_MAX_B0_PRIORITY; ++i) {     struct ttm_lru_bulk_move_pos *pos = &amp;bulk-&gt;vram[i];     struct ttm_mem_type_manager *man;</pre>  |
| <pre>if (!pos-&gt;first)</pre>                                                                                                                                  |
| <pre>dma_resv_assert_held(pos-&gt;first-&gt;base.resv); dma_resv_assert_held(pos-&gt;last-&gt;base.resv);</pre>                                                 |
| <pre>man = &amp;pos-&gt;first-&gt;bdev-&gt;man[TTM_PL_VRAM]; list_bulk_move_tail(&amp;man-&gt;lru[i], &amp;pos-&gt;first-&gt;lru,</pre>                         |
|                                                                                                                                                                 |
| <pre>for (i = 0; i &lt; TTM_MAX_B0_PRIORITY; ++i) {     struct ttm_lru_bulk_move_pos *pos = &amp;bulk-&gt;swap[i];     struct list_head *lru;</pre>             |
| <pre>if (!pos-&gt;first)</pre>                                                                                                                                  |
| <pre>dma_resv_assert_held(pos-&gt;first-&gt;base.resv); dma_resv_assert_held(pos-&gt;last-&gt;base.resv);</pre>                                                 |
| <pre>lru = &amp;pos-&gt;first-&gt;bdev-&gt;glob-&gt;swap_lru[i];     list_bulk_move_tail(lru, &amp;pos-&gt;first-&gt;swap, &amp;pos-&gt;last-&gt;swap); }</pre> |
| <pre>} EXPORT_SYMBOL(ttm_bo_bulk_move_lru_tail);</pre>                                                                                                          |
|                                                                                                                                                                 |

#### Bulk Moving Use Case in AMDGPU

- Legacy approach
  - AMDGPU driver will move all PD/PT and Per-VM BOs into idle list. Then move all of them on the end of LRU list one by one. The result of this is that many BOs are moved to the end of the LRU again and again, which has a serious impact on performance.
- Bulk Moving
  - Collect all PD/PT and Per-VM BOs and bulk move them to the end of LRU list one time instead of one by one. This will reduce cost during the buffer moving.

## Performance Improvement Data

GPU: Radeon™ RX Vega Video Memory: 8G System Memory: 16G OS: Ubuntu 18.04 LTS

|                                                | The Talos Principle<br>(Vulkan)                           | Clpeak (OpenCL™) | BusSpeedReadback (OpenCL™)                                              |
|------------------------------------------------|-----------------------------------------------------------|------------------|-------------------------------------------------------------------------|
| Original                                       | 147.7 FPS                                                 | 76.86 us         | 0.319 ms(1k) 0.314 ms(2K)<br>0.308 ms(4K) 0.307 ms(8K)<br>0.310 ms(16K) |
| Original + WA (don't<br>move PT BOs on<br>LRU) | 162.1 FPS                                                 | 42.15 us         | 0.254 ms(1K) 0.241 ms(2K)<br>0.230 ms(4K) 0.223 ms(8K)<br>0.204 ms(16K) |
| Bulk Move                                      | 163.1 FPS<br>ሌ                                            | 40.52 us         | 0.244 ms(1K) 0.252 ms(2K)<br>0.213 ms(4K) 0.214 ms(8K)<br>0.225 ms(16K) |
| Bi<br>th                                       | ulk move will get<br>he highest FPS<br>and lowest latency |                  |                                                                         |



#### Contribution

- Christian König <Christian.Koenig@amd.com>
  - He raised the original idea of bulk moving for the optimization of buffer migration.
  - I worked with him to deliver the completed solution in the kernel driver.
- Alex Deucher <alexander.deucher@amd.com>
  - Maintain and lead AMDGPU kernel driver to support the solution upstream.
  - Actively review and refine the quality of Linux<sup>®</sup> AMDGPU driver stack.



# Thank You and Q&A

# DISCLAIMER AND ATTRIBUTIONS

#### DISCLAIMER

The information contained herein is for informational purposes only, and is subject to change without notice. While every precaution has been taken in the preparation of this document, it may contain technical inaccuracies, omissions and typographical errors, and AMD is under no obligation to update or otherwise correct this information. Advanced Micro Devices, Inc. makes no representations or warranties with respect to the accuracy or completeness of the contents of this document, and assumes no liability of any kind, including the implied warranties of noninfringement, merchantability or fitness for particular purposes, with respect to the operation or use of AMD hardware, software or other products described herein. No license, including implied or arising by estoppel, to any intellectual property rights is granted by this document. Terms and limitations applicable to the purchase or use of AMD's products are as set forth in a signed agreement between the parties or in AMD's Standard Terms and Conditions of Sale. GD-18

©2019 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo, Radeon and combinations thereof are trademarks of Advanced Micro Devices, Inc. Linux is a trademark of Linus Torvalds and OpenCL is a trademark of Apple Inc. Other product names used in this publication are for identification purposes only and may be trademarks of their respective companies.