Speaker
Description
Neural Processing Units (NPUs) are becoming as common as GPUs in embedded SoCs, but Linux lacks a unified NPU subsystem. Current drivers are fragmented, vendor-specific, and often only tuned for vision inference (YOLO, ResNet). At the same time, newer workloads such as LLMs and multimodal models demand more flexible memory management, scheduling, and runtime integration.
This talk demystifies how NPUs work at the subsystem level — from DMA mapping and SRAM partitioning to command queue management. It will walk through case studies of deploying both vision models (YOLOv8) and LLMs (LLaMA3-tiny) on NPUs, highlighting where current Linux subsystems (DRM, V4L2, accel, AI/ML proposals) succeed and where they fall short.
Finally, it shows how Edgeble has adapted these learnings in real deployments on SoC and PCIe based NPU engine drivers by adding quantization and model scheduling.
The session aims to start a discussion around a more unified Linux NPU subsystem, drawing parallels to the GPU evolution, and inviting collaboration with kernel developers, hardware vendors, and OSS communities.