Azure Edge FPGA and it's support for DNN

Again, a recent discussion with my colleague about deploying YOLOv4 on Azure Edge device with FPGA. Our team is trying to deploy a YOLOv4 model on Azure Edge device, and they hope to take advantage of the acceleration provided by FPGA. However, the official Azure Edge document states they only support 5 types of DNN models. They would like to know whether YOLO can be supported and why.

Based on my understanding of FPGA and YOLO, the short answer is no: we can’t deploy our current yolov4 detector on microsoft FPGA chips without extra efforts. Why?

How FPGA speed up deep learning? unlike cpu and gpu, fpga chips have the algorithms built in with the chips. So the FPGA provided by microsoft (highly likely built by xlinix) has gone through the following process: 1. an empty FPGA with no algorithm written =>2. programmers use hardware description language(eg. verilog/VHDL) to write CERTAIN algorithms inside the chips =>3. package it. In this case, they support only five DNN models(resnet-50, vgg-16, etc). This means they only wrote these five types of algos on the chip, so there is no way other algo will work on FPGA.

So what is yolo? yolo is a series of object detector based on deep learning. The structure is shown below. noted the backbone part inside yolo, that’s where all the computational expensive convolutions play. the current yolov4 uses “cspdarknet-53” as the backbone. unfortunately, cspdarknet-53 is not in the supported list of FPGA supported five DNN models. Therefore, we can’t use our current yolov4 directly on this FPGA chip.

so what to do? I have some ideas, might not be a complete list, just some thoughts.

Modify our current yolov4, change the backbone from cspdarknet-53 to resnet-50. Since resnet-50 is supported, then this modified version can be run on FPGA. This is actually what pp-yolo did by paddlepaddle. (https://towardsdatascience.com/pp-yolo-surpasses-yolov4-object-detection-advances-1efc2692aa62). Drawback: need lots of effort to modify the yolov4 structure. more time, more tests.
Forget about FPGA, use cpu or gpu to run. based on my previous experience, gpu can usually get good performance while cpu is in general bad.
If with gpu or cpu, the fps is still far from our requirements. we can: 1. use yolov4-tiny instead of yolov4. 2. instead of detecting every frame, detect every 5 frames and use tracking to fill in the gap.