# CPS Summer School 2022

Tutorial C4D:

A programmable and reconfigurable FPGA overlay

Alessandro Capotondi<sup>1</sup>, Daniel Madroñal<sup>2</sup> Gianluca Bellocchi<sup>1</sup>, Andrea Marongiu<sup>1</sup>, Francesca Palumbo<sup>2</sup> <sup>1</sup>Università degli Studi di Modena e Reggio Emilia <sup>2</sup>Università degli Studi di Sassari





This project has received funding from the ECSEL Joint Undertaking (JU) under grant agreement No 826610. The JU receives support from the European Union's Horizon 2020 research and innovation programme and Spain, Austria,







**COMP4DRONES** will provide a framework of key enabling technologies for safe and autonomous drones that will leverage their customization and modularity for civilian services

# AGENDA

- Introduction 1
- Methodology overview 2
- MDC tool 3
- **OODK** overlay 4
- 5 **COMP4DRONES** use case
- Conclusions 6



3

# AGENDA

#### Introduction 1

- Methodology overview 2
- 3 MDC tool
- **OODK** overlay 4
- 5 **COMP4DRONES** use case
- Conclusions 6





The *"classic" set-up* comprises a **micro-**controller unit (MCU) that is used for control and actuation

| MCU |  |
|-----|--|

- a companion computer



#### **Drone system**

#### Companion computer

*Current paradigm* envisions coupling a **MCU** with

Heterogeneous solutions (Nvidia Tegra TX2, Xilinx Zyng US+, ...) are increasingly used

































## Introduction Motivation

#### What has to be simplified?

- System-Level Design
  - Build and evaluate accelerator-rich systems
    - Expensive
    - Time-consuming





### Introduction Motivation

#### What has to be simplified?

- System-Level Design
  - Build and evaluate accelerator-rich systems
    - Expensive
    - Time-consuming
- Design Space Exploration (DSE)
  - Key effects only manifest at system-level
  - User knobs:
    - System optimization
    - Accelerator optimization



# Introduction Motivation

#### What has to be simplified?

- System-Level Design
  - $\,\circ\,$  Build and evaluate accelerator-rich systems
    - Expensive
    - Time-consuming
- Design Space Exploration (DSE)
  - $\,\circ\,$  Key effects only manifest at system-level
  - $\circ~$  User knobs:
    - System optimization
    - Accelerator optimization
- Accelerator Design
  - Multi-functionality support
  - Multi working-point support





Step 1:

Overview of the proposed methodology (How to build a whole FPGA-based system starting from a dataflow specification)



Step 1:

Overview of the proposed methodology (How to build a whole FPGA-based system starting from a dataflow specification)

Step 2:

Accelerator definition and generation (MDC workflow)



Step 1:

Overview of the proposed methodology (How to build a whole FPGA-based system starting from a dataflow specification)

Step 2:

Accelerator definition and generation (MDC workflow)

Step 3:

COMP4DRONES

Overlay connection and usage from SW (OODK workflow)



# AGENDA

#### Introduction 1

- Methodology overview 2
- 3 MDC tool
- **OODK** overlay 4
- 5 **COMP4DRONES** use case
- Conclusions 6



1) Dataflow specification 2) Datapath merging and wrapper generation



#### generation 3) Build the system

1) Dataflow specification 2) Datapath merging and wrapper generation

**Prerequisites** 

Dataflow applications

HDL components

Communication protocol

**COMP4DRONES** 



#### generation 3) Build the system

1) Dataflow specification 2) Datapath merging and wrapper generation



**COMP4DRONES** 



#### generation 3) Build the system

Backend: HWPU wrapper generation

1) Dataflow specification 2) Datapath merging and wrapper generation



**COMP4DRONES** 



#### generation 3) Build the system

#### **FPGA** overlay

Backend: HWPU wrapper generation

Build the system: Overlay + HWPUs

# Methodology overview FPGA overlay





## Methodology overview HWPU accelerator wrapper



Bellocchi, Gianluca, Alessandro Capotondi, Francesco Conti, and Andrea Marongiu. "A risc-v-based fpga overlay to simplify embedded accelerator deployment." In 2021 24th Euromicro Conference on Digital System Design (DSD), pp. 9-17. IEEE, 2021.

Conti, Francesco, Pasquale Davide Schiavone, and Luca Benini. "XNOR neural engine: A hardware accelerator IP for 21.6-fJ/op binary neural network inference." IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 37.11 (2018): 2940-2951.



26

# Methodology overview HWPU accelerator wrapper





## Methodology overview HWPU accelerator wrapper





#### MDC-based reconfigurable application

# Methodology overview App modeling



# Methodology overview App modeling

#### Dataflow Models



**COMP4DRONES** 



#### Directed graph of actors (functional units)

#### Actors exchange tokens (data packets) through dedicated channels

# Methodology overview App modeling









# Methodology overview App modeling





# Methodology overview App modeling





| 0 | 1 | 2 |
|---|---|---|
| 1 | 1 | 0 |
| 0 | 0 | 0 |
| X | X | 1 |

#### Methodology overview HW accelerator generation





#### HDL components library

#### Methodology overview HW accelerator integration





# Methodology overview System generation

#### A subset of the generable accelerator-rich systems



**Agile system-level design** and exploration methodology





## AGENDA

#### Introduction 1

- Methodology overview 2
- MDC tool 3
- **OODK** overlay 4
- 5 **COMP4DRONES** use case
- Conclusions 6





### Edge detection using different kernels

#### **INPUT IMAGE**





### Edge detection using different kernels

#### **INPUT IMAGE**



### SOBEL





### Edge detection using different kernels

#### **INPUT IMAGE**



### SOBEL



#### **COMP4DRONES**



#### ROBERTS





**COMP4DRONES** 



#### PÁG

Baseline MDC Core Structural Profiler Power Manager

MDC *design suite:* <u>https://github.com/mdc-suite</u>













generation

MDC design suite: https://github.com/mdc-suite











### Baseline MDC Core: Datapath merging and CGR

44



generation

MDC design suite: https://github.com/mdc-suite



COMP4DRONES







### Baseline MDC Core: Datapath merging and CGR

### Structural Profiler: DSE for optimal CGR composition

### <u>Power Manager</u>: Clock and power gating by regions



generation

and processor

MDC design suite: https://github.com/mdc-suite









- Baseline MDC Core: Datapath merging and CGR
- Structural Profiler: DSE for optimal CGR composition
- <u>Power Manager</u>: Clock and power gating by regions
- <u>Co-Processor Generator</u>: Wrapper to connect accelerator

Co-Processo

Generator



Structural Profiler

Power Manager

generation

<u>Co-Processor Generator</u>: Wrapper to connect accelerator and processor

MDC design suite: https://github.com/mdc-suite











#### **Baseline MDC Core:** Datapath merging and CGR

#### Structural Profiler: DSE for optimal CGR composition

#### Power Manager: Clock and power gating by regions

#### **Relevant for this** tutorial



MDC design suite: https://github.com/mdc-suite











#### **Baseline MDC Core:** Datapath merging and CGR

#### Structural Profiler: DSE for optimal CGR composition

regions

ct accelerator

#### **Relevant for this** tutorial

### MDC Co-processor generation: HWPU generated by MDC





#### **MDC**-based reconfigurable application

### MDC + OODK HWPU accelerator wrapper

#### Streamer

 Specialized DMA controller that transforms streams into memory accesses

#### Controller

- Register file to host runtime parameters
- Control FSM for coarse-grained control/(re-)configuration



**COMP4DRONES** 



Hardware Processing Unit



## AGENDA

- Introduction 1
- Methodology overview 2
- 3 MDC tool
- **Onboard Overlay Development Kit** 4
- 5 **COMP4DRONES** use case
- Conclusions 6



### Overlay Development Kit Starting point

#### **PULP** architecture

**COMP4DRONES** 

- PULP stands for «Parallel Ultra Low Power»
- > Open and Scalable HW/SW research and development platform
- Cluster-based architecture
- RISC-V ISA compliant

#### Website: pulp-platform.org







## What does PULP Ecosystem include?

| RISC-V Cores |       |              |                      |     | Peripherals |      |     |
|--------------|-------|--------------|----------------------|-----|-------------|------|-----|
|              | RI5CY | Micro        | Zero<br>riscy<br>32b |     | Ariane      | JTAG | SPI |
|              | 32b   | riscy<br>32b |                      | 64b | UART        | I2S  |     |
|              |       |              |                      |     | DMA         | GPIO |     |





Interconnect

Logarithmic interconnect

**APB – Peripheral Bus** 

**AXI4** – Interconnect



## What does PULP Ecosystem include?

| <b>RISC-V C</b> | ores         | Peripherals  |        |      |            |
|-----------------|--------------|--------------|--------|------|------------|
| RI5CY           |              | Zero         | Ariane | JTAG | SPI        |
| 32b             | riscy<br>32b | riscy<br>32b | 64b    | UART | <b>I2S</b> |
|                 |              |              |        | DMA  | GPIO       |

Μ

interconnect

R5 R5 R5

#### **Platforms**



### Single Core

- PULPino
- PULPissimo

### **Multi-core**

interconnect

0

- Fulmine
- Mr. Wolf

# AcceleratorsHWCE<br/>(convolution)Neurostream<br/>(ML)HWCrypt<br/>(crypto)



Interconnect

Logarithmic interconnect

**APB – Peripheral Bus** 

**AXI4** – Interconnect



## What does PULP Ecosystem include?



#### **Platforms**



### Overlay Development Kit Starting point 2

### HERO

- FPGA emulation of heterogeneous and massively parallel PULP systems
- Instantiable with COTS FPGA-based heterogeneous SoCs



Website: pulp-platform.org



Kurth, A., Capotondi, A., Vogel, P., Benini, L., & Marongiu, A. (2018) HERO: An open-source research platform for HW/SW exploration of heterogeneous manycore systems.



## HERO is not only HW

- Includes complete SW support
  - Linux-based OS Distribution
  - Easy to port legacy code!

User Level

## Heterogenous programming model

Based on OpenMP 5.x

Kernel Level

Hardware

https://github.com/pulp-platform/hero





## **Programming Model**



Allows to write programs that start on the host but seamlessly integrate the PMCAs.



Offloads with OpenMP 4.5 target semantics, zero-copy (pointer passing) or copy-based







### OODK **FPGA** overlay

#### What is it?

- Hardware abstraction layer
- $\succ$  Overlays the original FPGA fabric  $\rightarrow$  Hides hardware details
- Enable easy customization and integration of new HW Accelerator

#### **Features:**

- $\succ$  Parametrized HW  $\rightarrow$  Flexible design of custom architectures
- $\succ$  Abstracted design flow  $\rightarrow$  Improved design productivity
- > Programmable via standard APIs for heterogeneous compute platforms (e.g. OpenMP)

### HDL-based IP are not easy to be customized using standard HDL language features! (can rely mainly on parametrization....)

Bellocchi, Gianluca, et al. "A risc-v-based fpga overlay to simplify embedded accelerator deployment." 2021 24th Euromicro Conference on Digital System Design (DSD). IEEE, 2021.





### OODK Architecture







### OODK Architecture

### **System Domain**

- Cluster
  - Multi and single-cluster architectures
  - Agile integration of different accelerators
- ≻ L2 memory
  - Data and instruction memory
- Remapping address block (RAB)
  - An IO-MMU for translation of virtual addresses
- SoC bus
  - Highly-scalable interconnect





77

### OODK Architecture

### **Cluster Domain**

- HW accelerators
  - MDC-based HWPU
- ➢ RISC-V core
  - Tightly-coupled SW control Accelerator routines, data management policies, etc.
  - L1 Instruction cache
- > DMA
  - ♦ Specialized core for efficient L2  $\leftrightarrow$  L1 data transfers
  - Support for 2D and 1D data transfers
- L1 data memory

- Multi-banked scratchpad data memory (not a cache!)
- Cluster interconnect
  - Highly-scalable logarithmic interconnect + Peripheral bus





### OODK HW/SW co-design and verification tool



**COMP4DRONES** 



### OODK (Genov+Arov)

Compile SW test application

## Run HW/SW validation test

### OODK HW accelerator generation and integration



**Applications are mapped to HW** and a HWPU wrapper is generated

**COMP4DRONES** 



### **OODK (Genov+Arov)**



**COMP4DRONES** 



### OODK (Genov+Arov)

### ile SW dication

## Run HW/SW validation test

# AROv (Accelerator-Rich Overlay) GenOv (Generator Overlay)

**Download on Github:** 

### https://github.com/gbellocchi/arov



**COMP4DRONES** 



#### **Accelerator-Rich Overlay Generator (GenOv) Python Generation Environment Source library Accelerator Wrapper Generation Runtime Device Specification** Source **GenOv Py-Lib Accelerator-Rich Overlay Generation Source Templates Acceleration Kernel Design Flow Output Environment Generated Accelerator-Rich Overlay Vivado HLS Support Standalone Accelerator** Handcrafted HDL Verification Testbench

### OODK System generation

#### Choose how to interconnecting accelerators is a primary requirement

- > Which type of interconnect topology better fits our needs?
- > What about the clustering level?
- > How do accelerators mutually work?
  - Accelerators can either work in parallel or sequentially

#### **Generation principles**

- > User knobs:
  - System optimization
    - ✓ Memory hierarchy, control cores, DMA, etc.
    - ✓ Accelerator interconnections (generic vs. application-specific interconnects)
    - ✓ Accelerator scheduling (concurrent, serial or mixed scheduling)



### OODK System generation (spec.py)

- 1. System information
- 2. Cluster information
- 3. HW accelerators interconnection
  - Logarithmic interconnect
  - Heterogeneous interconnect





### OODK Example #1 – Connection to Cluster Interconnect





### OODK Example #2 – Multi-Cluster Interconnection

class oodk\_specs:

```
def system(self):
    self.oodk_config
    return self
def cluster_0(self):
    self.cl_offset
    self.core
    self.tcdm
    self.lic
    self.hci
    return self
def cluster_1(self):
    self.cl_offset
```

self.core self.tcdm self.lic self.hci return self

def cluster\_2(self):
 self.cl\_offset
 self.core
 self.tcdm
 self.lic
 self.hci
 return self

= 'ex\_2\_sys\_gen'

```
= 0
= [ 'ibex', 1 ]
= [ 32 , 128]
= [ [ 'kernel_A' , 'hwpu']]
= [ ]
```

```
= 0
= [ 'ibex', 1 ]
= [ 32 , 128]
= [ [ 'kernel_B' , 'hwpu']]
= [ ]
```

```
= 0
= [ 'ibex', 1 ]
= [ 32 , 128]
= [ [ 'kernel_C' , 'hwpu']]
= [ ]
```





### **OODK** Example #3 – Heterogeneous Interconnection







**COMP4DRONES** 



### **OODK (Genov+Arov)**

**Compile SW** 

**Run HW/SW** validation test

## OODK System generation

#### **Test application**

- Baremetal software test
- Compiled for the OODK system
- > A template version is generated together with the system itself

### **Accelerator Driver Generation**

#### **HW/SW** validation test

 $\succ$  RTL simulation

- > Before to head up to the FPGA set-up, the generated designs are tested in QuestaSim testbench
- > The real behavior of the baremetal application is tested
  - The RISC-V core executes the test application
  - The accelerators functionality is validated with synthetic stimuli





**COMP4DRONES** 



# **FPGA** overlay

### n HW/SW dation test

# **OODK** Tutorial

**Download Sources and Installation** 

### **Download Arov+Genov Repository**

- > Github:
  - https://github.com/gbellocchi/arov
- **Open terminal**
- mkdir oodk; cd oodk
- git clone https://github.com/gbellocchi/arov
- cd arov; source setup.sh
- git submodule update --init --recursive

### (optional) SW Development kit HERO Repository

- > Github:
  - https://github.com/pulp-platform/hero



- Open terminal

**COMP4DRONES** 



| × +                           |                                   |                                                 |
|-------------------------------|-----------------------------------|-------------------------------------------------|
| github.com/gbellocchi/a       | arov                              |                                                 |
| in                            | 📷 a 👔 🚥 🚾 🕻                       | 📕 🖸 🦹 🧕 🍥 🔅 📙 UNIMORE 📙 EUProjec                |
| Imp to                        | / Pull requests Issues M          | arketplace Explore                              |
| arov Public                   |                                   | • Unwatch 2                                     |
| ues 🕄 Pull requests           | O Actions 	☐ Projects 	[          | 🋱 Wiki 😲 Security 🗠 Insights                    |
| ch 양 1 branch                 | 😯 1 tag                           | Go to file Add file - Code -                    |
| <b>apotondi</b> Added Example | 25                                | E Clone                                         |
|                               | pulp cluster:                     | HTTPS SSH GitHub CLI                            |
|                               | Added recipes for date22 design   | https://github.com/gbellocchi/arov.git          |
| 2619                          | Added Examples                    | Use Git or checkout with SVN using the web URL. |
|                               | Removed test for traffic generate | 단 Open with GitHub Desktop                      |
|                               | Added Examples                    |                                                 |
|                               | Added Examples                    | Download ZIP                                    |

git clone https://github.com/pulp-platform/hero git checkout cps-school

## OODK Tutorial Arov (Accelerator Rich Overlay)





## OODK Tutorial Genov (Generator of Overlay)

| <b>()</b> ( | Search or jump to                                | 7 Pull requests Issues Marketpla                           | ace Explore                       |                                |           |
|-------------|--------------------------------------------------|------------------------------------------------------------|-----------------------------------|--------------------------------|-----------|
| 🛱 gbe       | llocchi <b>/ genov</b> 🕑                         | Public                                                     |                                   |                                | Jr        |
| <> Cod      | e 💽 Issues 🏦 P                                   | Pull requests 🕟 Actions 🖽 Projects 🖽 Wiki                  | i 🛈 Security 🗠 Insights           |                                |           |
|             | ᢞ bf62619ef7 <del>-</del>                        | <b>្រឹ 2</b> branches 🛛 🔊 0 tags                           |                                   | /docs -> doc                   | umentat   |
|             |                                                  |                                                            |                                   |                                |           |
|             | Alessandro Cap                                   | potondi Added Examples                                     |                                   | /genov -> ge                   | nerators  |
|             | 📄 doc                                            | correction in doc/verif                                    |                                   |                                |           |
|             | genov                                            | Augmented dimension of add                                 | dress generator counte            | templates)                     |           |
|             | src                                              | Added Examples                                             |                                   | 14 hours ago                   |           |
|             | tools                                            | Medified generation of output                              | ut environment. Static components | are add 2 months ago           |           |
|             |                                                  | Modified generation of outpr                               | at environment. static components | are add 2 months ago           |           |
|             | 🗋 .gitignore                                     | - Generation of wrapper clust                              |                                   | /src -> Your /                 | Accelerat |
|             | <ul><li>.gitignore</li><li>.gitmodules</li></ul> |                                                            | ter interface has been r          |                                | Accelerat |
|             |                                                  | - Generation of wrapper clust<br>Added hwpe-tb master from | ter interface has been r          | /src -> Your /<br>3 months ago | Accelerat |
|             | .gitmodules                                      | - Generation of wrapper clust<br>Added hwpe-tb master from | gbellocchi github.                | /src -> Your /<br>3 months ago | Accelerat |

#### **COMP4DRONES**



#### ion

### (python) + backend (IP

### tors Specs, Your Systems spcs

### Exercise 1 Instantiate your first Overlay

- **1 Cluster**
- 16xMem Banks L1
- 128KB L1
- 8x RISC-V 'risky' cores
- 4x traffic\_gen, hwpe Accelerators





### Exercise 1 Instantiate your first Overlay

**Create:** 

exercise1/specs/ov\_specs.py





### Exercise 1 Instantiate your first Overlay

### (First time only)

cd genov make py\_env source local\_py\_env/bin/activate

#### Load Python Environment

cd genov source local\_py\_env/bin/activate

#### **Generate the Overlay**

cd genov

make TARGET\_OV=<OVERLAY FOLDER NAME> ov\_gen

make TARGET\_OV=example1 ov\_gen #### in our specific case ####



#### **COMP4DRONES**





 $\checkmark$  arov



# Exercise 2

Instantiate your second Overlay!

- **2 Clusters**
- Cluster 0
  - 16xMem Banks L1
  - 128KB L1
  - 4x RISC-V 'risky' cores
  - 3x traffic\_gen, hwpe Accelerators
- Cluster 1
  - 8xMem Banks L1
  - 128KB L1
  - 2x RISC-V 'risky' cores
  - 1x traffic\_gen, hwpe Accelerators





# Exercise 2 exercise2/specs/ov\_specs.py

| <pre>def cluster_0(self):     self.cl_offset     self.core     self.tcdm     self.lic      self.hci     return self</pre> | <pre>= 0 = [ 'riscy', 4 ] = [ 16 , 128] = [ [ 'traffic_gen' ,        [ 'traffic_gen' ,        [ 'traffic_gen' ,        [ 'traffic_gen' ,        ]</pre> |
|---------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------|
| <pre>def cluster_1(self):     self.cl_offset     self.core     self.tcdm     self.lic     self.hci     return self</pre>  | <pre>= 1 = [ 'riscy', 2 ] = [ 8 , 128] = [ [ 'traffic_gen' , = [ ]</pre>                                                                                |

cd genov make TARGET\_OV=example2 ov\_gen





# Ok, ok, but how to code this thing!





# Helloworld on exercise1 Overlay

### **SW Requirements**

- Installation of HERO SDK
  - Download prebuild in the release:
    - <u>https://github.com/gbellocchi/arov/releases/tag/cp</u> s-summer-school-22-v0.2
  - Build from sources (takes time...)
    - https://github.com/pulp-platform/hero
    - git checkout cps-summer-school-22
    - Follow README.md

### Where do we exploit OpenMP?

- Offload from the host of SoC computation to the overlay
- Parallel OpenMP pragma to PARALLELIZE execution on the **RISC-V** cores **COMP4DRONES**







# **OpenMP** Helloworld



### hero/openmp-example/helloworld/helloworld.c

```
#include <hero-target.h> // BIGPULP MEMCPY
#include <stdio.h>
                         // printf()
#pragma omp declare target
void helloworld(void) {
#pragma omp parallel
  printf("Hello World, I am thread %d of %d\n", omp_get_thread_num(), omp_get_num_threads());
#pragma omp end declare target
int main(int argc, char *argv[]) {
#pragma omp target device(BIGPULP MEMCPY)
  helloworld();
```

#### return Q.

cd hero source setup.sh cd openmp-examples/helloworld/ make clean all ## build heterogenous application for board make clean all only=pulp ### build for RTL simulation (questasim)  $\leftarrow$  Use this today COMP<sub>4D</sub>KONES



# Execute QuestaSim simulation\*

cd arov/genov

make TARGET\_OV=example1 ov\_deploy ## make design ready for deployment (simulation, build)

cd ../

make TARGET\_OV=example1 APP\_PATH=/path/to/hero/openmp-examples/helloworld GUI=0 vsim

| # [0,0] Hello World, I am thread 0 of 8            |       |
|----------------------------------------------------|-------|
| <pre># [0,1] Hello World, I am thread 1 of 8</pre> |       |
| <pre># [0,2] Hello World, I am thread 2 of 8</pre> | 10    |
| <pre># [0,3] Hello World, I am thread 3 of 8</pre> |       |
| <pre># [0,4] Hello World, I am thread 4 of 8</pre> | 1 And |
| # [0,5] Hello World, I am thread 5 of 8            |       |
| <pre># [0,6] Hello World, I am thread 6 of 8</pre> |       |
| # [0,7] Hello World, I am thread 7 of 8            | 20    |

\* Questasim installation is required. If you do not access to any modelsim simulator you can also use the IntelQuartus Edition https://www.intel.it/content/www/it/it/software/programmable/quartus-prime/questa-edition.html **COMP4DRONES** 







# Execute QuestaSim simulation\*

cd ../

### make TARGET\_OV=example1 APP\_PATH=/path/to/hero/openmp-examples/helloworld GUI=1 vsim



\* Questasim installation is required. If you do not access to any modelsim simulator you can also use the IntelQuartus Edition https://www.intel.it/content/www/it/it/software/programmable/quartus-prime/questa-edition.html **COMP4DRONES** 





### Traffic Gen Accelerator example /hero/openmp-examples/cps-school-22-hwpe-example

| ∨ cps-sc | hoo | -22- | hwpe | e-exam | ple |
|----------|-----|------|------|--------|-----|
|----------|-----|------|------|--------|-----|

C main.c

Makefile

C traffic\_gen\_api.c

C traffic\_gen\_api.h

#### #pragma omp parallel

#pragma omp master

```
printf("I am the master, and I am going to program the accelerator\n");
arov_struct arov;
int offload id;
int cluster id = 0;
int acc id = 0;
____device uint32 t * a local = (___device uint32 t *)hero_l1malloc(1024*sizeof(uint32 t));
printf("Initialized the Traffic Gen %d\n", acc id);
arov init(&arov, cluster id, acc id);
printf("Prepare Traffic Gen %d Job descriptor\n", acc_id);
arov_map_params_traffic_gen(&arov, cluster_id, acc_id,
 a local, /* buffer 11 base pointer */
 1024, /* i/o size in word (I/O) */
  512, /* input size in word */
 1, 1, 512, /* total tx generated */
 1, 1, 512, 1,/* n_reps */
  16); /* n banks touched */
printf("Program Traffic Gen %d\n", acc_id);
```

#### offload\_id = arov\_activate(&arov, cluster\_id, acc\_id); arov program(&arov, cluster id, acc id);

#### cd hero

#### source setup.sh

cd openmp-examples/cps-school-22-hwpe-example make clean all ## build heterogenous application for board make clean all only=pulp ### build for RTL simulation (questasim)  $\leftarrow$  Use this today COMP<sub>4DRONES</sub>



# Execute QuestaSim simulation\*

cd ../

### make TARGET\_OV=example1 APP\_PATH=/path/to/hero/openmp-examples/cps-school-22-hwpeexample GUI=0 vsim

| # | [0 1] | Т   | <b></b> | waitin | a fo  | r + h             |       | rmina | tion  | of    | 211 | the | thread  |
|---|-------|-----|---------|--------|-------|-------------------|-------|-------|-------|-------|-----|-----|---------|
|   |       |     |         |        | -     |                   |       |       |       |       |     |     |         |
|   |       |     |         |        | -     |                   |       |       |       |       |     |     | thread  |
| # | [0,2] | Ιä  | am      | waitin | ıg fo | r the             | e tei | rmina | tion  | of    | all | the | thread  |
| # | [0,4] | Ιá  | am      | waitin | ig fo | r the             | e tei | rmina | tion  | of    | all | the | thread  |
| # | [0,5] | Ιa  | am      | waitin | g fo  | r the             | e tei | rmina | tion  | of    | all | the | thread  |
| # | [0,7] | Ia  | am      | waitin | iq fo | r the             | e tei | rmina | tion  | of    | all | the | thread  |
|   |       |     |         |        |       |                   |       |       |       |       |     |     | thread  |
|   |       |     |         |        |       |                   |       |       |       |       |     |     | the acc |
|   |       |     |         | alized |       |                   |       |       |       |       |     |     |         |
|   |       |     |         | re Tra |       |                   |       |       |       | ot 0. | ~   |     |         |
|   |       |     |         |        |       |                   |       | JD UE | SCIT  |       |     |     |         |
|   |       |     | -       | am Tra |       |                   | U     |       |       |       |     |     |         |
| # | [0,0] | Sta | art     | Traff  | ic G  | en 0              |       |       |       |       |     |     |         |
| # | [0,0] | Wa: | it      | for te | ermin | atio              | n Tra | affic | Gen   | 0     |     |     |         |
| # | [0,0] | Tra | aff     | ic Gen | 0 e   | xecu <sup>.</sup> | tion  | is c  | omplo | ete   |     |     |         |
| # | [0,0] | Ιä  | am      | waitin | ig fo | r the             | e tei | rmina | tion  | of    | all | the | thread  |
|   |       |     |         | s all  |       |                   |       |       |       |       |     |     |         |
|   |       |     |         | s all  |       |                   |       |       |       |       |     |     |         |
|   |       |     |         | s all  |       |                   |       |       |       |       |     |     |         |
|   |       |     |         | s all  |       |                   |       |       |       |       |     |     |         |
|   |       |     |         |        |       |                   |       |       |       |       |     |     |         |
|   |       |     |         | s all  |       |                   |       |       |       |       |     |     |         |
|   |       |     |         | s all  |       |                   |       |       |       |       |     |     |         |
| # | [0,6] | Tha | at      | s all  | folk  | s!                |       |       |       |       |     |     |         |
| # | [0,7] | Tha | at      | s all  | folk  | s!                |       |       |       |       |     |     |         |
|   |       |     |         |        |       |                   |       |       |       |       |     |     |         |

\* Questasim installation is required. If you do not access to any modelsim simulator you can also use the IntelQuartus Edition https://www.intel.it/content/www/it/it/software/programmable/quartus-prime/questa-edition.html COMP4DRONES





| d | S  |   |    |   |             |
|---|----|---|----|---|-------------|
| d | S  |   |    |   |             |
| d | s  |   |    |   |             |
| d | s  |   |    |   |             |
| d |    |   |    |   |             |
| d |    |   |    |   |             |
| d |    |   |    |   |             |
|   | el | 2 | ra | + | <b>•</b> •• |
| - | 00 |   |    |   | 01          |
|   |    |   |    |   |             |
|   |    |   |    |   |             |
|   |    |   |    |   |             |
|   |    |   |    |   |             |
|   |    |   |    |   |             |
|   |    |   |    |   |             |
| d | s  |   |    |   |             |
|   |    |   |    |   |             |
|   |    |   |    |   |             |
|   |    |   |    |   |             |
|   |    |   |    |   |             |
|   |    |   |    |   |             |
|   |    |   |    |   |             |
|   |    |   |    |   |             |

# Execute QuestaSim simulation\*

#### cd ../

### make TARGET\_OV=example1 APP\_PATH=/path/to/hero/openmp-examples/cps-school-22-hwpeexample GUI=1 vsim

|                  | akath               |                                         |              |                          |        |         |     |      |
|------------------|---------------------|-----------------------------------------|--------------|--------------------------|--------|---------|-----|------|
| 🙆 .              |                     |                                         | Msgs         |                          |        |         |     |      |
| e 🌳 ama          |                     |                                         |              | (dma)                    |        |         |     |      |
|                  | ster_peripherals    |                                         |              | ( cluster_peripherals )  |        |         |     |      |
|                  | ster_interconnect   |                                         |              | ( cluster_interconnect ) |        |         |     |      |
|                  | tcdm_interco        |                                         |              | ( tcdm_interco )         |        |         |     |      |
|                  | tcdm_sram_master    |                                         |              | ( tcdm_sram_master )     |        |         |     |      |
| 🖬 🔶 srar         |                     |                                         |              | (sram)                   |        |         |     |      |
| icac             |                     |                                         |              | (icache[0])              |        |         |     |      |
|                  | _acc_region         |                                         |              | (LIC_acc_region)         |        |         |     |      |
| ··               |                     | 1'h1                                    |              |                          |        |         |     |      |
|                  |                     | 1'h1                                    |              |                          |        |         |     |      |
|                  | test_mode           | 1'h0                                    |              |                          |        |         |     |      |
|                  | wrapper[0]          |                                         |              | (wrapper[0])             |        |         |     |      |
|                  | hwpe_xbar_master[0] |                                         |              | ( hwpe_xbar_master[0] )  |        |         |     |      |
|                  |                     | 1'h0                                    |              |                          |        |         |     |      |
|                  | 🛓 🥠 add             | 32'h0000000                             |              | 0000000                  |        |         |     |      |
|                  |                     | 1'h1                                    |              |                          |        |         |     |      |
|                  | 💽 👍 wdata           | 32'h0000000                             |              | 0000000                  |        |         |     |      |
|                  | 💁 🔶 be              | 4'hf                                    |              | f                        |        |         |     |      |
|                  |                     | 1'h0                                    |              |                          |        |         |     |      |
|                  |                     | 1'hx                                    |              |                          |        |         |     |      |
|                  | 🛯 🔶 r_rdata         | 32'h0000000                             |              | 0000000                  | 1000., | J 10000 | a88 |      |
|                  |                     | 1'h0                                    |              |                          |        |         |     |      |
|                  | hwpe_xbar_master[1] |                                         |              | ( hwpe_xbar_master[1] )  |        |         |     |      |
|                  |                     | 1'h0                                    |              |                          |        |         |     |      |
|                  | • dd                | 32'h0000000                             |              | 0000000                  |        |         |     |      |
|                  | 👍 wen               | 1'h0                                    |              |                          |        |         |     |      |
|                  | 🛓 🥠 wdata           | 32'h0000000                             |              | 0000000                  |        |         |     |      |
|                  |                     | 4'h0                                    |              | 0                        |        |         |     |      |
|                  |                     | 1'h0                                    |              |                          |        |         |     |      |
|                  |                     | 1'hx                                    |              |                          |        |         |     |      |
|                  | 💁 🔶 r_rdata         | 32'h0000000                             |              | 0000000                  | (1000  | 10000   | a88 | r r  |
|                  |                     | 1'h0                                    |              |                          |        |         |     |      |
|                  | hwpe_cfg_slave[0]   |                                         |              | ( hwpe_cfg_slave[0] )    |        |         |     |      |
|                  | wrapper[1]          |                                         |              | (wrapper[1])             |        |         |     |      |
|                  | wrapper[2]          |                                         |              | (wrapper[2])             |        |         |     |      |
|                  | wrapper[3]          |                                         |              | (wrapper[3])             |        |         |     |      |
|                  |                     | 2'h0 2'h0 2'h0 2'h0 2'h0 2'h0 2'h0 2'h0 |              | 0000000                  |        |         |     |      |
|                  | busy_o              | 1'h1                                    |              |                          |        |         |     |      |
|                  |                     |                                         |              |                          |        |         |     |      |
| • • • •          | Now                 |                                         | 17003 ns     | 0 ns 40(                 | )0 ns  |         |     | 8000 |
|                  | Cursor 11           |                                         | 35680.473 ns |                          |        |         |     |      |
|                  |                     |                                         | 35686.614 ns |                          |        |         |     |      |
|                  | Cursor 13           |                                         | 35800.671 ns |                          |        |         |     |      |
| * Questasim inst | Cursor 14           |                                         | 16 ns        |                          |        |         |     |      |
| Questasiiii iiis |                     | KI                                      |              | <                        |        |         |     |      |

https://www.intel.it/content/www/it/it/software/programmable/quartus-prime/questa-edition.html **COMP4DRONES** 







### Traffic Gen Accelerator example /hero/openmp-examples/cps-school-22-hwpe-example

| ∨ cps-sc | hoo | -22- | hwpe | e-exam | ple |
|----------|-----|------|------|--------|-----|
|----------|-----|------|------|--------|-----|

C main.c

Makefile

C traffic\_gen\_api.c

C traffic\_gen\_api.h

#### #pragma omp parallel

#pragma omp master

```
printf("I am the master, and I am going to program the accelerator\n");
arov_struct arov;
int offload id;
int cluster id = 0;
int acc id = 0;
____device uint32 t * a local = (___device uint32 t *)hero_l1malloc(1024*sizeof(uint32 t));
printf("Initialized the Traffic Gen %d\n", acc id);
arov init(&arov, cluster id, acc id);
printf("Prepare Traffic Gen %d Job descriptor\n", acc_id);
arov_map_params_traffic_gen(&arov, cluster_id, acc_id,
 a local, /* buffer 11 base pointer */
 1024, /* i/o size in word (I/O) */
  512, /* input size in word */
 1, 1, 512, /* total tx generated */
 1, 1, 512, 1,/* n_reps */
  16); /* n banks touched */
printf("Program Traffic Gen %d\n", acc_id);
```

#### offload\_id = arov\_activate(&arov, cluster\_id, acc\_id); arov program(&arov, cluster id, acc id);

#### cd hero

#### source setup.sh

cd openmp-examples/cps-school-22-hwpe-example make clean all ## build heterogenous application for board make clean all only=pulp ### build for RTL simulation (questasim)  $\leftarrow$  Use this today COMP<sub>4DRONES</sub>



## But, where are the FPGA????

### **OODK provides support also for synthesis and** implementation on FPGA\*



#### cd arov

make TARGET\_OV=example1 fpga

\* Implementation and synthesis requires Xilinx Vivado Installation and Valid License for the target board.





### What can you do with this tool??? Automated resource Space Exploration





## What can you do with this tool??? Automated Performance Evaluation







- Meas L1 (Hom-FG)
- Meas L1 (Het-FG)
- Meas L1 (Worst-FG)
- Meas L2 (1cl-FG)
- Meas L2 (2cl-FG)
- Meas L2 (4cl-FG)
- Meas L2 (8cl-FG)
- Meas L2 (16cl-FG)
- – Model L1 Ideal
- – Model L2 Ideal
- —— Model L1 (Hom-FG)
- Model L1 (Het-FG)
- Model L1 (Worst-FG)

# AGENDA

#### Introduction 1

- Methodology overview 2
- MDC tool 3
- **OODK** overlay 4
- 5 **COMP4DRONES** use case
  - Conclusions

6



111

# Current application: C4D





# Current application: C4D



Development and assessment of Smart and Precision Agriculture Technologies to enable:

- 2. Real-time plants indexes;
- spraying;

**COMP4DRONES** 



**1. Improve non-real time actions**, i.e. forecast on production volume and optimized water management. field monitoring and inspection, i.e. automatic disease detection and cross-correlation of

3. Prompt on-field intervention, i.e. customized spot

# Current application: C4D



COMP4DRONES

Development and assessment of Smart and Precision Agriculture Technologies to enable:

- 2. Real-time plants indexes;
- spraying;

### **TECHNICAL SET-UP**

field field data, and a spraying drone -



**1. Improve non-real time actions**, i.e. forecast on production volume and optimized water management. field monitoring and inspection, i.e. automatic disease detection and cross-correlation of

3. Prompt on-field intervention, i.e. customized spot











### **USER NEEDS**

- 1. Use as little pesticides: Proper assessment of health status & on spot interventions
- 2. Waste as little water as possible: Precise growth assessment







### **USER NEEDS**

- 1. Use as little pesticides: Proper assessment of health status & on spot interventions
- 2. Waste as little water as possible: Precise growth assessment







### **EXPECTED BENEFITS**

- 1. Reduced impact on the environment 2. Reduced human effort 3. Improved usability of advanced technologies by non-expert operators



### **USER NEEDS**

- 1. Use as little pesticides: Proper assessment of health status & on spot interventions
- 2. Waste as little water as possible: Precise growth assessment







# **Current application: Baseline**





# Current application: Scenario 2





# **Current application: Scenario 3**





# C4D methodology experimental result

- Overall x2 speedup when comparing SW vs HW implementation of the AES algorithm
- ✓ Implementation targets a ZU9EG SoC with a resource cost of: ≻ ~43.7% LUTs

  - ➤ ~11.7% FFs
  - ≻ ~13.2% BRAMs





# AGENDA

#### Introduction 1

- Methodology overview 2
- MDC tool 3
- **OODK** overlay 4
- 5 **COMP4DRONES** use case
- Conclusions 6



## Conclusions

- ✓ Simplified design of HW accelerators through MDC
- ✓ Multi-functionality, multi working-point and reconfiguration support for CGRAs
- ✓ Support for accelerators generated with different tools (e.g., CAPH, HLS)
- ✓ Agile methodology for the design and exploration of accelerator-rich systems
- ✓ Simplified validation and deployment of the generated HW/SW system
- ✓ Practical use case: COMP4DRONES



### Give us a feedback!





## Contacts

Prof. Alessandro Capotondi

✓ <u>Alessandro.capotondi@unimore.it</u>

Dr. Daniel Madroñal

✓ <u>dmadronalquin@uniss.it</u>

Ing. Gianluca Bellocchi

✓ gianluca.bellocchi@unimore.it

Prof. Andrea Marongiu

✓ <u>a.marongiu@unimore.it</u>

Prof. Francesca Palumbo

✓ <u>fpalumbo@uniss.it</u>



## CPS Summer School 2022

Tutorial C4D:

A programmable and reconfigurable FPGA overlay

Alessandro Capotondi<sup>1</sup>, Daniel Madroñal<sup>2</sup>

<sup>1</sup>Università degli Studi di Modena e Reggio Emilia <sup>2</sup>Università degli Studi di Sassari







This project has received funding from the ECSEL Joint Undertaking (JU) under grant agreement No 826610. The JU receives support from the European Union's Horizon 2020 research and innovation programme and Spain, Austria,



