Playing with the Cyclone V SoC system – DE0-Nano-SoC Kit/Atlas-SoC

This project is about the implementation of a System on Chip (SoC) on the Cyclone V SoC from Altera [1].
The design is implemented on the evaluation board DE0-Nano-SoC Kit/Atlas-SoC from Terasic [2] which I bought recently to experiment with the Cyclone V SoC. The Cyclone V SoC is a FPGA combined with a dual-core ARM® Cortex®-A9 hard processor system (HPS) and some peripherals. The FPGA and the ARM processor are connected by a high speed bus system providing high speed interconnection and gives a lot more possibilities compared to a system with stand alone ARM + FPGA. Since everything is implemented on a single chip, one comes a long way with just the evaluation board and does not need a complicated PCB design to combine MCU with a FPGA. This brings SoC design even into a low budget lab :).

The test system which was designed is meant as a starting point for further designs, and to explore the possibilities of the Cyclone V SoC system.

The design demonstrates:

  • HPS-to-FPGA bridge
  • lightweight HPS-to-FPGA bridge
  • Control of LED via HPS-to-FPGA bridge
  • Hardware accelerated calculations outside the HPS
  • On chip RAM
  • DMA controller transfer RAM to RAM and RAM to GPIO
  • Scattering Gather DMA controller DDR-RAM to GPIO
  • Clock crossing bridge
  • Reserve a dedicated amount of the DDR-RAM for Linux and the FPGA

  • The hardware design

    The developed system contains two DMA controllers to test different methods of DMA driven data transfers.
    A hardware adder on the FPGA is implemented to demonstrate how more complicated calculations can be performed hardware accelerated, outside the ARM MCU.
    It is demonstrated how to use the HPS-to-FPGA bridge and lightweight HPS-to-FPGA bridge.

    Most components implemented on the FPGA fabric are available as IP-components from Altera [8].

    My starting point was the golden reference design from Altera [3] which follows with the evaluation board. On top I added my components of the SoC.

    A simplified flow diagram shows the connections of the components in the FPGA and the HPS.


    The system was designed with help of QSys. All interconnections and addresses are shown in the following:


    In order to get the components connected to the outside world physical pins of the FPGA have to be connected and also VHDL components outside the QSys system have to be connected. This is done in the Verilog file “ghrd.v” which comes with the Golden Reference Design” and which needs to be extended by:

    //my wires
      wire [7:0] reg1_to_add;
      wire [7:0] reg2_to_add;
      wire [7:0] add_to_reg3;
      //my add
      SimpleAdd MyAdd(
    	.reg1		(reg1_to_add),
    	.reg2		(reg2_to_add),
    	.reg3		(add_to_reg3)

    This connects the QSys system to the LEDs, connects “myBus” to 8 pins of the GPIO connector. These GPIO pins are then connected to a Logic Analyzer for testing. “myBus” is conceded via a clock bridge getting the data at a speed of 50 MHz from the DMA controller and passing it to the GPIO at a speed of 2 MHz.
    Further “reg1” to “reg3” are connected with the components “MyAdd” which is a hardware adder implemented in the FPGA for testing hardware accelerated operations outside the MCU.

    The hardware adder is implemented in VHDL by the simple code sample:

    -- Title           : Title
    -- Author          : Daniel Pelikan
    -- Date Created    : xx-xx-2016
    -- Description     : Description
    -- Copyright 2016. All rights reserved
    library ieee;
    use ieee.std_logic_1164.all;
    use IEEE.numeric_std.all;
    entity SimpleAdd is
    --  generic (
    --    g_Variable : integer := 10    
    --    );
    		reg1		: in std_logic_vector(7 downto 0);
    		reg2		: in std_logic_vector(7 downto 0);
    		reg3		: out std_logic_vector(7 downto 0)
    end entity;
    architecture rtl of SimpleAdd is
    --	signal tmp : std_logic := 0 ;
    --	constant const    : std_logic_vector(3 downto 0) := "1000";
    	reg3 <= std_logic_vector(unsigned(reg1) + unsigned(reg2));
    end rtl;

    and simply adds “Reg1” and “Reg2” and presents the result in “Reg3”.

    Modifying the Linux to have some of the DDR RAM for the FPGA

    Now we have the hardware design ready and need to write the software to control all components implemented in hardware.
    This can be done for the Cyclone V SoC with the “Eclipse for DS5 Altera edition” a design system for ARM MCUs modified to work with the Altera SoCs [4].

    The starting point here is again the default Linux System which comes with the DE0-Nano-SoC Kit/Atlas-SoC.
    By default the system uses the full 1024 MB of RAM connected to the Cyclone V SoC.

    In order to be able to use some of the RAM without violating the integrity of the Linux system the used RAM of the Linux system needs to be limited. In this design this is done by setting a limit to 800 MB.

    This is done with an option in the u-boot command line a limit can be set how much memory should be assigned to the kernel.
    This can be done with the option mem=xxxx [5].

    To be able to enter the u-boot console connect the serial cable to the serial port before booting 115200 is the baud-rate.

     setenv bootargs console=ttyS0,115200 mem=1000M

    It seesm that mmcboot is booted by default.
    The default line in u-boot looks like:

    mmcboot=setenv bootargs console=ttyS0,115200 root=${mmcroot} rw rootwait;bootz ${loadaddr} - ${fdtaddr}

    which needs to be changed to:

     mmcboot=setenv bootargs console=ttyS0,115200 mem=800M root=${mmcroot} rw rootwait;bootz ${loadaddr} - ${fdtaddr}


    setenv mmcboot 'setenv bootargs console=ttyS0,115200 mem=800M root=${mmcroot} rw rootwait;bootz ${loadaddr} - ${fdtaddr}'

    More info can be found in [6,7].

    This reserves 800 MB of DDR-RAM for the Linux system and the rest can be used for other things.

    Something similar can probably be done with a Kernel module, but we keep it simple for the moment, we do everything in user space.

    The idea is to have 800 MB RAM for the Linux system, 64 MB and 16MB for some tasks in the FPGA.

    If the limitation of memory has worked out, can be checked with the commands:
    cat /proc/meminfo
    cat /proc/iomem

    Software design for the HPS

    A user space application is written performing the memory mapping of all components from physical space into the user space.
    This gives access to the control registers of the components implemented in the FPGA.

    It should be kept in mind that on Linux the memory address is not necessary the physical address. Memory addresses in user space are usually virtual addresses and their physical mapping can even change under time. In kernel space there are more possibilities.

    Since the control of the DMA and Scatter Gather DMA controller are a bit more complicated header file have been developed in order to make control easier and more intuitive.

    The header files can be found in GitHub dma.h, sgdma.h

    The main code starts by including the header file needed for the system and by defining some parameters:

    #define DEBUG
    #include <stdio.h>
    #include <unistd.h>
    #include <stdint.h>
    #include <fcntl.h>
    #include <sys/mman.h>
    #include "hwlib.h"
    #include "soc_cv_av/socal/socal.h"
    #include "soc_cv_av/socal/hps.h"
    #include "soc_cv_av/socal/alt_gpio.h"
    #include "hps_0.h"
    #include "sgdma.h"
    #include "dma.h"
    //settings for the lightweight HPS-to-FPGA bridge
    #define HW_REGS_BASE ( ALT_STM_OFST )
    #define HW_REGS_SPAN ( 0x04000000 ) //64 MB with 32 bit adress space this is 256 MB
    #define HW_REGS_MASK ( HW_REGS_SPAN - 1 )
    //setting for the HPS2FPGA AXI Bridge
    #define ALT_AXI_FPGASLVS_OFST (0xC0000000) // axi_master
    #define HW_FPGA_AXI_SPAN (0x40000000) // Bridge span 1GB
    #define HW_FPGA_AXI_MASK ( HW_FPGA_AXI_SPAN - 1 )
    //SDRAM 32000000-35ffffff //64 MB
    #define SDRAM_64_BASE 0x32000000
    #define SDRAM_64_SPAN 0x3FFFFFF
    //SDRAM 36000000-36ffffff //16 MB
    #define SDRAM_16_BASE 0x36000000
    #define SDRAM_16_SPAN 0xFFFFFF

    The next step is to define the pointers to the start addresses of the components and configure the memory mapping:

    int main() {
    	//pointer to the different address spaces
    	void *virtual_base;
    	void *axi_virtual_base;
    	int fd;
    	void *h2p_lw_reg1_addr;
    	void *h2p_lw_reg2_addr;
    	void *h2p_lw_reg3_addr;
    	//void *h2p_lw_myBus_addr;
    	void *h2p_led_addr; //led via AXI master
    	void *h2p_rom_addr; //scratch space via ax master 64kb
    	void *h2p_rom2_addr;
    	void *sdram_64MB_add;
    	void *sdram_16MB_add;
    	// map the address space for the LED registers into user space so we can interact with them.
    	// we'll actually map in the entire CSR span of the HPS since we want to access various registers within that span
    	if( ( fd = open( "/dev/mem", ( O_RDWR | O_SYNC ) ) ) == -1 ) {
    		printf( "ERROR: could not open \"/dev/mem\"...\n" );
    		return( 1 );
    	//lightweight HPS-to-FPGA bridge
    	virtual_base = mmap( NULL, HW_REGS_SPAN, ( PROT_READ | PROT_WRITE ), MAP_SHARED, fd, HW_REGS_BASE );
    	if( virtual_base == MAP_FAILED ) {
    		printf( "ERROR: mmap() failed...\n" );
    		close( fd );
    		return( 1 );
    	//HPS-to-FPGA bridge
    	if( axi_virtual_base == MAP_FAILED ) {
    		printf( "ERROR: axi mmap() failed...\n" );
    		close( fd );
    		return( 1 );

    LED configuration

    The following code sets some of the LEDs connected to the LED bus ON and OFF. This happens by writing to the address of the LEDs with help of the HPS-to-FPGA bridge:

    	//configure the LEDs of the Golden Reference design
    	printf( "\n\n\n-----------Set the LEDs on-------------\n\n" );
    	//LED connected to the HPS-to-FPGA bridge
    	h2p_led_addr=axi_virtual_base + ( ( unsigned long  )( 0x0 + PIO_LED_BASE ) & ( unsigned long)( HW_FPGA_AXI_MASK ) );
    	*(uint32_t *)h2p_led_addr = 0b10111100;

    Running the code gives the pattern ON|OFF|ON|ON|ON|ON|OFF|OFF on the LEDs.

    Add two numbers in the FPGA

    The next implementation is the hardware adder. Here the HPS writes into the two input registers (“reg1”, “reg2”) of the adder implemented in the FPGA and reads back the output register (“reg3”) with the result. The result is calculated outside the MCU!
    The read and write to the registers happens via the lightweight HPS-to-FPGA bridge.

    	//Adder test: Two registers are connected to a adder and place the result in the third register
    	printf( "\n\n\n-----------Add two numbers in the FPGA-------------\n\n" );
    	//the address of the two input (reg1 and reg2) registers and the output register (reg3)
    	h2p_lw_reg1_addr=virtual_base + ( ( unsigned long  )( ALT_LWFPGASLVS_OFST + PIO_REG1_BASE ) & ( unsigned long)( HW_REGS_MASK ) );
    	h2p_lw_reg2_addr=virtual_base + ( ( unsigned long  )( ALT_LWFPGASLVS_OFST + PIO_REG2_BASE ) & ( unsigned long)( HW_REGS_MASK ) );
    	h2p_lw_reg3_addr=virtual_base + ( ( unsigned long  )( ALT_LWFPGASLVS_OFST + PIO_REG3_BASE ) & ( unsigned long)( HW_REGS_MASK ) );
    	//write into register to test the adder
    	*(uint32_t *)h2p_lw_reg1_addr = 10;
    	*(uint32_t *)h2p_lw_reg2_addr = 5;
    	//read result of the adder from register 3
    	printf( "Adder result:%d + %d = %d\n", *((uint32_t *)h2p_lw_reg1_addr), *((uint32_t *)h2p_lw_reg2_addr), *((uint32_t *)h2p_lw_reg3_addr) );

    The result on the terminal:

    Running the system gives the following results:

    -----------Add two numbers in the FPGA-------------

    Adder result:10 + 5 = 15

    Write to the on chip RAM

    The next piece of code writes to the two On Chip RAM modules created in the FPGA. This is done via the HPS-to-FPGA bridge.

    	//prepare the on chip memory devices
    	printf( "\n\n\n-----------write on chip RAM-------------\n\n" );
    	//ONCHIP_MEMORY2_0_BASE connected via the HPS-to-FPGA bridge
    	h2p_rom_addr=axi_virtual_base + ( ( unsigned long  )( ONCHIP_MEMORY2_0_BASE ) & ( unsigned long)( HW_FPGA_AXI_MASK ) );
    	h2p_rom2_addr=axi_virtual_base + ( ( unsigned long  )( ONCHIP_MEMORY2_1_BASE ) & ( unsigned long)( HW_FPGA_AXI_MASK ) );
    	//write some data to the scatch disk
    	for (int i=0;i<16000;i++){
    		*((uint32_t *)h2p_rom_addr+i)=i*1024;
    		*((uint32_t *)h2p_rom2_addr+i)=i+3;
    	printf( "Print scratch disks:\n" );
    	printf( "ROM1 \t ROM2\n");
    	for (int i=0;i<10;i++){
    		printf( "%d\t%d\n", *((uint32_t *)h2p_rom_addr+i),*((uint32_t *)h2p_rom2_addr+i) );

    The result on the terminal

    -----------write on chip RAM-------------

    Print scratch disks:
    0 3
    1024 4
    2048 5
    3072 6
    4096 7
    5120 8
    6144 9
    7168 10
    8192 11
    9216 12

    DMA RAM to RAM

    Next the DMA controller is configured via the lightweight HPS-to-FPGA bridge which should be used for component configuration tasks. The DMA controller then transfers automatically the content of on chip RAM1 to RAM2.

    	printf( "\n\n\n-----------DMA RAM to RAM-------------\n\n" );
    	//print the content of scratchdisk 1 and 2
    	printf( "Print scratch disk 1 and 2:\n" );
    	for (int i=0;i<10;i++){
    		printf( "%d\t%d\n", *((uint32_t *)h2p_rom_addr+i),*((uint32_t *)h2p_rom2_addr+i) );
    	//create a pointer to the DMA controller base
    	h2p_lw_dma_addr=virtual_base + ( ( unsigned long  )( ALT_LWFPGASLVS_OFST + DMA_0_BASE ) & ( unsigned long)( HW_REGS_MASK ) );
    	//configure the DMA controller for transfer
    	_DMA_REG_READ_ADDR(h2p_lw_dma_addr)=ONCHIP_MEMORY2_0_BASE; //read from ROM1
    	_DMA_REG_WRITE_ADDR(h2p_lw_dma_addr)=ONCHIP_MEMORY2_1_BASE; //write to ROM2
    	_DMA_REG_LENGTH(h2p_lw_dma_addr)=4*16000;//write 100x 4bytes since we have a 32 bit system
    	//start the transfer
    	_DMA_REG_CONTROL(h2p_lw_dma_addr)=_DMA_CTR_WORD | _DMA_CTR_GO | _DMA_CTR_LEEN;
    	//wait for DMA to be finished
    	stopDMA();//stop the DMA controller
    	//check if data was copied
    	printf( "Print scratch disk 1 and 2:\n" );
    	for (int i=0;i<10;i++){
    		printf( "%d\t %d\n", *((uint32_t *)h2p_rom_addr+i),*((uint32_t *)h2p_rom2_addr+i) );

    The result on the terminal:

    -----------DMA RAM to RAM-------------

    Print scratch disk 1 and 2:
    0 3
    1024 4
    2048 5
    3072 6
    4096 7
    5120 8
    6144 9
    7168 10
    8192 11
    9216 12
    DMA Registers:
    status: 2
    read: 18f8
    write: 22508
    length: c868
    control: 8c
    DMA Status Registers:
    Status: BUSY
    Print scratch disk 1 and 2:
    0 0
    1024 1024
    2048 2048
    3072 3072
    4096 4096
    5120 5120
    6144 6144
    7168 7168
    8192 8192
    9216 9216


    The DMA controller can also be used to write from the on chip RAM to the GPIO via the “myBus”. A clock crossing bridge translates the speed from 50 MHz to 2 MHz.

    	printf( "\n\n\n-----------DMA RAM to PIO-------------\n\n" );
    	//generate some test data were every byte is a different number
    	//the least significant byte is the clocked out first to the PIO
    	for (int i=0;i<16000;i++){
    		*((uint32_t *)h2p_rom_addr+i)=(uint32_t)((i*4+3)%256<<24 | (i*4+2)%256<<16 | (i*4+1)%256<<8 | (i*4+0)%256<<0);
    	//configure the DMA
    	_DMA_REG_LENGTH(h2p_lw_dma_addr)=4*100;//write 100x 4bytes
    	//start the transfer
    	_DMA_REG_CONTROL(h2p_lw_dma_addr)=_DMA_CTR_BYTE | _DMA_CTR_GO | _DMA_CTR_LEEN | _DMA_CTR_WCON;//0b1010001100;
    	//some debug info
    	//wait untill the whole transfer is finished
    	stopDMA();//stop the DMA controller

    The result an be logged with a logic analyzer:

    The code saves a pattern of 8 bits from 0 to 255 into the memory. Then 400 bytes are written from the ROM to the GPIO. Which can be seen in the logic analyzer 0-255 and 0-143 which are 400 bytes.


    Scatter Gather DMA controller – DDR RAM to GPIO

    The next test is to use the Scattering Gather DMA controller to read from the 64 MB DDR-RAM area defined before and write the data via the “myBus” to the GPIO. The Scattering Gather DMA controller gets its transfer requests from a linked list terminated by a empty transfer. This linked list is saved in a dedicated 16 MB descriptor area in the DDR-RAM also created before. This gives the possibility to build transfer queues which the DMA controller transferes after each other. The descriptor area is used because of the complication to map physical addresses in user space to memory addresse controlled by the Linux system.

    	//Scatter Gather DMA controller
    	printf( "\n\n------------- Scatter DMA---------------\n\n" );
    	//space for data
    	sdram_64MB_add=mmap( NULL, SDRAM_64_SPAN, ( PROT_READ | PROT_WRITE ), MAP_SHARED, fd, SDRAM_64_BASE );
    	//space for the descriptor
    	sdram_16MB_add=mmap( NULL, SDRAM_16_SPAN, ( PROT_READ | PROT_WRITE ), MAP_SHARED, fd, SDRAM_16_BASE );
    	//create descriptors in the mapped memory
    	struct alt_avalon_sgdma_packed  *sgdma_desc1=sdram_16MB_add;
    	struct alt_avalon_sgdma_packed  *sgdma_desc2=sdram_16MB_add+sizeof(struct alt_avalon_sgdma_packed);
    	struct alt_avalon_sgdma_packed  *sgdma_desc3=sdram_16MB_add+2*sizeof(struct alt_avalon_sgdma_packed);
    	struct alt_avalon_sgdma_packed  *sgdma_desc_empty=sdram_16MB_add+3*sizeof(struct alt_avalon_sgdma_packed);
    	//Address to the physical space
    	void* sgdma_desc1_phys=(void*)SDRAM_16_BASE;
    	void* sgdma_desc2_phys=(void*)SDRAM_16_BASE+sizeof(struct alt_avalon_sgdma_packed);
    	void* sgdma_desc3_phys=(void*)SDRAM_16_BASE+2*sizeof(struct alt_avalon_sgdma_packed);
    	void* sgdma_desc_empty_phys=(void*)SDRAM_16_BASE+3*sizeof(struct alt_avalon_sgdma_packed);
    	//configure the descriptor
    	//map memory of the control register
    	h2p_lw_sgdma_addr=virtual_base + ( ( unsigned long  )( ALT_LWFPGASLVS_OFST + MEMORYDMA_BASE ) & ( unsigned long)( HW_REGS_MASK ) );
    	//fill the data space
    	for (long int i=0;i<100000/*SDRAM_64_SPAN*/;i++){
    		*((uint32_t *)sdram_64MB_add+i)=i;
    	//init the SGDMA controller
    	//set the address of the descriptor
    	//start the transfer
    	//wait until transfer is complete
    	//stop the core by clearing the run register

    The Scattering Gather DMA controller writes 10, 20 and 30 bytes from DDR-RAM to the GPIO. This is gain monitored with the logic analyzer:


    The last step is to clean up the mapped memory

    	// clean up our memory mapping and exit
    	if( munmap( sdram_64MB_add, SDRAM_64_SPAN ) != 0 ) {
    		printf( "ERROR: munmap() failed...\n" );
    		close( fd );
    		return( 1 );
    	if( munmap( sdram_16MB_add, SDRAM_16_SPAN ) != 0 ) {
    		printf( "ERROR: munmap() failed...\n" );
    		close( fd );
    		return( 1 );
    	if( munmap( virtual_base, HW_REGS_SPAN ) != 0 ) {
    		printf( "ERROR: munmap() failed...\n" );
    		close( fd );
    		return( 1 );
    	if( munmap( axi_virtual_base, HW_FPGA_AXI_SPAN ) != 0 ) {
    		printf( "ERROR: axi munmap() failed...\n" );
    		close( fd );
    		return( 1 );
    	close( fd );
    	return( 0 );

    To get the code:

    git clone


    %d bloggers like this: