Really slow data transfers / cache flush to memory problem

Discussion around products based on ARM Cortex-A5 core.

Moderator: nferre

Crank
Posts: 3
Joined: Fri Jun 29, 2018 5:50 pm

Really slow data transfers / cache flush to memory problem

Tue Jul 03, 2018 10:28 am

Hi,

I'm working on a project for SAMA5D2 CPU. I have a problem with the amount of CPU that's being spent on simple data transfers.

Each of these if-statements seem to take about 150 cycles, that's 600 cycles in total just for couple of very simple memory accesses!

-----------------------------------------------
if (UART0->UART_SR & (1 << 1))
UART0->UART_THR = 0b01010101;

if (FLEXCOM4->usart.US_CSR & (1 << 1));
FLEXCOM4->usart.US_THR = 0b01010101;

if (FLEXCOM3->usart.US_CSR & (1 << 1));
FLEXCOM3->usart.US_THR = 0b01010101;

if (UART1->UART_SR & (1 << 1))
UART1->UART_THR = 0b01010101;
-----------------------------------------------

The alternative method would be to use a DMA transfer. But to make that work, you need to clean the cache region so the data has been transferred from L1 cache to DDR memory before the DMA can access it. If I clean 64 bytes (2 cache lines) from cache with the below method, it takes 390 cycles! Just to clean two 32 byte cache lines. Eventually I would need to transfer a lot more data than that, so this cache cleaning would become a really expensive operation.

Here's the cache cleaning method I'm using:

-----------------------------------------------
// Ensure that cache content has been written into DDR memory
// Call before DMA transfer starts
// SHOULD be non-blocking but doesn't seem to be?

inline __attribute__((always_inline)) void Cache_Clean_Region_Non_Blocking(const void* p_start, uint32_t length)
{
assert(length);

const uint32_t end_addr = uint32_t(p_start) + length;
uint32_t mva = uint32_t(p_start) & ~(L1_CACHE_BYTES-1);

do
{
asm("mcr p15, 0, %0, c7, c10, 1" :: "r" (mva)); // DCCMVAC
mva += L1_CACHE_BYTES;
} while (mva < end_addr);
}
-----------------------------------------------

The last method I've tried is writing the data to non-cached internal SRAM before the DMA transfer. But this is even slower than the above method. What can be done to make these transfers much less CPU consuming?
blue_z
Location: USA
Posts: 1770
Joined: Thu Apr 19, 2007 10:15 pm

Re: Really slow data transfers / cache flush to memory problem

Fri Jul 06, 2018 12:05 am

Crank wrote:Each of these if-statements seem to take about 150 cycles, that's 600 cycles in total just for couple of very simple memory accesses!

-----------------------------------------------
if (UART0->UART_SR & (1 << 1))
UART0->UART_THR = 0b01010101;

if (FLEXCOM4->usart.US_CSR & (1 << 1));
FLEXCOM4->usart.US_THR = 0b01010101;

if (FLEXCOM3->usart.US_CSR & (1 << 1));
FLEXCOM3->usart.US_THR = 0b01010101;

if (UART1->UART_SR & (1 << 1))
UART1->UART_THR = 0b01010101;
-----------------------------------------------
Your claim of execution times is dubious because you provide no methodology and the code makes little or no sense. Two of these if statements are no-ops.
Memory-mapped I/O is not "simple memory accesses".
(You need to learn how to use the code tags so that formating is preserved.)



Crank wrote:What can be done to make these transfers much less CPU consuming?
Apparently you're running a standalone program without an OS?
Why are you even concerned about these timings?
The speed of serial communications is essentially constrained by the baudrate.
Is your standalone program capable of multitasking, or is it waiting for this I/O to complete?

Regards
Crank
Posts: 3
Joined: Fri Jun 29, 2018 5:50 pm

Re: Really slow data transfers / cache flush to memory problem

Fri Jul 06, 2018 12:14 pm

blue_z wrote:
Fri Jul 06, 2018 12:05 am
Why are you even concerned about these timings?
The speed of serial communications is essentially constrained by the baudrate.
Is your standalone program capable of multitasking, or is it waiting for this I/O to complete?
I'm running a bare metal system.

When I measure the amount of cycles those 4 if-statements consume, it's about 600 cycles. I.e. it's stalling the whole program for that time.

I measured it so that I just added those lines in the middle of the main loop and measured the amount of CPU cycles before/after adding the routine there. During testing there was nothing else that used those UARTs/FLEXCOMs in the system. I.e. nothing waits for any of the UART/FLEXCOM operations to finish.

Return to “SAMA5D Cortex-A5 MPU”

Who is online

Users browsing this forum: No registered users and 2 guests