FX3: lose data from GPIF-II to USB 3.0 even with 2 threads

Tip / Sign in to post questions, reply, level up, and achieve exciting badges. Know more

cross mob
lock attach
Attachments are accessible only for community members.
Anonymous
Not applicable

Dear all,

   

I am currently testing data transfer with FX3, from GPIF2 to USB 3.0. I took the project GpifToUsb as basis. Since I would like to have a clock of 32 MHz, I reduced the PCLK clock (clkdiv=12 instead of 4) and I work with Streamer C# (modified version to do datalogging).

   

First, I noticed that 4 buffers of 32kB do not work properly. The throughput is very low. To solve this problem, I reduced the buffers to 4x8kB and I got almost the right throughput (32bits x 32MHz -> 128 MB/s, I got about 120-125).

   

So, my first question is: why does the system does not work well with large buffers and work well with small buffers? It is really weird because I think it should be the opposite...

   

Then, tried to connect a 8 bits counter on 8 GPIF pins to see if I lost data. Because of latency and long wire (I am using a bread board), I had to put the PCLK to 16 MHz. Then, to have a throughput of about 64 MB/s, I had to select buffers of 4x2kB becasue 4x8kB gave me a very low throughput...

   

So, with the original firmware, I lose 28 words of 32 bits every time a buffer is full. This seems to be normal according to the fact we have only 1 thread/socket and there is a latence while switching buffer (as explained in AN75779).

   

So I made some modifications in order to use 2 sockets/threads to tranfert data from GPIF2 to FX3 memory. Here is the code using CyU3PDmaMultiChannelConfig_t dmaMultiCfg; (instead of CyU3PDmaChannelConfig_t dmaCfg;)

   

    dmaMultiCfg.size  = 2048;
    dmaMultiCfg.count = 4;
    dmaMultiCfg.validSckCount = 2;
    dmaMultiCfg.prodSckId [0] = (CyU3PDmaSocketId_t)CY_U3P_PIB_SOCKET_0;
    dmaMultiCfg.prodSckId [1] = (CyU3PDmaSocketId_t)CY_U3P_PIB_SOCKET_1;
    dmaMultiCfg.consSckId [0] = (CyU3PDmaSocketId_t)CY_FX_EP_CONSUMER_SOCKET;
    dmaMultiCfg.prodAvailCount = 0;
    dmaMultiCfg.dmaMode = CY_U3P_DMA_MODE_BYTE;
    dmaMultiCfg.prodHeader = 0;
    dmaMultiCfg.prodFooter = 0;
    dmaMultiCfg.consHeader = 0;
    dmaMultiCfg.prodAvailCount = 0;

   

    dmaMultiCfg.notification = CY_U3P_DMA_CB_CONS_SUSP;
    dmaMultiCfg.cb = GpifToUsbDmaMultiCallback;
    apiRetStatus = CyU3PDmaMultiChannelCreate (&glDmaMultiChHandle, CY_U3P_DMA_TYPE_AUTO_MANY_TO_ONE, &dmaMultiCfg);

   


    if (apiRetStatus != CY_U3P_SUCCESS)
    {
        CyU3PDebugPrint (4, "CyU3PDmaChannelCreate failed, Error code = %d\n", apiRetStatus);
        CyFxAppErrorHandler(apiRetStatus);
    }

   

    /* Set DMA Channel transfer size */
    apiRetStatus = CyU3PDmaMultiChannelSetXfer (&glDmaMultiChHandle, CY_FX_GPIFTOUSB_DMA_TX_SIZE, 0);
    if (apiRetStatus != CY_U3P_SUCCESS)
    {
        CyU3PDebugPrint (4, "CyU3PDmaChannelSetXfer failed, Error code = %d\n", apiRetStatus);
        CyFxAppErrorHandler(apiRetStatus);
    }

   

As you can see, I basically modified the code the "DmaMulti" objects instead of "Dma". For the state machine, I did some adaptions. There is 2 threads and the 1st thread fill in the data until the !DMA_RDY_TH0 flag is set and switch to the 2nd thread until the !DMA_RDY_TH1 flag is set. Please see the attached file.

   

With this configuration, I have better results but not perfect. Instead of loosing 28 words of 32 bits while the GPIFII switch the buffer, I lose only 1 word of 32 bits.

   

So I have 2 issues:

   

- Why do I have to decrease buffer size in order to have correct transfert?

   

- Why do I loose 1 word of 32 bits when I use 2 threads?

   

Thank you in advance,

   

Best regards,

   

Christian

1 Solution
Anonymous
Not applicable

Hi!

Good news! I managed to get the correct throughout! with 16 MHz 32 bits, I got EXACTLY 64 MB/s ! I solved my issue by using a counter in the state machine to switch the buffer instead of using the DMA_RDY_THx flags that are set 1 clock cycle too late! I checked the data I acquired and there is no missing data anymore. Ok, using the state machine counter is not the most elegant way to do but at least it works!

Now, I will try to interface a quad 24 bits delta sigma A/D converter and see if I will manage to control it only with the FX3 state machine or it I will add a CPLD as master! I'm so happy!!!

Best,

Christian

View solution in original post

0 Likes
9 Replies
Anonymous
Not applicable

Hi!

   

I also encounter the same problem, lost data! 
My FPGA data rate is 44MBps(Byte), first I store the data into 2K depth FIFO(32bitswith), then I read the fifo by clock 90M to GPIF(32bits)and use the usb3014 dma thread to send data to usb bus. finally I tried to use pBulkEpIn->XferData(pBulkBuf, nBulkLen) to catch usb data, nBulkLen is 4*1024, when I use other nBulkLen , Ican't get data. The result is that I can only get almost 30MBps on PC

0 Likes
Anonymous
Not applicable

Hi!

   

Glad to know that I am not the only one who encounter issues! With certain buffer sizes, I can get almost all my data but I lost 1 sample every time the DMA has to change buffer despite I use 2 DMA threads as suggested in the UVC example!

   

I am back from vacations and I will try to find why! If anyone has a suggestion about my first problem (lost of 1 sample) or my second problem (same as Cngaoxf_2607281), it would be nice!

   

Best,

   

Christian

0 Likes
Anonymous
Not applicable

I implemented the error control for PIB that I found in AN65974 p20/68 and I got CYU3P_PIB_ERR_THR0_WR_OVERRUN and CYU3P_PIB_ERR_THR1_WR_OVERRUN or only one of them if I use only 1 thread.

   

How can I solve this problem?

   

I saw that I have to set DMA descriptors in the firmware, but I don't find any example even in the "cyfxbulklpautomanytoone" example... would it be the problem of loosing always 1 sample when I switch the DMA buffer?

   

EDIT: Ok, according to what I read, DMA descriptors are set when the DMA channel is created and the complexity is not seen by the user. Is it right?

   

Since I use the DMA_RDY_TH0/1 as flag to go to the next state machine state, I suspect a latency for the flag to be set. Then, the produced tries to fill a full buffer (that why I get a WRITE_OVERRUN error and I lost 1 sample) and the next buffer starts to be filled 1 clock to late.

   

About the throughput that goes very low when I choose a large DMA buffer, I saw on my oscilloscope that the flag DMA_RDY_TH0/1 is sometimes at low state with some 'peak' at high state (which mean the buffer is filled and then goes to high state because it as to be emptied by the consumer) which is fine... but after a few transfers, the same flag is stuck at high state which mean the buffer cannot be filled and that is why I got such a low throughput... I guess I need to fix it first before going further...

   

Does any Cypress employee could help me? It would be nice!

0 Likes
KandlaguntaR_36
Moderator
Moderator
Moderator
25 solutions authored 10 solutions authored 5 solutions authored
        1. As you know that throughout depends on various factors as said in AN86947 (http://www.cypress.com/documentation/application-notes/an86947-optimizing-usb-30-throughput-ez-usb-f...) You told that the throughput is increased by reducing the buffer size. What is the burst size, packets per xfer and xfer per queue you have set with 4X32 KB Dma buffer size at 32 MHz PCLK? Have you maintained the same when you switched to 4X8 KB buffer size? With 4X32 KB buffer, have you checked the throughput by varying packets per xfer and xfer per queue values? If yes, is there any improvement. 2. Yes, you have found the answer for data loss of 1 word (32 bits)   
0 Likes
Anonymous
Not applicable

Dear Srdr,

   

I answered to you but without using "reply button" (because there were a bug when I was doing this way). So I post this message to be sure you get a notification!

   

Best,

   

Christian

0 Likes
lock attach
Attachments are accessible only for community members.
Anonymous
Not applicable

Dear Srdr,

   

Thank you very much for your message!

   

Yes, I know the throughout depends on many factors. In my test, I do not need to have the maximum

   

throughout but I do need to get 64MB/s with 32 bits and 16 MHz or 128MB/s with 32 bits and 32 MHz

   

because this is the ammount of data that I have to get from the GPIF-II.

   

I can achieve almost this throughout depending of the parameters, I guess the difference is due to

   

the loss data (the state machine transitions are made with DMA_RDY_THx flags). But with some

   

parameters, the buffers are not emptied for long time that cause the throughout going down. I took

   

some pictures of the oscilloscope screen to show you what I would like to say. The yellow trace is

   

DMA_RDY_TH0 flag and the blue one is DMA_RDY_TH1 flag.

   

Here are the results with Streamer (Packets per Xfers is set to 256 and Xfers to queue is set to

   

64, I get almost the same results with lower values):

   

16 MHz 32 bits: (theoretical throughput = 64MB/s)
4x2KB -> 62'300 KB/s (See picture 16MHz 4x2048), everything seems to be OK
4x4KB -> 9'000 KB/s (See pictures 16MHz 4x4096 and 16MHz 4x4096 zoom), everything is OK for some

   

times and the buffers are stall for long time... Do you know why? Here is the cause of the very low

   

throughput...)
4x8KB -> 12'200 KB/s (buffers issues)
4x16KB -> 18'000 KB/s (buffers issues)
4x32KB -> does not work

   

32 MHz 32 bits: (theoretical throughput = 128MB/s)
4x2KB -> 124'700 KB/s (OK)
4x4KB -> 124'800 KB/s (OK)
4x8KB -> 124'900 KB/s (OK)
4x16KB -> 24'500 KB/s (buffers issues)
4x32KB -> does not work

   

I can work with 4x32KB only with 1 thread but as soon as I try with 2 threads 'many-to-one', I can

   

reach only 4x16KB.

   

I hope you will be able to help me!

   

Thank in advance.

   

Christian

   

0 Likes

EDITED:

Christian,

1. Lower Throughput with higher buffer size: - this is your first issue in this thread

I reproduced your test and can see the same results as you.

With the reduced the Clock frequency, 32 MHz,  32 KB buffer is taking time to full, till that time USB is waiting for the buffer.

If you decrease the buffer size to 8K and increase buffer count to 16 to maintain the same buffer space,You can achieve around 123 MB/s throughput.

Here I used one to one channel only.

dmaCfg.size  = 8192;// CY_FX_DMA_BUF_SIZE;

dmaCfg.count = 16;//_FX_DMA_BUF_COUNT;

2. loosing one sample when there is buffer switch and getting OVERRUN Error:

This may be due to socket switching. Is it possible to put a counter on external processor, so that it will stop driving data when the counter hit and wait for DMA FLag to dissented? This will avoid the data loss.

3. Issue 4x32 KB Buffer size:

Since you are using Many to one DMA Channel, and configured buffer size as 4X32 KB for each channel(1. Prod_0 to Cons_0 2. Prod_1 to Cons_0). Effectively, It needs 4X32KBX2 = 256 KB.

By default, 512 KB RAM Chips has only 224 KB allocated for buffer area. You can see this in cyfx.c file.

Anonymous
Not applicable

Hi,

I'll be (at last) back on the project!

I'll investigate in details your last message. But what I want to say is that I cannot loose any data (not even 1 bit) because I will use the FX3 to send raw data over USB and they will be processed by a computer or single-board-computer. So if I set a throughout of 128Mbits/s, I do need to get these amount of data, not only 123 Mbits/s.

So, since my last message, did Cypress make a example to transfer continuous data without any loss on 2 threads and without external signals like the example with the image sensor (AN 75779 if I remember correctly)?

Best,

Christian

0 Likes
Anonymous
Not applicable

Hi!

Good news! I managed to get the correct throughout! with 16 MHz 32 bits, I got EXACTLY 64 MB/s ! I solved my issue by using a counter in the state machine to switch the buffer instead of using the DMA_RDY_THx flags that are set 1 clock cycle too late! I checked the data I acquired and there is no missing data anymore. Ok, using the state machine counter is not the most elegant way to do but at least it works!

Now, I will try to interface a quad 24 bits delta sigma A/D converter and see if I will manage to control it only with the FX3 state machine or it I will add a CPLD as master! I'm so happy!!!

Best,

Christian

0 Likes