Cause of host EHCI USB transaction errors on bulk OUT to CY7C68001?

Tip / Sign in to post questions, reply, level up, and achieve exciting badges. Know more

cross mob
ErWo_4845296
Level 1
Level 1
First solution authored First reply posted Welcome!

I'm a software engineer with RTP Corp.  We are using the CY7C68001 EZ-USB SX2 High Speed USB Interface Device.  We have an embedded software design used for process control and have written a simple embedded USB driver.  Our devices are limited to one USB port with one USB High Speed device (the CY7C68000).  The USB device is on-board with the CPU chip-set (Intel SCH US15W)   We transfer data using four endpoints.  Two of the endpoints transfer about 1072 bytes of output data, and 245 bytes of input data continually at a 1 millisecond rate. The other two endpoints transfer 512 bytes or less of output or input data randomly to communicate with flash memory.

The USB host controller is an Intel EHCI device built into the chip-set.  The software mostly polls the EHCI, and the only interrupt is the Interrupt On Completion of transfer descriptors.  The interrupt service routine examines the descriptors to see which ones have completed, and sets operating system even flags associated with the completed transfers. Then it clears the interrupt request.

So long as we use only one pair of IN/OUT endpoints to transfer the 1 millisecond repetitive data, we see no problem.  When we perform output transfers to the endpoint for flash in addition to the 1 millisecond periodic outputs, we start seeing intermittent errors on the 1 ms. output.  The endpoint transferring data every millisecond gets transaction errors in the status for the USB host controller transfer descriptors.  Those transaction errors appear to be recoverable. The transfer completes successfully.  The transaction errors continue at random times after doing a single data transfer on the flash endpoints.  Along with the intermittent but frequent transaction errors on the 1 ms. outputs we sometimes see lost or incorrect data.

My understanding of a transaction error is that the USB device (CY7C68001) did not respond on USB or responded with an incorrect USB token.  That would indicate either a logic error in the (CY7C68001), or check code failures on tokens or output data.  I have not found any configuration of the host controller that explains the intermittent recoverable transaction errors.  Why would communicating to a second output endpoint cause transaction errors on a different endpoint?  Is there some violation of the FIFO handshaking that could explain the transaction errors on USB?  For example, if an OUT FIFO goes from not-full to full without any output data being transferred, would that cause a transaction error. In that case, the CY7C68001 would return ACK instead of NYET in responses to OUT transfer, but then later NAK the OUT data transfer instead of sending ACK as expected.  Would that be considered a transaction error?  Does an output FIFO going from empty to full without any data being put into it cause the CY7C68001 to violate the protocol enough to produce a transaction error in the host controller?  Is there any other scenario that might explain a transaction error detected by the host controller?

The other odd thing about this problem is that a power cycle makes the problem stop occurring until we perform another transfer to the flash output endpoint.  Doing a software restart or hardware reset does not make the problem stop occurring. Our host controller driver is doing a USB reset to the CY7C68001 in both cases.  Our hardware engineer has verified that even on a software restart, the CY7C6801 and the logic connected to that are also being reset via a hardware signal.

We have been able to verify that communicating with either of the pairs of OUT/IN endpoints and not the other avoids the problem.  By synchronizing the transfers for the sets of endpoints, and moving the timing relationship, we can make the problem happen more or less frequently. Even when the transfers are done separately from each other in time the problem still happens.

Since I am the software engineer, I am only somewhat familiar with the CY7C68001. I am mostly concerned about any EHCI configuration or programming errors that might explain the transaction errors.  We are preserving the PING status and data toggle across transfers to each endpoint.  We only use bulk transfers, either output or input, and we only use the asynchronous schedule.  Other than timing, there is nothing different about the transfers at 1 ms. versus the flash transfers.

We have at most three transfers active in the asynchronous list.  The 1 ms. OUT and IN transfers are started at nearly the same time, with the OUT slightly preceeding the IN.  The OUT transfer always completes before the IN transfer, since the input data is not placed into the FIFO until the expected output data has been completely received by the CY7C68001.  The third transfer that may be present in the asynchronous schedule is either an OUT or IN to one of the flash communication endpoints. The only case where the flash transfers took significant time to complete was during a flash erase.  The IN transfer would not complete until the erase completed about 300 ms. later.  We changed the flash communication so that an IN completes almost immediately, within 125 microseconds or less.  Changing the flash IN transfer to use less time on the USB bus did not have any effect, and an OUT to the flash endpoint still caused the other OUT endpoint to get transaction errors continually afterward.

If there are hardware considerations that might explain this problem of transaction errors and incorrect output data, I would like to pass the information on to our hardware engineer.  I can see how a violation of the FIFO handshaking could cause incorrect data, but I can't explain how that would cause USB transaction errors.  My impression is that the USB logic is mostly separate from the FIFO handshaking.

Should we be looking for electrical issues on the USB bus between the host controller and the CY7C68001?  Does this even sound like a problem being caught by error check codes?

Any suggestions will be appreciated.

0 Likes
1 Solution

We believe that we have identified and corrected the problem causing the transaction errors.  The

CY7C68001, PLD and flash memory chip were all near each other on the board and connected to the same 3.3V power source.  When an erase command was sent to the flash, it made large intermittent demands on the 3.3V power and that caused noise on the 3.3V power to the

 CY7C68001. The noise apparently caused USB data errors or errors in the USB tokens, resulting in the transaction errors.

 

The retry of a bad output transfer usually succeeded, but in some cases, the USB data error was undetectable by the CRC.  In that case we saw the incorrect or missing data.

We are providing 3.3V from our own on-board power circuitry and NOT using the USB interface power.  A capacitor was used to eliminate noise, but the part was supposed to be a tantalum capacitor.  An incorrect part was being used that was not a tantalum capacitor.  We replaced the part with the correct one and that appears to have corrected the problem with transaction errors on USB output transfers.

It is going to be a few days before we are certain that the problem has been corrected, but so far it has operated without errors for longer than any of our previous tests.

Thanks for the information and suggestions.

View solution in original post

0 Likes
5 Replies
YatheeshD_36
Moderator
Moderator
Moderator
750 replies posted 500 replies posted 250 solutions authored

Hello,

Can you please explain the registers and their values that are set in SX2?

Do you have any control transfers before you perform out transfers to the flash?

An application interface diagram and a flow chart indicating the sequence of transfers from the host and response from the external master will be helpful to analyze the issues.

Thanks,

Yatheesh

0 Likes

Yatheesh,

This is the information that our hardware engineer sent me regarding the chip registers.

USB  uses default descriptor 

To use the default descriptor.

Write a length of 6

Initiate a Write Request to register 0x30.

Writes the VID, PID, and DID bytes: 0xB4, 0x04, 0x02, 0x10,

0x01, 0x00 (in nibble format per the command protocol).

Register 1 Iconfig = x42

Register 2 flagsAB = E8

Register 2 flagsCD = 42

All other registers set to default.

IFCLK is 33 mHZ externally sourced.

Endpoint 2 is for data from USB

End point 4 is for data to USB

Endpoint 6 is for data from USB

End point 8 is for data to USB

Transfer type is bulk transfers.  High speed only

Sequence

Slave Read endpoint 2 until empty and transfer complete.

If empty but transfer not complete wait for not empty and continue read.

When complete slave write to endpoint 6 until done

Wait 50 us 

Check endpoint 4 if not empty read endpoint 4 until empty.

Check data available for endpoint 8 if data available write data to endpoint 8.

Go back to endpoint 2 and wait for not empty

Data is sent over USB to endpoint 2 once every millisecond-   size up to 4Kbytes 

Data is sent to endpoint 6 once every millisecond - size up to 8Kbytes

Data is sent over USB to endpoint 4 up to once per millisecond as needed    size up to 256 bytes 

Data is sent to endpoint 8 up to once per millisecond as needed - size up to 256 bytes

==============================================================

The rest of this is my description of the timing.

The output to endpoint 2 and the input from endpoint 6 are asynchronous bulk transfers started at the same time every 1 millisecond.  The output transfer when we see the problem is about 1072 bytes and input transfer is about 245 bytes.  Because of the logic in the PLD connected to the CY7C68001, the output transfer completes first, and the input transfer remains in progress until the output transfer completes.  The input transfer occurs immediately after the output transfer because the PLD already has the data to send.

The PLD connected to the CY7C68001 is communicating with an I/O backplane containing digital and analog IO cards.  The PLD is busy for about 100 microseconds of each millisecond.  If the USB output transfer arrives while the PLD is busy, then both the USB output transfer and USB input transfer will remain pending until the PLD is no longer busy.  The problem occurs without regard to whether the USB output transfer has to wait for the PLD to accept the data.

Endpoints 4 and 8 are used up to every millisecond, but usually there are long periods of time up to many seconds between IO transfers for those.  They are used for communication to flash.  They were completely asynchronous with the one millisecond transfers on the other endpoints.  We changed the software so that the flash USB output or USB input is always done after the completion of the USB input from endpoint 6.  So, the USB transfers for endpoints 4 and 8 now never occur during communication with the other two endpoints.  That did not eliminate the problem, but we found that changing the amount of delay before the flash USB IO is done affected the rate at which the errors occurred.

The problem happens when we communicate with both pairs of endpoints, but does not occur if we communicate with only one pair of IN/OUT endpoints.  We also found that the problem does not happen, or is much less likely to happen if the USB output to endpoint 2 is smaller than 1072 bytes.  Since the FIFO is 2 x 512 bytes, we are wondering if this has something to do with the USB output having to wait to send more data.  Normally that situation should not cause a transaction error, but there seems to be something else wrong that results in a transaction error.  So far as we can tell, that transaction error is always recoverable with one retry.

We are trying to understand why we get the USB transaction errors on the outputs to endpoint 2.  What interaction with the FIFOs would cause that sort of error?  Also, along with the frequent transaction errors, we get less frequent bad or missing output data.  The bad or missing data has occurred from both output endpoints 2 and 4.  The flash endpoint 4 is used much less frequently and we see less frequent data errors, but no transaction errors.

The CY7C68001 data sheet does not describe the interaction between the USB OUT protocol and the FIFOs.  I am wondering if swapping between the 512 byte buffers for the OUT endpoint at the wrong time could cause a violation of the USB protocol.  For example, if the endpoint responds with an ACK instead of a NYET because the buffer is empty, but then responds with a NAK for the next packet because the buffer is not empty.  Is it possible for the master to somehow swap a partly full buffer for an empty buffer, leaving insufficient space to accept the next packet?

Is there some other explanation for why we would get a transaction error?  Is there something we can look at to see if the CY7C68001 detected a transaction error?

0 Likes

Hello,

As you are using the default SX2 descriptors and register values for EPxCFG,  as per the datasheet:

pastedImage_0.png

EP2 and EP4 will be configured as OUT endpoints, i.e. data from USB.

EP6 and EP8 will be configured as IN endpoints, i.e. data to USB.

Please double confirm the descriptors manually on the host side and if the device is enumerating with expected device descriptors and EPxCFG default register values. Also, check if you are performing the right transfers on the right endpoints from the host.

512 byte double buffering on EP2 and transferring 1072 bytes will only effect the bandwidth and should not cause any issues for the transfers.

Please set a larger timeout value on the host side for all transfers.

Can you please capture USB traces between the host and the SX2 and share it with us?

If you do not have a hardware analyzer, then you can use a software USB analyzer like Wireshark pcap Capture.

This will give a better understanding of the issue.

Thanks,

Yatheesh

0 Likes
ErWo_4845296
Level 1
Level 1
First solution authored First reply posted Welcome!

Our USB device and the CPU chip-set are all on a single board.  There is no USB connector between the SX2 and the chip-set USB port.  We have only been able to connect an oscilloscope or logic analyzer to monitor the USB connection.  We do not have a USB analyzer or capture device.

Please also note that we are not using Windows or Linux.  We are using a proprietary embedded USB driver written by me.  The endpoint configuration, packet size and other parameters are set to constant values, and our driver does not read the descriptors from the USB device.  The device initialization consists of a USB reset, wait for connection, set address and set configuration.  There is only one configuration, the default one.

We have checked the hardware configuration of the SX2 registers and the USB host controller DMA transfer descriptors.  All of the transfers succeed, but some of the OUT transfers retry once after a "transaction error".  We see that error status bit set in the completed transfer descriptor on the host, although the transfer completes successfully.  The initial contents of the descriptors that fail are the same as the ones that succeed except for the length of the transfer.  Our transfers are small enough that we only need one qTD in the host.

What timeout do you want me to increase?  The Intel EHCI hardware has no timeout parameter that I can find.  We are not getting a time-out in our software because the transfer descriptors complete very quickly (within a few hundred microseconds).

Other than this "transaction error" status, the transfers seem to work correctly.  The critical problem is that output data is sometimes incorrect.  We also would like to understand the cause of the "transaction error" status and determine if it is related to the occasionally incorrect output data.

We have just finished testing using only a single output and single input endpoint 2/6.  We still see the problem, but only when we perform output transfers that write or erase the flash connected to our PLD through the SX2.  Once the problem starts happening, restarting the software and re-initializing the host controller does not stop the "transaction error" problem.  Even when that problem happens it is still only some of the output transfers but not all of them.  Changing output transfer sizes or changing the timing of when output transfers are done makes the problem occur more or less frequently.  In some cases we have run the hardware for hours or days without getting any of the transaction errors, but once they start, only a power cycle stops them from occurring frequently.

The next test we plan to do is to combine the flash and other data so that we perform only a single output and single input transfer per millisecond.

Our hardware engineer is also going to change the order of initializing the other SX2 registers versus the default enumeration registers.  The hardware is currently loading the enumeration registers before initializing the other registers.

Is there some hardware initialization that should be done after receiving the ENUMOK status?

0 Likes

We believe that we have identified and corrected the problem causing the transaction errors.  The

CY7C68001, PLD and flash memory chip were all near each other on the board and connected to the same 3.3V power source.  When an erase command was sent to the flash, it made large intermittent demands on the 3.3V power and that caused noise on the 3.3V power to the

 CY7C68001. The noise apparently caused USB data errors or errors in the USB tokens, resulting in the transaction errors.

 

The retry of a bad output transfer usually succeeded, but in some cases, the USB data error was undetectable by the CRC.  In that case we saw the incorrect or missing data.

We are providing 3.3V from our own on-board power circuitry and NOT using the USB interface power.  A capacitor was used to eliminate noise, but the part was supposed to be a tantalum capacitor.  An incorrect part was being used that was not a tantalum capacitor.  We replaced the part with the correct one and that appears to have corrected the problem with transaction errors on USB output transfers.

It is going to be a few days before we are certain that the problem has been corrected, but so far it has operated without errors for longer than any of our previous tests.

Thanks for the information and suggestions.

0 Likes