Same issue here
I found yesterday this post from kalev and I'am a little bit suprised there is no comment from Cypress on this problem and post from 04.03.2013 it seems there is a possible issue. In parallel, I didn't found any errata sheet or note for this device FX3, which refers to this problem.
With my own development board (rev 3) I can reproduce reported problems and this makes my nervous, because I have to finalize my first prototype design pretty soon and won't spend days in analyzing old know issues - first post is more then 14 month old.
FX3USBtest version. 1.3 - Data pattern = 0x00000000,0x00FFFFFF
00:00:15 Read/Write=143.5/143.3MB/s Errors PHY/LNK=162/74
There is no change/progress related to this noise and USB communication failure issues. Cypress has not found anything (at least they reported me that everything is OK in FX3).
As I have already mentioned above, there are two closely related issues:
a) FX3 is sensitive to toggling its own GPIF pins, causing USB SuperSpeed Phy/Link errors.
b) Phy/Link errors cause USB SuperSpeed communication failure.
Of course, FX3 could be better and be less sensitive, but my main concern is that I do not know the reason why the communication breaks.
What I have noticed, sometimes the content of Bulk IN pipe packets seems to appear in EP0 control packets before the communication finally breaks - my image sends data 0xFFFFFFFF to host and the same data appears in EP0 packets. Like FX3 outputs wrong data in packet or Host just interprets packets wrong.
As I have no USB analyzer, then I can't explore, which side, FX3 or Host behaves wrong. It can be host side as well - some USB3.0 chipsets/drivers seem to be real crap (I have very bad experiences with Etron driver).
I have got a bit more background information about FX3 PHY/LNK error counter behavior from Cypress Tech support. I modified my FX3USBnoise.img to reflect this new knowledge and I updated it also to the latest FX3 SDK version 1.3.1.
My own short test showed that version 1.3.1 behaves as previous ones - if PHY errors appear, sooner or later the communication breaks.
FX3GPIFnoise4.zip 139.1 K
Hi kalev I also came into another problem, that is, when I start bulk loop, data may be corruption (some bytes are broken at receival) and a work around is to set DMA buffer size to 16 (thank godness this works and currently no obvious side effects occur). But the instability is the severe problem. I would like to ask for several improvements for the device to work longer and at last place use a USB3 phy directly (which is difficult but seem to work).
I read your post on may last year about replacing crystal by oscillator that would increase stability. So I have a question: if I replace my 19.200MHz crystal with oscillator, is it true that the device won't fail again? Our device is requested to run for days and frequently resetting device (wdt) is not a good idea since re-enumeration would cause at least 5 seconds. Thanks in advance. I'll go buy 19.200MHz oscillator to have a test. I'll use bushound to see what the host receives when device goes down.
I have never seen data corruption myself (if not taking into account buggy FPGA images that do not satisfy timing requirements). But I also have not played a lot with different DMA buffer sizes/counts, so I may be just lucky I have selected the right combination (size=16*1024, count=4).
>if I replace my 19.200MHz crystal with oscillator, is it true that the device won't fail again?
According to my tests, it increases stability, but this does not mean that it will never fail. You can just use the same device with a bit longer cable, or in a bit more noisy environment, or with worse adapter card, etc.
And I have experimented only by replacing the crystal on Cypress FX3 DVK with ASDMB-19.200MHZ-LY oscillator. Can't comment other crystals/oscillators.
We also need very reliable communication - device should work several weeks/months continuously. Fortunately in most use cases there is actually no need for SuperSpeed throughput, so we decided to run our device in USB 2.0 High speed mode by default. Users are suggested to switch to SuperSpeed mode only if this is extremely needed.
As the noise level depends on GPIF bus width, then we also limited to 16-bit bus (200MB/s satisfies us). SuperSpeed communication seems to be most sensitive to noise generated when FX3 outputs data to GPIF bus, i.e. when FX3 sends data to FPGA. Fortunately we do not need high throughput in this direction at all, so we use only 8 bits of bus in this direction, keeping rest 8 bits steady 0. We are considering also using some other way (SPI, UART, I2C) to transfer data from FX3 to FPGA.
What I currently believe/hope, in practise, most users should be able to run our device in SuperSpeed mode with at least 1m cable.
Hi kalev! Thanks for your reply. I found something useful:
Here's the content: Hi Ronald,
This issue can be avoided by properly decoupling all of the separate FX3 power domains. If you look at page 9 of the schematics ( http://nuand.com/bladerf.pdf ), you will see how each power domain has an a ferite bead and decoupling caps. I verified the BER of the GPIF and found it to be 0 bits after hours of testing on several different units. Are you seeing any anomalies in your sample captures?
The official DVK board seems to use LDOs and has small capacitors on AVDD and RX/TX pair and just srcsink of 5 threads can easily break down the transmission (can I say immediately?). Our design uses even smaller capacitor (0201) and a buck instead of LDO so I think this could also be a possible reason (I remember I do observed less PHY/LINK errors replacing with an LDO and added several capacitors, each 10 uF). I'll post another issue and test the board with LDO and larger capacitor next day and also send one board to solder a newer chip (fabed in 2013). Hope the new chip solves the stability issue. Our data rate is about 80MB/s so USB2.0 may not be sufficient and migrate to GbE is not possible currently.
While I tried again but no effect. I replaced all power suppliers to LDOs but still fails. The chip is 1213. It's a problem that FX3 is quite not stable. I think I can only add some auto-reset mechanism.
Unfortunately they do not report with what cable lengths they tested. With very short cable (USB signals are very good) the effect does not show up.
We use LDOs and separate ferrite beads + decoupling caps for most of power domains. Exceptionally, unfortunately, power for oscillator and CVDDQ is generated by buck converter. But still, there is proper filtering with ferrite and cap. I remember I played a lot with adding/replacing decoupling caps and separating power domains from each other - nothing seemed to give noticeable positive effect (everything seems to be arlready in its best).
And as you mentioned, Cypress FX3 kit uses LDOs and it's still far from perfect.
My experience is that my prototypes of same batch are actually different - some of them work quite well over 1.8+1.8=3.6m cable, some fail even with one 1.8m cable.
Also, I have noticed that sometimes the device works quite well after enumeration over certain cable length - 0 PHY errors in several hours. But about in 1 cases of 3...10 power-ups/enumerations it starts to give errors immediately (and communication breaks in less than hour). So it's a bit random and depends on enumeration (SuperSpeed training?).
As USB errors rate is clearly in correlation with FX3 outputting data on GPIF bus, and I have not found external component(s) that could cause this, then I just have to think that this happens inside FX3 (could it be that GPIF interface and USB or clock signals use the same ground wire inside chip? Or crosstalk inside FX3?).
I have played only with CYUSB3014-BZXI chip. Could it be that some other FX3 chip is less sensitive?
About your 80MB/s data rate. This is achievable with 8-bit GPIF bus between FPGA and FX3. Compared to 32-bit bus, this should create much less noise.
Above posts in this thread have focused mainly on FX3 noise sensitivity and how to reduce BER (Bit Error Rate).
However, I think, the main problem is that SuperSpeed communication is not reliable and in my understanding it breaks too easily.
Let's remember once again two Tech Support clauses I presented already in my first post:
1) IO toggling results in (FX3) substrate noise (so this may be the reason for PHY errors);
2) the layers of the communication protocol (USB) are designed to recover from such dynamic errors;
The main question should be, why USB protocol, in practice, does not recover?
Unfortunately I have no USB Analyzer. Has somebody explored, what actually happens when USB communication breaks? What's to blame? Host? Host chipset or driver? FX3? USB standard? Nobody?
Hi in my latest tests I found something interesting. My board is very unstable. After many and many tests, I think I may found my problem. When I solder two chokes in parallel, the transmission lasts longer; however when I solder them in serial the board just does down on every starting of transmission. So I just throw up all the chokes and to my suprise the board now started to work! Although there're a few PHY/LINK errors and sometimes causes OUT endpoint to malfunctioning (this happens majorly in data transfer simutaneously at different endpoints, seems to be a crosstalk issue in PHY IC design), a RESET from software is able to bring the dead endpoint into alive.
The choke is 600R at 100MHz that is officially recommended. However as I read from Cypress's official document I found that the core current is 800mA at maximum and the power supplier should be able to provide that current. The choke is very thin-it may can not provide that enough power. So on transmission the voltage drops heavily and the PHY is down. I did not use a oscilloscope to check the actual core voltage. I'll check it next time. I think it may be the cause.
The final configuration of my board is 1.2V LDO and 1.8V LDO and I don't think a BUCK will work properly without chokes. I'll have several tests on BUCK configuration. Although the disappearance of the device is hardly elimiated, I still have to face the real world problem of OUT endpoint's hang. Currently I'll test and reset OUT endpoint when command send fails. Maybe Cypress can give us a better solution.
According to latest tests the FX3 is only able to work under 1.8V VIO @ 100MHz PPORT and reduction to 75MHz is very unstable. And replacing VIO to buck converter will cause device to disappear very quickly. Promotion VIO to 2.5V will put FX3 into unresponsible state soon.
Under 1.2V VCore & 1.8V VIO & 100MHz PPORT with external 19.2MHz oscillator (sorry I do not have its specification about its variation) with both LDOs without chokes, the board passes test for at least 30 minutes and transmitted about 550GB toggling data with very few PHY errors (only a few hundreds) LINK error may occur but very rare. When tested under 50MHz PPORT both errors are increasing rapidly and device will fail very soon.
So the final & temperary solution is to change all current boards to 2 LDOs with no choke, and PPORT run at 100MHz. I have found that FX3 tends to be unstable if PPORT is not running at 100MHz, it seems that the internal bus inside FX3 is running at 100MHz. At 1.8V VIO the FPGA is OK to work under 2.5V. Cypress should give us a minimum requirement on the chokes.
I have the same problem. When i download my firmware with external data into FX3 GPIF ,fx3 can not re-enumerate steadily.It disappear and then appear at the CYCONTROL endlessly . But it can re-enumerate steadily when the exteranal data is 0X00 or 0XFF .So i think the reason is GPIF noise you said. When i use DC-DC power souce of 1.2v ,the result will be more worse than using LDO.My usb cable is 1m or 1.5m. I think 1.2v voltage quality has impact on gpif noise.I have no good idea to improve stability of my own board
I think everyone with this problem should compare SDK versions as well. I have a couple boards that work fine with the 1.2.3 SDK but with 1.3.1 either fail with phy errors or can't enumerate at all. I don't know why the upgrade in the SDK makes a difference but they're otherwise working great at full usb3 speed with the 1.2.3 sdk but with 1.3.1 can only work on usb2. There is likely some other hardware factor aside from just the SDK version but something in the SDK seems influential.
I will go to describe very typical situation (about 20% of cases) I see when the communication breaks in my noise test.
Host's read from IN pipe fails with error code USBD_STATUS_XACT_ERROR in USB driver.
Typically Control Endpoint (EP0) still continues to function so that I can read for example device string descriptors.
I have noticed that if I read string, device does not return string data but instead it returns data that should be returned in IN pipe, i.e. data from GPIF bus. But what is more interesting, I specify data length=256 in my read SETUP requests (this is more than actual string length) and device returns always correct data amount matching exactly the string length I requested! Like FX3 sets all fields (endpoint number and data length) correctly in data packet header but somehow data payload in packet gets wrong. If I continue to read strings (about 100...400 times), finally I will start to get data that is very much alike to strings - not 100% correct, but at least it's clear that it's not from GPIF bus.
As I have no USB analyzer, then I can't state that this is FX3 that starts to send incorrect data. Theoretically this can be also a bug in host chipset/driver. But clearly, somewhere there are serious problems in recovering from USB protocol errors.
Hi I read about one of your posts about LNK_PHY_TX_TRIM register data and I validated that register only to found out that both SDK 1.2.3 and 1.3.1 set the same thing - 0x0B569011. I dumped all of USB3 Link Controller Registers (you can reference by FX3_TRM) and the data is stored in the attachment.
Appearently there're differences between two versions of SDKs and I counted there're a total of 8 registers. However we can also find out Cypress hides several register definition. I'll try to write these register values into the registers to see if 1.3.1 performs better.
1.2.3 is much stable than 1.3.1; when using crystals 1.2.3 won't cause device fail but 1.3.1 will easily break down.
FX3 USB Reg.txt.zip 590 bytes