Section 9.3 of the PSoC 5LP TRM and the bus interface section of the Cortex M3 TRM seem to be describe the issues in reasonable detail.
My interpretation is that the SRAM is organized as two banks capable of parallel access within a single cycle. That is a direct path from the CPU's Icode/Idata to the lower bank, as well as a data access through the System bus to the upper bank.
The upper bank is therefore faster for data accesses such as the stack. In fact it implies contention between *any* parallel accesses below 0x20000000, such as between instruction fetches from the FLASH cache and data access to the lower SRAM bank.
Disclaimer: I may easily have gotten this all wrong, though my optimization experiments so far seem to bear it out.