Read the architecture TRM, and the respective AppNotes about DMA. There is a setup time for each DMA transfer.In your case, its most likely 6 clock cycles, which would end up in the 3MHz you observer.
But what do you want to achieve? Wouldn't it be easier to connect a clock to the output directly? Or do you want, in the end, write complete bit patterns to a PGIO port?
DMA approach of course potentially will create a lot of jitter.
Also, maybe not relevant, but take a look at info in this ap note about
faster toggling of GPIO (non DMA) -