Rich, you seem to be taking it personally. There's less guessing involved in GPL discussion than you think.
I've sat there and measured the performance of the GPL interpreter, and I've spent hours tracing through it. I learned to program GPL at the byte level. You know GPL from the programmer's point of view. No offense intended but I believe that I know more about how GPL /works/ than you do. You know more about how to /use/ GPL than I do.
My opinion is that you are wrong that VDP delays are the /main/ cause of BASIC's performance. I back up my opinion with all the experimentation and research I've done with the machine. I have sat there with my logic analyzer and watched the system operate. I have replaced the GROMs with a microcontroller and traced the system operation. I have spent HOURS tracing through the GPL interpreter. The main cause of BASIC's performance, in my opinion, is the large amount of work that the GPL interpreter does to decode every instruction it reads.
If you time "MOVE 1000 FROM CPU@>A000 to VDP@>1000" in GPL versus a tight copy loop in assembly, GPL unfortunately loses by a large margin. The actual VDP time in both cases is equivalent. But the MOVE instruction sets up the GPL address for every byte, because it has to assume that the VDP interrupt may have occurred. It also has to abstract both the source and destination addresses, since it doesn't enforce anything particular. This is the GPL MOVE loop, starting at >065E in the ROM (just the loop, not the setup or instruction decode):
065E
B *6 Execute source (jumps to one of the source functions below)
Source ROM or RAM:
MOVB *1+,11 Fetch
B *7 Execute destination (jumps to one of the destination functions)
Source VDP RAM:
MOVB @>83E3,*15 Write address
MOVB 1,*15
INC 1
MOVB @>FBFE(15),11 Fetch data
B *7 Execute destination
Source GROM:
MOVB 1,@>0402(13) Write GROM address
MOVB @>83E3,@>0402(13)
INC 1
MOVB *13,11 Fetch data
B *7 Execute destination
Destination RAM:
MOVB 11,*2+ Write
JMP >06CA Go on
Destination GROM:
MOVB 2,@>0402(13) Write GROM address
MOVB @>83E5,@>0402(13)
INC 2 Next address
MOVB 11,@>0400(13) Write into GRAM
JMP >06CA Go on
Destination VDP register:
CB @>83E5,14 R2 Lbyte >01?
JNE >06AC
COC @>0012,14 Version?
JNE >06A8
ORI 11,>8000 Set 16k bit
06A8
MOVB 11,@>83D4 Register value 1
06AC
MOVB 11,*15 Write
ORI 2,>0080 VDP register
MOVB @>83E5,*15 Write from Lbyte R2
INC 2 Next register
JMP >06CA Go on
Destination VDP RAM:
MOVB @>83E5,*15 Write address VDP
ORI 2,>4000 Writing
MOVB 2,*15
INC 2 Next address
MOVB 11,@>FFFE(15) Write data
06CA
DEC 8 End ?
JGT >065E No, go on
B @>083E Return GPL interpreter, set condition bit and GROM address from substack
The above code executes for every single byte of the MOVE instruction. And that's one of the faster GPL instructions, because after it's decoded, it's running entirely in 16-bit assembly. But look at the case for copying from CPU (say a cartridge) to VDP memory. There are 10 instructions per byte. Three of those incur the VDP waitstate (the same wait state would be incurred if the CPU access was in 8-bit space, which for a cartridge it is). But instructions have varying durations.
In terms of cycles, we have this in 16-bit ROM, so no wait states on code execution. Registers are in scratchpad. So we're running at top speed. I get 174 cycles per byte for this loop. Of that, 4 cycles is wait state for reading from the 8-bit cartridge ROM (2%), and 24 cycles are wait state for talking to the VDP, thanks to read-before-write (13%). The wait state overhead is high, but GPL has to set the address every loop, so it's doing 3 times as many writes as assembly has to. And even 13% is still the minority of the time.
Assembly doesn't have to be flexible. So it can be MUCH faster.
MOVB *R0+,*R15 * write from memory to VDP, autoincrement source
DEC R2
JNE LP
Let's assume normal case - this code running in 8-bit memory, not scratchpad. Even so, I get 70 cycles per byte. 12 cycles are wait state for reading instructions from 8-bit memory (17%), 4 cycles are wait state for reading CPU (5%) and 8 cycles are wait state for writing VDP (11%). The code is almost 2.5x faster than the GPL loop, though, and that's running from 8-bit memory with wait states!
The VDP is the SAME SPEED as the 32k memory expansion, and as every expansion card in the PEB. The difference being it's not random access. Accessing the VDP memory sequentially takes the same amount of time as accessing 8-bit RAM or ROM sequentially. In fact, if you are doing a copy loop to or from RAM or ROM, VDP can actually be FASTER, because you can drop one of the post-increments in your loop, giving the 9900 less work to do.
But until it has been PROVEN otherwise, by someone actually correcting the issue and demonstrating it, it's still just theory and opinion.