A Taken branch before NMI delays NMI execution by one full instruction

HiassofT · August 31, 2010

While analyzing some timing details of my highspeed SIO code I ran into a really weird issue:

The cpu finished executing a taken branch instruction in scanline-cycle 9, but instead of servicing the NMI in cycle 10 it started the next instruction and then started the NMI handler afterwards.

At first I thought I was mad, or that I missed something important, but then a quick check showed that an "LDX $E0" ending in cycle 9 started the NMI at cycle 10 - as expected. I wrote a test program and compared the behaviour of "LDX $E0", "JMP", "BCS taken" and "BCS not taken" - all finishing in cycle 9. "LDX", "JMP" and "BCS not taken" worked as expected, only "BCS taken" showed the strange delay.

The program outputs the current program counter (from the stack, where RTI would return) plus the next 4 bytes. For simplicity I just put a "LDA #1" "LDA #2" "LDA #3" (i.e. $A9, $01, $A9, $02, $A9, $03), starting where the NMI should kick in at cycle 10. So an output of "A9 01 A9 02" means that everything's fine, an output of "A9 02 A9 03" means the CPU executed another full instruction.

Here's the link to my test program, including source code and logic analyzer captures: http://www.horus.com/~hias/tmp/vbitest-1.0.zip

And here's the screenshot of my Atari 800XL:

I also did a quick test on the Atari800 emulator (current CVS version), but it seems it doesn't implement this weird 6502 behaviour - the output is "A9 01 A9 02" in all 4 tests.

To double-check that I hadn't messed up with cycle counting I hooked up my logic analyzer. First, the samples of "LDX $E0" preceeding the NMI:

Marker "A" denotes the beginning of cycle 7, where the first byte of "LDX $E0" is fetched. In cycle 8 NMI goes low and the CPU fetches the second byte of the "LDX $E0" instruction. In cycle 9 the CPU executes the "LDX $E0". In cycle 10 (marker "B") NMI goes high, and since no instruction is running the CPU starts with the 7-cycle NMI sequence, which ends in cycle 16 (marker "C" at end of this cycle/beginning of cycle 17).

Now the same with the "BCS taken":

Marker "A" / cycle 7 is the first byte of "BCS" fetched, in cycle 8 NMI goes low, second byte of "BCS" fetched, in cycle 9 the branch is taken and BCS finished. Then cycle 10 (marker "B"): NMI goes low and the CPU loads the first byte of "LDA #1". Cycle 11: second byte of "LDA #1" loaded, "LDA #1" is finished. Finally in cycle 12 (marker "C"): NMI sequence is started and ends in cycle 18 (marker "D" at the end of the cycle).

WTF, this is really strange and I never heard of this before. All docs I read so far mentioned that the interrupt servicing is delayed until an already started instruction finishes. But not that a new instruction is started before servicing the interrupt... At least now I know why the calculated cycles of my highspeed SIO code didn't match the results of my real-world tests :-)

Does anyone of you know something more about this?

so long,

Hias

Rybags · August 31, 2010

Weird... looks like you're just doing a branch +00 there.

Does it do the same for other situations like:

. branch back/forward by some significant amount.

. branch that crosses a page boundary.

. other types of branch instruction.

HiassofT · August 31, 2010

Weird... looks like you're just doing a branch +00 there.

Yes, in this test case I just do a branch to the next instruction, for simplicity.

Does it do the same for other situations like:

. branch back/forward by some significant amount.

. branch that crosses a page boundary.

. other types of branch instruction.

In my highspeed code I had a BNE backwards a few bytes, in the loop that checks IRQST. I first saw this strange behaviour when I finally managed to setup a testcase where a 1-cycle-longer VBI code lead to very occasional errors (hitting the absolute worst case is quite hard). Here's a screenshot (ignore the time-scale, it was sampled at Atari clock speed, so each sample is one Atari cycle):

I also reproduced this (with BCS backwards) in my test-code (not in the code that I uploaded before).

Ah, and thanks for reminding me to check branches across page boundaries, I'll try that tomorrow (just finishing the screenshots for the other weird POKEY thing I discovered this weekend, that lead to this weird CPU thing...).

so long,

Hias

Bryan · August 31, 2010

Just thinking out loud...

Normally, the last cycle of an instruction resets the T-state counter and allows execution of interrupts or an opcode fetch. It's possible that branch instructions circumvent the normal process by locking the CPU in an opcode fetch mode until the conditions of the branch are met. Consider that the MOS Hardware Manual states that a branch instruction will load the next opcode after the branch instruction on cycle 3 (T2), and will load an alternate opcode on cycle 4 (T3) (if the branch is taken) and another on cycle 5 (T4) if carry is added.

It would be interesting to see if SYNC is asserted for the 2nd instruction.

EDIT: I was thinking about the fact that the NMI occurs normally if the branch is NOT taken. This would indicate that it's the 2nd sequential opcode fetch that seems to thwart the NMI. The branch probably returns to a T0 state after 2 cycles but leaves some extra logic engaged to hold at T0 and adjust the PC if the branch is to be taken. This hold mechanism may inadvertently drop the NMI response.

Edited September 1, 2010 by Bryan

Rybags · September 1, 2010

Sounds plausible... in a sense, the 6502 is executing the next instruction so the delay might be "expected".

But aren't there other instances where the next instruction is fetched early... what about NOP?

+bob1200xl · September 1, 2010

The 65816 manual states that a Branch Taken adds one cycle to the instruction. (as does a branch accross a page boundary) The 65816 is supposed to follow the 6502 timings, so I would expect that the 6502 has the same result.

As a guess, the PC needs to be re-loaded on a relative branch taken and the extra cycle does this. It doesn't seem to execute the next instruction, does it?

Bob

While analyzing some timing details of my highspeed SIO code I ran into a really weird issue:

The cpu finished executing a taken branch instruction in scanline-cycle 9, but instead of servicing the NMI in cycle 10 it started the next instruction and then started the NMI handler afterwards.

At first I thought I was mad, or that I missed something important, but then a quick check showed that an "LDX $E0" ending in cycle 9 started the NMI at cycle 10 - as expected. I wrote a test program and compared the behaviour of "LDX $E0", "JMP", "BCS taken" and "BCS not taken" - all finishing in cycle 9. "LDX", "JMP" and "BCS not taken" worked as expected, only "BCS taken" showed the strange delay.

The program outputs the current program counter (from the stack, where RTI would return) plus the next 4 bytes. For simplicity I just put a "LDA #1" "LDA #2" "LDA #3" (i.e. $A9, $01, $A9, $02, $A9, $03), starting where the NMI should kick in at cycle 10. So an output of "A9 01 A9 02" means that everything's fine, an output of "A9 02 A9 03" means the CPU executed another full instruction.

Here's the link to my test program, including source code and logic analyzer captures: http://www.horus.com/~hias/tmp/vbitest-1.0.zip

And here's the screenshot of my Atari 800XL:

I also did a quick test on the Atari800 emulator (current CVS version), but it seems it doesn't implement this weird 6502 behaviour - the output is "A9 01 A9 02" in all 4 tests.

To double-check that I hadn't messed up with cycle counting I hooked up my logic analyzer. First, the samples of "LDX $E0" preceeding the NMI:

Marker "A" denotes the beginning of cycle 7, where the first byte of "LDX $E0" is fetched. In cycle 8 NMI goes low and the CPU fetches the second byte of the "LDX $E0" instruction. In cycle 9 the CPU executes the "LDX $E0". In cycle 10 (marker "B") NMI goes high, and since no instruction is running the CPU starts with the 7-cycle NMI sequence, which ends in cycle 16 (marker "C" at end of this cycle/beginning of cycle 17).

Now the same with the "BCS taken":

Marker "A" / cycle 7 is the first byte of "BCS" fetched, in cycle 8 NMI goes low, second byte of "BCS" fetched, in cycle 9 the branch is taken and BCS finished. Then cycle 10 (marker "B"): NMI goes low and the CPU loads the first byte of "LDA #1". Cycle 11: second byte of "LDA #1" loaded, "LDA #1" is finished. Finally in cycle 12 (marker "C"): NMI sequence is started and ends in cycle 18 (marker "D" at the end of the cycle).

WTF, this is really strange and I never heard of this before. All docs I read so far mentioned that the interrupt servicing is delayed until an already started instruction finishes. But not that a new instruction is started before servicing the interrupt... At least now I know why the calculated cycles of my highspeed SIO code didn't match the results of my real-world tests :-)

Does anyone of you know something more about this?

so long,

Hias

Bryan · September 1, 2010

Sounds plausible... in a sense, the 6502 is executing the next instruction so the delay might be "expected".

But aren't there other instances where the next instruction is fetched early... what about NOP?

The 6502 will often fetch an opcode on the same cycle it completes an internal operation. If we look at AND imm, it is actually a 3 cycle operation:

cycle     instruction 1          instruction 2
C0      1: Read opcode $29
C1      2: Read argument
C3      3: Write A&arg to A   1: read opcode...
C4                            2: etc...
C5                            3: etc...

...but it only has an impact of 2 cycles since it finishes during the opcode fetch of the next instruction. Instructions that write to memory on the last cycle can't take advantage of this feature since the bus isn't available, and neither do simple instructions with no arguments (like TAY). Apparently, the 6502 needs at least 2 dedicated cycles per instruction.

HiassofT · September 1, 2010

Just thinking out loud...

Normally, the last cycle of an instruction resets the T-state counter and allows execution of interrupts or an opcode fetch. It's possible that branch instructions circumvent the normal process by locking the CPU in an opcode fetch mode until the conditions of the branch are met. Consider that the MOS Hardware Manual states that a branch instruction will load the next opcode after the branch instruction on cycle 3 (T2), and will load an alternate opcode on cycle 4 (T3) (if the branch is taken) and another on cycle 5 (T4) if carry is added.

It would be interesting to see if SYNC is asserted for the 2nd instruction.

EDIT: I was thinking about the fact that the NMI occurs normally if the branch is NOT taken. This would indicate that it's the 2nd sequential opcode fetch that seems to thwart the NMI. The branch probably returns to a T0 state after 2 cycles but leaves some extra logic engaged to hold at T0 and adjust the PC if the branch is to be taken. This hold mechanism may inadvertently drop the NMI response.

Very interesting idea!

I checked the MOS hardware manual and saw that the opcode fetch in cycles T2 and T3 didn't have the "discarded" remark like the other dummy reads. Of course this might just be a simple omission error, or (also thinking out loud) an indication that the next instruction is chained to the branch instruction if the branch was taken.

I also had a look at the SYNC line, but it looks normal. It's high in the first cycle of the branch instruction, in the first cycle of the "LDA #1" instruction and also in the first cycle of the NMI sequence. So it must be something internal inside the CPU...

so long,

Hias

atariksi · September 1, 2010

The 65816 manual states that a Branch Taken adds one cycle to the instruction. (as does a branch accross a page boundary) The 65816 is supposed to follow the 6502 timings, so I would expect that the 6502 has the same result.

As a guess, the PC needs to be re-loaded on a relative branch taken and the extra cycle does this. It doesn't seem to execute the next instruction, does it?

Bob

Just an aside: It seems that in updating PC, 6502 increments PC as a 16-bit register in order to fetch next instruction but in branching it's dealing with it as two 8-bit halves where updating upper half requires an extra cycle.

HiassofT · September 1, 2010

The 65816 manual states that a Branch Taken adds one cycle to the instruction. (as does a branch accross a page boundary) The 65816 is supposed to follow the 6502 timings, so I would expect that the 6502 has the same result.

This is right, but this additional 3rd cycle is executed in scanline cycle 9, just before the NMI handler should start. If the branch begins 1 cycle later, this third cycle is executed in scanline cycle 10, the CPU waits for this to finish and then starts the NMI handler in cycle 11:

Also, if the branch is one cycle earlier, the CPU starts the "LDA #1" in cycle 9, finishes it in cycle 10 and then starts the NMI handler in cycle 11.

So the behaviour with instructions ending in cycle 10 (and also later) is as expected, it's just this weird behaviour if a taken branch finishes in cycle 9.

As a guess, the PC needs to be re-loaded on a relative branch taken and the extra cycle does this. It doesn't seem to execute the next instruction, does it?

No, this is just a dummy read, the execution happens later, in the next cycle, where the CPU fetches the opcode again and also sets the SYNC line.

BTW, since you mentioned the 65816: could you please run the vbitest.atr on your 65816 computer (configured to 1.79MHz, if that's possible, so that the timing is identical to the 6502)? It would be really interesting if the 65816 also acts like this. A test with a 65C02 would be interesting, too (unfortunately I neither have a 65C02 nor a 65816 here).

so long,

Hias

Bryan · September 1, 2010

Anyone seen this? I'd love to get my hands on a real simulation model of the 6502.

a26-james.pdf

Bryan · September 1, 2010

I also had a look at the SYNC line, but it looks normal. It's high in the first cycle of the branch instruction, in the first cycle of the "LDA #1" instruction and also in the first cycle of the NMI sequence. So it must be something internal inside the CPU...

Interesting. According to one of the more detailed 6502 block diagrams out there, SYNC is actually an indicator of the T1 state:

http://www.weihenstephan.org/~michaste/pagetable/6502/6502.jpg

I'm not sure why it's not the T0 state, unless T0 is not the usual starting point for an instruction.

+bob1200xl · September 1, 2010

The PC is probably a 16 bit counter that can be incremented by +1 with a clock line (in the same machine cycle) or loaded in parallel from the adder, 8 bits at a time. In those instances where you need to set all 8 bits in the lower byte, it takes another machine cycle to do the add/parallel load. Plus another cycle to increment the high byte in the case of a carry from adding the offset.

The 65816 manual states that a Branch Taken adds one cycle to the instruction. (as does a branch accross a page boundary) The 65816 is supposed to follow the 6502 timings, so I would expect that the 6502 has the same result.

As a guess, the PC needs to be re-loaded on a relative branch taken and the extra cycle does this. It doesn't seem to execute the next instruction, does it?

Bob

Just an aside: It seems that in updating PC, 6502 increments PC as a 16-bit register in order to fetch next instruction but in branching it's dealing with it as two 8-bit halves where updating upper half requires an extra cycle.

HiassofT · September 1, 2010

Anyone seen this? I'd love to get my hands on a real simulation model of the 6502.

Thanks for the info, I want one of these too! (and also models for Pokey, Antic, Gtia, of course :-)

so long,

Hias

HiassofT · September 1, 2010

Interesting. According to one of the more detailed 6502 block diagrams out there, SYNC is actually an indicator of the T1 state:

http://www.weihenstephan.org/~michaste/pagetable/6502/6502.jpg

I'm not sure why it's not the T0 state, unless T0 is not the usual starting point for an instruction.

I guess this must be a typo. I read the whole MOS hardware manual today, and it says that the SYNC signal goes high during each opcode fetch (for example on page 127 where they discribe how to implement single-instruction-stepping). Later, in Appendix A, where the instructions are displayed in detail, all instructions start in T0 with the opcode fetch. Only signalling T1, but not T0, wouldn't make too much sense, IMO.

so long,

Hias

drac030 · September 1, 2010

Now the same with the "BCS taken":

Marker "A" / cycle 7 is the first byte of "BCS" fetched, in cycle 8 NMI goes low, second byte of "BCS" fetched, in cycle 9 the branch is taken and BCS finished. Then cycle 10 (marker "B"): NMI goes low and the CPU loads the first byte of "LDA #1". Cycle 11: second byte of "LDA #1" loaded, "LDA #1" is finished. Finally in cycle 12 (marker "C"): NMI sequence is started and ends in cycle 18 (marker "D" at the end of the cycle).

So you are saying, that if the "BCS taken" instruction points back to itself, the CPU will never service the NMI?

HiassofT · September 2, 2010

So you are saying, that if the "BCS taken" instruction points back to itself, the CPU will never service the NMI?

Interesting idea, a simple endless loop, with the right timing, that locks up the entire computer :-)

Unfortunately this didn't work. In cycle 10 the branch was executed a second time, but in cycle 13 the NMI-sequence started:

But, of course, after the NMI returned the CPU was caught in an endless loop:

So it seems like the NMI triggering is crucial for this to happen.

so long,

Hias

Edited September 2, 2010 by HiassofT

HiassofT · September 2, 2010

Oh, another thing:

Could someone please run the vbitest.atr on an Atari 400 or 800? This could give some indication if this behaviour is specific to the 6502C used in XL/XEs or if it's a general (NMOS) 6502 issue.

Currently I only have XLs and XEs here, my Atari 800 is sitting some 50km away at my parents' home, and I can't get there soon...

so long & thanks,

Hias

phaeron · September 2, 2010

I did some checking on my own (attached), and I can confirm this behavior, at least on an NTSC 800XL. The NMI is delayed by a clock for a taken branch, but not for a NOP, JMP, or an LDA abs,X that crosses a page boundary. It doesn't seem to apply to a taken branch that also crosses a page (4 cycles), however.

This might be related to some behavior I saw where attempting to enable a VBI or DLI at just the right clock caused it to occur, but also one cycle late. The game Atomix Plus! was crashing on Altirra until I emulated that.

vbltiming.zip

Rybags · September 2, 2010

But for a branch taken (4-cycle), wouldn't we need to adjust the code such that everything occurs 1 cycle earlier, to test whether /NMI is delayed or not?

phaeron · September 2, 2010

Yup, my 4-cycle branch test executes the instructions one clock earlier so they end at the same times as the other tests. I did a another set of tests using POKEY IRQs, and I'm seeing the same one-cycle delay there too. I think we can blame the 6502 and rule out ANTIC's slightly short NMI pulse.

Sigh, this is going to be messy to emulate.

stimer-bcc.zip

HiassofT · September 2, 2010

Hi Phaeron!

I did some checking on my own (attached), and I can confirm this behavior, at least on an NTSC 800XL. The NMI is delayed by a clock for a taken branch, but not for a NOP, JMP, or an LDA abs,X that crosses a page boundary. It doesn't seem to apply to a taken branch that also crosses a page (4 cycles), however.

Yup, my 4-cycle branch test executes the instructions one clock earlier so they end at the same times as the other tests. I did a another set of tests using POKEY IRQs, and I'm seeing the same one-cycle delay there too. I think we can blame the 6502 and rule out ANTIC's slightly short NMI pulse.

Thanks a lot for doing the tests!

I extended my test-program and also can confirm your results.

So the conclusion is: the additional 1-instruction-delay happens both for NMIs and IRQs, but only if a branch instruction which takes 3 cycles (i.e. a taken branch not crossing a page boundary) finishes before the interrupt would normally be serviced by the CPU.

Here's the link to the new test-program "inttest.atr", including source and logic analyzer samples: http://www.horus.com/~hias/tmp/inttest-1.0.zip

And here's the output of my PAL 800XL:

BTW: When testing on your Atari please note that the Pokey IRQ timing is quite tight (see the Pokey serial and IRQ timing details thread), so the IRQ tests might output "1 instruction delay" in all test-cases. I confirmed this with my "bad" PAL 800XL.

I also made some new logic anaylzer screenshots:

Branch not taken before NMI (2 cycle branch instruction), NMI executes immediately:

Branch to same page taken before NMI (3 cycle branch instruction), here we have the additional instruction:

Branch crossing page taken before NMI (4 cycle branch instruction), NMI executes immediately:

Branch to same page taken before IRQ (3 cycle branch instruction), also with the additional instruction:

so long,

Hias

+bob1200xl · September 2, 2010

On a 1.79mhz 65816, you get addresses of 2206, 2306, 2406, and 2507. The code is each row is the same: A9 01 A9 02.

On a 7.16mhz 65816, you get addresses of C277 (all rows) and code of A2 05 8D 0A (all rows).

Bob

The 65816 manual states that a Branch Taken adds one cycle to the instruction. (as does a branch accross a page boundary) The 65816 is supposed to follow the 6502 timings, so I would expect that the 6502 has the same result.

This is right, but this additional 3rd cycle is executed in scanline cycle 9, just before the NMI handler should start. If the branch begins 1 cycle later, this third cycle is executed in scanline cycle 10, the CPU waits for this to finish and then starts the NMI handler in cycle 11:

Also, if the branch is one cycle earlier, the CPU starts the "LDA #1" in cycle 9, finishes it in cycle 10 and then starts the NMI handler in cycle 11.

So the behaviour with instructions ending in cycle 10 (and also later) is as expected, it's just this weird behaviour if a taken branch finishes in cycle 9.

As a guess, the PC needs to be re-loaded on a relative branch taken and the extra cycle does this. It doesn't seem to execute the next instruction, does it?

No, this is just a dummy read, the execution happens later, in the next cycle, where the CPU fetches the opcode again and also sets the SYNC line.

BTW, since you mentioned the 65816: could you please run the vbitest.atr on your 65816 computer (configured to 1.79MHz, if that's possible, so that the timing is identical to the 6502)? It would be really interesting if the 65816 also acts like this. A test with a 65C02 would be interesting, too (unfortunately I neither have a 65C02 nor a 65816 here).

so long,

Hias

HiassofT · September 2, 2010

Hi Bob!

On a 1.79mhz 65816, you get addresses of 2206, 2306, 2406, and 2507. The code is each row is the same: A9 01 A9 02.

Thanks a lot for testing!

I just had a quick look at the 65816 datasheet again, and looks like all the opcodes I used in the timing-critical section need the same number of cycles as the 6502. Therefore this means that the 65816 isn't affected by this "interrupt bug".

On a 7.16mhz 65816, you get addresses of C277 (all rows) and code of A2 05 8D 0A (all rows).

Oh, well, my program isn't prepared for such fast CPUs :-) So you just see that the CPU is executing some OS ROM code (must be the "print" I used for output). But it's still good to see that the address and instructions are identical :-)

BTW: Anyone with an Atari 800 with original 6502 CPU willing to run the test?

so long,

Hias

+bob1200xl · September 3, 2010

An 800 with the 'old' 6502 (I checked):

2206 A9 01 A9 02

2308 A9 02 A9 03

2406 A9 01 A9 02

2507 A9 01 A9 02

Bob

Hi Bob!

On a 1.79mhz 65816, you get addresses of 2206, 2306, 2406, and 2507. The code is each row is the same: A9 01 A9 02.

Thanks a lot for testing!

I just had a quick look at the 65816 datasheet again, and looks like all the opcodes I used in the timing-critical section need the same number of cycles as the 6502. Therefore this means that the 65816 isn't affected by this "interrupt bug".

On a 7.16mhz 65816, you get addresses of C277 (all rows) and code of A2 05 8D 0A (all rows).

Oh, well, my program isn't prepared for such fast CPUs :-) So you just see that the CPU is executing some OS ROM code (must be the "print" I used for output). But it's still good to see that the address and instructions are identical :-)

BTW: Anyone with an Atari 800 with original 6502 CPU willing to run the test?

so long,

Hias

A Taken branch before NMI delays NMI execution by one full instruction

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Recently Browsing 0 members