Jump to content







Photo

Time, it is precious

Posted by , 25 November 2005 · 86 views

Well, I mentally looked at the list to Leprechaun to-dos and decided that I needed to tackle the BLIT routine next. This is the code which copies the sprite graphics (running etc) from ROM to the strips of SuperCharger RAM that the kernel then uses. (Much like the 5200 & A8s.)And I'm glad I did, because I discovered (to my horror) that the routine barely fits in the time between the end of VSYNC and the start of the kernel. (Actually, I had to increase that time by a couple of lines even.) Each player & enemy typically needs over 600 CPU cycles to handle, with close to 500 cycles being the actual BLIT. (The rest being figuring out which player sprite gets used and setting up the BLIT.)I'm now a little concerned that I'm going to run short on time to squeeze the remaining features into the time between then end of the kernel and the start of VSYNC. (I'm particularly worried about adding a score display since that will automatically chew up a bunch of lines.)Anyway new features in the attached binary:- The colored blocks are now the right number of lines high. This really shows off the intelligent flicker.- player now changes color when dead

Attached Files






Hi there!

Hm... it seems I don't get you here. After you have fully determined which shape needs to be copied in which RAM area, you need like 600 cycles to copy it?!? I think your sprites are like 10 byte tall, that should be copied in less than 100 cycles, no?

And why do you copy it at all, can't you simply access the sprites via (),Y?

Greetings,
Manuel
  • Report
Let's deal with the second question first. It comes down to LDA (zp),Y taking 5 cycles while LDA abs,Y only needing 4. Then most of my kernel collapses into a very simple routine:
LINESUB
 STA	WSYNC	; do a single line
 STA	COLUPF	; 0
 LDA	FGBM0,Y	; 3
 STA	GRP0	; 7
 LDA	FGBM1+8,Y; 10
 STA	GRP1	; 14
 LDA	BGBM0,Y	; 17
 STA	PF1	; 21
 LDA	BGBM1,Y	; 24
 STA	PF2	; 28
 LDA	BGBM2,Y	; 31
 STA	PF0	; 35
 LDX	BGBM3,Y	; 38
 STX	PF1	; 42
 INY  ; 45
 NOP  ; 47
 AND	#$0F	; 49
 STA	PF2	; 51
 STA	PF0	; 54
 RTS  ; 57-63
I'm using SuperCharger RAM abilities to treat the BGBM0-4 and FGBM0-4 as virtual bitmaps. Unfortunately, writes to SC RAM require a bunch of cycles. So my BLIT routine looks like:
BLIT0
 LDY	YPOS+NUMSPR
3$
 CPY	#8
 BCS	1$
 LDA	(SPR_PTR),Y
 TXS
 TAX
 LDA	SCDATA,X; SuperCharger write
 TSX
 LDA	FGBM0,X; equivalent to STX FGBM0,X
 BCC	2$
1$
 LDA	SCDATA
 NOP
 LDA	FGBM0,X
2$
 INY
 INX
 TXA
 AND	#$0F
 BNE	3$
 BEQ	ZPUTSPR
Of course, any suggestions on how to reduce the number of cycles required would be appreciated. The full source is included in the attached ZIP.
  • Report
Hi there!

EricBall, on Fri Nov 25, 2005 11:53 PM, said:

Let's deal with the second question first.  It comes down to LDA (zp),Y taking 5 cycles while LDA abs,Y only needing 4.  Then most of my kernel collapses into a very simple routine

Hm... it seems like you save 6 cycles with that, but wouldn't getting rid of JSR and RTS alone save you 12?

EricBall, on Fri Nov 25, 2005 11:53 PM, said:

I'm using SuperCharger RAM abilities to treat the BGBM0-4 and FGBM0-4 as virtual bitmaps.

Hm... wouldn't the sprites require only two slices of RAM (sizeof SCREENHEIGHT)? Since you can only display two per row? If there's more in vertical seperation, you could blit them into the same slice.

EricBall, on Fri Nov 25, 2005 11:53 PM, said:

Unfortunately, writes to SC RAM require a bunch of cycles.  So my BLIT routine looks like:
BLIT0
 LDY	YPOS+NUMSPR
3$
 CPY	#8
 BCS	1$
 LDA	(SPR_PTR),Y
 TXS
 TAX
 LDA	SCDATA,X; SuperCharger write
 TSX
 LDA	FGBM0,X; equivalent to STX FGBM0,X
 BCC	2$
1$
 LDA	SCDATA
 NOP
 LDA	FGBM0,X
2$
 INY
 INX
 TXA
 AND	#$0F
 BNE	3$
 BEQ	ZPUTSPR

Admittedly I don't understand it. What is the purpose of the major branch in the code used for? You either write your sprite data to the SC RAM, but when Y>=8 then it does what? Why?

Of course, it seems like it would be a lot faster, if you'd split this multipurpose routine into two seperate ones. If I get that right, it checks with each iteration of the loop wether it is doing a sprite or not? :)

If you'd have two seperate routines, it seems you'd also save a lot of register swaps and saves/restores. Or, maybe even better than two seperate routines, have one routine doing both tasks entwined, two loops, one after the other!

EricBall, on Fri Nov 25, 2005 11:53 PM, said:

Of course, any suggestions on how to reduce the number of cycles required would be appreciated.  The full source is included in the attached ZIP.

Hm... I prefer discussing smaller/isolated parts, since I usually find it a hard task to deal with a complete source written by someone else :)

Greetings,
Manuel
  • Report
The kernel is more than that single subroutine, which is used for 13 out of every 18 line sequence. Of the remaining 5 lines, two are used to reposition one of the player sprites (alternating) and the other 3 are fairly tight for cycles. So although I could save 12 cycles removing the JSR/RTS, I need the 6 cycles saved elsewhere in the kernel.

Quote

Hm... wouldn't the sprites require only two slices of RAM (sizeof SCREENHEIGHT)? Since you can only display two per row? If there's more in vertical seperation, you could blit them into the same slice.
Exactly. FGBM0/1 are each divided into 16 line zones and may contain multiple sprites if the sprites are vertically separated. In this way I have 5 sprites which only flicker if more than two are on the same line.

Quote

Admittedly I don't understand it. What is the purpose of the major branch in the code used for? You either write your sprite data to the SC RAM, but when Y>=8 then it does what? Why? . . . If I get that right, it checks with each iteration of the loop wether it is doing a sprite or not?
Sorry, forgot to comment that. The 1$ branch writes a zero to FGBM0 (there's actually a separate BLIT1 routine for handling FGBM1). So just like a normal SkipDraw routine, it writes either 0 or the sprite data to FGBM0. I have to do this because I have no way of knowing what part of that 16 line zone was used in the last frame (it could be top, bottom, or even in the middle). This routine also handles whether the sprite is at the top, bottom or middle of the 16 line zone.
  • Report
Hi there!


EricBall, on Sat Nov 26, 2005 12:16 PM, said:

Sorry, forgot to comment that.  The 1$ branch writes a zero to FGBM0 (there's actually a separate BLIT1 routine for handling FGBM1).  So just like a normal SkipDraw routine, it writes either 0 or the sprite data to FGBM0.  I have to do this because I have no way of knowing what part of that 16 line zone was used in the last frame (it could be top, bottom, or even in the middle).  This routine also handles whether the sprite is at the top, bottom or middle of the 16 line zone.

I have a feeling that my suggestion of entwining the loops into two different ones was the right one then. Just fill the whole 16 bytes completely with zeros, then only write the 8 bytes of the sprite :)

Greetings,
Manuel
  • Report

Cybergoth, on Sat Nov 26, 2005 4:02 PM, said:

I have a feeling that my suggestion of entwining the loops into two different ones was the right one then. Just fill the whole 16 bytes completely with zeros, then only write the 8 bytes of the sprite :)
I have never done anything for the SC, but it looks like writing to RAM is terribly slow. So filling the RAM twice seems to be slow too.

But the code above is also not looking very optimized. I'll have a closer look at it tonight.
  • Report
Hi there!

Thomas Jentzsch, on Sat Nov 26, 2005 6:24 PM, said:

Cybergoth, on Sat Nov 26, 2005 4:02 PM, said:

I have a feeling that my suggestion of entwining the loops into two different ones was the right one then. Just fill the whole 16 bytes completely with zeros, then only write the 8 bytes of the sprite :)
I have never done anything for the SC, but it looks like writing to RAM is terribly slow. So filling the RAM twice seems to be slow too.

Well, right now it's writing 16 bytes, and my suggestion would indeed require 24, but I think 16 * the overhead in the big loop is much more than the additional ~80 cycles for the 8 superfluous writes :)

Greetings,
Manuel
  • Report
I don't understand how the 1$ branch works. Shouldn't X be 0? :)

And how is X initialized?
  • Report

Thomas Jentzsch, on Sat Nov 26, 2005 11:24 AM, said:

Cybergoth, on Sat Nov 26, 2005 4:02 PM, said:

I have a feeling that my suggestion of entwining the loops into two different ones was the right one then. Just fill the whole 16 bytes completely with zeros, then only write the 8 bytes of the sprite :)
I have never done anything for the SC, but it looks like writing to RAM is terribly slow. So filling the RAM twice seems to be slow too.

But the code above is also not looking very optimized. I'll have a closer look at it tonight.

It might be possible to swap SC RAM with regular RAM for speed. Stuff that changes more gradually like scores or playfield graphics could be in SC RAM. Stuff that needs to be refreshed more aggressively could be in main RAM.
  • Report
I think the loop has to be rewritten so that it's counting down instead of up for one thing.

And can't the indirect addressing within the interior loop be eliminated by precalculating the pointer before entering in?

It should also be possible to unroll the loop if you have enough rom(ram) space and your sprites are all the same height (or padded in ROM).
  • Report
First, I'd like to say thanks for taking an interest. I really appreciate it and hope we can find some optimizations.

Thomas Jentzsch, on Sat Nov 26, 2005 1:24 PM, said:

I have never done anything for the SC, but it looks like writing to RAM is terribly slow. So filling the RAM twice seems to be slow too.
Correct, writes to SC RAM requires at least 9 cycles: 4 cycles read from $F0nn, then 5 cycles to read from the destination address for the actual write. (And the 5 cycle an SC requirement since the write occurs 5 address changes after the read from $F0nn.)

Cybergoth, on Sat Nov 26, 2005 10:02 AM, said:

I have a feeling that my suggestion of entwining the loops into two different ones was the right one then. Just fill the whole 16 bytes completely with zeros, then only write the 8 bytes of the sprite :)

Cybergoth, on Sat Nov 26, 2005 1:39 PM, said:

Well, right now it's writing 16 bytes, and my suggestion would indeed require 24, but I think 16 * the overhead in the big loop is much more than the additional ~80 cycles for the 8 superfluous writes :)
You might be right. The outer loop takes more than 9 cycles per itteration. So unrolling the 16 writes to zero the zone would require less time. It would probably simplify the BLIT too. Of course, ROM space isn't infinite....

Thomas Jentzsch, on Sat Nov 26, 2005 2:18 PM, said:

I don't understand how the 1$ branch works. Shouldn't X be 0?  :) And how is X initialized?
X is initialized to the top of the 16 line zone outside the loop. So it's 0,16,32.... depending upon where the sprite is onscreen vertically. Check out lepbank1.asm starting with PUTSPR.

mos6507, on Sat Nov 26, 2005 6:41 PM, said:

It might be possible to swap SC RAM with regular RAM for speed.  Stuff that changes more gradually like scores or playfield graphics could be in SC RAM.  Stuff that needs to be refreshed more aggressively could be in main RAM.
And although I'd love to use regular RAM, I'd need 320 bytes or a major rewrite of the kernel. (If it's even possible to have an equivalent kernel without using SC RAM.)

mos6507, on Sat Nov 26, 2005 6:55 PM, said:

I think the loop has to be rewritten so that it's counting down instead of up for one thing.
Yes, I know it's tradition (and typically more optimizable) in 2600 games to count backwards, but I find it very difficult to get things right when I use that method. Currently, counting backwards in this routine wouldn't gain me anything since neither index register is 0 at the end of the loop. But maybe if I separate the clear & BLIT operations as Manual suggests, then it will be possible.

mos6507, on Sat Nov 26, 2005 6:55 PM, said:

And can't the indirect addressing within the interior loop be eliminated by precalculating the pointer before entering in?
My current plans are to store the sprites in "unused" areas of ROM space (e.g. after the virtual bitmaps) rather than in a single page, which means indirect addressing is about the only possibility. You should have seen the mess before I split the routine into separate ones for FGBM0 and FGBM1, with different Y offsets.

mos6507, on Sat Nov 26, 2005 6:55 PM, said:

It should also be possible to unroll the loop if you have enough rom(ram) space and your sprites are all the same height (or padded in ROM).
All sprites are 8 lines high, but ROM space isn't infinite. I'm not sure I've got the space to pad out the sprites to 16 lines; especially if I have different sprites for the player and the enemies.
  • Report
[deleted]
  • Report
Some possible improvement suggestions:
1. split the loop
2. use LAX
3. when writing zeroes don't increase Y
4. when writing zeroes put INX instead of NOP between SC accesses (though according to my docs, 4 cycles are enough anyway)
5. instead of
  TXA
  AND #$0F
you could do
  CPX endX; precalculated
This is all pretty easy and should save you a little over 100 cycles.

If you need more cycles, you could try to use only one single counter (Y) and do something like
  LAX (SPR_PTR),Y 
  LDA SCDATA,X   
  LDA (SC_PTR),Y; setting up this pointer is a bit tricky
  INY          
  CPY endY 
  BNE 3$
  • Report

Thomas Jentzsch, on Sun Nov 27, 2005 10:41 AM, said:

Some possible improvement suggestions:
1. split the loop
2. use LAX
3. when writing zeroes don't increase Y
4. when writing zeroes put INX instead of NOP between SC accesses (though according to my docs, 4 cycles are enough anyway)
5. instead of
  TXA
  AND #$0F
you could do
  CPX endX; precalculated
This is all pretty easy and should save you a little over 100 cycles.

If you need more cycles, you could try to use only one single counter (Y) and do something like
  LAX (SPR_PTR),Y 
  LDA SCDATA,X   
  LDA (SC_PTR),Y; setting up this pointer is a bit tricky
  INY          
  CPY endY 
  BNE 3$
Tres cool. I hadn't considered LAX, but that does get around the lack of LDX (zp),Y. Setting up the pointers shouldn't be too bad since I should be able to set it up so Y counts down from 7 to 0 (which then saves some more cycles since the CPY can be skipped).
  • Report

EricBall, on Sun Nov 27, 2005 10:25 PM, said:

Tres cool.  I hadn't considered LAX, but that does get around the lack of LDX (zp),Y.  Setting up the pointers shouldn't be too bad since I should be able to set it up so Y counts down from 7 to 0 (which then saves some more cycles since the CPY can be skipped).
At the cost of some additional overhead outside the loop, yes.

But this was just peephole optimization. I wonder, if the general approach can't be optimized. Maybe you will come down to 400 cycles, but worst case each of the (5 or 5*2?) sprites can be in two 16 lines stripes. Right? So setting up those would require ~4000/8000 cycles, which is still way too much.

So maybe it would be better to track the animations and vertical sprite positions. So when a sprite isn't moving vertically, isn't animated between two frames and doesn't flicker, you don't have to BLIT it. With some scheduling, the first two points should be easy to handle. So only the flickering might become a problem. But then you could maybe use more than one RAM location/stripe.

Though, probably the easiest solution is getting rid of JSR/RTS and load the data directly from ROM. :)

Quote

It comes down to LDA (zp),Y taking 5 cycles while LDA abs,Y only needing 4.
And thinking a bit further, why don't you just use selfmodifying code here :)
  • Report

Thomas Jentzsch, on Mon Nov 28, 2005 3:47 AM, said:

Though, probably the easiest solution is getting rid of JSR/RTS and load the data directly from ROM. :)
And thinking a bit further, why don't you just use selfmodifying code here :)
This all depends on whether it is possible to create an equivalent kernel which doesn't require SC RAM. Perhaps you or Manuel has coded a kernel which I could reuse. So let's look at what my kernel does and maybe you (or Manuel) can provide some suggestions.

32 pixel x 160 line asymmetrical playfield (1LK) stored as 4 pages in SC RAM. Each 8 lines is a single row of the grid. PF colour changes used to make ropes & ladders. Blank line between rows (making a 180 line display) used to reposition one player sprite (alternating between rows).

There are 5 independent sprites (player + 4 enemies). 16<=Xpos<140, 0<=Ypos<156, 8 pixels wide, 8 lines high. Up to two sprites may be displayed on a single row without flicker. Partial sort algorithm used to provide intelligent flicker. So if there are no conflicts then no flickering. Each sprite independently coloured.
  • Report
Hi there!

EricBall, on Mon Nov 28, 2005 4:53 PM, said:

Thomas Jentzsch, on Mon Nov 28, 2005 3:47 AM, said:

Though, probably the easiest solution is getting rid of JSR/RTS and load the data directly from ROM. :)
And thinking a bit further, why don't you just use selfmodifying code here :)
This all depends on whether it is possible to create an equivalent kernel which doesn't require SC RAM. Perhaps you or Manuel has coded a kernel which I could reuse. So let's look at what my kernel does and maybe you (or Manuel) can provide some suggestions.

I addmittely have nothing that'd be close enough. I either have sprite repositionings or complex playfield, but not both. Closest homebrew example I could think of is Hunchy 2, but it doesn't do the complex color changes you do.

After I had some closer look over your complete source code, I no longer think that getting rid of JSR/RTS will help you any or is even possible at all. I (and probably TJ as well) mistakenly thought that you only gain the difference between (),Y and $,X, but this is certainly not the case: When using (),Y you'd be forced to add switchdraw code for both sprites + sprite skipping logic - certainly a killer :)

A thought I had that may bring a great speedup boost though:

How about having lots of RAM stripes with 1 sprite only? Have insane padding of zeros on bottom and top.

And then just use a set of pointers inside the kernel, all properly adjusted to the set of sprites you're going to draw.

Then, you'd only need to copy each sprite shape to RAM once at startup.

(Of course you could do that with giant mostly zero filled ROM tables as well...)

Just another idea... :)

IIRC I ran into similar problems with Jumpman but since it was planned without extra RAM I ended up with the levels getting more and more symmetrical :)

Greetings,
Manuel
  • Report

Cybergoth, on Mon Nov 28, 2005 12:30 PM, said:

A thought I had that may bring a great speedup boost though:

How about having lots of RAM stripes with 1 sprite only? Have insane padding of zeros on bottom and top. And then just use a set of pointers inside the kernel, all properly adjusted to the set of sprites you're going to draw. Then, you'd only need to copy each sprite shape to RAM once at startup. (Of course you could do that with giant mostly zero filled ROM tables as well...)  Just another idea... :)
Unfortunately, ROM isn't infinite. I've only got 6K to work with. Right now I've got ~300 bytes of code space + 2K in the last bank (which I was saving for title / end / load / music?) for the remaining features. So dedicating a bank per sprite really isn't feasible.

However, I've coded up a new blit routine based on your & TJ's suggestions which has cut 600 cycles (8 lines) off the runtime. That gives me some breathing room. I'll put up a new release after I get home and merge together the various changes I've made over the weekend.
  • Report

EricBall, on Mon Nov 28, 2005 5:53 PM, said:

This all depends on whether it is possible to create an equivalent kernel which doesn't require SC RAM.
Probably I simply just still not get the problem. :)

Here is how I understand it:
- your BLIT code fills a 16 bytes RAM area with (partial) data of no more than one sprite at a time
- your kernel reads that data and copies it into GRPx
- you are doing this to save 1 cycle for each read inside the kernel

Right?

If yes, then just use selfmodifying code inside the kernel (using abs,Y or abs,X), which directly points to sprite data. Each sprite has to be padded with 15 blank bytes then.

I will cost you 15 bytes for each sprite, but you will make things much easier.
  • Report

EricBall, on Mon Nov 28, 2005 5:53 PM, said:

32 pixel x 160 line asymmetrical playfield (1LK) stored as 4 pages in SC RAM.  Each 8 lines is a single row of the grid.  PF colour changes used to make ropes & ladders.  Blank line between rows (making a 180 line display) used to reposition one player sprite (alternating between rows).
Sounds a bit like the POP kernel.

BTW: You could save some cycles by using a mirrored playfield. Then you would avoid any writes to PF0 and the AND. So you would have 7 reads and writes (=49 cycles) instead of 7 reads, 9 writes, AND and NOP (59 cycles).
  • Report

May 2012

S M T W T F S
  12345
6789101112
13141516171819
202122 23 242526
2728293031  

Recent Comments

Search My Blog