Jump to content
IGNORED

Chimera Queues - Brainstorming discussion


mos6507

Recommended Posts

I'm opening up this topic to gather ideas for how to implement queues on Chimera.

 

Right now we only have two hotspots reserved for each queue, a seek and a read. Reading from the read hotspot gives you the current value of the queue and indexes forward. When you write to a seek hotspot, it sets the new current index (0-255). Writes to the read hotspot will write to the current index and move the index forward. There are currently no other special features although we have a way of allowing for queues to be larger than 256 bytes. We'd like to gather ideas on how we could implement queues that would be more useful to programmers. Then we'll see what we can implement. So offer your ideas.

Link to comment
Share on other sites

Right now we only have two hotspots reserved for each queue, a seek and a read. Reading from the read hotspot gives you the current value of the queue and indexes forward. When you write to a seek hotspot, it sets the new current index (0-255). Writes to the read hotspot will write to the current index and move the index forward. There are currently no other special features although we have a way of allowing for queues to be larger than 256 bytes. We'd like to gather ideas on how we could implement queues that would be more useful to programmers. Then we'll see what we can implement. So offer your ideas.

 

Are you using "magic writes", or how are you differentiating between writes and reads?

 

I'll rehash, for the benefit of any newcomers, a couple ideas I've put forth before:

 

-1- For some zero-page address, including NUSIZx, ENAxx, CTRLPF, AUDVx, and HMxx, it might be useful to have hotspots at $4x-$6x to take advantage of read-modify-write addressing (have the Chimera output data on D0-D5 during the read). This would save two cycles/scan line for each queue or tone generator used.

 

-2- It may be useful to have a "pipeline" mode where accessing any read hotspot will return the value fetched for the previous queue and then read a value from the selected queue (which will be returned on the next read). This would be slightly nuisancesome on the 2600 side, but might save some time on the Chimera side.

 

Also, in addition to looking at queues, it may be helpful to look at some of the other features of 4A50 banking. They allow for the 6507 to do some amazing tricks faster and more efficiently than with a flat banking space, but there is room for a little improvement (most notably, having two 256-byte banking areas instead of one).

Link to comment
Share on other sites

My list of requests:

1. fast read access 4 or 3 cycles (incl. ZP hotspots)

2. consecutive hotspot read access (at least 2 or 3 times before having to pause)

3. fast write access (4 cylces, pauses are ok here)

4. many queues (at least 4, the more the better)

5. intelligent queues:

  • optional flagging (like DPC), define offset for source, start and stop row, return data from source if inside bounds, else 0 (this would make things much easier)
  • fast, direct setup of those parameters (especially for kernels with multi sprites) (like DPC, not just seeking around)
  • combining queues, two combined queues return one result, e.g.:
    i. one queue pointing at the data, the other one defines when to iterate (especially useful for playfields)
    ii. return data contains the logical result of two queues (and, or, xor)
    iii. queues can do basic operations (e.g. multiple shifts)
     
    Combining this functionality would e.g. allow huge horizontal scrolling playfields (queue1 << 2 | queue2 >> 6)

6. rereading without iterating (nice to have)

Edited by Thomas Jentzsch
Link to comment
Share on other sites

It would be great if there were a bubble sort (i.e. flickersort) over the values in the queues. In other words, you would load the queue with y values for the sprites and then read them out in sorted order during the kernel. Sorting these values is currently an expensive operation, and it is very difficult to flickersort both player sprites and still have time to do anything else.

 

Chris

Link to comment
Share on other sites

It would be great if there were a bubble sort (i.e. flickersort) over the values in the queues. In other words, you would load the queue with y values for the sprites and then read them out in sorted order during the kernel. Sorting these values is currently an expensive operation, and it is very difficult to flickersort both player sprites and still have time to do anything else.

I suppose that's something Chimera also could do outside the kernel. After all, you would have to sort on data structures, at least you need an object id which is sorted too.

 

But having some more or less generic usable sorting functionality would be very good.

Link to comment
Share on other sites

But having some more or less generic usable sorting functionality would be very good.

What are your timing expectations for the sort? I mean if you where to load a queue up, then signal a sort, how long do I have to accomplish that before you need the values? I cant feed you values and sort at the same time.

 

Vern

Link to comment
Share on other sites

1. fast read access 4 or 3 cycles (incl. ZP hotspots)

I think we satisfy this for reads right now, minus the ZP stuff (I still need to figure that out). For writes, as of this moment, you have to use the SuperCharger method of writing, so that slows you down a little, I am working to fix this.

 

2. consecutive hotspot read access (at least 2 or 3 times before having to pause)

This is operational, I can give access to 68K of queues that can be consecutively read from the cart, no ZP yet.

 

3. fast write access (4 cylces, pauses are ok here)

This is a tough one, but it is in the works.

 

4. many queues (at least 4, the more the better)

Right now we have 116 256 byte queues. There can be 216, I just have implemented the second set of 100 yet.

 

5. intelligent queues:

  • optional flagging (like DPC), define offset for source, start and stop row, return data from source if inside bounds, else 0 (this would make things much easier)

Any chance you could break down exactly you expect?

 

  • fast, direct setup of those parameters (especially for kernels with multi sprites) (like DPC, not just seeking around)

Define fast setup of parameters. At the moment any write is kinda sorta slow.

 

  • combining queues, two combined queues return one result, e.g.:
    i. one queue pointing at the data, the other one defines when to iterate (especially useful for playfields)
    ii. return data contains the logical result of two queues (and, or, xor)
    iii. queues can do basic operations (e.g. multiple shifts)
     
    Combining this functionality would e.g. allow huge horizontal scrolling playfields (queue1 << 2 | queue2 >> 6)

I can do this. We have two types of queues, one set, 16 queues total, resides in the ARMs local SRAM, the other set, 100 soon to be 200 total queues, resides in external SRAM. All are 256 bytes in size. The local ARM queues are very fast and a lot of manipulation can be done with them including logic operation between queues, shifts, masks, etc. The other queues arent so fast as the ARM has to fetch values from the SRAM that is also shared with the VCS. I am sure we can get these things to work in the faster queues, I just need to work out a way to control all these things.

 

6. rereading without iterating (nice to have)

Already done.

 

Vern

Link to comment
Share on other sites

What are your timing expectations for the sort? I mean if you where to load a queue up, then signal a sort, how long do I have to accomplish that before you need the values?

This could take quite some time, if you can do it outside the kernel. Usually you don't have to sort more than 5 to 10 values. If you can sort 10 values within a few 100 Atari 2600 cycles, that would be good enough. But probably you are much faster, right?

 

I cant feed you values and sort at the same time.

No problem, as long as the sort doesn't block the access too long.

Link to comment
Share on other sites

For writes, as of this moment, you have to use the SuperCharger method of writing, so that slows you down a little, I am working to fix this.

That would be good.

 

Right now we have 116 256 byte queues. There can be 216, I just have implemented the second set of 100 yet.

I doubt any program will ever be able to utilize them all. So if you have to sacrifice some (or even a lot) for additional functionality, then do it. :)

 

Any chance you could break down exactly you expect?

Currently i would have to waste a lot of bytes for padding. In a 180 lines kernel I would need 180 (or maybe 90) padding 0 bytes. This is wasting way too much ROM.

 

We need to define:

a. the start address, where the queue should start to read from (12 bit)

b. the offset, where it should switch from returning 0 to returning the actual data at the current address (8 bit)

c. the offset where it should switch to padding again (8 bit)

 

This is how the DPC does it, and IMO very efficient.

 

This logic could be either implemented directly into one queue (optimal solution).Or, if you allow combining two queues, then an alternative would be to have one queue pointing at the data and the other one to a padding data block. So the padding data could be reused by multiple sprites (still wastes space, but better than nothing).

 

Define fast setup of parameters. At the moment any write is kinda sorta slow.

Being able to control the queues efficiently. E.g. setup the 3 parameters I mentionend above in less then 50 cycles. The DPC does this by writing 4 values (4*4 cylces) to 4 hotspots (4*4 cylces) . Just resetting the queue and then having to iterate until it is a the correct position is IMO not an expedient solution.

 

I can do this. We have two types of queues, one set, 16 queues total, resides in the ARMs local SRAM, the other set, 100 soon to be 200 total queues, resides in external SRAM. All are 256 bytes in size.

So the queues point at fixed addresses? Variable addresses would IMO be very useful. So there are just a few types of queues (fast, slow, inteligent, dumb) and the number of them is only limited by the number of hotspots.

 

I am sure we can get these things to work in the faster queues, I just need to work out a way to control all these things.

If the are a few intelligent queues and the some more dumb queues that's good enough.

 

Maybe my requests are a bit spoiled by the DPC, but it works very close to the requirements.

 

BTW: Another idea:

7. Could the queues be able to read data from a different bank?

Link to comment
Share on other sites

-2- It may be useful to have a "pipeline" mode where accessing any read hotspot will return the value fetched for the previous queue and then read a value from the selected queue (which will be returned on the next read).

I had the same idea last night. :)

 

Especially when there are complex operations done on the data, it might be better to do it that way, because the maybe necessary delays, would be hidden then.

 

Even if consequtive reads from those queues might require some pause, this is better than not implementing complex functionality due to timing problems.

Link to comment
Share on other sites

What are your timing expectations for the sort?

This could take quite some time, if you can do it outside the kernel. Usually you don't have to sort more than 5 to 10 values.

Assuming you load the values into a 'fast' queue and its only 5 or 10 items, I should be able to get that to you on the next access best case, and worst case a couple cycles. Should be really fast. I thought you were talking about 256 items. But obviously the time would depend on the number of items.

 

My ARM programming is all done in a manor that tries to keep paths deterministic. So even if you send me a presorted list I am going to take the same amount of time to sort it as a worst case list. The idea behind that is I want to minimize jumping with conditional code. This is really important to keep timings predictable.

 

I will setup a hotspot so the VCS can see when the ARM is busy. That way you can perform special functions and find out how long the ARM takes. You wouldnt have to sample the spot in your final code, just while your developing code. Somethings, like disk access, you wont have a choice, you will have to poll the busy hotspot.

 

Vern

Link to comment
Share on other sites

-2- It may be useful to have a "pipeline" mode where accessing any read hotspot will return the value fetched for the previous queue and then read a value from the selected queue (which will be returned on the next read).

I had the same idea last night. :)

 

Especially when there are complex operations done on the data, it might be better to do it that way, because the maybe necessary delays, would be hidden then.

 

Even if consequtive reads from those queues might require some pause, this is better than not implementing complex functionality due to timing problems.

So what you want here is a hotspot that just reads out the heads of all your queues?

 

Vern

Link to comment
Share on other sites

So what you want here is a hotspot that just reads out the heads of all your queues?

I am not sure if we understand each other correctly here.

 

We need one reading hotspot/queue (which reads the head). But maybe the data returned is delayed by one, to gain more flexible timing (for the Chimera).

Edited by Thomas Jentzsch
Link to comment
Share on other sites

This is how the DPC does it, and IMO very efficient.

The queue setup here is a lot different than the DPC. There is a 128K SRAM chip on the cart. 64K of that is accessible to the VCS directly through banking and SuperCharger writes, all the game code resides somewhere in here (or will be swapped into here from disk). The second half cannot be touched directly by the VCS. This is where the ARM stores most of the queue data. The VCS doesnt have linear access to the queue data. It can only read/write values through a hotspot. The 256 hotspots are available in all banks. Not all of them control queues, but a good chunk of them do.

 

So for this cart you would load a queue with data through a series of hotspot writes or have it swapped in from disk, then you would access all the data through the single hotspot for that queue.

 

If I understand you correctly, you want to setup a queue, provide a start index value, a stop index value. Then as you read data out, a 0x00 is returned every time unless the current position is in between the index values you provided. How is this different than filling the queue with your data and zeros everywhere else, then just reading the queue out?

 

Define fast setup of parameters. At the moment any write is kinda sorta slow.

Being able to control the queues efficiently. E.g. setup the 3 parameters I mentionend above in less then 50 cycles. The DPC does this by writing 4 values (4*4 cylces) to 4 hotspots (4*4 cylces) . Just resetting the queue and then having to iterate until it is a the correct position is IMO not an expedient solution.

You do not have to iterate queues to a correct position. Each queue has a seek hotspot. You simply write a value and the queue goes there.

 

BTW: Another idea:

7. Could the queues be able to read data from a different bank?

Queues are not dependent on current bank selection. All queues are available all the time.

 

Vern

Link to comment
Share on other sites

Here's a thought - when you are trying to use a missile, or the ball, to make a sprite, you need HMxx data, NUSIZx/CTRLPF, and ENAxx data.

 

Looking at the NUSIZx/CTRLPF bytes, you want to change the size of the missile/ball but leave the size of the player, or the parameters of the playfield, the same. If those things are unknown (i.e., they might change during the kernel), you'd have to OR them in, like this:

  lda (NusizPtr),Y
  ora CurrentPlayerSize
  sta NUSIZ0

Obviously, you could use the queue in place of the lda (),Y; it would be pretty cool if you could also do the ORA operation behind the scenes also.

I don't know if it would be easier to just set up a separate queue with the values to be ORAed in or if some other method would be easier.

 

...

 

Just some thoughts. :)

Link to comment
Share on other sites

We need one reading hotspot/queue (which reads the head). But maybe the data returned is delayed by one, to gain more flexible timing (for the Chimera).

Thats done, thats exactly how the queues work, except the data is not delayed by one. There is one hotspot for reads and writes and one hotspot for seeks, for each queue.

 

There is a very good chance I am misunderstanding you. If possible use an example.

 

Vern

Link to comment
Share on other sites

Obviously, you could use the queue in place of the lda (),Y; it would be pretty cool if you could also do the ORA operation behind the scenes also.

If what you want here is to perform a logical operation against data retrieved from the queue, thats possible and I am looking into it right now.

 

Vern

Link to comment
Share on other sites

If I understand you correctly, you want to setup a queue, provide a start index value, a stop index value. Then as you read data out, a 0x00 is returned every time unless the current position is in between the index values you provided.

Yup.

 

How is this different than filling the queue with your data and zeros everywhere else, then just reading the queue out?

Reading it is ok, but filling the queue will require a massive amount of time. E.g. clearing 256 bytes in an unrolled loop would require ~9*256 = 2304 cycles. Do you plan a "clear" hotspot for a queue?

 

I give you an example:

I need to display two different sprites at different vertical positions inside a kernel. One is displayed in even frames and one in odd frames. Since it is the same kernel, I have to use the same queue. So I have to setup the queue each frame like this:

1. clear the queue

2. copy sprite data into queue

3. position queue so that the sprite is displayed at the correct vertical position

 

I understand the queues will wrap around, so maybe 1. is not necessary and I will only have to overwrite the old sprite data from previous frame (may still take some time). But then 3. get's complicated, because I have to position the queue based on the position of the previous frame(s).

 

You do not have to iterate queues to a correct position. Each queue has a seek hotspot. You simply write a value and the queue goes there.

Sounds ok.

 

Queues are not dependent on current bank selection. All queues are available all the time.

So we have a whole bank just for the kernel and non-queued data. And get all graphic data via queues from the SRAM, right? Cool! :cool:

Link to comment
Share on other sites

There is a very good chance I am misunderstanding you. If possible use an example.

Supercat and I are just worried that complex queue operations may slow down the read access to the queue. So we had the idea of a buffer register. Whenever you read from the queue, you read from the buffer and the result is returned immediately. Then the Chimera iterates and uses the next few cycles to fill the buffer register.

 

So reading would be still as fast as possible (without waiting for a complex operation result) at the cost of some pause between reads.

Link to comment
Share on other sites

1. clear the queue

2. copy sprite data into queue

3. position queue so that the sprite is displayed at the correct vertical position

Why cant you just reuse the data and reseek the queue? Why clear the queue and input the same information that was there before?

 

A clear queue would be possible if needed.

 

Queues are not dependent on current bank selection. All queues are available all the time.

So we have a whole bank just for the kernel and non-queued data. And get all graphic data via queues from the SRAM, right? Cool! :cool:

Right, actual queue data does reside in VCS addressable space, its completely separate.

Link to comment
Share on other sites

Supercat and I are just worried that complex queue operations may slow down the read access to the queue. So we had the idea of a buffer register. Whenever you read from the queue, you read from the buffer and the result is returned immediately. Then the Chimera iterates and uses the next few cycles to fill the buffer register.

In a sense that is exactly what happens. When you read a queue, you actually read the physical hotspot. When the VCS accesses the hotspot the ARM kicks in and grabs the next value and places it into the hotspot, then next time the VCS hits the hotspot, it will get the next value. So it is using the buffered approach you recommended.

 

Vern

Edited by Delicon
Link to comment
Share on other sites

Why cant you just reuse the data and reseek the queue? Why clear the queue and input the same information that was there before?

Because I want to display two different looking sprites. And they could not only have different graphics, but also different colors and size (both horizontal and vertical). So I would have to setup three queues that way (or even more for very complex graphics).

 

And easy understandable example is Pac-Man. There all four ghosts are displayed by one sprite, flickering over four frames. Colors and size are constant inside the kernel, but the graphics differ.

Link to comment
Share on other sites

Because I want to display two different looking sprites. And they could not only have different graphics, but also different colors and size (both horizontal and vertical). So I would have to setup three queues that way (or even more for very complex graphics).

Why not use a different queue for each sprite? Seems like if you have time to switch queue parameters, then you would have time to change queues.

 

What if there was a special set of hotspots that could be reassigned to different queues?

 

Vern

Edited by Delicon
Link to comment
Share on other sites

I echo TJ, there has to be a way to populate the queues *very* quickly. Automatically would be best.

 

Some possibilities:

Have a hotspot to clear a queue. I.e., write the queue-number to hotspot CLEAR and that queue is zeroed out.

Even better: be able to fill a queue with a value other than zero (write the queue number to the hotspot first then write the value).

Even better: be able to fill *regions* of a queue with a certain value - i.e., write $9F to the first 100 spots in queue #2.

 

Holy grail: what TJ requested: filling part of a queue with values (plural) from ROM. So you have sprite graphic in ROM from $F000 to $F00F and you fill spots 32-47 in queue #3 with those graphics. Fill automatically, that is, by writing, say, the initial queue position, the final queue position, and the address of where to pull the data to fill those positions.

So, using above example:

write 32

write 47

write $F000

Done!

 

Doable?

Edited by vdub_bobby
Link to comment
Share on other sites

Why not use a different queue for each sprite? Seems like if you have time to switch queue parameters, then you would have time to change queues.

 

What if there was a special set of hotspots that could be reassigned to different queues?

That would solve the problem.

 

Another problem would be, that you need at least one queue for each sprite (multiplied with animation frames). So if a sprite is 10 pixels tall on average, your approach wastes 246 bytes (or 96%) of space. So 128k SRAM shrinks to 5.2k if you only use it for sprites.

 

I somehow don't like that. ;)

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...