vdub_bobby, on Fri Mar 9, 2007 1:44 PM, said:
Assembly:
;--count set bits in v
; trashes v, accumulator, result is in c
lda #0
sta c
Loop
lda v
beq Done
inc c
dec v
and v
sta v
bne Loop
Done
Seems like that should be improveable.
I'll take a stab at it...
Execution time for your version is 11 cycles if v is 0, or 5+n*23-1 cycles if v is not zero (where n is the number of bits set in v). Average time (4 bits set in input) is 96 cycles.
One obvious improvement would be to use the X register instead of memory location c. Use "ldx #0" to initialize (saves 3 cycles, assuming "c" was a zero page address), and "inx" instead of "inc c" (saves 3 cycles per set bit in v). Also the code gets 2 bytes smaller. Of course, if you already had written a lot of code that used this as a subroutine, all that code would have to change (to look for the result in X instead of c). Depending on how the calling code used the result, you might end up losing a lot more than the 2 bytes you've saved.
Another improvement would be to move the Loop label down two lines (below the "beq Done")... You only need to do "lda v : beq Done" once, before the first loop iteration (to catch the special case where the initial value of v is zero). At the end of the loop, you're making this check again (the "bne Loop")... if this branch is taken (back to Loop), it's redundant to check v for zero again (you're guaranteed it's not, since you just took the "bne Loop" branch). This will save you 4 cycles per set bit in v, and it won't break compatibility with (hypothetical) existing code that calls the routine, either. (It will add 4 constant cycles before the loop, though)
Applying both improvements (X register and moving the label) would save 7 cycles per loop iteration, which is a 30% speedup (of just the loop)... not bad.
Code looks like:
;--count set bits in v
; trashes v, accumulator, result is in X register
ldx #0
lda v
beq Done
Loop
inx
dec v
and v
sta v
bne Loop
Done
Timing is 8 cycles if v is 0, or 7+16*n-1 if v is not 0... Average case (4 bits set in input) is 70 cycles, 28% faster than the original, assuming I've gotten all the numbers right (if not, I apologize). Total code size is 15 bytes (3 bytes smaller than the original).
However, there are a couple of other ways to count set bits that are faster at the expense of more code. I was messing around with this a while back, and came up with 5 other solutions (I never came up with the Kernighan method on my own, though). Here are a couple of them:
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
; Count the number of set bits in A
; Unrolled version
; Result in A
; Uses one byte of RAM at tmp (preferably ZP)
; On exit: X and Y untouched, C/Z/N undefined
; Code size: 36 bytes (+1 for RTS = 37)
; Execution time: Constant, 5+(7*8) = 61 cycles (+6 for RTS = 67)
countbits_3:
sta tmp; 3
lda #0 ; 2
rol tmp; 5
adc #0 ; 2
rol tmp; 5
adc #0 ; 2
rol tmp; 5
adc #0 ; 2
rol tmp; 5
adc #0 ; 2
rol tmp; 5
adc #0 ; 2
rol tmp; 5
adc #0 ; 2
rol tmp; 5
adc #0 ; 2
rol tmp; 5
adc #0 ; 2
rts
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
; Count the number of set bits in A
; Nybble table version
; Result in A
; Uses one byte of RAM at tmp (preferably ZP)
; Execution time: Constant, 35 cycles (+6 for RTS)
; Code size: 17 bytes table + 19 bytes code = 36 bytes (+1 for RTS = 37)
; This is faster than the fully-unrolled version,
; but it trashes all the registers and flags. Could use PHA/PLA instead of
; TAY/TYA, to preserve the Y register, at the expense of 2 cycles.
nyb_bits: byte 0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4
countbits_4:
tay ; 2
and #$0F ; 3
tax ; 2
lda nyb_bits,x; 4
sta tmp ; 3
tya ; 2
lsr ; 2
lsr ; 2
lsr ; 2
lsr ; 2
tax ; 2
lda nyb_bits,x; 4
clc ; 2
adc tmp ; 3
rts ; 6
Neither of these is all that original. The unrolled version is obvious to any 6502'er, and the nybble-table idea came from some code I saw either here or on the [stella] list (I couldn't find the original code, so I rewrote it from scratch).
Here's one I came up with on my own. It's not all that fast, but I like it because it fits in 11 bytes of code and doesn't use a temp location in RAM. Also, if I added a ROL before the RTS, the accumulator and carry flag would keep their original values, which might simplify things for the calling code.
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
; Count the number of set bits in A
; Result in Y
; Uses no RAM
; On exit:
; Y = result, X = 0, A = undefined, C = undefined, Z = 1, N = 0
; Min execution time (input $00, 0 bits set):
; 4+(2+3+2+3)*8-1 = 83 cycles
; Avg execution time (input $AA, 4 bits set):
; 4+(2+3+2+3)*4+(2+2+2+2+3)*4-1 = 87 cycles
; Max execution time (input $FF, 8 bits set):
; 4+(2+2+2+2+3)*8-1 = 91 cycles
; (add 6 cycles for RTS to all 3 counts)
; Code size: 11 bytes (+1 for RTS = 12)
countbits_2:
ldx #8; 2
ldy #0; 2
cb2loop
rol ; 2
bcc cb2_noiny; 2/3
iny ; 2
cb2_noiny
dex ; 2
bne cb2loop ; 3
; rol; adding this ROL preserves the original values of A and C (and sets Z and N according to the A value)
rts; 6