How about the
TIA Painter's demos? The asymmetrical one would be easy to alter into block letters for a message. To save romspace, you can use an imbedded loop when drawing the playfield (so that each line of data is repeated ~n times). Here's an example that does that (replace the lines in the asymmetrical source code):
;////////////// Start To Draw Playfield ///////////////////////////////
LDY #23 ;2 @64 set 24 lines
Draw_Picture_Loop:
LDX #7 ;2 @62 set 8 scanlines for each pixel
Block_Pixels_Loop:
LDA (pf0_vector),Y ;5 @67 load PF0 before the scanline...
STA WSYNC ;3
STA PF0 ;3 @3
LDA (pf1_vector),Y ;5 @8
STA PF1 ;3 @11
LDA (pf2_vector),Y ;5 @16
STA PF2 ;3 @19
LDA (pf0_vector),Y ;5 @24 load PF0 before the scanline...
ASL ;2 @26 so there's time to reuse ROM
ASL ;2 @28
ASL ;2 @30
ASL ;2 @32
STA PF0 ;3 @35
LDA (pf4_vector),Y ;5 @40
STA PF1 ;3 @43
LDA (pf5_vector),Y ;5 @48
STA PF2 ;3 @51
DEX ;2 @53
BPL Block_Pixels_Loop;2 @55
DEY ;2 @57
BPL Draw_Picture_Loop;2 @59
;////////////// End Of Display ////////////////////////////////////////
24 lines at 8 pixels high each = 192 scanlines...that would be sufficient for smallish messages...10 letters across (3 point + a blank space) by 4 rows (5-point) with a blank row after each. Fudge the numbers around a bit if you want more rows. You could also use a pixel-height variable and number-or-rows variable if you want that to be adjustable for stuff that needs more lines than others...just use those instead of the immediate values in the first couple lines. There's 6 cycles to spare before the WSYNC, so it's safe.
Before executing the kernel, set aside a group of 5 vectors (pf0_vector, etc) and store the LSB/MSB of your data tables to them. Indirect-Y addressing does the rest.