LLX > Neil Parker > Apple II > Kaleidoscope
(There's also a version in Flash now too. The Java applet that used to be on this page is archived here.)
Those of you who have been in the Apple II world long enough to remember the DOS 3.3 System Master disk may recall the Integer BASIC program on it called COLOR DEMO. One of its menu options is KALEIDOSCOPE, which draws pretty symmetrical pictures on the lo-res screen.
I occasionally like to run COLOR DEMO and watch the pretty kaleidoscope colors. Listing the source code, I noticed that it doesn't run forever--it stops after a fixed number of iterations. But each iteration was rather slow, and I never had the patience to watch the whole thing from start to finish.
Then one day I studied the listing more closely, and it occurred to me that the source code was rather inefficient, and that I could probably speed it up quite a bit with a little rewriting--maybe by enough to watch the whole thing. So this article is about my adventures optimizing KALEIDOSCOPE.
The original Integer BASIC source code, stripped of extraneous fluff, is as follows:
700 GR : CALL -936: FOR W=3 TO 50: FOR I=1 TO 19: FOR J=0 TO 19:K=I+J 750 COLOR=J*3/(I+3)+I*W/12 760 PLOT I,K: PLOT K,I: PLOT 40-I,40-K: PLOT 40-K,40-I 770 PLOT K,40-I: PLOT 40-I,K: PLOT I,40-K: PLOT 40-K,I 780 NEXT J,I,W
There are a couple of obvious things that can be improved in this code. In line 750, the expression I*W/12 is recomputed on each pass through the J loop, even though it doesn't depend on J. Likewise, there's no need to recompute 40-I and 40-K for each PLOT statement.
So here's an improved routine:
700 GR : CALL -936: FOR W=3 TO 50: FOR I=1 TO 19: X=I*W/12: I40=40-I: FOR J=0 TO 19:K=I+J:K40=40-K 750 COLOR=J*3/(I+3)+X 760 PLOT I,K: PLOT K,I: PLOT I40,K40: PLOT K40,I40 770 PLOT K,I40: PLOT I40,K: PLOT I,K40: PLOT K40,I 780 NEXT J,I,W
(Moving X=I*W/12 and I40=40-I out of the J loop is called loop invariant removal. Precalculating I40 and K40 instead of recalculating them each time is called common subexpression elimination.)
But that's not the best that can be done. Multiplication on any 6502-based hardware is always a somewhat slow operation, due to the 6502's lack of a hardware multiply instruction. The J*3 operation can clearly be made more efficient by writing it as J+J+J. What's not quite so easy to see is that something similar can be done with I*W...note that each time through the W loop, I*W starts out with the value of W, and each time through the I loop, it increases its value by W. Thus the following code is mathematically the same as the above, but uses only quick additions instead of slow multiplications:
700 GR : CALL -936: FOR W=3 TO 50:IW=0: FOR I=1 TO 19:IW=IW+W: X=IW/12: I40=40-I: FOR J=0 TO 19:K=I+J:K40=40-K 750 COLOR=(J+J+J)/(I+3)+X 760 PLOT I,K: PLOT K,I: PLOT I40,K40: PLOT K40,I40 770 PLOT K,I40: PLOT I40,K: PLOT I,K40: PLOT K40,I 780 NEXT J,I,W
(The IW optimization is called strength reduction).
This code runs noticeably faster than the original code, but I still found it too slow to wait for. Clearly more drastic measures are called for...
Machine language! That's right...my next step was to hand-compile the BASIC into into 6502 code. Not only does machine code run much faster than interpreted BASIC could ever hope to, there are additional optimizations that wouldn't be of much help to the BASIC but would help machine language.
First the variables, all on page zero for speed. All the variables used in the BASIC code can be stored in single bytes, except for IW, which needs to store values up to 950. The two additional variables A1L and A1H are temporaries used in the division by 12 and by (I+3). Memory locations were chosen to avoid conflict with BASIC's zero-page usage.
1 W = 6 2 I = 7 3 J = 8 4 K = 9 5 A1L = $3C 6 A1H = $3D 7 K40 = $FC 8 I40 = $FD 9 IWL = $FE 10 IWH = $FF
Necessary I/O locations:
11 KBD = $C000 12 KBDSTRB = $C010 13 LORES = $C056 14 AN3OFF = $C05F
Necessary ROM routines:
15 PLOT = $F800 16 SETCOL = $F864 17 SETGR = $FB40 18 HOME = $FC58
Now for the code. First we initialize the lo-res screen and clear the text area at the bottom. The LDA LORES is done because SETGR doesn't, and the LDA AN3OFF is in case somebody was using double-res graphics recently.
19 ORG $2000 2000: 20 40 FB 20 JSR SETGR ;GR 2003: AD 56 C0 21 LDA LORES 2006: AD 5F C0 22 LDA AN3OFF 2009: 20 58 FC 23 JSR HOME ;HOME
Now the W loop. This just starts it out--the increment of W and test for the final value are done at the bottom of the loop.
200C: A9 03 24 LDA #3 ;FOR W=3 TO 50 200E: 85 06 25 STA W
Initialize IW. I40 is also initialized here so instead of subtracting I from 40 each time, we can just DEC it each time through the I loop. This is another strength reduction, which wasn't worthwhile in BASIC but is in machine language.
2010: A9 00 26 WLP LDA #0 ;IW=0 2012: 85 FE 27 STA IWL 2014: 85 FF 28 STA IWH 2016: A9 28 29 LDA #40 ;I40=40 2018: 85 FD 30 STA I40
Now the I loop begins. It works just like the W loop.
201A: A9 01 31 LDA #1 ;FOR I=1 TO 19 201C: 85 07 32 STA I
IW=IW+W is straightforward.
201E: A5 06 33 ILP LDA W ;IW=IW+W 2020: 18 34 CLC 2021: 65 FE 35 ADC IWL 2023: 85 FE 36 STA IWL 2025: 90 02 37 BCC ILP1 2027: E6 FF 38 INC IWH
Here IW is divided by 12, and the result is stored in the X register (which is safe because nothing else in the code clobbers it). The division is done using a standard 16-bit by 8-bit long division algorithm (with the dividend in A1L and A1H), coded inline because it's used only once and it saves the cycles that JSR/RTS would use.
2029: 85 3C 39 ILP1 STA A1L ;X=IW/12 202B: A5 FF 40 LDA IWH 202D: 85 3D 41 STA A1H 202F: A0 10 42 LDY #16 ;(inline 16/8 division) 2031: A9 00 43 LDA #0 2033: 06 3C 44 DIV1LP ASL A1L 2035: 26 3D 45 ROL A1H 2037: 2A 46 ROL 2038: C9 0C 47 CMP #12 203A: 90 04 48 BCC DIV1A 203C: E9 0C 49 SBC #12 203E: E6 3C 50 INC A1L 2040: 88 51 DIV1A DEY 2041: D0 F0 52 BNE DIV1LP 2043: A6 3C 53 LDX A1L
Here's the previously-promised decrement of I40. Also, K40 is set up to receive the same treatment.
2045: C6 FD 54 DEC I40 ;I40=I40-1 2047: A5 FD 55 LDA I40 ;K40=I40 2049: 85 FC 56 STA K40
The beginning of the J loop. The end of the loop (below) will guarantee that the accumulator holds the current value of J for the K=I+J computation.
204B: A9 00 57 LDA #0 ;FOR J=0 TO 19 204D: 85 08 58 STA J 204F: 18 59 JLP CLC ;K=I+J 2050: 65 07 60 ADC I 2052: 85 09 61 STA K
Since you can't control-C a machine language program, this gives the user a chance to quit at any time by pressing a key.
2054: AD 00 C0 62 LDA KBD ;IF PEEK(KBD)>127 THEN DONE 2057: 10 06 63 BPL NOKEY 2059: 8D 10 C0 64 STA KBDSTRB 205C: 4C DD 20 65 JMP DONE
The color computation. The shift-and-add computation below is quicker that adding J to itself three times. Note that J is never greater than 19, so the carry flag is automatically guaranteed to be clear before the first two ADC instructions. Another inline division is used (A1L by A1H, result in A1L)...this one only needs to divide 8 bits by 8 bits.
205F: A5 08 66 NOKEY LDA J ;COLOR=J*3/(I+3)+X 2061: 0A 67 ASL 2062: 65 08 68 ADC J 2064: 85 3C 69 STA A1L 2066: A5 07 70 LDA I 2068: 69 03 71 ADC #3 206A: 85 3D 72 STA A1H 206C: A9 00 73 LDA #0 ;(inline 8/8 division) 206E: A0 08 74 LDY #8 2070: 06 3C 75 DIV2LP ASL A1L 2072: 2A 76 ROL 2073: C5 3D 77 CMP A1H 2075: 90 04 78 BCC DIV2A 2077: E5 3D 79 SBC A1H 2079: E6 3C 80 INC A1L 207B: 88 81 DIV2A DEY 207C: D0 F2 82 BNE DIV2LP 207E: 8A 83 TXA 207F: 18 84 CLC 2080: 65 3C 85 ADC A1L 2082: 20 64 F8 86 JSR SETCOL
The eight plot commands. These could be made faster by rearranging them so that all the plots with the same Y coordinate are done in groups, which minimizes the number of times the row address has to be calculated. But I kept the plot order the same to guarantee that the display would identical to the original BASIC display.
2085: A4 07 87 LDY I ;PLOT I,K: etc... 2087: A5 09 88 LDA K 2089: 20 00 F8 89 JSR PLOT 208C: A4 09 90 LDY K 208E: A5 07 91 LDA I 2090: 20 00 F8 92 JSR PLOT 2093: A4 FD 93 LDY I40 2095: A5 FC 94 LDA K40 2097: 20 00 F8 95 JSR PLOT 209A: A4 FC 96 LDY K40 209C: A5 FD 97 LDA I40 209E: 20 00 F8 98 JSR PLOT 20A1: A4 09 99 LDY K 20A3: A5 FD 100 LDA I40 20A5: 20 00 F8 101 JSR PLOT 20A8: A4 FD 102 LDY I40 20AA: A5 09 103 LDA K 20AC: 20 00 F8 104 JSR PLOT 20AF: A4 07 105 LDY I 20B1: A5 FC 106 LDA K40 20B3: 20 00 F8 107 JSR PLOT 20B6: A4 FC 108 LDY K40 20B8: A5 07 109 LDA I 20BA: 20 00 F8 110 JSR PLOT
The second half of the K40 strength reduction, and the end of the J loop:
20BD: C6 FC 111 DEC K40 ;K40=K40-1 20BF: E6 08 112 INC J ;NEXT J 20C1: A5 08 113 LDA J 20C3: C9 14 114 CMP #20 20C5: D0 88 115 BNE JLP
The end of the I loop. Alas, the top of the loop is too far away to reach with a BNE.
20C7: E6 07 116 INC I ;NEXT I 20C9: A5 07 117 LDA I 20CB: C9 14 118 CMP #20 20CD: F0 03 119 BEQ NOILP 20CF: 4C 1E 20 120 JMP ILP
And the end of the W loop:
20D2: E6 06 121 NOILP INC W ;NEXT W 20D4: A5 06 122 LDA W 20D6: C9 33 123 CMP #51 20D8: F0 03 124 BEQ DONE 20DA: 4C 10 20 125 JMP WLP 20DD: 60 126 DONE RTS --End assembly, 222 bytes, Errors: 0
And that's it.
The machine-language version runs in about 20 seconds on a standard Apple II, and in about 8 seconds on a IIGS at full speed. That's plenty fast enough to satisfy my curiosity--I had no problem waiting for this version to finish.
So was it worth it? Well...that depends. I was glad to finally have a version that I could watch from start to finish, but the final pattern isn't really any more spectacular than the early stages.
I had entirely too much fun writing and optimizing the machine language version.
Here's the source code and binary for those who want to try the machine language version themselves but don't want to type in the hex code manually. And here's the assembly listing for those who want to see it without the running commentary.
(Just so Applesoft programmers don't feel left out, only a few minor modifications need to be made to convert the Integer program to Applesoft.)
700 GR : HOME : FOR W = 3 TO 50:IW = 0: FOR I = 1 TO 19:IW = IW + W: X = INT (IW / 12): I4 = 40 - I: FOR J = 0 TO 19:K = I + J:K4 = 40 - K 750 COLOR= INT ((J + J + J) / (I + 3)) + X 760 PLOT I,K: PLOT K,I: PLOT I4,K4: PLOT K4,I4 770 PLOT K,I4: PLOT I4,K: PLOT I,K4: PLOT K4,I 780 NEXT : NEXT : NEXT
(This could be speeded up slightly by putting all the numeric constants in variables, since Applesoft can look up a variable faster than it can parse a constant. It can be speeded up a lot by compiling with the Beagle Compiler, but nowhere near as much as the handwritten assembly code above.)
Update: Since writing the above, I've learned a couple of things: First, the Integer BASIC code apparently didn't originate in COLOR DEMO, but dates back to the famous Red Book, the original reference manual that shipped with early Apple II's. I've never seen an authentic Red Book, but apparently the kaleidoscope program was originally called ROD'S COLOR PATTERN, and was attributed to Randy Wigginton.
Second, I'm not the only person out there in Internet-land who's tried hand-compiling this code. The earliest ones I've seen were published in the Apple Assembly Line newsletter--the January 1984 issue contains a 68000 version by Bob Urschel (intended to run on an Apple II with a 68000 coprocessor card), and the March 1984 issue contains a 6502 version by Charles H. Putney. There's also a version by John B. Matthews.
None of these versions uses quite the same approach as my code. Bob
Urschel's version is a straight translation of the original Integer
BASIC, with no attempts at optimization at all. Charles Putney's
version does the loop invariant removals and the common subexpression
eliminations, but uses table lookups for
J*3/(I+3), and the PLOTs. John Matthews's version is
somewhere between those two extremes, with no optimizations for the color
computation, but using table lookups to speed up the PLOTs. My effort
probably falls somewhere between the Matthews and Putney versions.
LLX > Neil Parker > Apple II > Kaleidoscope
Original: June 4, 2004
Modified: August 21, 2005--added Update
Modified: February 5, 2007--Apple Assembly Line is at a new URL
Modified: June 1, 2008--John B. Matthews moved to a new URL
Modified: October 20, 2008--added link to Flash version