Introduction to 80x86 AssemblerAtrevida Game Programming Tutorial #12
Copyright 1997, Kevin Matz, All Rights Reserved.
This chapter and several more following it will introduce 80x86 assembly
language. We'll use Intel 8088-level assembly language, which is fully
compatible with all Intel (and compatible) 80x86 processors, including
the latest Pentium chips.
What is assembly language?The CPU (central processing unit) or microprocessor acts as an interpreter. It reads instructions one at a time from memory and performs each action. These instructions are in what is called machine language. The instructions are just bytes of binary data. Machine language instructions, which may have data associated with them, may be one byte long; most are two or three bytes long, and some are larger.
For example, there is an instruction which instructs the processor to clear the carry flag. (Don't worry about what this means.) This particular instruction has a one-byte machine language code: "11111000b" (F8 hex). Whenever the processor reads in the instruction F8 hex, it will clear the carry flag.
Multi-byte instructions often include data. These data are usually called operands. These are usually extra bytes added after a instruction code; the instruction codes themselves are called opcodes. The extra bytes might hold a value representing a certain register, or an address, or some arbitrary value (perhaps the ASCII code of a certain character).
To write a machine-language program, we would first need a big chart listing all of the different instructions, and all of the corresponding opcodes, either in binary or hexadecimal. Then we could write a program by writing down all of the necessary instructions in sequence.
As far as I know, this was the way programming was done in the very early years of computing when stored-program computers were new (before that, most computers required "re-wiring" to do different tasks; grab an encyclopedia and look at the photos of ENIAC and its buddies). There are a few problems with this method though, which assembly language and higher-level languages solve.
The first problem is that of unreadability: a program consisting of pages upon pages of hex digits is not very readable, at least for humans. Imagine searching for a bug in a stack of pages of hex digits.
The second problem involves the lack of variables, or at least variable names. You have to manually set aside and specify addresses to chunks of memory to serve as variables.
The third problem, which is even more serious, involves addressing issues. If you want to insert extra instructions into the middle of a program, you have to recalculate any addresses or relative addresses (described later, don't worry) in your program to accomodate the changes in location of the different sections of code. This makes modifying existing programs incredibly difficult.
Assembly language solves these problems. Assembly language allows you to refer to instructions by longer names, called mnemonics. So, instead of remembering or looking up F8 hex, you can use the mnemonic "CLC" to clear the carry flag. The majority of mnemonics are abbreviations that are three letters in length, although they can be between two (eg. "JE") and nine or so letters (eg. "CMPXCHG8B") in length. These mnemonics will seem confusing and hard to remember at first, but they do actually become quite memorizable. They are far better than long strings of hex digits.
Assembly language also allows the use of labels and variables, which let you give names to addresses in a program without specifying the actual "concrete" addresses.
An assembler is a program that takes an assembler source code file and (perhaps in conjunction with a linker) translates it to machine language, so that it can be run (that is, interpreted by the processor). It's really just like a compiler, except that there is a one-to-one relationship for the assembler's instructions: a mnemonic, plus its operands (data), always translates to a single machine language opcode, plus the operands after it. In a compiler, commands and structures normally expand into several (or many) machine language instructions.
Why use assembly language?Assembly language instructions perform "tiny little actions". Instructions exist to do such things as increment, decrement, add, subtract, perform logical operations (AND, OR, XOR, etc.), compare two values, move values from one register to another, and so on. You won't find any assembly language instructions to write text to the screen or handle keyboard input.
This means assembler programs have to deal with a lot more detail. For writing to the screen, for example, you can't just call a pre-written function like printf(). You have to write your own special routines in assembler to write to the screen (which, in this case, can be made a bit easier using interrupts).
So a major disadvantage is the micromanagement that you must handle. It almost always takes longer to write a program in assembler than it does to write an equivalent program in a high-level language. So why would anyone use assembly language?
The two biggest reasons are control and speed. Assembler allows you direct access to whatever hardware devices and resources you want. You can do anything however you like, whereas in high-level languages, if you don't like the way a built-in function works, there's not much you can do about it. This added control lets you optimize for speed. Properly written assembler programs can run much faster than compiled programs, for several reasons. Compilers often generate redundant code or code that could execute faster if written a different way. (Modern compilers are actually very good at producing optimized code, but there is still room for improvement.) Also, high-level languages often perform extra error checking, such as bounds checking. True, C does basically no error checking, which is why it's generally faster than other languages. Assembler gives you total control over these matters, so you can decide how efficiently a routine should run.
In terms of readability and maintainability, high-level langauges are far superior to assembler. The worst aspect of assembler is its complete lack of portability -- you can't easily convert your program to other platforms that use different families of microprocessors. Other makes and models of processors use different instruction sets and assembly languages.
Today, assembly langauge is mainly used for speeding up critical routines used in larger programs that are written in high-level languages such as C or C++. In this and the following chapters, we'll write both independent, stand-alone programs, and functions to be used in C and C++ programs.
Which assembler should I use?These tutorials are written with Borland's Turbo Assembler (TASM) in mind. It seems to be the most widely-used assembler for the PC, so it's more or less the standard. I'm satisfied with it. The latest version of TASM as of this writing is 5.0. It's impossible to find in stores and it's not cheap, although you can get a slight discount if you're a student.
A good alternative is the shareware assembler A86, along with its debugger D86. These should be reasonably easy to find on the internet -- do an FTP search or visit some software repository sites and look for "A86". If you do decide to use A86, please register it; I believe the registration fee is about $50 US.
Microsoft has or had an assembler, MASM -- I don't think they update or support it any longer. There are also some other lesser-known commercial brands of assemblers, and there are some old shareware ones such as CHASM.
The assembler code here has only been tested with TASM. For other assemblers, you may need to do some basic conversions. The modern assemblers are more or less similar. The instructions and opcodes must be the same across all assemblers for the PC; they mainly differ in the formatting of the "overhead" assembler directives and structures, and they differ in terms of fancy new features.
While you're choosing an assembler, you might want to go out and get a book on assembly language. These tutorials will cover all of the important points, but it's always good to have a second source of reference. More importantly, though, make sure that you get an assembly language book that has a good instruction set listing at the back. If computer books are too expensive (yes, they certainly are), you can use an on-line instruction set reference such as http://www.qzx.com/pc-gpe/intel.doc.
Getting startedMost assembler books and tutorials start with binary and hexadecimal numbers. I'm assuming that you've read Chapters 1 through 3; if you have, congratulations, you're already familiar with binary and hex, so we don't need to bother with it here.
I'm also assuming you've read Chapters 4 through 6, so when we have to deal with memory addressing, interrupts, and hardware ports, you'll already know what's going on and we'll only need to learn how to use them with assembler.
RegistersThe 8088 processor's registers were covered in Chapter 5, but they are so important that I'm going to briefly review them here. You'll probably find it helpful if you print out or sketch a copy of the following diagram:
|<------------------- Word -------------------->| |<---- Byte (high) ---->|<---- Byte (low) ----->| bits: 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ AX | AH | AL | "Accumulator" +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ BX | BH | BL | "Base" +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ CX | CH | CL | "Count" +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ DX | DH | DL | "Data" +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ CS | | Code Segment +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ DS | | Data Segment +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ ES | | Extra Segment +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ SS | | Stack Segment +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ IP | | Instruction Ptr. +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ SI | | Source Index +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ DI | | Destination Index +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ SP | | Stack Pointer +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ BP | | Base Pointer +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ FLAGS | | | | |OF|DF|IF|TF|SF|ZF| |AF| |PF| |CF| (Flags) +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
Here are brief explanations of the registers. If you've read Chapter 5 recently and you feel familiar with this material, skip it!
AX, BX, CX, and DX are the "scratch-pad" or "general-purpose" 16-bit registers. Although they can be used for whatever miscellaneous purposes, some instructions require that certain data be present in particular registers: AX is usually used for storing values to be operated on by mathematical operations. CX is often used as a counter. BX and DX are occasionally used to store address segments or offsets. These registers can also be accessed in 8-bit fragments: AH, BH, CH, and DH are the high 8-bit registers, encompassing the most-significant (left-most) bytes. AL, BL, CL, and DL are the low 8-bit registers, encompassing the least-significant (right-most) bytes. (Little-endian order doesn't apply to the processor's registers, although you could re-draw the diagram a bit to simulate that order.)
The segment registers are CS, DS, ES, and SS. CS is the code segment register, storing the segment portion of the address that points to the next instruction in memory to be executed. The data segment register, DS, refers to the segment in which data (variables) are stored. ES stands for "extra segment". It can be considered a general-purpose segment register, although it does have uses with certain string operations. SS is the stack segment register. Its function and its uses with SP and BP are best left for later discussion.
IP, SI, DI, SP, and BP are offset registers. They work in conjunction with the segment registers. CS:IP points to the next instruction in memory. DS:SI and ES:DI can be used to reference variables, and they are frequently used with string operations. Again, SP and BP will be discussed later.
FLAGS stores several single-bit boolean status flags. A flag is set if it is 1, and is clear if it is 0. The flags will be described as we need them; here are their meanings:
The newer processors have more registers: the 386 and up, for example, have 32-bit registers (eg. EAX), additional segment registers, debugging registers, and so on. We won't concern ourselves with these registers here.
Using some simple instructionsWe're ready to start using some simple instructions. Let's start with the INC instruction. It takes the form:
Let's increment the AX register. We would write:
This would increment whatever value was stored in the 16-bit AX register. If we wanted to decrement AX or some other register, we would probably use the DEC instruction:
To decrement the AX register, we would write:
The other registers work too (although for now, let's just stick to the general-purpose registers):
INC BX DEC CL INC DH DEC DX
Many instructions take two operands. One of the most commonly used instructions is MOV, which means "move". MOV is analogous to the assignment operator ("=" in C and BASIC, ":=" in Pascal, etc.) in high-level languages (so actually, "copy" would be a better name than "move"). It takes the form:
MOV destination, source
source is copied to destination. In C, we might want to let one variable equal to another. If our variables were x and y, we might say "x = y;" to make x equal y. In assembler, if we wanted BX to equal DX, we would use:
MOV BX, DX ; Let BX = DX
(In fact, in Turbo or Borland C/C++, we could use the register pseudo-variables and write "_BX = _DX;".)
Note that comments can be added by using a semicolon. Semicolons in assembler work the same way "//" does in C++: when a semicolon is encountered, everything after it on that line is ignored.
If we wanted to copy the contents of AH to CL, we would write:
MOV CL, AH ; Let CL = AH
The source operand need not be a register. It can be an arbitrary value (a "literal"), such as 5, or a variable (discussed later). This is perfectly acceptable:
MOV DL, 5 ; Let DL = 5
This lets DL equal 5 dec. Literal values are assumed to be decimal; if you want to specify a hexadecimal number, add an "h" (uppercase and lowercase both work), and also add a "0" prefix if the first digit is between A and F. ("5Ah" means 5A hex. "05Ah" also means 5A hex, but the leading "0" is unnecessary. "0FFh" means FF hex. But, to the assembler, "FFh" looks like a variable name.) For example:
MOV AX, 0E75Ah ; Let AX = E75A hex
Binary values can be specified by adding a "b" or "B" to the end of the binary number:
MOV AL, 10101010b ; Let AL = 10101010 bin
A prefixed "0" is not necessary, although it wouldn't hurt either.
Perhaps we want to add two values together. Instruction ADD takes the form:
ADD destination, source
The value of source is added to the value already existing at destination, and the result is stored back in destination. So, let's suppose we want to add 5 and 3. Here's one way:
MOV AX, 5 ; Let AX = 5 MOV BX, 3 ; Let BX = 3 ADD AX, BX ; Let AX = AX + BX (using C notation: AX += BX)
After these instructions, AX would contain 8. We can reduce the number of instructions needed from three to two by using a literal as the source operand for ADD:
MOV AX, 5 ; Let AX = 5 ADD AX, 3 ; Let AX = AX + 5
Let's try some more instructions. SUB, for subtract, works in the same way that ADD does:
SUB destination, source
source is subtracted from destination, and the result is stored back in destination:
MOV AX, 200 ; Let AX = 200 dec MOV BX, 50 ; Let BX = 50 dec SUB AX, BX ; Let AX = AX - BX (using C notation: AX -= BX)
AX would then contain 150 dec. Or, using a literal value for the source operand:
MOV AX, 200 ; Let AX = 200 dec SUB AX, 50 ; Let AX = AX - 200 dec; AX will be 150 dec
What if we wanted to do bitwise logical operations? Conveniently, we are provided instructions such as AND, OR, XOR, and NOT:
AND destination, source
source and destination are ANDed together and the result is stored in destination:
MOV AL, 00001111b ; Let AL = 00001111 bin AND AL, 10101010b ; Let AL = 00001111 bin AND 10101010 bin ("&" in C)
AL would then contain 00001010 bin.
The OR instruction takes the form:
OR destination, source
The source and destination are ORed together and the result goes back into destination:
MOV AL, 00001111b ; Let AL = 00001111 bin OR AL, 10101010b ; Let AL = 00001111 bin OR 10101010 bin ("|" in C)
AL would then contain 10101111 bin.
The XOR instruction works the same way:
XOR destination, source
source and destination are XORed together and the result is stored in destination:
MOV AL, 00001111b ; Let AL = 00001111 bin XOR AL, 10101010b ; Let AL = 00001111 bin XOR 10101010 bin ("^" in C)
AL would then contain 10100101 bin.
NOT, of course, is a single-operand instruction:
All the bits in destination are flipped and the result is stored back in destination:
MOV AL, 00001111b ; Let AL = 00001111 bin NOT AL ; Let AL = NOT AL
AL would then contain 11110000 bin.
Notes on code formattingJust like with other programming languages, there exists a never-ending debate between programmers regarding the way that code should be formatted. The traditional way to write assembler code was to strictly group everything in columns. This was apparently done to facilitate the transfer of mainframe computers' assembler code from paper to punched cards (other languages like COBOL had, or still have, similar requirements). For example, all labels (described later) were to begin in column 1 and extend no further than column 11, instructions (mnemonics) were to begin in column 12 and extend no further than column 18, and operands and comments were to start in some particular columns only. In addition, everything was to be in upper case. Code in this style might look something like this:
START: MOV AX , 5 ; LET AX = 5 MOV BX , 2Eh ; LET BX = 2E HEX ADD AX , BX ; LET AX = BX INC AX DEC BX AND AX , BX
It's very organized, but at the same time it's a bit difficult to read. More modern styles do not as strictly separate the instructions and the operands. The AX, BX, AX, AX, BX, and AX in the third column in the above code sample don't have any relationship to each other, other than the fact that they are all operands. The operands should be closer to the instructions, because they make more sense that way, and it's easier to read. Also, most assembler programmers nowadays seem to prefer lower-case (mixed with capitals where sensible), which is probably easier to read.
Start: mov ax, 5 ; Let ax = 5 mov bx, 2eh ; Let bx = 2e hex add ax, bx ; Let ax = bx inc ax dec bx and ax, bx
I myself still use upper-case for instructions and registers, but I use mixed-case for labels and variable names, and I normally indent the "actual code" (not labels) four spaces, so that the labels stand out. These are just my personal preferences, of course. Develop a style that you feel comfortable with, and try to use it consistently.
Calling interruptsWe're going to write an actual assembler program soon, but none of the instructions we have learned about really do anything noticeable. Adding or subtracting or incrementing values in registers won't show anything on the screen. So how can we indicate that our program actually does anything, and how can we test that it works?
We can use interrupts -- recall Chapter 5. There are interrupt services that can write characters or strings to the screen, so we'll use one of them.
If you're worried that calling interrupt services in assembler will be difficult, you'll be pleased to learn that it's actually easier to do in assembler than it is in C/C++. All you need to do is set up the registers as per the interrupt service's description in an interrupt list, and then you simply use the INT instruction:
So, let's use interrupt number 10 hex, service 0A hex, to display a letter "A" on the screen. If you don't have an interrupt list handy, here's the quick description:
INT 10h, Service 0Ah Write Character Only at Cursor Position Input: AH = 0Ah AL = ASCII code of character to write BH = Screen page number CX = Count of characters to write Output: The character is written CX times to the screen, starting at the current cursor position.
We can use MOV instructions to set up the registers. Then, to call the interrupt service, we simply use "INT 10h":
MOV AH, 0Ah ; Interrupt service number 0A hex MOV AL, 65 ; 65 dec is the ASCII code for "A" MOV BH, 0 ; Screen page is 0 (default) MOV CX, 1 ; Write one character only INT 10h ; Call the interrupt service and ; write the "A" to the screen
That should work. The assembler should also let you replace the line
MOV AL, 65
MOV AL, "A"
MOV AL, 'A'
The "A" or 'A' is essentially "converted" to its ASCII equivalent, 65 dec.
While we're dealing with interrupt calls, let's take a look at one more example. This next example will probably be used in every stand-alone assembler program we write.
To end an assembler program, we actually need to call a DOS interrupt service. It's Interrupt 21 hex, Service 4C hex:
INT 21h, Service 4Ch Terminate Program Input: AH = 4Ch AL = DOS errorlevel return code Output: The program ends, and control is transferred to the calling program or operating system shell.
Normally, the errorlevel is 0, to indicate that no errors occurred. If a serious error did happen in your program (perhaps running out of memory), you would usually set the errorlevel to some non-zero value. That way, another program that was run after your program could take special actions based on the errorlevel.
So, in your assembler programs, wherever you want the program to quit, you can use:
MOV AH, 4Ch ; Interrupt service number 4C hex MOV AL, 0 ; Errorlevel return code 0 INT 21h ; Call the interrupt service to ; terminate the program
Here's an optimization tip: it takes slightly longer for the processor to execute two MOV instructions that move 8 bits each than it does to execute one MOV instruction that moves 16 bits. The two MOV instructions in the above code sample are copying to both halves of the AX register, so why not write a single 16-bit value to AX using one MOV instruction? Here's how we might do it:
MOV AX, 4C00h ; Let AH = 4Ch (interrupt service ; number), and let AL = 0 ; (errorlevel return code) INT 21h ; Call the interrupt service to ; terminate the program
Writing our first assembler programWe're ready now to actually write an assembler program! Let's incorporate the code samples given above to write the letter "A" to the screen and to terminate the program.
Type in this program (no, don't cut and paste, you won't learn anything that way), and call it TEST1.ASM, or some other reasonable filename:
------- TEST1.ASM begins -------
%TITLE "TEST1.ASM: Asm. Test Program #1 -- writes a character to the screen" IDEAL MODEL small ; For a .EXE file STACK 256 CODESEG Start: ; Write the "A" to the screen, using Interrupt 10 hex, Service 0A hex: MOV AH, 0Ah ; Interrupt service number 0A hex MOV AL, 65 ; 65 dec is the ASCII code for "A" MOV BH, 0 ; Screen page is 0 (default) MOV CX, 1 ; Write one character only INT 10h ; Call the interrupt service and ; write the "A" to the screen ; Exit the program: MOV AX, 4C00h ; Interrupt service number 4C hex, ; return error code 0 INT 21h ; Call interrupt service to end the ; program END Start ; Begin execution at label "Start:"
------- TEST1.ASM ends -------
Now, time for some explanations.
The "%TITLE" line is an assembler directive (like a preprocessor directive in C/C++). It simply identifies the program: you can put your name or the date or the program's purpose here. It can be omitted if you like.
The "IDEAL" line is also another directive. It turns on TASM's IDEAL mode, which has more features and is easier to work with than MASM (Microsoft's Macro Assembler). If you leave out the "IDEAL" directive, TASM will emulate MASM.
"MODEL small" is another directive. "MODEL" lets you choose which memory model you wish to work with (discussed later). There are many choices; if you have configured your Turbo or Borland C/C++ compiler to use different memory models, then you'll be familiar with memory model names such as tiny, small, medium, large, and huge. There are others. For now, we'll use small, which is more than adequate for most of our stand-alone programs.
"STACK 256" tells the assembler to set aside 256 bytes of memory for the program's stack segment. We'll learn about the stack later.
"CODESEG" tells the assembler that the code segment starts at that line. The code segment is where you put your actual assembly-language code. (With older assemblers, you might have to chant mystical incantations like "CODE SEGMENT PARA PUBLIC 'CODE'" to do the same thing.)
"Start:" is a label. The place where you want execution to begin is traditionally labelled "Start:", although there's nothing wrong with calling it something else. You can jump or branch to a label in the same way you can GOTO to a label in a high-level language.
The remaining part of the program looks familiar. You simply place your assembly-language code in this area. Notice the code to terminate the program.
The very last line reads "END Start". The "END" tells the assembler to stop reading the assembler source file -- anything after the "END" line will be ignored. What does the "Start" do here, though? This tells the assembler to begin the execution of the program at the "Start:" label. Again, you could rename "Start" to something else, or, if you have many labels in a program, you could specify a different one.
Let's assemble this program and see if it works. If you're using Turbo assembler, and you have your PATH set correctly, get to a DOS prompt and enter:
This produces TEST1.OBJ. To create a .EXE file, use the Turbo Linker utility:
This will produce TEST1.EXE, as well as a map file, TEST1.MAP. You can now run TEST1.EXE, and you should see an "A" displayed.
That was a lot of work just to get a single letter displayed! You'll find that, in assembler, everything takes lots of work. That's why we normally use high-level languages as much as possible, and use assembler only when necessary.
If you've got Turbo Assembler, you've probably also got Turbo Debugger. This is a helpful utility that lets you step through assembler code in the same way you can step through C/C++ code in the Turbo or Borland C/C++ IDE. It lets you watch all of the processor's registers change as you step through the program one instruction at a time.
I'll briefly show you how to step through our TEST1 program using Turbo Debugger. If you don't have a debugger, or you're using D86 or CodeView or something else, skip the next paragraph.
Assemble the program using "TASM /zi TEST1.ASM". The /zi command-line option tells TASM to include full debugging information. Link the object code using "TASM /v TEST1.ASM". The /v option tells TLINK to include full symbolic debugging information. Then run Turbo Debugger with "TD TEST1". From the View menu, select "CPU". On the right, you'll see the contents of the processor's registers and flags. In the center, you'll see the assembler code, and on the left, you'll see addresses and then the actual machine language equivalents (in hex) of the assembler code. You can use the F8 key to single-step through the code. Watch the registers change to new values. When the program writes something to the screen, use Alt-F5 to see the output screen. For more details, see your Turbo Debugger manual.
SummaryIn this chapter, we've learned about what assembly language is, how it is related to machine language, and when and how we should use it. We've reviewed the 8088's registers and flags. We've learned how to use the instructions INC, DEC, MOV, ADD, SUB, AND, OR, XOR, NOT, and INT. We've seen some code formatting styles. And we've written and assembled a simple assembler program.
In the next tutorial, we'll learn how to use variables (not just registers), we'll learn more instructions, and we'll find out how to use the stack. We'll eventually also find out about branching, comparing, looping, multiplying and dividing, addressing modes, flags, string instructions, hardware ports, clock cycles, and other fun topics. We'll also discover how to integrate assembler code with our C/C++ programs.