1. Verwendung von Zeigern in C Programmen / Using Pointers in C Programs

Werner Van Belle¹ - werner@yellowcouch.org, werner.van.belle@gmail.com

1- Yellowcouch;

Abstract : Geklärt soll werden was Zeiger sind und wie Zeiger programmiert werden (Syntax); Wo liegen die Vorteile, wo sind Nachteile zu erkennen. Da Programmier-Anfänger oft Probleme mit Zeigern haben, sollen auch Fragen des Programmierstils und der Lesbarkeit von Programmen im Zusammenhang mit Zeigern angesprochen/diskutiert werden.

Reference: Werner Van Belle; Verwendung von Zeigern in C Programmen / Using Pointers in C Programs;

This tutorial explains what pointers are and how they can be used in C. The tutorial also covers certain aspects regarding programming style: how to avoid pointer problems.

Beginners often have difficulties understanding pointers, mainly because; if they already have problems understanding the content of regular variables, then indirections to variables become an even larger obstacle.

Secondly, and this is especially true for C, pointers are closely related to the hardware at hand, the choice of compiler and compiler options, and one should have a fairly correct view of the memory layout of their program to be able to deal efficiently with pointers. Although the C language specification makes some efforts to shield the programmer from the hardware, it actually doesn't succeed that well.

Let's start with the basics: a C program. The shown program has a main routine which allocates 5 variables on its stackframe. These are a, which is an unsigned character (or a byte), b, which is an unsigned integer, c which is an array of 5 integers and i which is the counter we use to iterate over elements. We also allocate a structure in the a_structure variable which is of type point, as declared above.

When the program start all these variables become allocated on the stack, so let's have a look at a memory dump of our program.

In the shown memory dump, blue numbers are memory content and should be read from left to right and top to bottom. Each row contains both a decimal and hexadecimal representation. The red numbers are the addresses. In the left column the segment is given (the base address with a 0 in the last digit) and in the top row the offset that should be added to obtain the correct address. For instance the top row right element is 64 (0x40 hex) which has address 0xbff91dff. Now, where are our variables stored ?

Variable a, with content 'D' is stored at address 0xbff91e1f. Variable b is stored from address 0xbff91e18 to 0xbff91e1b. It is interesting to observer that this was clearly a 32 bit compiler, hence the 4 bytes necessary to store a normal integer. The order in which the bytes are stored is determined by the endianity of the hardware and will be represented differently if you have a Motorala (as opposed to Intel) processor. Variable c contains an array of 5 unsigned integers and ranges from from address 0xbff91e04 to address 0xbff91e17. This is simply a consecutive series of the 5 elements stored in the array, and consequently we can mark the start and stop boundary of each individual integer. The point structure a_structure is stored from 0xbff91df00 to 0xbff91dff and contains two doubles: x and y, both 8 bytes long.

Although I referred to a variable by its start and stop address, in general, when people talk about 'the address of variable X', they actually want to know the starting address. So in summary: a has address 0xbff91e1f, b has address 0xbff91e18, c has address 0xbff91e04, i has address 0xbff91e00, a_structure has address 0xbff91df0, a_structure.y has address 0xbff91df8 and c[3] has address 0xbff91e10.

To understand that your variables are stored in memory and that they have an address is the first idea necessary to understand pointers.

However, is it sufficient to talk only about the address of something ? Suppose that I would ask what the content at address 0xbff91dfc were ? In this case one would look it up and say: 215 (0xd7), which would have been correct if we were talking about a byte. However, often we might be referring to the integer stored at this position, in which case we would need to take the next 4 bytes: 215, 227, 52, 64. (In which also depends on the endianity of the hardware).

It is not sufficient to point to a location, one should also have the necessary understanding of the type we expect to find at an address. In other words: to understand the content of an address one should know what type is supposed to be stored there. Without this information neither the compiler (nor the human) can guess what we are actually referring too.

The type to which we refer is the second idea that is necessary for the understanding pointers.

A Pointer is an address and an expected type. The address is runtime information and is just a number that, as any other ordinary value, can be passed around. The expected type is compile time information such that the compiler knows what to do when it sees the address. The combination of both is called a pointer.

Next, we will look into how we can declare pointers and use them in C.

Declaration

Because pointers are addresses that can be used as value, we need some way to store them in variables, and that in turn means that we need some way to declare variables of a pointer type.

In C a pointer is declared as Type * varname, in which Type tells the compiler what type to expect at the address contained in the variable varname. The * makes clear to the compiler that we are not talking about a variable of Type in this declaration.

In the provided examples: A is a pointer to a character. B is a pointer to an integer and Root is a pointer to a node structure. These three variables will each contain an address, so they will all be of the same size; 4 bytes on a 32 bit compiler and 8 bytes on a 64 bit compiler.

A mistake that might be confusing is the fact that the star in a declaration associates with the variable name. If we write char* A, B as a means to declare A as a pointer to a character and B as a pointer to a character, then we will find that the compiler interprets this as A being a pointer to a character and B just a character. If we wanted to write the former, we should have written char*A, *B

We can also add extra stars if we want, in which case we create pointers to pointers. E.g: char** a is a pointer to a pointer to a character. This is sometimes used to create multidimensional arrays.

To demonstrate how such pointers appear at runtime we fall back onto the knowledge that strings in C are actually 'pointers to characters', and every time we write a string literal. E.g; "zero terminated string" then the compiler will allocate this string -or rather this sequence of character- in the data segment and replace the literal with the starting address of the sequence.

The memory dump shows the content of 2 memory ranges. The first is our data segment, starting at address 0x804..., the second is our runtime stack, stored in the stack segment, starting at 0xbfb...

In this program the compiler placed our zero terminated string at address 0x8049649. This address is then assigned to variable a, which is a pointer to a character. a itself is stored in the stack segment at address 0xbfba8240 and the content of a is '0x8049649', which is the address of our string. So a is pointing to the start of our string. In typical computer science fashion we also draw cloud and arrow.

Operations on the address

Pointers can appear in two types of expressions. One type deals with the addresses, the other with the content at the address pointed too. We start by describing a number of operations that only work upon the address.

Assignment

At runtime, a pointer is merely an address which is used as a value. If we assign one pointer to another, then we just copy the address. The content of the pointers (that is the memory to which is pointed by the pointer) is not touched upon in any way.

This means that it becomes possible to have two variables referring to the same content. In the example, both variables a and b will be pointing to the same string.

Arithmetic

Another address operation are addition (+), subtraction (-), incrementing (++) and decrementing (--) pointers. Such operations take the address and treat it as if it were an integer.

If a is a pointer to char (with address 0x0001 for instance), then a+5 will be a pointer to a character as well, but with address 0x0006 instead.

Now, here is of course a tricky situation: the compiler is sufficiently smart to take into the account the size of the pointed-to type.

If a were an int* then a+5 would be 21 (dec) because 21=1+5*4 (the size of an integer). This trick makes it possible to map array arithmetic onto pointer arithmetic and vice versa.

This illustrates the memory layout of two such pointers. a points to a string at address 0x8049649 and b, calculated as a+5, is 0x80496e4, which is 5 characters down in string.

Address of

The next operation that deals with addresses is the 'address of'-operator, written as an &.

If we have an expression Expr that is of a certain type Type, then taking the address of this expression is done with &Expr. This will return the startaddress of the expression at runtime. The type of the compound expression &Expr is Type*.

The example allocates first: 4 normal variables, after which we, in the second part, declare 5 pointers: ptr1, ptr2, ptr3, ptr4 and ptr5. In the third part we start assigning the addresses of the variables to these pointers.

We will now have a look at these various assignments.

The first assignment of &a to ptr1 takes the address of a, which is 0xbf98981b and places this in ptr1. The type of &a is a pointer to a character because a itself is a character.

We can continue assigning all these addresses to the various pointers. &s has as type 'pointer to a point structure' (written as struct point*) and is assigned to ptr2 which is of the same type.

&s.y has type 'pointer to a double' (written as double*) and is assigned to ptr3 which is of the same type. The address of s.y lies in the middle of the s structure.

&c[3] has type unsigned int* and is indeed assigned to an unsigned int* in ptr4.

Now an unexpected possibility arises at ptr5. ptr5 is declared as a 'pointer to a point structure'. We however assign &b to it, but &b is a pointer to an unsigned integer. This clearly makes no sense, so will the compiler object ?

The answer is no: the compiler does compile this, although it might give a warning, it will take the address of b (0xbf989814) and place it in ptr5.

Later on we will see what problems this causes.

Using Pointer Content

Up to now we saw how can deal with the addresses contained in a pointer. Of course, passing addresses around is nice and makes sharing of data structures possible. It does however not explain us how we can use the content of a pointer. For that purpose we need some extra syntax.

To refer (and consequentyle read or write) to the content of a pointer, one uses the * operator in front of the expression. If Expr is of type T*, then *Expr is of type T.

The example declares two variables a and b, after which we assign the address of a to ptr1. To use the content of ptr1 (the character), we write *ptr1. If we do so the program will start dealing with the content at this address as if it were a character and in this case assign the letter 'F' to it. Because ptr1 refers to a memory location also occupied by variable a, we effectively modified the content of a as well. When we print a we will see that it has become an 'F' instead of 'D'.

To make the distinction between address and content operations clear, we introduce an example that relies on both.

We again use variables a and b, both characters. ptr1 and ptr2 are both pointers to characters. ptr1 is initialized to the address of a and pointer b is initialized to the address of b.

In the next statement ptr1=ptr2, ptr2 is assigned to ptr1. In this statement the address contained in ptr2 is placed in ptr1. So ptr1 and ptr2 both point to a memory location also occupied by variable b

In the next statement *ptr1='F' we assign to the content of the memory to which is pointed (hence *ptr1). Thereby we effectively modify variable b.

ptr1 and ptr2 point to memory occupied by variable a and b respectively.

ptr1 receives the address contained in ptr2 and now points also to b

Writing to the content of ptr1, written as *ptr1 results in a modification of variable b.

Bad Aliases

Let's now come back to the ptr5 issue we discussed before. ptr5 is a pointer to a point structure. However, the address we passed into ptr5 was actually pointing to an integer. The compiler will probably warn you about this issue, but will otherwise happily contiue with the assignment.

So what will happen now if were to write (*ptr5).x=0 ?

Writing (*ptr5).x=0 means that write to the memory pointed to in ptr5 (which starts thus at address 0xbf989814). If we write to this address the compiler assumed that a double is stored there (point.x is a double) and will this write 8 bytes (a double is 8 bytes long) with value 0. In this case the range 0xbf989814 to 0xbf98981b will be affected, thereby affecting both variable b and a in a fairly unpredictable fashion.

This is a bad thing (tm) and the main reason why pointers are such a dangerous tool. If you use a pointer that is declared to be of the wrong type then you might end up in deep shit.

Why pointers

In the remainder of this tutorial we focus on the reasons behind, and the uses of, pointers.

Data sharing

The first major advantage offered by pointers is that of sharing of data. Since multiple pointers can refer to the same address, it becomes possible to affect multiple data structures (all pointing to the same address) by only modifying the content at this address.

Suppose we have a point that has two fields x and y, both pointing to a double.

If we then declare 4 points top-left (tl), top-right (tr), bottom-left (bl) and bottom-right (br) such that the y coordinate of the top-left and top-right points, refer both to the same address, in this case the address of a double called top, then a modification to top will directly affect both point.

Call by Reference

Using the possibility to share data between different sections of the program makes it also possible to speed up programs. Instead of passing along full structures and their complete content, we could as well pass a pointer along.

This is particularly useful if we think about larger structures, such as a line in our example. Instead of passing around 32 bytes we could only pass along a reference. However, let us first look at the 'normal' method of passing values.

In a normal pass-by-value program we can declare functions that take structures as arguments. Each time a function is called the runtime copies the structure onto the stack and the callee will deal with the received structure. In this example print_point is called twice from within print_line, so it will copy 2 times 16 bytes onto the stack (16 bytes because a point contains 2 doubles, each 8 bytes).

Calling print_line itself afterwards requires the copying of the 32 bytes of the line structure. So in total we copied 64 bytes just to print a line.

To modify the above program to use references (or pointers) instead of structures we need to declare that print_point accepts a pointer to a point structure instead of a point structure.

When doing so, we should also pay attention to the body of print_point. It must be written slightly different because it first needs to dereference p before it can obtain the x and y fields of the structure.

And the last necessary modification is that we need to pass a pointer into print_point instead of a point itself. Consequently, we need to take the addresses of the fields a and b in our line l, which is given by &l.a and &l.b.

Doing the math shows that the runtime now only needs to copy 2 times 4 bytes (which is the size of an address) and 32 bytes for the original print_line. 40 bytes instead of 64 bytes.

We can go a step further and also pass the line structure as a pointer to a line structure instead of a line structure.

To do this, print_line needs to accept a struct line* argument and instead of directly using l, it needs to dereference it first.

In this case the runtime only copies 12 bytes, which -if we only look at memory copy operations- is a speedup larger than a factor 5 compared against the pass-by-value method.

The more observant of you my have noticed that the previous program became hard to read and expressions such as &((*l).b) are probably the reason why pointers confuse people without end.

In C there is a shorthand notation to avoid such convoluted expressions. Instead of writing (*ptr_to_structure).field one can also write ptr_to_structure->field.

That means that &((*l).b) can be written as &l->b

Using the -> notation we obtain the following rewrite.

One of the dangers of passing values as a reference lies in the modified semantics. When passing values as reference the value is shared accross various parts of the program instead of copied.

Suppose we have a print_point that affects our point structure by setting the x field to 0 (this is admittedly a rather stupid 'printing' function). This assignment will also affect the caller and aven the line passed into print_line. =img48.png To demonstrate the difference between call-by-reference and call-by-value we need to look at the stack frames. The left part of the picture illustrates passing by value. Each structure is pushed onto the stack. The right stackframe on the other hand has been passing pointers around instead. When we write to p.x (or p->x in the second case) we observe the following:

left: the effect remains local (the print_point stackframe will be removed and thus the caller doesn't notice any effect)

right: the print_point and print_line frames will be removed, but the effect remains visible to the caller(s).

Memory Management

Pointers are also helpfull to dynamically allocate memory. We don't always know at compile time how much data our program will need. This means that we need to allocate the memory at runtime, and that means in turn that we need to work dynamically with newly generated addresses. Essentially, we want to allocate and deallocate memory.

The C library offers such functionality through functions as malloc, memcpy, memset, free and so on. To understand what they return we need to look into three new concepts.

void*

First is the fact that memory management functions return void*, this is not a pointer to the void data structure, it is instead a pointer to nothing. void* is just an address without type information. void* is called a void pointer, or a generic pointer.

Type casting

Because void pointers are only addresses the compiler can never generate code to deal with the content of these addresses. As such, before a void pointer becomes usable we need to type-cast it to a proper pointer-type. This is done through a regular type cast: (T*)(Expr) will cast Expr into a type T*.

NULL

The last aspect of memory allocation is that such functions can return a NULL pointer. A NULL pointer is a pointer that contains as address the value 0x0000.

NULL pointers are valuable to denote that something doesn't exist yet, something no longer exists or as a type of special return value. NULL is an invalid pointer that refers to memory that should not be accessed. Writing (and even reading in some systems) will cause your program to fail. =img53.png Sadly enough, not only NULL pointers can be invalid.

Suppose that I would assign a random integer to a pointer. This would mean that the pointer would contain as address this random integer and thus point to a random memory location. Since we cannot know at program writing time what will be stored at this location (if there is stored anything related to our program at all), accessing such address will cause our program to crash, or might lead to spurious erratic behavior.

You might be writing to a random location in your data segment, stack segment, or even code segment. You might as well be writing to a random location in the segments of other programs (this is true on systems without hardware supported memory protection), or you could be writing to the kernel memory (on very old systems such as the Atari 6052 processors). Clearly, such invalid pointers are quite painful because we cannot recognize them simply by looking at the address.

In practice however an invalid pointer is not always direct invalid, it might as well refer to memory that belongs to our program, but not to what you expect. E.g: Our ptr5 example where we casted a pointer to the wrong type.

How can we avoid invalid pointers ?

Tip #1: The first step is of course to notice compiler warnings where the compiler tells you that you are casting something to an incompatible type.

Tip #2: The dereferencing of a NULL pointer can happen if we don't check against NULL, so it is always useful to ensure that a pointer is not NULL before dereferencing it. An assert statement is a useful tool to do this since it can be compiled out.

Tip #3: When you free an address it might make sense to set the pointers referring to that address to NULL afterwards. This makes it possible at a later stage to verify whether the pointer is invalid.

Tip #4: Initialize pointers. In C variables are not automatically initialized, which means that if you do not set a pointer to NULL it will be pointing to a random (and very likely invalid) address. Just make sure to set them to NULL or to a sensible value after initialization. This is also true if we allocate a structure. In that case, allocate the necessary memory, cast the void pointer to whatever structure you want and initialize the content of that structure.

Another source of pointer mayhem is due to pointer arithmetic

Tip #5: If you allocate an array of size N, then you can index that array from 0 to N-1. Although most people know this, they often make errors against this. The best trick to avoid this is to introduce bound checks such as assert(i<size && i>=0)

Tip #6: If you use a pointer to walk through memory (ptr++, or ptr--) make sure you notice a zero terminating byte.

Tip #7: Do not use functions that do not take the size of the target location into account. Notorious examples are strcpy, gets and some other C library functions. These have caused so many breaches of security that it would be worthwhile to figure out which type of brain damage caused somebody to declare these to be part of the C library.

Tip #8: Use memory allocation debuggers (e.g: efence, dmalloc). These have a large repertoire of possibilities such as putting a signature in front and behind each allocated memory block (if it changed at deallocation time, the memory got corrupted). They can also report which pieces of memory were not properly deallocated.

Tip #9: Avoid the use of the & operator. Very few high level programming languages still offer this operator and in a well designed program it has little place. There are however two other good reasons why we should avoid this. First of all: it allows one to take the address of a structure allocated on the runtime stack. As soon as you pass this structure back to the caller, the structure will be removed but the pointer will remain, making it an invalid pointer. Secondly: and this is related to C++ programs. Using an & declaration instead of a * declaration results in slower code with GCC.

Tip #10: One could use pointer arithmetic *(ptr+5) to deal with arrays. One can also use array indices ptr[5]. The latter is more readable and in most situations leads to more efficient code since the compiler knows that ptr5 will not be modified in the inner loop of a for statement.

There is an entire collection of invalid pointers caused by poor memory management: freeing an invalid pointer, freeing the same structure multiple times, freeing content of a structure that is still referenced elsewhere. And not as painful, but also tentative of a crappy design: forgetting to free the content of date structures, or worse: being unable to decide whether the content of a structure should be freed or not.

This type of errors often indicate poor program design.

Program documentation is important for yourself and co-developers. The easiest way to do that is to specify for each variable whether it can be NULL, whether it might or should not be modified further down and who owns the memory (and thus who should manage it).

As example: consider a function that will put a key and value in a high level data structure as means of a cache. In this case we could say that both key and value can be NULL. Which is not directly obvious since the key will be compared with other keys, so the fact that this function can deal with a NULL key is important information. We also know that the content of key will not be modified by the cache. However it is necessary to know that the caller should after passing the key into this function not modify the content of key either (with the main reason being that a modification to the key will screw up the sorting in the cache). The content of value on the other hand can change since hte cache doesn't read the content of value. Neither key nor value will be owned by the cache, so the caller is responsible for deallocating the memory. Which also means that the key should be removed from the cache before it is deallocated.

Tip #11: document pointers: who owns pointers, will pointers modify, are they allowed to be modified by the caller, can they be NULL, can we write to the content ?

Depending on the compiler, hardware and compiler options, there exist a number of esoteric issues with pointers, which often have to do with memory protection schemes and pointer sizes.

On a Solaris system for instance it is possible to read *(0x3) as a character, but if you try to read it as an integer you will get a bus error.

Some compilers (Borland C) make it possible to work with long and short pointers. This just confuses people since they need to figure out how to convert them to each other.

Tip #12: Avoid short pointers, long pointers and features provided to you by your specific version of a compiler. Stick to one pointer interpretation that is all encompassing.

Tip #13: Avoid memory protection schemes that alias memory for the sheer joy of it (e.g: mmapping the same file to different parts in memory is possible, but should maybe not be done, since you will not be able to tell by looking at the pointers that they are effectively the same content).

Tip #14: avoid writing to memory that is supposed to remain constant (e.g: writing to the content of a string literal. It's not entirely sure whether the compiler shared the same literals or not.)

Regarding programming style there are two other tips that can be given. The first is that a function should work in its locality and should not reach too far out of its data space. This is essentially a modified version of the Law of Demeter is (which was defined for object oriented programs), but it remains quite valid:

Tip #15: a function should only access its arguments, locally instantiated variables, global variables and direct fields in the above, but not a second indirection.

A good argument why expressions such as a->b->c are a bad thing is that we never checked whether a->b is not NULL. If we need to use double indirefctions too often, we might think about rewriting the function to deal with a->b, instead of a itself.

The second tip that helps with programming style are handles. A handle is a structure that represents a void pointer. These are particularly successful if you need to design an API towards your library that can be used by other programmers

Tip #16: Handles are structures that contain only a void pointer. The library will cast these handles to proper pointers before using them. This makes it possible for the library user to be blissfully unaware of the internal structures of the library. This means that the library is responsible for memory management, again something the library user doesn't need to care about and the library user doesn't need to work too much with those darn stars.

High level data structures

Another use of pointers where it is actually necessary are high level data structures such as lists (doubly linked or not), trees (balanced or not) and other structures that can refer to themselves.

In this example, we see how a double linked list of 4 elements is represented in memory. The cell structure itself refers to a previous and a next element, which is again of type cell'. This circular reference could not be written without a pointer since that would mean that the compiler would crash or be busy ad infinitum while calculating the size of the cell structure.

In such high level data structures NULL pointers play an important role since they designate the end of the data structure. However, a good way to avoid needing to check continuously against NULL pointers is to use sentinels. Instead of setting the first and last element of a list to NULL we could allocate two structure that designate the first and last element of the list. Any insertion into the list will now be done after the start-sentinel, meaning that the code no longer needs to check whether the newly inserted element happens to be the first,. or the last, or both. The same is true for delete operations. Quite a lot of the insertion and deletion logic becomes much simpler if we create two fake cells.

Tip #17: Use sentinels in high level data structures. They tend to make the algorithms simpler, faster and by not introducing NULL pointers at every possible location also make it possible to omit NULL checks.

Polymorphism

Although this starts to be related to object oriented programming, many C programs need some way to represent and deal with data structures that have extra data attached. For instance in a graphics library a pointer to a graphics object could be a pointer to a point, a pointer to a line, a pointer to a square and so on. Casting such 'graphics object pointer' into the proper type is not possible without the adding of runtime type information. Of course, the danger is clear: casting to the wrong type will lead to great pain.

Summary

In this tutorial we saw what pointers are, how to declare them, how to use the address and the content of a pointer. Pointers turned out the be necessary for data sharing, pass by reference, memory management, high level data structures and polymorphic content.

Because pointers offer you a very powerful tool, compilers cannot verify the correctness of your program and you are left to your own devices to deal and avoid invalid pointers.

Tips to avoid pointer problems

1. Listen to the compiler when it tells you that you are assigning pointers of incompatible type.
2. Check whether a pointer is NULL before using it.
3. After freeing a pointer set it to NULL.
4. Initialize pointers, also in freshly (and/or dynamically) allocated structures.
5. Do bound checking on array content, whether you access the array as pointer or as index: assert(i<size && i>=0)
6. When walking through memory (ptr++, or ptr--) make sure you notice a null terminating byte.
7. do not use functions that do not take the size of the target into account. Notorious examples are strcpy, gets and some other C library functions.
8. Use memory allocation debuggers such as efence, dmalloc and others.
9. Avoid the use of the & operator. Very few high level programming languages still offer this operator and in a well designed program it has little place.
10. use array indices instead of pointer arithmetic.
11. document pointers: who owns pointers, will pointers modify, are they allowed to be modified by the caller, can they be NULL ?
12. Avoid short pointer, long pointers and features provided to you by your specific version of a compiler. Stick to one pointer interpretation that is all encompassing.
13. Avoid memory protection schemes that alias memory for the sheer joy of it.
14. avoid writing to memory that is supposed to remain constant (e.g: writing to the content of a string literal. It's not entirely sure whether the compiler shared the same literals or not.)
15. a function should only access its arguments, locally instantiated variables, global variables and direct fields in the above, but not a second indirection.
16. Make the user of your library use handles instead of pointers.
17. Use sentinels in high level data structures.

To en a quick link to http://xkcd.com/138/

http://werner.yellowcouch.org/
werner@yellowcouch.org