http://www.the-software-experts.de/e_dta-sw-refact-maintenance.htm

Refactoring Methods

General Remarks

On this page we only want to give an overview on some of the methods we apply in the refactoring of C code. These methods differ from the methods decribed e.g. in Martin Fowlers book about refactoring. One of the reasons is that we developed the methods for C where as Martin Fowler describes them for Java. Another even more relevant reason is that we set up the methods for safety critical microcontroller applications which have to use MISRA C and usually use a defined rigid architecture and design rules. Refactoring in this environment has to follow defined rules. For more details we recommend to read the page on refactoring principles where we go into more details on this subject.

Make the Code Error free according to the MISRA checking Tool

I discussed the ideas of a safe coding in C on the pages about design, therefore I will not repeat it here. The main goal of this refactoring step is to let an automatic code checking tool run on your code and analyze its output log file. These tools come up with most of the important errors you can bring into a C-program while you code. These are potential erros like the:

if (a = b){... some code ...}

where almost certainly it was meant to be:

if (a == b){... some code ...}

The tool will also come up with error messages concerning suspicious type casts. If you have:

void my_function(void)
{
unsigned char x;
unsigned short y;

... some code ...
x = y;
... some code ...
}

The assignment of y to the variable x will lead to a loss of the upper 8 data bits. This may be perfectly o.k. because your data range of y is within 8 bit. Then you can do a simple type cast to solve the problem:

void my_function(void)
{
unsigned char x;
unsigned short y;

... some code ...
x = (unsigned char)y;
... some code ...
}

Refactoring at this point has to check the possible data ranges and find a satisfying solution for the problem. This may be a simple type cast to keep the checking tool silent, but it also may mean a bigger re-design of the code portion to avoid the loss of bits in an assignment.
The overall goal is to work on the code util the checking tool does not report any more errors. This will guarantee a certain quality of the code, which is suitable for safety critical systems.

Get rid of Bitfields

The bitfields in focus here are the so called ANSI-bitfields. They are commonly named that way because they already have been defined in the old ANSI-C standard (1989). They are defined as follows:

struct BF {
int bit0:1;
int bit1:1;
int bit2:1;
int bit3:3;
int bit4:1;
int bit5:1;
int bit6:1;
} bitfield;

The access to the individual bit is done as:

bitfield.bit1 = 1;

There are several problems involved in these bitfields. First of all not all compilers allow the same kind of definitions. Some compilers allow multiple data types as e.g. "unsigned short" or "unsigned char" for the definition of the individual bits. The data types can be all basic data types and they can be mixed in the same struct definition. Other compilers rigidly stick to the old ANSI definition which only allows the datatype "int" for the bits and multibit fields in the struct (multibit fields have e.g. the size of 3 bits in our example). However if you are forced to use the data type "int" in reality it makes the individual bit a signed number of the value -1. Therefore if you use a bit in any calculation it will use the -1 or 0 for the calculation although you most likely would expect it to be 1 and 0. This means you have a problem due to unexpected values in your calculation but you also may have a problem with the portability of your code. Only if you restriced yourself to using "int" as datatype for the bits you can be sure that your code is portable across all compilers.

Another problem is that the memory representation of your ANSI-bitfield is completely implementation dependent. You simply do not know how the stuff is stored in the memory. This means you can not access it as a complete word, character or long variable. Such a way of accessing the bitfield may come in handy if you think of a fast initialization. You could initialize the bitfiled by writing to it as a word variable and assigning 0 to it. Since you do not know how it is represented in memory you are forced to initialize it bit by bit via the symbolic expression. Another problem due to the various memory representations may occur if you have to store memory data e.g. to a file in one CPU environment and reload the data in another environment with a different memory representation of bitfields.

Because of these problems I came up with a method to use normal variables such as unsigned short or unsigned long as bitfields. With some smart macros I can set a bit e.g. with the expression bitfiled |= 0x8000; and reset the same bit with the expression bitfield &=~ 0x8000; With a couple of defines for the bits and the operations this leads to an easy to use method for bitfields which can be also accessed as normal variables. Their size is the size of the basic datatype used in the definition. I also came up with some functions to read and write the multibit fields inside a bitfield. They look a bit complex, but runtime measurements showed that this way of dealing with bitfields is even faster than the operations on ANSI-bitfields.

Because of these problems you should have a refactoring step which replaces all ANSI-bitfields with the smarter and less problematic solution.

Data Types have to be used according to the Design Rules

In the design rules I set up we first of all prohibited all floating point variables. Only a few microcontrollers have a floating point unit and can handle them without overhead. A micro which does not have a FPU will go into quite lengthy library calls to deal with them. This is simply a runtime killer. On top of it there was never a series application which really required the precision and data range of floats. Another problem is that floats are a bit tricky in compare operations (if, for, while, etc.). The expression if (a == b) will be rarely fulfilled if both or one of the variables is a float. The same is true for loops. For these reasons I banned floating point values completely out of all systems I was responsible for.

Other datatype issues are mainly related to the mixing of datatypes. You should avoid mixing signed and unsigned calculations. Bring everything to the one kind. However if you are forced to mix them you should make sure that the conversions are save and e.g. the signed value is definitely not negative before you assign it to an unsigned value. Or have you ever seen this example?:

int main(void)
{

unsigned char x;

x = ~0xAA;

if (x == ~0xAA)
{
printf("I should come out here\n\n");
}
else
{
printf("I should not be here\n\n");
}
getchar();
}

Everybody would expect to come out in the "I should come out here" branch. But it doesn't! The problem comes from the implicit promotion (8bit -> 16bit) the C compiler performs and then it casts it back to 8bit. A problem due to the implicit mixing of datatypes. The refactoring should take care of this and bring the code into a save condition.

No Global Variables

I described the danger of global variables sufficiently in the design section of this web-site and there is no need to repeat it here. It is also against the idea of object orientation to have global variables. One of the outstanding features of object orientation is that the data are combined with the operative code lines in the same module and that these data are encapsulated. This means that they can not be accessed from other modules (object) except via the defined get-function. If your code needs global variables it is suffering from much more serious design sins and should be refactored to conform to a true object oriented design. After you have done this it should be no problem to make the global variables at least static and thus not linkable for other modules.

Standard C-Libraries avoided?

Most programmers will be puzzled by this demand to get rid of the use of standard C-Libraries. But here I am talking about safety critical microcontroller applications. The demands of the related international standards show clearly that your source code has to undergo certain quality and test measures before it can be certified in the used system. In other words there is a big fuzz about the code YOU wrote and then you just use the code SOMEBODY ELSE wrote in the form of standard libraries. And on top of it, in most cases you do not even have the source code of the library so that you are not even able to see how it was done. What? Do you believe this code is perfect just because it was sold to you when you bought your compiler? Fairytales! It was done by programmers and they make mistakes. If you want to have the control of what is happening in your safety critical system you must not use third party software and if you have to use it for some reason it has to be transparent regarding the source code, the tests which have been performed on it and the other quality measures which were employed. Usually you do not get this proof. You just get a heap of binaries. Make your own libraries, test them according to the required measures by the international standards and make sure by refactoring that it is used everywhere in your code.

Check the Control and Data Flow

As indicated in the design section a good system should be data flow driven in its design and the data flow should be uni-directional as far as possible. This means that the outputs which have to be generated by the system need certain data. The output should aquire the data it needs to perform the output. This data is processed in some way and in turn relies on further data, until you reach the inputs of your system. In other words your objects should have get-functions but preferably no set-functions. Make sure in the refactoring phase that this design princple is adhered to. It may mean bigger changes in the design if you have to restructure the code, but it is worth while.

Use defined Naming Conventions

Naming conventions are a bit formalistic and thus not very highly esteemed. However if you name your variables, functions and modules consistently and according to a defined system you will have the great benefit that everybody will immediately know what the code is all about. Starting from the programmer himself, the colleagues who does a review, the next programmer who wants to re-use it or do maintenance on it, the tester, all will benefit from it. I will not go into the naming convention itself since I did this on another page, but use this refactoring method and apply the naming convention to make the code better understandable.

Use defined Templates

Define templates for your code and use them! There should be clearly identified sections where you put typedefs, defines, function prototypes, the source code itself, etc. etc. The reasons are similar to those for using defined names for variables and functions. It makes the whole code better understandable and everybody knows where to find what.

Keep the Header Structure simple

Some years ago I came as a newcomer into a project and tried to get all the code and headers collected and tried them to compile. It took me three days to succeed! O.k. there were some configuration managent problems and I had to get some files from certain colleagues instead of beeing able to get it from the code repository. But most of the time I wasted with finding out which headers I needed to get in which of their versions. Finally, after I succeeded, I found that for the literal handful of source code modules I needed almost 50 different header files which were nested, had cross includes and all sorts of dependencies. Many of the headers were included because a single define was needed out of the hundreds it contained. I cleaned it out after I was put in charge of the design. Now the same software needs 5 headers and they do not have any dependencies, cross includes or circular includes. And they are single level includes i.e. there are no nestings. The refactoring step needs to be based on a clear concept and "design" of your headers. You need to know what headers there shall be and what they have to contain. Then make sure that they do not depend on each other. Include them in the right sequence rather than making nestings. The following example shows how to reduce the nestings:

The header file macros.h makes use of a basic data types in the form of a redefinition of "unsigned short"

#define HIGEST_BIT(T_UWORD)0x8000

In order to be able to do this the file macros.h should include the file which contains the redefinition of "unsigned short"

#include"datatypes.h"
#define HIGEST_BIT(T_UWORD)0x8000

The datatype.h file contains the following define:

typedef unsigned shortT_UWORD;

But the problem is that many other headers will also need these data type redefinitions. Now imagine that the same happens with other defines which are located in other headers. The result is a nested header structure with circular and cross includes, like the one I described above. The solution is quite simple. You just have to prohibit sub-includes completely, and then after you have set up a clear concept what shoud be in which header, simply include the headers in the right sequence in your source files:

#include"datatypes.h"
#include"macros.h"

Then all headers which are included after the inclusion of the datatypes have them available and do not need to include it for themselves.

Split up C-Functions

There are a few indicators which tell you if a C-function should be split up or not. One of the indicators is the parameter list. If the pass parameters to a function are more than 4 then you should have a closer look to find a way to split up the function. The number of 4 is based on the usual number of parameters which can be passed to a function in CPU registers. If you have more parameters they go on the stack. This value may differ depending on the CPU and compiler you use. In some cases this may be six parameters, but not more. This rule is concerned about the runtime and resources of your application, but at the same time it can be said that the more parameters a function has the more complex it is. A condidate to be split in several smaller functions.

I have seen functions with 40 input interface variables. Not as pass parameters but as global variables. A closer look showed that the function contained a number of independent sections. Each section had e.g. 10 code lines and used 5 of the variables, but the sections were not related to each other in any way. This was an ideal candidate to make a function out of each of these sections and use a pass parameter interface to hand them the needed data. Generally speaking if a functions is bigger than what you can display on a single screen page you should find a way to split it up.

Use temporary Variables the Right Way

There are some commonly practiced sins concerning local variables. First of all you should not write to a variable which came in by the pass parameter interface. It is an input and a good programming style is that you never ever ovewrite an input. Neither a temporary variable which came in via the pass parameter interface nor static or global data. Always open a new variable if you have to modify something on your inputs. If you do not observe this it may be hard to test your software.

Secondly avoid to reuse the same local variable for different purposes. I saw programs which had a variable called "temp" and it was used in sequence as a loop counter, for storing intermediate calculation results, etc. etc. Open an own variable for each of these purposes. If you fear that you end up with too many local variables in your function and that they are therefore put on the stack, you should split the function!

Use local variables for all operations in your function and interface to the outside world i.e. static variables or get-functions of other objects at the beginning and at the end of your function. This advise is contrary to what you find in other refactoring literature but there are some good reasons for it. First of all there is the subject of runtime again. If you call a get-function multiple times in a piece of code, the function will a be always called and executed. This generates some overhead because in order to call a function the address of the next instruction after the function call will be stored on the stack, some registers may be saved to the stack, there has to be a jump to the called function and finally a jump back to the stored next instruction of your calling program part. All this is some overhead if you compare it to simply accessing a variable in memory. Therefore call the sub-function once and use the value which will be held in a register throughout your part of the program. This is a trade off between an object oriented design on the one hand and an optimized speed on the other hand. The same is true for the use of static data in your piece of code. If a variable which is residing permanent in the memory is used multiple times the program will always do a fetch from the memory each time it comes across the variable. This fetch operation will also consume extra time which you can avoid if the value is loaded to a register (local variable) and then used in the program. The following example will illustrate this. It uses a pseudo Assembler code to illustrate what operations are performed by the CPU on the memory to run through a sequence of code lines:

void my_function()
{
... some code ...

if (My_StaticValue < -8191) Register1 = Memory(My_StaticValue)
Register2 = -8191
Register1 COMPARE TO Register2
{
My_StaticValue = -8191; Memory(My_StaticValue) = Register2
}
else
{
if (My_StaticValue > 8191)Register1 = Memory(My_StaticValue)
Register2 = 8191
Register1 COMPARE TO Register2
{
My_StaticValue = 8191;Memory(My_StaticValue) = Register2
}
}
}

Now I show the example of the same code lines if the value is not residing in memory, but is coming in by the pass parameter interface:

void my_function(My_Value)Register1 contains My_Value
{
... some code ...

if (My_Value < -8191)
Register2 = -8191
Register1 COMPARE TO Register2
{
My_Value = -8191; Register1 = Register2
}
else
{
if (My_Value > 8191)
Register2 = 8191
Register1 COMPARE TO Register2
{
My_Value = 8191;Register1 = Register2
}
}
}

As you can see the second design option uses less code lines and therefore executed faster. Less code simply means faster execution. But on top of it you have to consider that the access to the memory is a slow operation. A line accessing the memory can take double or even more of the time that an operation using registers would take. There are influences like the waitstates and other bus issues which slow down these operations.

But there is another subject which gives us a good reason for interfacing to the outside world and using temporary variables inside the function. As you can see in the above example concerning the data types the implicit promotion and type casting mechanism of C-compilers can play quite nasty tricks on you. By interfacing every used variable to locals you can use local variables which are of the natural bit size of the registers and thus avoid automatic promotion by the compiler. With other words you simply perform a step explicitly which would be done by the compiler anyway implicitly without you being aware of it. The casting and promotion between the various bit sized variable types is under your control after you did this refactoring step. In a code for a 16 bit CPU could look like:

unsigned short my_function(unsigned char my_offset)
{

unsigned short x;
unsigned short y;

x = (unsigned short)my_StaticCharacterX;
y = (unsigned short)my_StaticCharacterY;

if (x <= y)
{
x = y + (unsigned short)my_offset;
}
else
{
x = y - (unsigned short)my_offset;
}
return(x);
}

Resolve multiple Definitions

Have you ever seen data definitions using a struct in a struct in a struct in a blabla and so on?

There is the feeling of the programmer or some other reason to group data together in structs. Certainly there is a place for structs in C but nesting them over more than one level turns into a nightmare. The access to an element would then look like:

My_struct.my_first_substruct.my_second_substruct.yet_another_struct.element = 1;

Because this is quite a mouth full and turns unreadable in the code smart programmers tend to have the idea to make defines to shorten the "mouth full". Then they do something like this:

#define STRUCT_L1 My_struct.my_first_substruct
#define STRUCT_L2 STRUCT_L1.my_second_substruct.yet_another_struct
#define MY_ELEMENT STRUCT_L2.element

And after they did this proud piece of redefinitions they are going to use "MY_ELEMENT" in their code. The result is that the person doing maintenance or testing on these structures and their redefinitions will simply go coo coo after a day of work trying to find the bottom of this swamp. No joke, I have seen nestings of ten levels and more! Get rid of this. Beak it up. There is no logical reason for grouping every single variable or constant used in the system in ONE struct. There are ways you can convince your linker to pack stuff together in the memory, and there are ways to break up the monster structs into smaller ones. Simple basic data types, simple and small structs, if needed a few arrays, that's what you should go for and your refactoring should achieve this.

Use Defines instead of Fixed Numbers

This is a simple refactoring method. Instead of using fixed values like 3.14 you should make a define like:

#define PI 3.14

Then in your code you should write "PI" instead of typing "3.14" at the places you need the constant. This makes the code better understandable, especially if your defines have sensible names. Further you can easily change the define at one place and all your code uses the updated value. Of course only if you used the define consequently at all places. This will avoid to have inconsistencies in case a constant changes. No need to look through all the modules in this case. The change is effiective at all places.

Avoid Function like Macros

Some programmers love function like macros. They seem to use them at every possible occasion. However they have a number of problems and therefore I prohibited them in all projects I was responsible for. Let's look at one of the problems. Imagine we have the following macro to do an absolute value calculation:

#define abs(x) (x>0)?x:-x

Then we use the macro in the follwoing code lines:

int main(void)
{
int b;
int c;
int y;

b = 5;
c = 7;

y = abs(b-c);
}

What you would expect is the result of 2 being assigned to y, because 5 - 7 equals to -2 and the absolute value of -2 is 2. But the function like macro does not do the trick. The actual result is -12 which is simply wrong. The reson for this is that the macro undergoes a text replace and thus the code line y = abs(b-c); extends to:

y = (b-c>0)?b-c:-b-c

Therefore the result is -b-c which is -12.

There are other problems with macros. If you do not explicitly use type casts to make the involved variables the data types you define, they will be automaitcally promoted to "int", which could involve the change of the sign and the bitsize of the value. This also leads to anwanted effects. I simply prohibited them because there is no logical reason for having them. Just make a simple function out of it and if you fear the overhead make the function "inline" and it will be extended to every code portion where the function is called, just like a macro would be, however without the dangers I just pointed out. Just one further hint. Inline functions need to be in the same module as the code which makes use of them, but if you put the inline function into a header it even can be used by every module which includes the header. For this subject you should disregard the MISRA standard. MISRA has no problem to allow function like macros (although they come up with some restrictions) but the 2004 MISRA it prohibits in rule 8.2 to put code into a header. At this point the standard is simply stupid and incorrect and you should do your own sensible rules here, like I pointed out in this refactoring step.

Simplify logical Conditions

Some multiple conditions in C-programms can be very hard to comprehend, and they are even harder to be tested. The same is true for nested "if" conditions. The following line may serve as an example:

if((a == b) || (c == y) || (z != a)) { printf("hello world");}

You can split this up into a few simple if conditions which make it much easier to see what is going on:

tmp = 0;
if(a == b) tmp = 1;
if(c == y) tmp = 1;
if(z != a) tmp = 1;

if (tmp == 1) printf("hello world");

Sorry for this simple example. It should show the principle of this refactoring step. If you look at the assembler lines the compiler makes out of the two examples you will find that the effort is almost the same and runtime or other resource issues can not be an argument against it. But it is much easier to be understood and it is much easier to modify it and test it.

Cut your Modules the Right Way

How should your modules look like? There are a few simple metrics which can help you to set them up in a way that they have a good size. As already mentioned each function in a module should not be bigger than what can be displayed on on screen page. Then there is the number of get-functions in a module. More than 20 is definitely too much and it is an indicator that you should find a way to split it up. Between 10 and 20 is o.k. if it can be justified that the module is good this way. 10 or less get-functions is ideal. Set-functions should be the exception in order to keep the data flow unidirectional. The number of private functions (sub-functions only used inside the same module) can be as much as needed. Again their size should be not bigger than a screen page. If there are very many very little ones you should also have a look into the design of the module. The number of public functions (except get-functions) should not exceed 3. In addition to that there may be a init-function which initializes the module. These numbers are values from experience and could be found in good design I saw so far.

In addition to that you should have a look at the interfaces of a module (object). The best is if they have basic data types. If you have arrays and structs (i.e. pointers to them) at the interfaces you should have a second look to see if this can be avoided. Very often you can get rid of such an interface by simply moving one function from one module to another.

Last but not least you should always make sure that your modules constitute objects in the sense of object orientation. I.e. they should represent an object of the real world, e.g. a sensor, a switch, a solenoid. Of course you should also make sure that you have an architecture which defines layers in your software (e.g. a HW-abstraction layer). All these principles should influence your refactoring activities on the size and shape of your modules.

+ Recent posts