Analyzing IEEE 754 Floatingpoint Values in C/C++
Kenneth R. Vollmar
Southwest Missouri State University
901 S. National Ave.
Springfield, MO 65804
(417) 8365789
KenVollmar@mail.smsu.edu
ABSTRACT
Results of floatingpoint operations can confuse beginning computer science students because the true behavior and value of floatingpoint numbers are not apparent when using the default options for precision and output formats. This paper demonstrates a simple technique in C/C++ to isolate the bit fields of floatingpoint data formats, aiding in the analysis of a floatingpoint number. Several demonstrations are made using this analysis technique to explain program behavior with floatingpoint data types.
Keywords
Floatingpoint, IEEE 754, computer arithmetic, bits, programming
INTRODUCTION
All computer scientists are likely to encounter the following behavior of floatingpoint data types, which may be especially confusing in introductory courses:

The accuracy of floatingpoint data types is not as expected.

Floatingpoint data types may be transparently mixed within a program.

The evident value is not always the true value.

Mathematical laws such as the distributive property apparently do not hold.

Mathematically equivalent operations give different results.

Assignment of intermediate data to a variable makes a difference in final results.

A logical test for equality of differenttyped floatingpoint variables fails, though the same values were assigned.
Beginning computer science students are likely to use only the default formatting options for precision and output, but then the true behavior and value of floatingpoint numbers are not apparent. Many introductory explanations of floatingpoint data types are simply that floatingpoint data types contain realvalued numbers in a certain range. Such oversimplifying descriptions cannot give students a true grasp of the capabilities, limitations, and methods for use of the floatingpoint data type.
The apparent anomalies may be explained by analysis on the floatingpoint data types, using a technique to examine the individual bits in a floatingpoint number.
THE IEEE 754 FLOATINGPOINT DATA TYPE
Almost every modern computer and compiler uses the IEEE 754 floatingpoint standard [2]. “This standard has greatly improved both the ease of porting floatingpoint programs and the quality of computer arithmetic.” [6, p. 277] It is most likely that university computing environments will conform to the IEEE standard. Importantly, there is some debate about the compliance of Java [8] with the entire standard, although the storage format is not in question.
The IEEE 754 standard completely describes the behavior and operations on floatingpoint data types, including advanced options such as the use of rounding modes and other computational techniques and controls. This paper focuses on the storage of numeric values within the floatingpoint data type.
Typical documentation in a textbook or with a compiler describes the floatingpoint data type as capable of containing real numbers, and gives the range of a floatingpoint data type. This statement is correct, but not complete. It is easy to miss important characteristics of the floatingpoint data type: Only some real numbers can be represented, and the precision of the representation varies with the size of the number.
The number of bit patterns in an Nbit integer data type is 2^{N}, the same as the number of bit patterns in an Nbit floatingpoint data type. A student may find this paradoxical when reading typical documentation, such as shown in Table 1. [1]

Type

Length

Range

long

32 bits

2,147,483,648 to 2,147,483,647

float

32 bits

3.4 x 10^{38} to 3.4 x 10^{+38}
[sic]

Table 1. Typical description of ranges of data types [1]
How can the ranges of the two data types be so different when the number of bit patterns are the same? It is true that both data types can contain 4,294,967,296 different bit patterns. Those patterns in the case of the 32bit integer data type are mapped to integers in a two’s complement representation. Those patterns in the case of the IEEE 754 32bit floatingpoint data type are nonevenly distributed in the range . Students should be aware that no 32bit data type can represent as many numbers as only the integers in the range , and that the number of real numbers in any range is infinite and therefore cannot possibly be represented in a finite number of bits.
The value represented by the IEEE 754 32bit floatingpoint data type is given by the formula
(1)^{SignBit} (1 + Significand) 2^{(Exponent  127)}
For instance, in the range [2,4), the representable numbers are shown in Table 2. The number of bits in the significand is a constant, so the number of representable values between powers of two remains constant. However, the distance between neighboring representable values changes for each value of exponent, as shown in Table 3. Some bit patterns are used to store special values, as shown in Table 4 (from [7]).
Bit representation

Value

(1)^{0} (1.000…000) 2^{1}

(1 + 0/2^{23}) 2 = 2

(1)^{0} (1.000…001) 2^{1}

(1 + 1/2^{23}) 2 2.000000238418579102

(1)^{0} (1.000…010) 2^{1}

(1 + 2/2^{23}) 2 2.000000476837158203

. . .

. . .

(1)^{0} (1.111…110) 2^{1}

(1 + (2^{23} 2)/2^{23}) 2 3.999999046325683594

(1)^{0} (1.111…111) 2^{1}

(1 + (2^{23} 1)/2^{23}) 2 3.999999284744262695

Table 2. Representable numbers in the numeric range [2,4). (IEEE 754 32bit floatingpoint)

Subrange

Stored exponent

True
exponent

Size of gap between representable numbers

(special case)

0



[2^{126}, 2^{125})

1

126

2^{149}

[1, 2)

127

0

2^{23}

[2^{i}, 2^{i}^{+1})

127 + i

i

2^{(}^{i}^{23)}

[2^{126}, 2^{127})

254

127

2^{124}

(special case)

255


 Table 3. Positive representable numbers. (IEEE 754 32bit floatingpoint)

IEEE 754 32bit float pattern

Value

exponent = 0; significand 0 (at least one bit in significand is nonzero)

(1)^{sign} 2^{}^{126} 0.significand (subnormal numbers)

exponent = 0; significand = 0 (all bits in significand are zero)

(1)^{sign} 0.0 (signed zero)

sign = 0; exponent = 255; significand = 0 (all bits in significand are zero)

+INF (positive infinity)

sign = 1; exponent = 255; significand = 0 (all bits in significand are zero)

INF (negative infinity)

sign = ?; exponent = 255; significand 0 (at least one bit in significand is nonzero)

NaN (NotaNumber)

Table 4. Special cases in IEEE 754 32bit floatingpoint numbers.
A C/C++ PROGRAM TO ISOLATE THE FIELDS OF A FLOATINGPOINT NUMBER
Operations on the individual bits of a data type are only available in C/C++ for integer data types, not floatingpoint types. The following process converts the bits (not the value) of the floatingpoint type into the bits of a samesized integer type. The process is generally more convenient and portable for 32bit floatingpoint data types than for 64bit types, since there is usually an available integer type whose byte length is at least 32 bits. Information on using this technique for the 64bit type appears at the end of this section.
In order to determine the value of a floatingpoint number, the data fields (sign, exponent, and significand) must be isolated as individual numbers. The fields of the IEEE 754 32bit data type are shown in Figure 1.

Bit 31

Bits 30…23

Bits 22...0

SignBit

Exponent

Significand

Figure 1. Bit fields of IEEE 754 32bit float type
Two’s complement representation is not used by any part of the data type. The SignBit indicates the sign of the entire number. The 8bit stored exponent is “biased” by a value of 127 from the “true” exponent. The significand indicates a fraction to be added to the number 1, which is not stored and is a “hidden bit.” For instance, the value will contain 0 in the sign field, in the exponent field, and in the significand.
Separating one bit field from another within the data type requires bitwise logical operations, which in C/C++ are not allowed on floatingpoint data types. The floatingpoint data type bits must be accessed as if they belonged to an integer data type. The union construct of C/C++ is a means of accessing the same memory location using different memory types.
// C/C++ union construct
union {
unsigned long itype; // Appropriatelength data type
float ftype;
} data;
// Assign to the floatingpoint form of the number
data.ftype = 3.14159;
// Show bits of the integer form of the number
cout << hex << data.itype;
The union construct must be carefully programmed, and may not be portable. A critically important guideline is to make each member of the union the same byte size. Otherwise, some memory is available to one member and not to another. Defensive programming practices should be used to verify the size of each member at runtime. For instance, the previous code fragment should be accompanied by some check on the boolean condition (sizeof(long) == sizeof(float)).
The use of unsigned on the integer data type in the union is important during logical shifts. On right shifts of signed integer types, the left side of the data is padded with the value in the sign bit position. On right shifts of unsigned integer types, the left side of the data is padded with zeroes.
C++ does include a “bit field” construct that allows specification of the number of bits to be used to store data. Theoretically, bit fields could be constructed to duplicate the length of bit fields of the floatingpoint data type. However, there is not enough visibility into the internals of the bit field construct to ensure correct operation. For instance, the declaration of bit fields in a certain order in a data structure does not guarantee that the bits are actually stored in that order.
The significand may be isolated by the bitwise AND operation, such as
unsigned long significand; // Appropriatelength data type
// Bitwise AND (single ampersand)
significand = data.itype & 0x7fffff; // 23 bit mask
Note the use of the single ampersand for a bitwise AND. The hidden bit may be made explicit by
significand = 0x800000; // Mask is a 1 in the 24th bit
The exponent may be isolated by
int stored_exponent;
// Bitwise rotate right 23 bits (dumping the significand off
// the right end) and retain only the exponent (i.e., remove
// the sign bit).
stored_exponent = (data.itype >> 23) & 0xff;
which leaves the exponent in its biased state. The biased, stored form of the exponent is a positive number, but the true exponent may be negative. It is important that the data type of the true exponent not be an unsigned number.
long true_exponent; // Must be signed; may be negative
true_exponent = stored_exponent  127;
Once the fields of an IEEE floatingpoint data type are available, a short program may be written showing each bit’s contributions to the floatingpoint value. An example C++ program is available at the author’s home page.
The value represented by the IEEE 754 64bit floatingpoint data type is given by the formula
(1)^{SignBit} (1 + Significand) 2^{(Exponent  1023)}
with fields shown in Figure 2.

Bit 63

Bits 62…52

Bits 51...0

SignBit

Exponent

Significand

Figure 2. Bit fields of IEEE 754 64bit float type
An implementation of the same technique for the 64bit floatingpoint type (double) may need to use arrays of some integer type to create a 64bit storage space in the union construct, since there is not necessarily an integer type whose size is at least 64 bits. The array may be of any integer type (char, int, long), provided the two members of the union are the same byte size. It is possible that the internal order of the bytes of the two union members will not be the same. On each computer platform, the location of the floatingpoint bits within the integer array should be verified by comparison to an expected result.
For instance, the following union may be used for double types:
// C/C++ union construct
union {
unsigned long itype[2]; // 64 bits total, in two longs
double dtype;
} data;
On a PC the sign bit of the double is contained in the Most Significant Bit of the second integer (itype[1] & 0x80000000), not the first integer (itype[0]) as might be supposed.
PRACTICE CASES FOR ANALYSIS OF FLOATINGPOINT BEHAVIOR
The following case studies illustrate some apparent anomalies in the use of floatingpoint data types. They will be resolved using bitwise analysis.
Practice case 1. The floatingpoint data type is perceived as less accurate than expected.
The problem is taken from [6, p. 86], a book of lab exercises. According to the provided test results, a program using float data types is in error by a single unit, but a program using double data types is correct. The problem asks for the odds of winning a lottery in which r balls are selected from a bin containing n distinct balls. The formula is given as
Number of combinations =
The lab book includes a test case for r = 6, n = 54, resulting in 25,827,165 combinations. A straightforward solution, such as
float combinations = 1;
int n = 54, r = 6;
for (int index = 0; index < r; index++)
{
combinations *= (n  index) / float(index + 1);
}
cout << "Combinations = " << combinations << endl << endl;
outputs the number of combinations as 25827166, one unit greater than the given answer for the lab book’s test case. Students might attribute this difference to a typographical error in the lab book.
The last iteration of the test case computation, , is unrepresentable in a standard floatingpoint format. By examining the bits of the IEEE 754 32bit float representation of 49/6, we see the represented value is approximately 8.166666984558105469 (too high, but the nearest representable value). When multiplied by the accumulating product of the previous terms, 3162510, the result rounds to an incorrect value.
Practice case 2. Equality of floatingpoint data types
Some uses of floatingpoint numbers provide more accuracy than the programmer ever wanted. An error much smaller than the programmer cares about can wreck her program. In this case, the number is more accurate than may be stored in the floatingpoint type, so the stored value of the floatingpoint number is not equal to the number itself.
Suppose a float variable is to be initialized with , using the standard math library in ,
#define M_PI 3.14159265358979323846
and immediately compared with M_PI.
float f = M_PI;
while (f == M_PI) . . .
The logical test for equality fails, even with no intervening operation on the variable between assignment and comparison. An analysis such as
cout.precision(20);
cout << " Value of f is " << f << endl;
cout << " Value of M_PI is " << M_PI << endl;
results in the output
Value of f is 3.141592741012573242
Value of M_PI is 3.141592653589793116
and it is evident that the math library’s value of M_PI is not representable in the 32bit floatingpoint data type. However, with a double data type,
double d = M_PI;
if (d == M_PI) . . .
the equality test succeeds, with output
Value of d is 3.141592653589793116
Value of M_PI is 3.141592653589793116
Here the two values compare as equal, even though the stored value for M_PI is not the same as the definition of M_PI in . The use of the standard strict equality test (==) on floatingpoint data types should be discouraged.
Practice case 3. The values of different floatingpoint data types are not comparable, even though they may be transparently mixed within a program.
Consider the following code, which performs the same operations on floatingpoint variables f and g, but with different intermediate steps:
float f, g;
float divisor; // see text for value
f = g = 1.0 / 10.0;
f = (f / divisor) * divisor;
g = g / divisor;
g = g * divisor;
if (f == g)
cout << "f == g";
else
cout << "f NOT= g";
cout << endl;
When divisor has a value of 2.0, the program outputs f == g, but when divisor has value 3.0, the program outputs f NOT= g. This is anomalous: not only do mathematically equivalent statements produce unequal results, but the program gives different results when the divisor changes.
The double data type is transparently used for intermediate operations on the variable f, and the result is assigned to the float type f only upon completion. In contrast, the subresult is assigned to float type g during the computation sequence, then further computation is done using that float value.
The value 1.0 / 10.0 is not exactly representable in a standard binary floatingpoint format. The initial assignment of the value to f and g means that all the bits of the significand are useful in specifying the value. 1.0 / 10.0 is represented with a zero in the Least Significant Bit (bit 0) of the IEEE 754 32bit significand, and a 1 in bit 1. The “loss” of the zero value in bit 0 during division by 2 is not observed. However, when the divisor is 3, the 1 in bit 1 is also lost and cannot be restored by multiplication in the reverse.
Exercise 1. Using the union construct, implement a routine that displays an IEEE 754 number in the form
(1)^{SignBit} (1 + Significand) 2^{(Exponent  127)}
or which displays, using a loop, the cumulative effect on a numeric value of increasingly lesssignificant bits.
Exercise 2. Research another floatingpoint format, such as MILSTD1750A. Convert a number in IEEE 754 32bit floatingpoint format to the other floatingpoint format. What might be advantages of another format? The MILSTD1750A 32bit floatingpoint format is (from Most Significant Bit to Least Significant Bit) one bit of sign, 23 bits of mantissa, and 8 bits of exponent. There are no hidden bits in the mantissa, which is normalized, and the exponent is in 2’s complement.
Exercise 3. Another technique of number storage is “slash arithmetic,” in which real numbers are stored as a numerator, denominator pair. Create storage for a numerator and denominator in a 32bit field. What are the ranges and precision of your implementation? How would arithmetic operations be done on a slash arithmetic data type?
Exercise 4. Add two integer data types using only bitwise logical operations and data shifting. The operands must be masked one bit at a time, maintaining a carry and shifting it to the nexthigher position. What should happen when the true result has a carry out of the highestavailable bit?
Exercise 5. An integer data type is an implementation of a “fixedpoint” data type, with the “radix point” defined to the right of the leastsignificant bit. Assume that the radix point is at another position, and use an integer data type to represent fractional values. What changes are necessary to arithmetic operations for this “new” type? How do the new arithmetic operations differ from the usual integer operations? [4]
Exercise 6. Describe the value of the indices during the bodies of the loops below.
for (int i = 0; i < 5; i += 0.25)
for (float f = 16777210.0; f < 16777220.0; f += 1.0)
Answer: In the i loop, the effect is an infinite loop with the value of i stuck at 0. The cause is that the sum of the loop index and the loop increment is unrepresentable in the data type of the loop index. The integer sum is rounded down to the next representable value, zero.
The f loop iterates, incrementing by 1, while the value of f is less than , since below that value an increment of 1.0 is representable in a float data type. However, similarly to the integer loop, the value 1.0 will become unrepresentable, and at that point the loop will become stuck with a static value of f.
Programming Notes:
Code fragments in this paper were programmed in C++ using Borland C++ 5.02 on a PCcompatible computer. The rounding mode used is “toward nearest.” Sample code is available at the author’s web page or by email.
REFERENCES 
Excerpted from the online version of Borland 5.02 C++ Programmer’s Reference.

IEEE Standard for Binary FloatingPoint Arithmetic (IEEE Standard for Binary FloatingPoint Arithmetic, ANSI/IEEE Std 7541985 (IEEE 754)) (Revised 1990). See http://standards.ieee.org/.

Goldberg, David. March 1991. “What Every Computer Scientist Should Know About FloatingPoint Arithmetic.” ACM Computing Surveys. Volume 23, Number 1. pp. 548. New York: ACM Press. Available at several web sites, including http://hoth.stsci.edu/public/compilers/commontools/numerical_comp_guide/goldberg1.doc.html#674.

Labrosse, Jean J. “FixedPoint Arithmetic for Embedded Systems.” C/C++ Users Journal, Feb. 1998. pp. 2128. Taken from “Embedded System Building Blocks.” ISBN 0879304405. R & D Books, July 1996.

Patterson, David A. and John L. Hennessy. Computer Organization and Design: The Hardware/Software Interface, Second Edition. Morgan Kaufmann Publishers, Inc. 1997.

Roberge, James and George Smith. Introduction to Programming in C++. Jones and Bartlett, 1995.

Numerical Computation Guide, Sun Microsystems, Inc. Chapter 2 gives good coverage of the IEEE Standard 754 [2]. Available at several web sites, including http://hoth.stsci.edu/public/compilers/commontools/numerical_comp_guide.

Kahan, William and Joseph D. Darcy. “How Java’s FloatingPoint Hurts Everyone Everywhere.” Originally presented 1 March 1998 at the invitation of the ACM 1998 Workshop on Java for High–Performance Network Computing held at Stanford University. The document is available at http://www.cs.berkeley.edu/~wkahan/JAVAhurt.pdf.
