Analyzing ieee 754 Floating-point Values in C/C++

Yüklə 63.2 Kb.
ölçüsü63.2 Kb.
Analyzing IEEE 754 Floating-point Values in C/C++

Kenneth R. Vollmar

Southwest Missouri State University

901 S. National Ave.

Springfield, MO 65804

(417) 836-5789


Results of floating-point operations can confuse beginning computer science students because the true behavior and value of floating-point numbers are not apparent when using the default options for precision and output formats. This paper demonstrates a simple technique in C/C++ to isolate the bit fields of floating-point data formats, aiding in the analysis of a floating-point number. Several demonstrations are made using this analysis technique to explain program behavior with floating-point data types.


Floating-point, IEEE 754, computer arithmetic, bits, programming


All computer scientists are likely to encounter the following behavior of floating-point data types, which may be especially confusing in introductory courses:

  • The accuracy of floating-point data types is not as expected.

  • Floating-point data types may be transparently mixed within a program.

  • The evident value is not always the true value.

  • Mathematical laws such as the distributive property apparently do not hold.

  • Mathematically equivalent operations give different results.

  • Assignment of intermediate data to a variable makes a difference in final results.

  • A logical test for equality of different-typed floating-point variables fails, though the same values were assigned.

Beginning computer science students are likely to use only the default formatting options for precision and output, but then the true behavior and value of floating-point numbers are not apparent. Many introductory explanations of floating-point data types are simply that floating-point data types contain real-valued numbers in a certain range. Such oversimplifying descriptions cannot give students a true grasp of the capabilities, limitations, and methods for use of the floating-point data type.

The apparent anomalies may be explained by analysis on the floating-point data types, using a technique to examine the individual bits in a floating-point number.


Almost every modern computer and compiler uses the IEEE 754 floating-point standard [2]. “This standard has greatly improved both the ease of porting floating-point programs and the quality of computer arithmetic.” [6, p. 277] It is most likely that university computing environments will conform to the IEEE standard. Importantly, there is some debate about the compliance of Java [8] with the entire standard, although the storage format is not in question.

The IEEE 754 standard completely describes the behavior and operations on floating-point data types, including advanced options such as the use of rounding modes and other computational techniques and controls. This paper focuses on the storage of numeric values within the floating-point data type.

Typical documentation in a textbook or with a compiler describes the floating-point data type as capable of containing real numbers, and gives the range of a floating-point data type. This statement is correct, but not complete. It is easy to miss important characteristics of the floating-point data type: Only some real numbers can be represented, and the precision of the representation varies with the size of the number.

The number of bit patterns in an N-bit integer data type is 2N, the same as the number of bit patterns in an N-bit floating-point data type. A student may find this paradoxical when reading typical documentation, such as shown in Table 1. [1]





32 bits

-2,147,483,648 to 2,147,483,647


32 bits

3.4 x 10-38 to 3.4 x 10+38


Table 1. Typical description of ranges of data types [1]

How can the ranges of the two data types be so different when the number of bit patterns are the same? It is true that both data types can contain 4,294,967,296 different bit patterns. Those patterns in the case of the 32-bit integer data type are mapped to integers in a two’s complement representation. Those patterns in the case of the IEEE 754 32-bit floating-point data type are non-evenly distributed in the range . Students should be aware that no 32-bit data type can represent as many numbers as only the integers in the range , and that the number of real numbers in any range is infinite and therefore cannot possibly be represented in a finite number of bits.

The value represented by the IEEE 754 32-bit floating-point data type is given by the formula

(-1)SignBit (1 + Significand) 2(Exponent - 127)

For instance, in the range [2,4), the representable numbers are shown in Table 2. The number of bits in the significand is a constant, so the number of representable values between powers of two remains constant. However, the distance between neighboring representable values changes for each value of exponent, as shown in Table 3. Some bit patterns are used to store special values, as shown in Table 4 (from [7]).

Bit representation


(-1)0 (1.000…000) 21

(1 + 0/223) 2 = 2

(-1)0 (1.000…001) 21

(1 + 1/223) 2 2.000000238418579102

(-1)0 (1.000…010) 21

(1 + 2/223) 2 2.000000476837158203

. . .

. . .

(-1)0 (1.111…110) 21

(1 + (223- 2)/223) 2 3.999999046325683594

(-1)0 (1.111…111) 21

(1 + (223- 1)/223) 2 3.999999284744262695

Table 2. Representable numbers in the numeric range [2,4). (IEEE 754 32-bit floating-point)


Stored exponent



Size of gap between representable numbers

(special case)


[2-126, 2-125)




[1, 2)




[2i, 2i+1)

127 + i



[2126, 2127)




(special case)


Table 3. Positive representable numbers. (IEEE 754 32-bit floating-point)

IEEE 754 32-bit float pattern


exponent = 0; significand 0 (at least one bit in significand is nonzero)

(-1)sign 2-126 0.significand (subnormal numbers)

exponent = 0; significand = 0 (all bits in significand are zero)

(-1)sign 0.0 (signed zero)

sign = 0; exponent = 255; significand = 0 (all bits in significand are zero)

+INF (positive infinity)

sign = 1; exponent = 255; significand = 0 (all bits in significand are zero)

-INF (negative infinity)

sign = ?; exponent = 255; significand 0 (at least one bit in significand is nonzero)

NaN (Not-a-Number)

Table 4. Special cases in IEEE 754 32-bit floating-point numbers.


Operations on the individual bits of a data type are only available in C/C++ for integer data types, not floating-point types. The following process converts the bits (not the value) of the floating-point type into the bits of a same-sized integer type. The process is generally more convenient and portable for 32-bit floating-point data types than for 64-bit types, since there is usually an available integer type whose byte length is at least 32 bits. Information on using this technique for the 64-bit type appears at the end of this section.

In order to determine the value of a floating-point number, the data fields (sign, exponent, and significand) must be isolated as individual numbers. The fields of the IEEE 754 32-bit data type are shown in Figure 1.

Bit 31

Bits 30…23

Bits 22...0




Figure 1. Bit fields of IEEE 754 32-bit float type

Two’s complement representation is not used by any part of the data type. The SignBit indicates the sign of the entire number. The 8-bit stored exponent is “biased” by a value of 127 from the “true” exponent. The significand indicates a fraction to be added to the number 1, which is not stored and is a “hidden bit.” For instance, the value will contain 0 in the sign field, in the exponent field, and in the significand.

Separating one bit field from another within the data type requires bitwise logical operations, which in C/C++ are not allowed on floating-point data types. The floating-point data type bits must be accessed as if they belonged to an integer data type. The union construct of C/C++ is a means of accessing the same memory location using different memory types.

// C/C++ union construct

union {

unsigned long itype; // Appropriate-length data type

float ftype;

} data;

// Assign to the floating-point form of the number

data.ftype = 3.14159;

// Show bits of the integer form of the number

cout << hex << data.itype;

The union construct must be carefully programmed, and may not be portable. A critically important guideline is to make each member of the union the same byte size. Otherwise, some memory is available to one member and not to another. Defensive programming practices should be used to verify the size of each member at runtime. For instance, the previous code fragment should be accompanied by some check on the boolean condition (sizeof(long) == sizeof(float)).

The use of unsigned on the integer data type in the union is important during logical shifts. On right shifts of signed integer types, the left side of the data is padded with the value in the sign bit position. On right shifts of unsigned integer types, the left side of the data is padded with zeroes.

C++ does include a “bit field” construct that allows specification of the number of bits to be used to store data. Theoretically, bit fields could be constructed to duplicate the length of bit fields of the floating-point data type. However, there is not enough visibility into the internals of the bit field construct to ensure correct operation. For instance, the declaration of bit fields in a certain order in a data structure does not guarantee that the bits are actually stored in that order.

The significand may be isolated by the bitwise AND operation, such as

unsigned long significand; // Appropriate-length data type

// Bitwise AND (single ampersand)

significand = data.itype & 0x7fffff; // 23 bit mask

Note the use of the single ampersand for a bitwise AND. The hidden bit may be made explicit by

significand |= 0x800000; // Mask is a 1 in the 24th bit

The exponent may be isolated by

int stored_exponent;

// Bitwise rotate right 23 bits (dumping the significand off

// the right end) and retain only the exponent (i.e., remove

// the sign bit).

stored_exponent = (data.itype >> 23) & 0xff;

which leaves the exponent in its biased state. The biased, stored form of the exponent is a positive number, but the true exponent may be negative. It is important that the data type of the true exponent not be an unsigned number.

long true_exponent; // Must be signed; may be negative

true_exponent = stored_exponent - 127;

Once the fields of an IEEE floating-point data type are available, a short program may be written showing each bit’s contributions to the floating-point value. An example C++ program is available at the author’s home page.

The value represented by the IEEE 754 64-bit floating-point data type is given by the formula

(-1)SignBit (1 + Significand) 2(Exponent - 1023)

with fields shown in Figure 2.

Bit 63

Bits 62…52

Bits 51...0




Figure 2. Bit fields of IEEE 754 64-bit float type

An implementation of the same technique for the 64-bit floating-point type (double) may need to use arrays of some integer type to create a 64-bit storage space in the union construct, since there is not necessarily an integer type whose size is at least 64 bits. The array may be of any integer type (char, int, long), provided the two members of the union are the same byte size. It is possible that the internal order of the bytes of the two union members will not be the same. On each computer platform, the location of the floating-point bits within the integer array should be verified by comparison to an expected result.

For instance, the following union may be used for double types:

// C/C++ union construct

union {

unsigned long itype[2]; // 64 bits total, in two longs

double dtype;

} data;

On a PC the sign bit of the double is contained in the Most Significant Bit of the second integer (itype[1] & 0x80000000), not the first integer (itype[0]) as might be supposed.


The following case studies illustrate some apparent anomalies in the use of floating-point data types. They will be resolved using bitwise analysis.

Practice case 1. The floating-point data type is perceived as less accurate than expected.

The problem is taken from [6, p. 86], a book of lab exercises. According to the provided test results, a program using float data types is in error by a single unit, but a program using double data types is correct. The problem asks for the odds of winning a lottery in which r balls are selected from a bin containing n distinct balls. The formula is given as

Number of combinations =

The lab book includes a test case for r = 6, n = 54, resulting in 25,827,165 combinations. A straightforward solution, such as

float combinations = 1;

int n = 54, r = 6;

for (int index = 0; index < r; index++)


combinations *= (n - index) / float(index + 1);


cout << "Combinations = " << combinations << endl << endl;

outputs the number of combinations as 25827166, one unit greater than the given answer for the lab book’s test case. Students might attribute this difference to a typographical error in the lab book.

The last iteration of the test case computation, , is unrepresentable in a standard floating-point format. By examining the bits of the IEEE 754 32-bit float representation of 49/6, we see the represented value is approximately 8.166666984558105469 (too high, but the nearest representable value). When multiplied by the accumulating product of the previous terms, 3162510, the result rounds to an incorrect value.

Practice case 2. Equality of floating-point data types

Some uses of floating-point numbers provide more accuracy than the programmer ever wanted. An error much smaller than the programmer cares about can wreck her program. In this case, the number is more accurate than may be stored in the floating-point type, so the stored value of the floating-point number is not equal to the number itself.

Suppose a float variable is to be initialized with , using the standard math library in ,

#define M_PI 3.14159265358979323846

and immediately compared with M_PI.

float f = M_PI;

while (f == M_PI) . . .

The logical test for equality fails, even with no intervening operation on the variable between assignment and comparison. An analysis such as


cout << " Value of f is " << f << endl;

cout << " Value of M_PI is " << M_PI << endl;

results in the output

Value of f is 3.141592741012573242

Value of M_PI is 3.141592653589793116

and it is evident that the math library’s value of M_PI is not representable in the 32-bit floating-point data type. However, with a double data type,

double d = M_PI;

if (d == M_PI) . . .

the equality test succeeds, with output

Value of d is 3.141592653589793116

Value of M_PI is 3.141592653589793116

Here the two values compare as equal, even though the stored value for M_PI is not the same as the definition of M_PI in . The use of the standard strict equality test (==) on floating-point data types should be discouraged.

Practice case 3. The values of different floating-point data types are not comparable, even though they may be transparently mixed within a program.

Consider the following code, which performs the same operations on floating-point variables f and g, but with different intermediate steps:

float f, g;

float divisor; // see text for value

f = g = 1.0 / 10.0;

f = (f / divisor) * divisor;

g = g / divisor;

g = g * divisor;

if (f == g)

cout << "f == g";


cout << "f NOT= g";

cout << endl;

When divisor has a value of 2.0, the program outputs f == g, but when divisor has value 3.0, the program outputs f NOT= g. This is anomalous: not only do mathematically equivalent statements produce unequal results, but the program gives different results when the divisor changes.

The double data type is transparently used for intermediate operations on the variable f, and the result is assigned to the float type f only upon completion. In contrast, the subresult is assigned to float type g during the computation sequence, then further computation is done using that float value.

The value 1.0 / 10.0 is not exactly representable in a standard binary floating-point format. The initial assignment of the value to f and g means that all the bits of the significand are useful in specifying the value. 1.0 / 10.0 is represented with a zero in the Least Significant Bit (bit 0) of the IEEE 754 32-bit significand, and a 1 in bit 1. The “loss” of the zero value in bit 0 during division by 2 is not observed. However, when the divisor is 3, the 1 in bit 1 is also lost and cannot be restored by multiplication in the reverse.

suggested Exercises

Exercise 1. Using the union construct, implement a routine that displays an IEEE 754 number in the form

(-1)SignBit (1 + Significand) 2(Exponent - 127)

or which displays, using a loop, the cumulative effect on a numeric value of increasingly less-significant bits.

Exercise 2. Research another floating-point format, such as MIL-STD-1750A. Convert a number in IEEE 754 32-bit floating-point format to the other floating-point format. What might be advantages of another format? The MIL-STD-1750A 32-bit floating-point format is (from Most Significant Bit to Least Significant Bit) one bit of sign, 23 bits of mantissa, and 8 bits of exponent. There are no hidden bits in the mantissa, which is normalized, and the exponent is in 2’s complement.

Exercise 3. Another technique of number storage is “slash arithmetic,” in which real numbers are stored as a numerator, denominator pair. Create storage for a numerator and denominator in a 32-bit field. What are the ranges and precision of your implementation? How would arithmetic operations be done on a slash arithmetic data type?

Exercise 4. Add two integer data types using only bitwise logical operations and data shifting. The operands must be masked one bit at a time, maintaining a carry and shifting it to the next-higher position. What should happen when the true result has a carry out of the highest-available bit?

Exercise 5. An integer data type is an implementation of a “fixed-point” data type, with the “radix point” defined to the right of the least-significant bit. Assume that the radix point is at another position, and use an integer data type to represent fractional values. What changes are necessary to arithmetic operations for this “new” type? How do the new arithmetic operations differ from the usual integer operations? [4]

Exercise 6. Describe the value of the indices during the bodies of the loops below.

for (int i = 0; i < 5; i += 0.25)

for (float f = 16777210.0; f < 16777220.0; f += 1.0)

Answer: In the i loop, the effect is an infinite loop with the value of i stuck at 0. The cause is that the sum of the loop index and the loop increment is unrepresentable in the data type of the loop index. The integer sum is rounded down to the next representable value, zero.

The f loop iterates, incrementing by 1, while the value of f is less than , since below that value an increment of 1.0 is representable in a float data type. However, similarly to the integer loop, the value 1.0 will become unrepresentable, and at that point the loop will become stuck with a static value of f.

Programming Notes:

Code fragments in this paper were programmed in C++ using Borland C++ 5.02 on a PC-compatible computer. The rounding mode used is “toward nearest.” Sample code is available at the author’s web page or by e-mail.


  1. Excerpted from the on-line version of Borland 5.02 C++ Programmer’s Reference.

  2. IEEE Standard for Binary Floating-Point Arithmetic (IEEE Standard for Binary Floating-Point Arithmetic, ANSI/IEEE Std 754-1985 (IEEE 754)) (Revised 1990). See

  3. Goldberg, David. March 1991. “What Every Computer Scientist Should Know About Floating-Point Arithmetic.” ACM Computing Surveys. Volume 23, Number 1. pp. 5-48. New York: ACM Press. Available at several web sites, including

  4. Labrosse, Jean J. “Fixed-Point Arithmetic for Embedded Systems.” C/C++ Users Journal, Feb. 1998. pp. 21-28. Taken from “Embedded System Building Blocks.” ISBN 0879304405. R & D Books, July 1996.

  5. Patterson, David A. and John L. Hennessy. Computer Organization and Design: The Hardware/Software Interface, Second Edition. Morgan Kaufmann Publishers, Inc. 1997.

  6. Roberge, James and George Smith. Introduction to Programming in C++. Jones and Bartlett, 1995.

  7. Numerical Computation Guide, Sun Microsystems, Inc. Chapter 2 gives good coverage of the IEEE Standard 754 [2]. Available at several web sites, including

  8. Kahan, William and Joseph D. Darcy. “How Java’s Floating-Point Hurts Everyone Everywhere.” Originally presented 1 March 1998 at the invitation of the ACM 1998 Workshop on Java for High–Performance Network Computing held at Stanford University. The document is available at

Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur © 2016
rəhbərliyinə müraciət

    Ana səhifə