Notes@HKU by Jax

Number Representation

Value representation

Computers have finite number of bits to represent numbers (e.g. 2322^32 for 32-bit).

  • Overflow: When the result of an operation is too large to be represented in the number of bits available.
  • Underflow: When the result of an operation is too small to be represented in the number of bits available.

Radix number systems

A position number system. Each digit aia_i is multiplied by the base rr raised to the power of its position ii.

(anan1...a1a0)r(a_na_{n-1}...a_1a_0)_r

Radix-r to radix-10

Base 10 value=i=0n1ai×ri\text{Base 10 value} = \sum_{i=0}^{n-1} a_i \times r^i

Radix-10 to radix-2

Integer part: Repeatedly divide by 2 and record the remainder.

Example
11÷2=5 remainder 15÷2=2 remainder 12÷2=1 remainder 01÷2=0 remainder 111=10112\begin{align*} 11 \div 2 &= 5 \text{ remainder } 1 \\ 5 \div 2 &= 2 \text{ remainder } 1 \\ 2 \div 2 &= 1 \text{ remainder } 0 \\ 1 \div 2 &= 0 \text{ remainder } 1 \\ 11 &= 1011_2 \end{align*}

Fractional part: Repeatedly multiply by 2 and record the integer part.

Example
0.25×2=0.5 integer 00.5×2=1.0 integer 10.25=0.012\begin{align*} 0.25 \times 2 &= 0.5 \text{ integer } 0 \\ 0.5 \times 2 &= 1.0 \text{ integer } 1 \\ 0.25 &= 0.01_2 \end{align*}

Conversion for r=2nr=2^{n} can be done replacing division by 22 with division by 2n2^n.

Sign representation

A function f(b)f(b) represents the value of a binary number bb based on the sign representation.

Excess K

Uses a Sign bit: 0 for positive, 1 for negative. The excess KK is added to the value to represent the sign.

  • Range: K-K to 2n1K2^n - 1 - K where nn is the number of bits.
  • Top half is positive, bottom half is negative.
f(b)=bKf(b)=b-K
Example
  • Excess 127: 1-1 is represented as 1262126_2.
  • Excess 128: 1-1 is represented as 1272127_2.

One's complement

Saves the leftmost bit for the sign. The binary representation is inverted for negative numbers.

  • Range: 2n1+1-2^{n-1}+1 to 2n112^{n-1}-1.
f(b)={bif b0(2n1)binvert{b}if b<0f(b)=\begin{cases} b & \text{if } b \geq 0 \\ (2^n-1)-b \equiv \text{invert}\{b\} & \text{if } b < 0 \end{cases}

Two's complement

Save leftmost bit for the sign. The binary representation is inverted and 1 is added for negative numbers.

  • Range: 2n1-2^{n-1} to 2n112^{n-1}-1.
f(b)={bif b02nbinvert{b}+1if b<0f(b)=\begin{cases} b & \text{if } b \geq 0 \\ 2^n-b \equiv \text{invert}\{b\}+1 & \text{if } b < 0 \end{cases}

Only two's complement can sum up numbers of different signs without special cases, to get the sum.

Floating-point representation

Represent a floating point binary number as long bits.

IEEE Floating Point Format

First convert the binary number to scientific notation and have a single digit to the left of the radix point (normalized):

Value=sign(1.f)2eb\text{Value} = \mathrm{sign}\cdot(1.f)\cdot2^{e-b}

Where ff is the Significand, ee is the exponent, and bb is the bias.

The binary representation is:

[sign]1 bit[e]x bits[——f——]lx bits\underbrace{[\text{sign}]}_{1 \text{ bit}} \underbrace{[\text{---}e\text{---}]}_{x \text{ bits}} \underbrace{[\text{------}f\text{------}]}_{l-x \text{ bits}}
  • Sign bit: +ve    0+\text{ve} \implies 0, ve    1-\text{ve} \implies 1
  • Significand ff: Add trailing zeros to fill bits of available length

The following table shows the parameters for multiple IEEE 754 formats:

ParameterBinary32Binary64Binary128
Storage (ll)32 bits64 bits128 bits
Exponent (xx)8 bits11 bits15 bits
Bias (bb)127102316383
Converting 13.375 to IEEE Binary32

Binary32: 8,23,1278,23,127

13.3751101.001213.375 \equiv 1101.001_2
  • Sign bit: 00 (positive)
  • Normalize: +1.1010012×23+1.101001_2 \times 2^3
    • f=1010012f=101001_2 (ignore leading 1)
    • Significand: 1010010000000000000000010100100000000000000000 (add trailing zeros)
  • Exponent: 33
    • Biased Exponent: 3+127=1301000001023+127=130 \equiv 10000010_2

Representation: 0100 0001 0101 0110 0000 0000 0000 00000100\ 0001\ 0101\ 0110\ 0000\ 0000\ 0000\ 0000

Converting Binary32 to decimal

Consider C0F21000016C0F210000_{16}:

1Sign bit 100 0000 1Exponent  111 0010 0001 0000 0000 0000Significand\underbrace{1}_{\text{Sign bit}}\ \underbrace{100\ 0000\ 1}_{\text{Exponent}}\ \ \underbrace{111\ 0010\ 0001\ 0000\ 0000\ 0000}_{\text{Significand}}
  • Sign bit: 11 (negative)
  • Exponent: 100 0000 12=129129127=2100\ 0000\ 1_2 = 129 \rightarrow 129-127=2
  • Significand:
    • Remove training 0s and add leading "1."
    • 1.111 0010 00012=1.89111101.111\ 0010\ 0001_2 = 1.89111\dots_{10}
Value=1.89111×22=7.56444\text{Value} = -1.89111 \times 2^2 = -7.56444

Note that the range of ee is 12541\rightarrow254 for Binary32. 00 and 255255 are reserved for special cases. !!! info "Representable numbers"

The range of floating point numbers is limited by the number of bits used for the significand and exponent.

For a given format with E.bitsE.bits exponent bits, F.bitsF.bits significand bits and s={1if special case for 0 and 2e10otherwises=\begin{cases}1 & \text{if special case for 0 and } 2^e-1 \\ 0 & \text{otherwise}\end{cases}:

  • Range:
    • Bias value: b=2E.bits11b=2^{E.bits-1}-1
    • All values: ±(1.f)2eb\pm(1.f)\cdot2^{e-b}
    • Use smallest values: f=02,e=s    ±1.02sb=±2sb\begin{align*} f=0_2, e=s \implies& \pm 1.0\cdot 2^{s-b}\\ =& \pm2^{s-b} \end{align*}
    • Use largest values: f=1112(length: F.bits)1.111...= 1+0.1+0.01++2F.bits= 22F.bits (geometry series)e=2E.bits1s    ±(22F.bits)22E.bits1s(2E.bits11)=±(22F.bits)22E.bits12E.bitss\begin{align*} f=111\dots_2 (\text{length: }F.bits) \rightarrow 1.111... =&\ 1 + 0.1 + 0.01 + \dots + 2^{-F.bits}\\ =&\ 2-2^{-F.bits}\ (\text{geometry series})\\ e=2^{E.bits}-1-s \implies& \pm(2-2^{-F.bits})\cdot2^{2^{E.bits}-1-s-(2^{E.bits-1}-1)}\\ = & \pm(2-2^{-F.bits})\cdot2^{2^{E.bits-1}-2^{E.bits}-s} \end{align*}
    • Value±[2sb,(22F.bits)22E.bits12E.bitss]\therefore \text{Value} \in \pm[2^{s-b},(2-2^{-F.bits})\cdot2^{2^{E.bits-1}-2^{E.bits}-s}]
  • Spacing:
    • Represented numbers are unevenly spaced as spacing depends on exponent, which scales logarithmically.
    • Spacing grows as we go further from 0 as scaling factor of 2e(2E.bits11)2^{e-(2^{E.bits-1}-1)}.

Operations

2's complement

  • Addition: Add numbers as unsigned, discard overflow.
  • Subtraction: Add the 2's complement of the number to be subtracted.
  • Multiplication: Shift-and-add unsigned numbers, and adjust sign.

Shift and add

For each bit in the multiplier, if it is 1, add the multiplicand shifted by the position of the bit.

To compute 11×1311\times -13:

    1011 - Multiplicand (11)
   x1101 - Multiplier (13)
-------- (Shift by digit from right)
    1011
   0000.
  1011..
 1011...
-------- (Sum vertical)
10001111 - (-143)

Result is negative: 100011112011100002011100012=14310001111_2 \rightarrow 01110000_2 \rightarrow 01110001_2 = -143

Floating point

Addition and subtraction

Basic steps:

  1. Align exponents
  2. Add / subtract number (exclude exponent)
  3. Normalize result
  4. Round to fit significand
Addition example

To compute 1.12×20+1.001×211.5+2.251.1_2 \times 2^0 + 1.001 \times 2^1 \equiv 1.5 + 2.25:

  • Align exponents: 1.12×20+10.012×201.1_2 \times 2^0 + 10.01_2 \times 2^0

  • Add numbers: 10.012+1.12=11.11210.01_2 + 1.1_2 = 11.11_2

  • Normalize: 1.1112×211.111_2 \times 2^1

  • Result: 1.1112×21=3.751.111_2 \times 2^1 = 3.75

Multiplication and division

Basic steps:

  1. Add / subtract exponents (subtract / add bias bb if working with IEEE)
  2. Multiply / divide numbers
  3. Normalize result
  4. Round to fit significand
  5. Determine sign
Multiplication example

To compute 1.12×20×1.0012×211.5×2.251.1_2 \times 2^0 \times 1.001_2 \times 2^1 \equiv 1.5 \times 2.25:

  • Add exponents: 0+1=10+1=1
  • Multiply numbers: 1.12×1.0012=1.101121.1_2 \times 1.001_2 = 1.1011_2
  • To normalize, use exponent: 11.0112×2111.011_2 \times 2^1
  • Normalize: 1.10112×211.1011_2 \times 2^1 (unchanged)
  • Result: 1.10112×21=3.3751.1011_2 \times 2^1 = 3.375

Note that we skipped subtracting bias as we're working with real binary numbers instead of a IEEE representation.

Multiplication example (IEEE)

To compute 3fc0000016×40100000161.5×2.253fc00000_{16} \times 40100000_{16} \equiv 1.5 \times 2.25:

  • Extract exponents: 3fc0000016=0 01111111 1000000000000000000000023fc00000_{16} = 0\ 01111111\ 10000000000000000000000_2, 4010000016=0 10000000 00100000000000000000000240100000_{16} = 0\ 10000000\ 00100000000000000000000_2
  • Add exponents and subtract bias: 011111112+100000002011111112127+128127=10000000201111111_2 + 10000000_2 - 01111111_2 \equiv 127 + 128 - 127 = 10000000_2
  • Extract numbers by removing trailing zeros and multiply: 1.12×1.0012=1.101121.1_2 \times 1.001_2 = 1.1011_2
  • Normalize: 1.10112×21000000020111111121.1011_2 \times 2^{10000000_2 - 01111111_2}
  • Result: 0 10000000 101100000000000000000002=3.3750\ 10000000\ 10110000000000000000000_2 = 3.375

On this page