Jump to content

Single precision vs double precision

ArkTheYO

What is the difference between single precision vs double precision digits? 

Link to comment
Share on other sites

Link to post
Share on other sites

28 minutes ago, M.Yurizaki said:

For other technical bits, single precision is 4-bytes, double precision is 8-bytes. Double precision performance also tends to be a lot slower than single precision. So unless you have a need for it, don't use double precision if working in floating point.

It's a bit more complex, because it depends on hardware. for x86:

  • x87 FPU uses 80-bit registers internally, conversion is needed for both float and double, thus both are equally fast. Except for very large datasets, where the fact that 2x the raw bytes need to be moved for double can come into play. (Specifically, running out of cache).
  • When using SSE vector code, 2 double calculations can be done in one pass or 4 floats, making float faster if correctly implemented. But not everything is/can be vectorized actual mileage varies depending on the problem.

In general:

  • If the hardware supports only double then float requires conversion and will be slower.
  • If the hardware supports only float then double will require multiple passes, making it slower.
  • If neither is supported both have to be software emulated, double will be slower. But on such a low end platform you'd probably want to avoid floating point altogether.
Link to comment
Share on other sites

Link to post
Share on other sites

On 9/3/2017 at 1:25 AM, ArkTheYO said:

What is the difference between single precision vs double precision digits? 

TL;DR If you want the what without the why:

Single precision floats allow you to represent many (but not all) real numbers between +/- 3.4 * 10^(38) = +/- (1 + (1 - 2^(-23))) * 2^(255 - 127).

Double precision floats allow you to represent many (but not all) real numbers between +/- 1.8 * 10^(308) = +/- (1 + (1 - 2^(-52))) * 2^(2047 - 1024).

 

If you want the why (and I'll admit this is surely too complicated of an answer for the question given):

Single precision floats are represented with 32 bits in the form

x | x x x x x x x x | x x x x x x x x x x x x x x x x x x x x x x x

s |     exponent   |                    mantissa/fractional               |

Where s = sign bit (positive=0 or negative=1)

Exponent = 8 bits that can represent any number between 0 and 255 (0 to 2^(8) - 1).

Mantissa/Fractional = 23 bits that are used to let you represent numbers that aren't exact powers of 2. (the first bit represents 1/2, the second one 1/4, ..., the last 1/2^(23))

 

The way you figure out what number is represented is by doing +/- (1 + 0.f1 f2 f3...f23) * 2^(exponent - 127).

So, for example, 127 = 01111111, so 1.0 is represented as 0 | 0 1 1 1 1 1 1 1 | 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0. You interpret this as +1 * (1 + 0.0) * 2^(127 - 127) = 1.0

Another example: 3.0 is represented as +1 * (1 + 0.5) * 2^(1) = 0 | 1 0 0 0 0 0 0 0 | 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

-7.5 is -1 * (1 + 0.875) * 2^(2) = 1 | 1 0 0 0 0 0 0 1 | 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

 

Why subtract 127? 127 is called the bias, and it's so that you can represent numbers less than 1 as well. Otherwise, 1.0 would be the smallest thing you could represent. 

This is also leaving out some cases where the (1 + 0.f1f2...) turns into (0 + 0.f1f2) (called "denormalized numbers"), but that's okay.

 

Double precision floats use exactly the same system but they have 64 bits instead of 32. They add 3 of those to the exponent and 29 to the fractional/mantissa bits.

x | x x x x x x x x x x x | x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x

s |         exponent        |                                                     mantissa/fractional                                                              |

So, you go from your largest exponent being 255 to 2047 = 2^(11) - 1 which greatly expands your range of representable values (both in how large you can get and how close to 0 you can get), and you go from 23 to 52 fractional bits which basically lets you have a lot more decimal point precision if it's needed.

 

Now, this is just how the standard is defined in IEEE_754-1985 (the 2008 additions pretty much just added new formats and left single and double floats alone), and, like Unimportant said, once you get into the actual implementation of this standard on various platforms...

On 9/3/2017 at 7:54 AM, Unimportant said:

It's a bit more complex, because it depends on hardware

 

Link to comment
Share on other sites

Link to post
Share on other sites

Some benchmarks on different floating point precision speeds.

http://nicolas.limare.net/pro/notes/2014/12/12_arit_speed/

Quote

Follow-up on my notes on code speedup. We measure the computation cost of arithmetic operations on different data types and different (Intel64) CPUs. We see that 64 bits integer is slow, 128 bits floating-point is terrible and 80 bits extended precision not better, division is always slower than other operations (integer and floating-point), and smaller is usually better. Yes, that was expected, but backed by hard code and numbers it's better, isn't it?

 

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×