Jump to content

Add one letter and the program runns 100x faster: the power of optimizing

So I discovered this issue today at work and I thougth you migth be interested, as it is quite extraordinary.

 

Background: I'm computing a smal neuronal network on a Cortex M4f. There you need the tangens hyperbolicus (tanh) function a lot. So one just use the tanh() function from math.h. Wrong!

 

The tanh() uses double precision even if you use a float for the argument, you don't even get a warning. Because the floating point unit (FPU) of the M4f doesn't support double precision, the function has to be emulated and takes about 150us. But if you use tanhf(), single precision and the FPU is used. So it takes only 1.6us.  Adding a single letter can meke a huge difference.

 

TL:DR:

a = tanh(b);  //150 us
a = tanhf(b); // 1.6 us

// -> speed up: 93.75x

 

Side note: If you are working on a machine that supports double precision with the FPU (e.g. x86), the difference is much smaler.

Mineral oil and 40 kg aluminium heat sinks are a perfect combination: 73 cores and a Titan X, Twenty Thousand Leagues Under the Oil

Link to post
Share on other sites

You should post some detailed profiling data if possible. I am curious to see the differences in overall application performance between builds.

CPU: Intel i7 - 5820k @ 4.5GHz, Cooler: Corsair H80i, Motherboard: MSI X99S Gaming 7, RAM: Corsair Vengeance LPX 32GB DDR4 2666MHz CL16,

GPU: ASUS GTX 980 Strix, Case: Corsair 900D, PSU: Corsair AX860i 860W, Keyboard: Logitech G19, Mouse: Corsair M95, Storage: Intel 730 Series 480GB SSD, WD 1.5TB Black

Display: BenQ XL2730Z 2560x1440 144Hz

Link to post
Share on other sites

2 hours ago, trag1c said:

You should post some detailed profiling data if possible. I am curious to see the differences in overall application performance between builds.

Sadly I do not have full profileing information, I was just using an osizilloscope to measure a few functions.

 

For the entire application we didn't get a speed up as we never used the tanh() funcen due to the low speed. Instead an approximation was used. By replaceing the approximation with tanhf() we gained accuracy and actually lot a little bit of speed.

 

The entire network with 14 neurons takes 750 us to compute. As it is used in a real time application we can't use large networks. But even with the smal one the classification is about 99% accurate.

Mineral oil and 40 kg aluminium heat sinks are a perfect combination: 73 cores and a Titan X, Twenty Thousand Leagues Under the Oil

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×