Jump to content

Why are these calculations so slow to be performed?

AlTech
Go to solution Solved by Nineshadow,
1 minute ago, AluminiumTech said:

You're using C++ though......

It shouldn't make much of a difference tbh.

Are you by any chance printing the numbers in each iteration of the loop? That would explain why it takes you so long.

On 25/08/2016 at 0:29 PM, AluminiumTech said:

............................

 

That was embarrassing................. After I fixed it, it took 3 seconds to do 2 million calculations....

 
 
 

just did it python and took about 3 seconds on my 8350 with 1 thread but 10 threads took about 30.

Quote

D:\vbox\Documents\Projects\Threads>python threads.py
starting thread 0 at 2016-08-27 19:35:22.034958
took 2.960297 seconds

D:\vbox\Documents\Projects\Threads>python threads.py
starting thread 0 at 2016-08-27 19:36:13.347054
starting thread 1 at 2016-08-27 19:36:13.353051
starting thread 2 at 2016-08-27 19:36:13.363071
starting thread 3 at 2016-08-27 19:36:13.410558
starting thread 4 at 2016-08-27 19:36:13.510572
starting thread 5 at 2016-08-27 19:36:13.552088
starting thread 6 at 2016-08-27 19:36:13.647092
starting thread 7 at 2016-08-27 19:36:13.673596
starting thread 8 at 2016-08-27 19:36:13.700095
starting thread 9 at 2016-08-27 19:36:13.783609
took 28.743302 seconds

 
import time
import math
from datetime import datetime
from threading import Thread
total=0
numThreads=10

def myfunc(i, total,thread):
    startTime = datetime.now()
    #print("starting thread {0} at {1}".format(i, startTime)),
    for x in range(1,2000000):
        c = math.sqrt((x**2)+(x**2))
    total += (datetime.now() - startTime).total_seconds()
    if i == (thread-1):
        print("took {0} seconds".format(total))
for i in range(numThreads):
    t = Thread(target=myfunc, args=(i,total,numThreads,))
    t.start()
    if i == (numThreads-1):
        print("{} threads satrted".format(numThreads))

 

                     ¸„»°'´¸„»°'´ Vorticalbox `'°«„¸`'°«„¸
`'°«„¸¸„»°'´¸„»°'´`'°«„¸Scientia Potentia est  ¸„»°'´`'°«„¸`'°«„¸¸„»°'´

Link to comment
Share on other sites

Link to post
Share on other sites

26 minutes ago, vorticalbox said:

just did it python and took about 3 seconds on my 8350 with 1 thread but 10 threads took about 30.

Makes sense since you're doing 2,000,000 per thread. 10x the work 10x the time.

1474412270.2748842

Link to comment
Share on other sites

Link to post
Share on other sites

 

29 minutes ago, fizzlesticks said:

Makes sense since you're doing 2,000,000 per thread. 10x the work 10x the time.

yeah though it would scale a little better but hey. Weirdly it hardly increases cpu usage at all.

                     ¸„»°'´¸„»°'´ Vorticalbox `'°«„¸`'°«„¸
`'°«„¸¸„»°'´¸„»°'´`'°«„¸Scientia Potentia est  ¸„»°'´`'°«„¸`'°«„¸¸„»°'´

Link to comment
Share on other sites

Link to post
Share on other sites

Just now, vorticalbox said:

 

yeah though it would scale a little better but hey. Weirdly it hardly increases cpu usage at all.

Python doesn't have the same kind of threading as other languages. Due to the GIL only 1 thread can run at a time, the most you'll get out of a Python program is 1 core's worth of CPU usage. So by adding more threads you're actually slowing the program down, it would be faster to do all 20,000,000 calculations in a single thread to avoid the extra work of switching between threads.

1474412270.2748842

Link to comment
Share on other sites

Link to post
Share on other sites

2 minutes ago, fizzlesticks said:

Python doesn't have the same kind of threading as other languages. Due to the GIL only 1 thread can run at a time, the most you'll get out of a Python program is 1 core's worth of CPU usage. So by adding more threads you're actually slowing the program down, it would be faster to do all 20,000,000 calculations in a single thread to avoid the extra work of switching between threads.

yeah I will look into multiprocessing

 

https://docs.python.org/2/library/multiprocessing.html#module-multiprocessing

 

see if I can get it to scale better.

                     ¸„»°'´¸„»°'´ Vorticalbox `'°«„¸`'°«„¸
`'°«„¸¸„»°'´¸„»°'´`'°«„¸Scientia Potentia est  ¸„»°'´`'°«„¸`'°«„¸¸„»°'´

Link to comment
Share on other sites

Link to post
Share on other sites

35 minutes ago, vorticalbox said:

yeah I will look into multiprocessing

 

https://docs.python.org/2/library/multiprocessing.html#module-multiprocessing

 

see if I can get it to scale better.

That will certainly scale much better up to your thread count but probably not worth the work when things like numpy exist.

 

import time
import numpy as np

size = 20000000
start = time.perf_counter()

xs = np.square(np.random.rand(size))
ys = np.square(np.random.rand(size))
zs = np.sqrt(np.add(xs, ys))

print(time.perf_counter() - start)

20,000,000 calculations takes 0.5 seconds for me.

1474412270.2748842

Link to comment
Share on other sites

Link to post
Share on other sites

I got bored and opened notepad and wrote a quick php code, to do this multithreaded  ( latest php 7 plus pthreads extension installed and enabled)

 

<?php
ini_set('memory_limit','4096M'); // only needed if you want to store the results in some array somewhere.

$threads = 8; // at least 1 required
$chunks = 2; // how many chunks should each thread do
$chunksize = 10000000; // how many sqrt operations should each thread do for each chunk

 
class WorkerThreads extends Thread
{
    private $workerId;
    private $chunks;
    private $chunksize;
 
    public function __construct($id,$chunk_cnt,$size)
    {
    $this->workerId = $id;
    $this->chunks = $chunk_cnt;
    $this->chunksize = $size;
    }
 
    public function run()
    {
    echo "Worker {$this->workerId} running " . PHP_EOL;
    for ($i=1;$i<=$this->chunks;$i++) {
     	$time_s = microtime(true);
        for ($j=0;$j<$this->chunksize;$j++) {
         $a = mt_rand();
         $b = mt_rand();
         $d = sqrt($a * $a + $b * $b);
        }    

     $time_e = microtime(true);
     $time_f = $time_e-$time_s;
     echo "Worker {$this->workerId} worked chunk $i in $time_f seconds" . PHP_EOL;
    }
    }
}

$time_s = microtime(true);
 
// Worker pool
$workers = [];
 
// Initialize and start the threads
foreach (range(1, $threads) as $i) {
    $workers[$i] = new WorkerThreads($i,$chunks,$chunksize);
    $workers[$i]->start();
}
 
// Let the threads come back
foreach (range(1, $threads) as $i) {
    $workers[$i]->join();
}

$time_e = microtime(true);
$time_f = $time_e-$time_s;

echo "Finished processing ". number_format($threads*$chunks*$chunksize)." square roots on $threads threads in $time_f seconds.".PHP_EOL;

 

The code above gets all my fx-8320 cores going and outputs something like this (it's 10 million per chunk, i wanted it to take a long time to see the cpu usage actually go up on all cores):

 

 

d:\Programs\php>php test.php
Worker 1 running
Worker 2 running
Worker 3 running
Worker 4 running
Worker 5 running
Worker 6 running
Worker 7 running
Worker 8 running
Worker 1 worked chunk 1 in 5.2623009681702 seconds
Worker 6 worked chunk 1 in 5.1532950401306 seconds
Worker 2 worked chunk 1 in 5.2923021316528 seconds
Worker 4 worked chunk 1 in 5.3003029823303 seconds
Worker 8 worked chunk 1 in 5.2202990055084 seconds
Worker 3 worked chunk 1 in 5.4963138103485 seconds
Worker 7 worked chunk 1 in 5.5463171005249 seconds
Worker 5 worked chunk 1 in 6.1193499565125 seconds
Worker 6 worked chunk 2 in 5.1422939300537 seconds
Worker 4 worked chunk 2 in 5.160295009613 seconds
Worker 8 worked chunk 2 in 5.0782899856567 seconds
Worker 2 worked chunk 2 in 5.2553009986877 seconds
Worker 3 worked chunk 2 in 5.3773081302643 seconds
Worker 7 worked chunk 2 in 5.3383049964905 seconds
Worker 1 worked chunk 2 in 5.9453399181366 seconds
Worker 5 worked chunk 2 in 5.1042909622192 seconds
Finished processing 160,000,000 square roots on 8 threads in 11.27564406395 seconds.

 

 

 

Link to comment
Share on other sites

Link to post
Share on other sites

50 million calculations on floating point numbers in roughly 70k microseconds. Without openMP it takes roughly 160k microseconds.

#include <iostream>
#include <random>
#include <chrono>
const int n = 50'000'000;
float a[n], b[n], c[n];
int main()
{
	std::random_device rd;
	std::mt19937 mt(rd());
	std::uniform_real_distribution<float> dist(0.0, 1999999973.0);
	std::cout << "Generating dataset...\n";
	for (int i = 0; i < n; ++i)
	{
		a[i] = dist(mt);
		b[i] = dist(mt);
	}
	std::cout << "Done generating dataset.\n";
	std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now();
	#pragma omp parallel for
	for (int i = 0; i < n; ++i)
	{
		c[i] = sqrt(a[i] * a[i] + b[i] * b[i]);
	}
	std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();
	std::cout << "Executed "<<n<<" P. theorems in " << std::chrono::duration_cast<std::chrono::microseconds>(end - begin).count() << " microseconds.\n";
	std::cin.get();
	return 0;
}

 

i5 4670k @ 4.2GHz (Coolermaster Hyper 212 Evo); ASrock Z87 EXTREME4; 8GB Kingston HyperX Beast DDR3 RAM @ 2133MHz; Asus DirectCU GTX 560; Super Flower Golden King 550 Platinum PSU;1TB Seagate Barracuda;Corsair 200r case. 

Link to comment
Share on other sites

Link to post
Share on other sites

10 hours ago, fizzlesticks said:

That will certainly scale much better up to your thread count but probably not worth the work when things like numpy exist.

 


import time
import numpy as np

size = 20000000
start = time.perf_counter()

xs = np.square(np.random.rand(size))
ys = np.square(np.random.rand(size))
zs = np.sqrt(np.add(xs, ys))

print(time.perf_counter() - start)

20,000,000 calculations takes 0.5 seconds for me.

new to python I will have a look at numpy not heard of it before. I was actually using this topic as a way to learn threading in python.

                     ¸„»°'´¸„»°'´ Vorticalbox `'°«„¸`'°«„¸
`'°«„¸¸„»°'´¸„»°'´`'°«„¸Scientia Potentia est  ¸„»°'´`'°«„¸`'°«„¸¸„»°'´

Link to comment
Share on other sites

Link to post
Share on other sites

@fizzlesticks having a look at your code doesn't that square 2 random numbers from the 20 m rather that calculating all 20 m values?

 

Edit: 

 

new an improved code that uses threads better than before. Due to memory issues i coudl only make 10 million numbers in a  list so added units to loop over the 10 million.

from multiprocessing import Pool
from datetime import datetime
import math
#set to cpu cores
threads = 8
#10 million numbers
calculations = 10 ** 7
#times to loop number above
units = 1

def f(x):
     for i in range(units):
          c = math.sqrt((x**2)+(x**2))


if __name__ == '__main__':
    myList = []
    print("creating list")
    for i in range(calculations):
         myList.append(i)
    print("List created")
    print("Starting {} calculations".format(calculations * units))
    for i in range(1,threads+1):
        print("Started using {} thread(s)".format(i))
        startTime = datetime.now()
        p = Pool(i)
        p.map(f, myList)
        print("Took {} with {} thread(s)".format((datetime.now() - startTime).total_seconds(), i))
    print("end")

Output

 

creating list
List created
Starting 10000000 calculations
Started using 1 thread(s)
Took 25.403118 with 1 thread(s)
Started using 2 thread(s)
Took 13.44821 with 2 thread(s)
Started using 3 thread(s)
Took 9.273681 with 3 thread(s)
Started using 4 thread(s)
Took 7.438988 with 4 thread(s)
Started using 5 thread(s)
Took 6.19076 with 5 thread(s)
Started using 6 thread(s)
Took 5.531793 with 6 thread(s)
Started using 7 thread(s)
Took 5.097657 with 7 thread(s)
Started using 8 thread(s)
Took 4.846125 with 8 thread(s)
end

As you can see scales very well with more threads :P

                     ¸„»°'´¸„»°'´ Vorticalbox `'°«„¸`'°«„¸
`'°«„¸¸„»°'´¸„»°'´`'°«„¸Scientia Potentia est  ¸„»°'´`'°«„¸`'°«„¸¸„»°'´

Link to comment
Share on other sites

Link to post
Share on other sites

21 hours ago, Nineshadow said:

 


const int n = 50'000'000;

 

What's up with the '000' ? That should not compile. If it does on your compiler the value might not be fifty million. Or is it a typo ? 

Link to comment
Share on other sites

Link to post
Share on other sites

11 hours ago, vorticalbox said:

@fizzlesticks having a look at your code doesn't that square 2 random numbers from the 20 m rather that calculating all 20 m values?

Nope, passing an array to most numpy function will do the operation element wise for each item in the array. For example here's the code with some print statement to show what's going on.

 

Code:

Spoiler

import numpy as np

xs = np.array(list(range(11)))
ys = np.array(list(reversed(range(11))))

print(xs)
print(ys)
print()

xs = np.square(xs)
ys = np.square(ys)

print(xs)
print(ys)
print()

zs = np.add(xs, ys)

print(zs)
print()

zs = np.sqrt(zs)

print(zs)

 

Output:

Spoiler

[ 0  1  2  3  4  5  6  7  8  9 10]
[10  9  8  7  6  5  4  3  2  1  0]

[  0   1   4   9  16  25  36  49  64  81 100]
[100  81  64  49  36  25  16   9   4   1   0]

[100  82  68  58  52  50  52  58  68  82 100]

[ 10.           9.05538514   8.24621125   7.61577311   7.21110255
   7.07106781   7.21110255   7.61577311   8.24621125   9.05538514  10.        ]

 

 

1474412270.2748842

Link to comment
Share on other sites

Link to post
Share on other sites

6 minutes ago, Unimportant said:

What's up with the '000' ? That should not compile. If it does on your compiler the value might not be fifty million. Or is it a typo ? 

It's new in C++14, you can use ' to separate groups of digits to make it more readable than trying to count how many zeros there are. Like we use a coma or period in the real world but those wouldn't work for various reasons so they chose '.

1474412270.2748842

Link to comment
Share on other sites

Link to post
Share on other sites

59 minutes ago, Unimportant said:

What's up with the '000' ? That should not compile. If it does on your compiler the value might not be fifty million. Or is it a typo ? 

Digit separators are a new feature in C++ 14, as @fizzlesticks mentioned.

auto integer_literal = 1'000'000;
auto floating_point_literal = 0.000'015'3;
auto binary_literal = 0b0100'1100'0110;

 

i5 4670k @ 4.2GHz (Coolermaster Hyper 212 Evo); ASrock Z87 EXTREME4; 8GB Kingston HyperX Beast DDR3 RAM @ 2133MHz; Asus DirectCU GTX 560; Super Flower Golden King 550 Platinum PSU;1TB Seagate Barracuda;Corsair 200r case. 

Link to comment
Share on other sites

Link to post
Share on other sites

22 hours ago, Nineshadow said:

Digit separators are a new feature in C++ 14, as @fizzlesticks mentioned.


auto integer_literal = 1'000'000;
auto floating_point_literal = 0.000'015'3;
auto binary_literal = 0b0100'1100'0110;

 

that is so awesome I hope python copies it :x

                     ¸„»°'´¸„»°'´ Vorticalbox `'°«„¸`'°«„¸
`'°«„¸¸„»°'´¸„»°'´`'°«„¸Scientia Potentia est  ¸„»°'´`'°«„¸`'°«„¸¸„»°'´

Link to comment
Share on other sites

Link to post
Share on other sites

  • 2 weeks later...
On 8/25/2016 at 7:05 AM, Nineshadow said:

Are you by any chance printing the results each iteration?

Because something like this :


#include <iostream>
#include <fstream>
#include <random>
#include <chrono>
float a, b, c;
int main()
{
	std::random_device rd;
	std::mt19937 mt(rd());
	std::uniform_real_distribution<double> dist(0.0, 1999999973.0);
	std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now();
	for (int i = 0; i < 1100000; ++i)
	{
		a = dist(mt), b = dist(mt);
		c = a*a + b*b;
	}
	std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();
	std::cout << "Time : " << std::chrono::duration_cast<std::chrono::microseconds>(end - begin).count() << std::endl;
	std::cin.get();
    return 0;
}

Takes me around ~750000 microseconds. But going with random numbers each time isn't exactly a great way of doing it, since results can vary a lot.

omg, use a namespace

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×