Jump to content

CPU vs GPU instruction set difference

BotDamian

I think you should also take into account that, compared to a CPU-only solution, you'll most likely introduce a fair amount of additional complexity into your codebase. Unless performance in that particular part of your application is absolutely mission critical as mentioned above, it may not be worth it from that perspective alone.

Remember to either quote or @mention others, so they are notified of your reply

Link to comment
Share on other sites

Link to post
Share on other sites

22 hours ago, BotDamian said:

So the thing is that it will filter (not sort) each 3s, it can be an array of 100 items then the next second it can be an array with 10k items, then 4k items.
Short said we will never know how many items there will be, if god wants to it could be even 5mil items. Similar to AI you don't know how much data comes out exactly, depends.

 

That's why I'm asking, if the Rasp Pi 4 can filter 1mil items in a second then this shouldn't be a problem. If it takes a second longer than 3s it's not bad, it shouldn't become 20s tho.

Well one benefit is that filtering is a lot less intensive than sorting.

 

Nothing beats running a quick trial.  Generate a bunch of fake data, and try filtering (like 10 million).  Setting enough data to illustrate the realistic worst case scenario.  From there, run your code and see how long it takes.  What I would do in this kind of scenario, I would create a few sets of test data.  The worst case, but also the 99% case scenario (and see how it runs).

 

If it doesn't meet your requirements, then you at least know how close you are.  e.g. if it's running in 20 seconds then you know you might need to either go for something stronger than Pi4 or move to GPU....if it runs in like 5 seconds, you might be able to tweak your current code to get it within your threshold.

3735928559 - Beware of the dead beef

Link to comment
Share on other sites

Link to post
Share on other sites

14 hours ago, wanderingfool2 said:

Nothing beats running a quick trial. Generate a bunch of fake data, and try filtering (like 10 million).

I wrote a small application in Kotlin that generates 100M objects (containing an Int) and then uses stream and parallelStream to filter them.

Spoiler

image.thumb.png.47775f3df53b7ec2c3bbd8f9742f36ca.png

What's interesting to see is, as the number of items that remain in the list after the operation completes increases, at some point the single threaded implementation actually starts to be faster than the multi-threaded implementation. My guess would be that merging the results starts to become a significant overhead.

 

Takeaway: the overhead of merging results can easily eat up all the gains you get from doing things in parallel. So be sure to benchmark with actual data and don't assume that things are faster just because you're running on multiple threads.

Remember to either quote or @mention others, so they are notified of your reply

Link to comment
Share on other sites

Link to post
Share on other sites

3 hours ago, Eigenvektor said:

I wrote a small application in Kotlin that generates 100M objects (containing an Int) and then uses stream and parallelStream to filter them.

haha, by chance do you still have your source?  [For the record, I believe your results and mostly agree with your conclusion]

 

I do agree, the overhead of merging is likely the reason.  Filtering is more of an O(n) operation, while sorting is O(nlogn), which given the results need to be done within 3 seconds is beneficial since filtering will take less time on larger datasets...but the real benefit coming from the fact that sorting in MT can effectively run the same operations as ST due to the algorithm.

 

I suspect with careful manipulation you could do a better result with filtering...but then it gets back to is the time invested in it worth it..filtering on a GPU could still potentially be quicker, as it has higher speed RAM and is built for quick comparisons.

 

While I know it won't match @BotDamian's dataset (and if run on a Rasp Pi 4 with a slower processor and RAM), the fact you did 100M in 1.5s as a max does show that he likely won't have to worry as much.

3735928559 - Beware of the dead beef

Link to comment
Share on other sites

Link to post
Share on other sites

1 minute ago, wanderingfool2 said:

haha, by chance do you still have your source?  [For the record, I believe your results and mostly agree with your conclusion]

I do 🙂 I can also attach the whole project if needed, but here's the relevant source code:

Spoiler

class FilterBenchmark {

    companion object {
        const val ITEM_COUNT = 100_000_000
        const val RUNS = 20

        const val MAX_INDEX = 10_000
        const val STEP_SIZE = 500
    }

    class Entity(
        val index: Int
    )

    class Result(
        val meanItems: Double,
        val meanDuration: Double,
    )

    private val numberFormat = NumberFormat.getInstance(Locale.ENGLISH)

    fun run() {
        print("Create $ITEM_COUNT items... ")
        val source = IntStream.range(0, ITEM_COUNT)
            .mapToObj { index -> Entity(index % MAX_INDEX) }
            .collect(Collectors.toList())
        println("done")

        val results = mutableMapOf<String, MutableList<Result>>()
        results["Stream"] = mutableListOf()
        results["Parallel Stream"] = mutableListOf()

        for (filter in 0..MAX_INDEX step STEP_SIZE) {
            results["Stream"]?.add(streamFilter(filter, source))
            results["Parallel Stream"]?.add(parallelStreamFilter(filter, source))
        }

        plotChart(results)
    }

    private fun streamFilter(filter: Int, source: List<Entity>): Result {
        return measure("Stream") {
            source.stream()
                .filter { it.index < filter }
                .collect(Collectors.toList())
        }
    }

    private fun parallelStreamFilter(filter: Int, source: List<Entity>): Result {
        return measure("ParallelStream") {
            source.parallelStream()
                .filter { it.index < filter }
                .collect(Collectors.toList())
        }
    }

    private fun measure(
        name: String,
        method: () -> List<Entity>
    ): Result {
        val filteredStats = DescriptiveStatistics(RUNS)
        val durationStats = DescriptiveStatistics(RUNS)

        for (n in 0..RUNS) {
            var filtered: List<Entity>
            val duration = measureNanoTime {
                filtered = method()
            }

            filteredStats.addValue(filtered.size.toDouble())
            durationStats.addValue(duration / 1_000_000.0)
        }

        println(
            """$name took an average of ${numberFormat.format(durationStats.mean)} ms to filter ${
                numberFormat.format(
                    filteredStats.mean
                )
            } out of ${numberFormat.format(ITEM_COUNT)} items in $RUNS runs"""
        )
        return Result(filteredStats.mean, durationStats.mean)
    }

    private fun plotChart(results: MutableMap<String, MutableList<Result>>) {
        val dataset = XYSeriesCollection()

        for (entry in results.entries) {
            val values = XYSeries(entry.key)

            for (result in entry.value) {
                values.add(result.meanItems, result.meanDuration)
            }

            dataset.addSeries(values)
        }

        try {
            val chart = createChart(dataset)

            ChartUtils.saveChartAsJPEG(
                File("output.jpg"),
                chart,
                1024,
                1024
            )
        } catch (e: IOException) {
            e.printStackTrace()
        }
    }

    private fun createChart(dataset: XYSeriesCollection): JFreeChart {
        val domainAxis = NumberAxis("Remaining items after filter operation")
        val rangeAxis = NumberAxis("Mean filter duration [ms]")

        val renderer = XYLineAndShapeRenderer()
        renderer.defaultShapesVisible = true

        val plot = XYPlot(
            dataset,
            domainAxis,
            rangeAxis,
            renderer
        )

        val sub1 = "Filter ${numberFormat.format(ITEM_COUNT)} items"
        val sub2 = "Ryzen 9 5900x @ 2200 MHz (POWERSAVE)"

        val chart = JFreeChart("Average filter duration", plot)
        chart.addSubtitle(TextTitle(sub1))
        chart.addSubtitle(TextTitle(sub2))
        return chart
    }
}

fun main() {
    FilterBenchmark().run()
}

 

External dependencies are Apache Commons Math3 (for calculating the mean) and JFreeChart for plotting the chart.

 

I will readily admit that this is a very naive brute force implementation. I'm sure there's any number of optimizations that could be made (first of all, no copying data). Initially I just wanted to see how big the difference between stream and parallelStream would be, but while playing with array sizes and different filters I decided to add the chart to see how it would behave.

 

9 minutes ago, wanderingfool2 said:

While I know it won't match @BotDamian's dataset (and if run on a Rasp Pi 4 with a slower processor and RAM), the fact you did 100M in 1.5s as a max does show that he likely won't have to worry as much.

Yeah, also those 1.5 seconds only happen if you copy the 100M source list into a 100M result set, basically duplicating all of the data. I'm using a naive stream implementation that creates a new array for the results, so I think a lot of that speed is down to memory.

 

If you just iterate over the 100M array and produce little to no results the speed it a lot higher. An implementation that works without copying would likely be a lot faster. It would also be interesting to see the difference between a work stealing thread pool vs. splitting the list into equal chunks (e.g. each thread only iterates a certain index range). Not entirely certain how parallelStream does it, would have to look it up.

 

Note that I limited my CPU to run at a constant 2.2 GHz using the powersave governor (Linux). If I remove that limitation, results actually get skewed in favor of single threaded (higher clocks, I suppose)

Spoiler

image.thumb.png.e62af8874e84d55b0bd47006a8de319f.png

 

Remember to either quote or @mention others, so they are notified of your reply

Link to comment
Share on other sites

Link to post
Share on other sites

The difference is that one entry is more complicated than an array of objects with one key value pair.

 

It will have about 7 keys, then probably 2-3 arrays from 0-100 objects that also have key value/arrays/objects.

 

I was also thinking about using MongoDB or lokiJS for that. Both of them use BSON, as far as I'm aware MongoDB is written in C++.

 

But mongodb wouldn't be the best fit for that, I would prefer something without querying etc.

Edited by BotDamian

Intel NUC 13 | i3 1315U2x 8GB 3600/CL16

 

 

AMD 7950x3d | Sapphire 7800XT Nitro | 2x 16GB Corsair Vengeance 5600Mhz CL36 1R | MSI B650-P Pro Wifi | Custom Loop


AUNE X8 Magic DAC + SS3602

AUNE X7s PRO Class-A | Sennheiser HD58X

Schiit Rekkr ELAC BS243.3

Link to comment
Share on other sites

Link to post
Share on other sites

4 hours ago, BotDamian said:

The difference is that one entry is more complicated than an array of objects with one key value pair.

This was just a quick example I cobbled together in an hour or two to be able to be able to toy around with it an see how much effect multiple threads would have. I've since added an implementation that uses a for-each loop and it already beats both stream implementations by a mile. In the worst case scenario (no items removed) it takes around 500 ms.

 

In any case, to get more realistic results for your particular use-case you'll have to create your own set of benchmarks with objects that resemble real-world data as much as possible.

 

4 hours ago, BotDamian said:

It will have about 7 keys, then probably 2-3 arrays from 0-100 objects that also have key value/arrays/objects.

As a rule of thumb, the more complex your filter gets the more CPU cycles it is going to use. The larger your objects get, the more expensive copying data around is going to get. In my simple example, the most time is spent on copying data around, since the filter itself is only one trivial comparison.

 

Also keep in mind, the larger your data set gets, the less likely it is to fit into memory. If that is the case and you need to swap to disk, I/O is going to have a much larger impact and the CPU cycles spent on your filter might no longer matter as much.

 

4 hours ago, BotDamian said:

I was also thinking about using MongoDB or lokiJS for that. Both of them use BSON, as far as I'm aware MongoDB is written in C++.

But mongodb wouldn't be the best fit for that, I would prefer something without querying etc.

Try to be as objective as possible when considering technologies to use. Don't let personal preferences keep you from using the best tool for the job. Also, where is the data you're trying to filter come from? Is it written on disk, is it created on the fly, is it streamed in over the network?

 

If data is on disk already, then keeping it in a database might actually be ideal. Databases are built for speed when it comes to querying data and they have many optimization strategies like indexes built in. If it is streamed in over the network and you're filtering on the fly, then maybe the CPU doesn't actually matter as much, because the network might become the limiting factor.

 

A small sample project similar to mine can help you determine where the actual bottleneck is before you waste a lot of time writing the best possible GPU filter there is only to discover that slow network I/O makes all of those improvements redundant.

Remember to either quote or @mention others, so they are notified of your reply

Link to comment
Share on other sites

Link to post
Share on other sites

On 7/15/2021 at 11:53 PM, BotDamian said:

That's why I'm asking, if the Rasp Pi 4 can filter 1mil items in a second then this shouldn't be a problem. If it takes a second longer than 3s it's not bad, it shouldn't become 20s tho.

Note that the Pi 4 does not support OpenCL, you have to use OpenGL 2.1/OpenGL ES 3.0 or Vulcan 1.0.

 

OpenGL is a simpler API but this version doesn't even support compute shaders. The data has to go through the graphics pipline and can only be passed to the shaders as vertex or texture buffers and the output has to be a texture buffer as well. This makes it harder to use it for general purpose computations.

 

With the new open source driver, Vulcan 1.0 compute shaders should work now, according to this article. The API is significantly more complex though.

ಠ_ಠ

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×