CPU vs GPU instruction set difference

Eigenvektor · July 16, 2021

I think you should also take into account that, compared to a CPU-only solution, you'll most likely introduce a fair amount of additional complexity into your codebase. Unless performance in that particular part of your application is absolutely mission critical as mentioned above, it may not be worth it from that perspective alone.

wanderingfool2 · July 16, 2021

22 hours ago, BotDamian said:

So the thing is that it will filter (not sort) each 3s, it can be an array of 100 items then the next second it can be an array with 10k items, then 4k items.
Short said we will never know how many items there will be, if god wants to it could be even 5mil items. Similar to AI you don't know how much data comes out exactly, depends.

That's why I'm asking, if the Rasp Pi 4 can filter 1mil items in a second then this shouldn't be a problem. If it takes a second longer than 3s it's not bad, it shouldn't become 20s tho.

Well one benefit is that filtering is a lot less intensive than sorting.

Nothing beats running a quick trial. Generate a bunch of fake data, and try filtering (like 10 million). Setting enough data to illustrate the realistic worst case scenario. From there, run your code and see how long it takes. What I would do in this kind of scenario, I would create a few sets of test data. The worst case, but also the 99% case scenario (and see how it runs).

If it doesn't meet your requirements, then you at least know how close you are. e.g. if it's running in 20 seconds then you know you might need to either go for something stronger than Pi4 or move to GPU....if it runs in like 5 seconds, you might be able to tweak your current code to get it within your threshold.

Eigenvektor · July 17, 2021

14 hours ago, wanderingfool2 said:

Nothing beats running a quick trial. Generate a bunch of fake data, and try filtering (like 10 million).

I wrote a small application in Kotlin that generates 100M objects (containing an Int) and then uses stream and parallelStream to filter them.

Spoiler

What's interesting to see is, as the number of items that remain in the list after the operation completes increases, at some point the single threaded implementation actually starts to be faster than the multi-threaded implementation. My guess would be that merging the results starts to become a significant overhead.

Takeaway: the overhead of merging results can easily eat up all the gains you get from doing things in parallel. So be sure to benchmark with actual data and don't assume that things are faster just because you're running on multiple threads.

wanderingfool2 · July 17, 2021

3 hours ago, Eigenvektor said:

I wrote a small application in Kotlin that generates 100M objects (containing an Int) and then uses stream and parallelStream to filter them.

haha, by chance do you still have your source? [For the record, I believe your results and mostly agree with your conclusion]

I do agree, the overhead of merging is likely the reason. Filtering is more of an O(n) operation, while sorting is O(nlogn), which given the results need to be done within 3 seconds is beneficial since filtering will take less time on larger datasets...but the real benefit coming from the fact that sorting in MT can effectively run the same operations as ST due to the algorithm.

I suspect with careful manipulation you could do a better result with filtering...but then it gets back to is the time invested in it worth it..filtering on a GPU could still potentially be quicker, as it has higher speed RAM and is built for quick comparisons.

While I know it won't match @BotDamian's dataset (and if run on a Rasp Pi 4 with a slower processor and RAM), the fact you did 100M in 1.5s as a max does show that he likely won't have to worry as much.

Eigenvektor · July 17, 2021

1 minute ago, wanderingfool2 said:

haha, by chance do you still have your source? [For the record, I believe your results and mostly agree with your conclusion]

I do I can also attach the whole project if needed, but here's the relevant source code:

Spoiler


class FilterBenchmark {

    companion object {
        const val ITEM_COUNT = 100_000_000
        const val RUNS = 20

        const val MAX_INDEX = 10_000
        const val STEP_SIZE = 500
    }

    class Entity(
        val index: Int
    )

    class Result(
        val meanItems: Double,
        val meanDuration: Double,
    )

    private val numberFormat = NumberFormat.getInstance(Locale.ENGLISH)

    fun run() {
        print("Create $ITEM_COUNT items... ")
        val source = IntStream.range(0, ITEM_COUNT)
            .mapToObj { index -> Entity(index % MAX_INDEX) }
            .collect(Collectors.toList())
        println("done")

        val results = mutableMapOf<String, MutableList<Result>>()
        results["Stream"] = mutableListOf()
        results["Parallel Stream"] = mutableListOf()

        for (filter in 0..MAX_INDEX step STEP_SIZE) {
            results["Stream"]?.add(streamFilter(filter, source))
            results["Parallel Stream"]?.add(parallelStreamFilter(filter, source))
        }

        plotChart(results)
    }

    private fun streamFilter(filter: Int, source: List<Entity>): Result {
        return measure("Stream") {
            source.stream()
                .filter { it.index < filter }
                .collect(Collectors.toList())
        }
    }

    private fun parallelStreamFilter(filter: Int, source: List<Entity>): Result {
        return measure("ParallelStream") {
            source.parallelStream()
                .filter { it.index < filter }
                .collect(Collectors.toList())
        }
    }

    private fun measure(
        name: String,
        method: () -> List<Entity>
    ): Result {
        val filteredStats = DescriptiveStatistics(RUNS)
        val durationStats = DescriptiveStatistics(RUNS)

        for (n in 0..RUNS) {
            var filtered: List<Entity>
            val duration = measureNanoTime {
                filtered = method()
            }

            filteredStats.addValue(filtered.size.toDouble())
            durationStats.addValue(duration / 1_000_000.0)
        }

        println(
            """$name took an average of ${numberFormat.format(durationStats.mean)} ms to filter ${
                numberFormat.format(
                    filteredStats.mean
                )
            } out of ${numberFormat.format(ITEM_COUNT)} items in $RUNS runs"""
        )
        return Result(filteredStats.mean, durationStats.mean)
    }

    private fun plotChart(results: MutableMap<String, MutableList<Result>>) {
        val dataset = XYSeriesCollection()

        for (entry in results.entries) {
            val values = XYSeries(entry.key)

            for (result in entry.value) {
                values.add(result.meanItems, result.meanDuration)
            }

            dataset.addSeries(values)
        }

        try {
            val chart = createChart(dataset)

            ChartUtils.saveChartAsJPEG(
                File("output.jpg"),
                chart,
                1024,
                1024
            )
        } catch (e: IOException) {
            e.printStackTrace()
        }
    }

    private fun createChart(dataset: XYSeriesCollection): JFreeChart {
        val domainAxis = NumberAxis("Remaining items after filter operation")
        val rangeAxis = NumberAxis("Mean filter duration [ms]")

        val renderer = XYLineAndShapeRenderer()
        renderer.defaultShapesVisible = true

        val plot = XYPlot(
            dataset,
            domainAxis,
            rangeAxis,
            renderer
        )

        val sub1 = "Filter ${numberFormat.format(ITEM_COUNT)} items"
        val sub2 = "Ryzen 9 5900x @ 2200 MHz (POWERSAVE)"

        val chart = JFreeChart("Average filter duration", plot)
        chart.addSubtitle(TextTitle(sub1))
        chart.addSubtitle(TextTitle(sub2))
        return chart
    }
}

fun main() {
    FilterBenchmark().run()
}

External dependencies are Apache Commons Math3 (for calculating the mean) and JFreeChart for plotting the chart.

I will readily admit that this is a very naive brute force implementation. I'm sure there's any number of optimizations that could be made (first of all, no copying data). Initially I just wanted to see how big the difference between stream and parallelStream would be, but while playing with array sizes and different filters I decided to add the chart to see how it would behave.

9 minutes ago, wanderingfool2 said:

While I know it won't match @BotDamian's dataset (and if run on a Rasp Pi 4 with a slower processor and RAM), the fact you did 100M in 1.5s as a max does show that he likely won't have to worry as much.

Yeah, also those 1.5 seconds only happen if you copy the 100M source list into a 100M result set, basically duplicating all of the data. I'm using a naive stream implementation that creates a new array for the results, so I think a lot of that speed is down to memory.

If you just iterate over the 100M array and produce little to no results the speed it a lot higher. An implementation that works without copying would likely be a lot faster. It would also be interesting to see the difference between a work stealing thread pool vs. splitting the list into equal chunks (e.g. each thread only iterates a certain index range). Not entirely certain how parallelStream does it, would have to look it up.

Note that I limited my CPU to run at a constant 2.2 GHz using the powersave governor (Linux). If I remove that limitation, results actually get skewed in favor of single threaded (higher clocks, I suppose)

Spoiler

BotDamian · July 18, 2021

The difference is that one entry is more complicated than an array of objects with one key value pair.

It will have about 7 keys, then probably 2-3 arrays from 0-100 objects that also have key value/arrays/objects.

I was also thinking about using MongoDB or lokiJS for that. Both of them use BSON, as far as I'm aware MongoDB is written in C++.

But mongodb wouldn't be the best fit for that, I would prefer something without querying etc.

Edited July 18, 2021 by BotDamian

Eigenvektor · July 18, 2021

4 hours ago, BotDamian said:

The difference is that one entry is more complicated than an array of objects with one key value pair.

This was just a quick example I cobbled together in an hour or two to be able to be able to toy around with it an see how much effect multiple threads would have. I've since added an implementation that uses a for-each loop and it already beats both stream implementations by a mile. In the worst case scenario (no items removed) it takes around 500 ms.

In any case, to get more realistic results for your particular use-case you'll have to create your own set of benchmarks with objects that resemble real-world data as much as possible.

4 hours ago, BotDamian said:

It will have about 7 keys, then probably 2-3 arrays from 0-100 objects that also have key value/arrays/objects.

As a rule of thumb, the more complex your filter gets the more CPU cycles it is going to use. The larger your objects get, the more expensive copying data around is going to get. In my simple example, the most time is spent on copying data around, since the filter itself is only one trivial comparison.

Also keep in mind, the larger your data set gets, the less likely it is to fit into memory. If that is the case and you need to swap to disk, I/O is going to have a much larger impact and the CPU cycles spent on your filter might no longer matter as much.

4 hours ago, BotDamian said:

I was also thinking about using MongoDB or lokiJS for that. Both of them use BSON, as far as I'm aware MongoDB is written in C++.

But mongodb wouldn't be the best fit for that, I would prefer something without querying etc.

Try to be as objective as possible when considering technologies to use. Don't let personal preferences keep you from using the best tool for the job. Also, where is the data you're trying to filter come from? Is it written on disk, is it created on the fly, is it streamed in over the network?

If data is on disk already, then keeping it in a database might actually be ideal. Databases are built for speed when it comes to querying data and they have many optimization strategies like indexes built in. If it is streamed in over the network and you're filtering on the fly, then maybe the CPU doesn't actually matter as much, because the network might become the limiting factor.

A small sample project similar to mine can help you determine where the actual bottleneck is before you waste a lot of time writing the best possible GPU filter there is only to discover that slow network I/O makes all of those improvements redundant.

shadow_ray · July 18, 2021

On 7/15/2021 at 11:53 PM, BotDamian said:

That's why I'm asking, if the Rasp Pi 4 can filter 1mil items in a second then this shouldn't be a problem. If it takes a second longer than 3s it's not bad, it shouldn't become 20s tho.

Note that the Pi 4 does not support OpenCL, you have to use OpenGL 2.1/OpenGL ES 3.0 or Vulcan 1.0.

OpenGL is a simpler API but this version doesn't even support compute shaders. The data has to go through the graphics pipline and can only be passed to the shaders as vertex or texture buffers and the output has to be a texture buffer as well. This makes it harder to use it for general purpose computations.

With the new open source driver, Vulcan 1.0 compute shaders should work now, according to this article. The API is significantly more complex though.

Sign In

CPU vs GPU instruction set difference

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Create an account or sign in to comment

Create an account

Sign in

Featured Topics

Topics

Latest From Linus Tech Tips:

I Was Never Meant to Have This Prototype CPU

Latest From Tech Quickie:

Why Do Speakers Hiss?

Latest From TechLinked:

Intel: “It Wasn’t Me”

Latest From GameLinked:

Is Nintendo being FRAMED?

Latest From ShortCircuit:

How is this even handheld?! - OneXPlayer X1

Latest From Mac Address:

Why did you buy an Apple Vision Pro?

Latest From Channel Super Fun:

I Swapped the CEO's Assistant For a Day!