Jump to content

Efficiency of string.contains()

HeaterUp

I was thinking about this a lot earlier, how is the efficiency of C# string.contains() method when compared to the amount of characters included as the parameter. For example I am working with around 20,000 active directory users and I am having to check for certain strings in specific attributes contained in their name (distinguishedName for those of you who know AD well). I am wondering given a

string example = "OU=this, CN=that"; 

Would example.contains("this") be quicker than example.contains("OU=this") as the first has less characters to match?

Link to comment
Share on other sites

Link to post
Share on other sites

Having fewer things to match will always be quicker. I don't think there's an algorithm where the total time is the same no matter the input. There are ways to make the comparison O(1) (hash the string want to find and using a sliding window and hashing the characters in the window of string being searched), but the setup is always going to be longer by virtue of just having more stuff.

Link to comment
Share on other sites

Link to post
Share on other sites

String.Contains() use an ordinal search. Will iterate each elements only once until there is a singular match then if the sequence match it will continue. If it see an invalid match it has to back track and restart the seeking so if you have many multiple match sequence it will iterate a tiny bit more than O(1) but it's negligible.

Link to comment
Share on other sites

Link to post
Share on other sites

According to the .NET source browser, String.Contains is implemented by a recursive sliding window search.

 

This implementation gives a run time that is dependent on the location of the first occurrence of the search string and the length of the search string.

 

The longer the search string is and the farther away it's first occurrence is from the beginning of the string being searched, the longer it takes to find the match. Additionally, partial matches slow down the search as well.

 

If your strings are formatted such that there is a way to exploit the formatting to get a faster search, it's possible to implement your own. Some introductory information on how to do so is available at https://docs.microsoft.com/en-us/dotnet/api/system.string.contains?view=netframework-4.8

 

In your specific example, it wouldn't matter if you searched "OU=this" or just "this" because the "OU=" always precedes "this" and must be checked anyway. 

 

If, however, "OU=" sometimes exists without "this", then it would be faster to just search for "this" in the case where "OU=<not this>" appears before "OU=this".

 

However, this is a micro optimization and you shouldn't worry about it unless you already know that this search is slowing you down beyond acceptable limits.

ENCRYPTION IS NOT A CRIME

Link to comment
Share on other sites

Link to post
Share on other sites

I feel like I'm missing something, so apologies if this is unhelpful, but I think it might be faster to turn this into a hash table.  In your case you might need two, one for hash['OU']='this', and another for hash['this']='OU'.  I guess technically you could just combine them into one if you have the key value be an object rather than another string.

 

That would give you O(1) lookup assuming you're searching by the exact string that's already stored every time.   If you need to find a substring within that string, you might try using a Trie.  I believe there are already several implementations that can be used in .NET.

 

In both cases however, you'd be trading off much faster looking up for slightly higher memory usage.

 

Again, sorry if this isn't helpful.  Could you perhaps makeup a few examples and show exactly what sort of operations you'd be performing on them.

 

 

 

Link to comment
Share on other sites

Link to post
Share on other sites

39 minutes ago, JacobFW said:

I feel like I'm missing something, so apologies if this is unhelpful, but I think it might be faster to turn this into a hash table.  In your case you might need two, one for hash['OU']='this', and another for hash['this']='OU'.  I guess technically you could just combine them into one if you have the key value be an object rather than another string.

 

That would give you O(1) lookup assuming you're searching by the exact string that's already stored every time.   If you need to find a substring within that string, you might try using a Trie.  I believe there are already several implementations that can be used in .NET.

 

In both cases however, you'd be trading off much faster looking up for slightly higher memory usage.

 

Again, sorry if this isn't helpful.  Could you perhaps makeup a few examples and show exactly what sort of operations you'd be performing on them.

 

 

 

I would be even worse as you need to parse the whole text to build the hashset, Unless the text is always in the same format and you know exactly where everything is index wise before hand.

Link to comment
Share on other sites

Link to post
Share on other sites

I don't really think either option is notably better than the other.

 

A few points that might be helpful:

1. Are you able to index the attributes (as in search indexing)? I would assume name is an indexed attribute by default. From what I can see FQL (Fast Query Language) is supported in AD. FQL queries run in seconds in my experiance but I've not used it in AD. If you see an option for KQL give that a try also.

2. Are you able to do this operation asyncronously? (parralel.foreach, tasks etc)

3. I don't know if this is viable for you but can you not just load all user data then when all the data has loaded do string.contains? String.contains on 20k items shouldn't be slow IMO, if the data is being loaded in repeatedly then it will take a long time.

4. I'm not sure what you've written in your app but each time you do string.contains you're checking your entire datasource for one thing. It's more effeciant to have a single loop for your data then search for what you're looking for. I'm too lazy to explain in depth, maybe someone can elaborate.

5. Can you trim your datasource just before you run a query

 

 

I'm fairly sure the way you're doing this can be optimized but without being able to see the code it's hard to say.

 

Link to comment
Share on other sites

Link to post
Share on other sites

5 hours ago, Franck said:

I would be even worse as you need to parse the whole text to build the hashset, Unless the text is always in the same format and you know exactly where everything is index wise before hand.

No, it just means there's a high initial cost.

Whether or not that cost is justified depends entirely upon the operations, how many he has to do, as well as how often it would have to be rebuilt.

 

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×