Jump to content

so im currently trying to build a deal finder for craigs list where it finds bench scores for listed gpus and basically gives me price/performance ratings on used offerings.

the problem at hand is the following:

i already have a list of gpu names and scores. thats done but now i have to compare the craigslist offering titles to that list for partial matches.

e.g.: "XFX Radeon RX 570 RS XXX Edition, 4GB GDDR5, RX-570P4DFD6/RX-570" has to correlate to "Radeon RX 570" from the list, rather than "GeForce GTX 570" wich is also on the list.

 

if anybody got some idea for an algorithm id love to discuss it...

"You know it'll clock down as soon as it hits 40°C, right?" - "Yeah ... but it doesnt hit 40°C ... ever  😄"

 

GPU: MSI GTX1080 Ti Aero @ 2 GHz (watercooled) CPU: Ryzen 5600X (watercooled) RAM: 32GB 3600Mhz Corsair LPX MB: Gigabyte B550i PSU: Corsair SF750 Case: Hyte Revolt 3

 

Link to comment
https://linustechtips.com/topic/799297-c-matching-strings/
Share on other sites

Link to post
Share on other sites

First off, the RX 570 != GTX 570. :P

 

This is actually a fairly difficult problem, mainly because humans are the ones typing up the descriptions (presumably), and things like putting the model number down sometimes doesn't occur to people.

Substring searching and a large data store of video cards to look through is about all you can do without getting into significantly more complex solutions. Even then, you'll have to review its matches to ensure they're actually matches.

Link to comment
https://linustechtips.com/topic/799297-c-matching-strings/#findComment-10055362
Share on other sites

Link to post
Share on other sites

1 minute ago, HarryNyquist said:

First off, the RX 570 != GTX 570. :P

 

This is actually a fairly difficult problem, mainly because humans are the ones typing up the descriptions (presumably), and things like putting the model number down sometimes doesn't occur to people.

Substring searching and a large data store of video cards to look through is about all you can do without getting into significantly more complex solutions. Even then, you'll have to review its matches to ensure they're actually matches.

shit :P i just grabbed some quick example and didnt notice.

yeah im looking for a way to filter out irrelevant info. there must be a way to reduce the titles to the model number. and honestly if you dont put the model in the title i dont give a crap.

i realize its not gonna be perfect but still i am wiuete interested in it.

i thought about reducing the title to only numbers but then theres experts who put "2048 MB RAM" in the title and all of a sudden i cant use that anymore

"You know it'll clock down as soon as it hits 40°C, right?" - "Yeah ... but it doesnt hit 40°C ... ever  😄"

 

GPU: MSI GTX1080 Ti Aero @ 2 GHz (watercooled) CPU: Ryzen 5600X (watercooled) RAM: 32GB 3600Mhz Corsair LPX MB: Gigabyte B550i PSU: Corsair SF750 Case: Hyte Revolt 3

 

Link to comment
https://linustechtips.com/topic/799297-c-matching-strings/#findComment-10055370
Share on other sites

Link to post
Share on other sites

Not a C# dev though, but for partial matching you might want look at regular expression matching. You can match subgroups and put these in a 'fuzzy' string search on a SQL database. But then, I may be underestimating your problem here.

 

CPU: i7-12700KF Grill Plate Edition // MOBO: Asus Z690-PLUS WIFI D4 // RAM: 16GB G.Skill Trident Z 3200MHz CL14 

GPU: MSI GTX 1080 FE // PSU: Corsair RM750i // CASE: Thermaltake Core X71 // BOOT: Samsung Evo 960 500GB

STORAGE: WD PC SN530 512GB + Samsung Evo 860 500GB // COOLING: Full custom loop // DISPLAY: LG 34UC89G-B

Link to comment
https://linustechtips.com/topic/799297-c-matching-strings/#findComment-10055458
Share on other sites

Link to post
Share on other sites

34 minutes ago, Limecat86 said:

Not a C# dev though, but for partial matching you might want look at regular expression matching. You can match subgroups and put these in a 'fuzzy' string search on a SQL database. But then, I may be underestimating your problem here.

 

its a shame really but this is the first time i actually took the time to learn about fuzzy search and its a great help thanks

"You know it'll clock down as soon as it hits 40°C, right?" - "Yeah ... but it doesnt hit 40°C ... ever  😄"

 

GPU: MSI GTX1080 Ti Aero @ 2 GHz (watercooled) CPU: Ryzen 5600X (watercooled) RAM: 32GB 3600Mhz Corsair LPX MB: Gigabyte B550i PSU: Corsair SF750 Case: Hyte Revolt 3

 

Link to comment
https://linustechtips.com/topic/799297-c-matching-strings/#findComment-10055573
Share on other sites

Link to post
Share on other sites

var cards = new List<string>
	{
		"RX-570P4DFD6/RX-570",
		"XFX Radeon RX 570 RS XXX Edition",
		"NVIDIA GeForce GTX 1080 Ti",
		"ROG-STRIX-GTX1080TI-O11G-GAMING"
	};
		
foreach (var card in cards)
{
	var match = Regex.Match(card, @"[a-zA-Z]{2,3}[ \-_\.]{0,}\d{3,4}");
    
	//this one is more strict
    //var match = Regex.Match(card, @"[GRTX]{2,3}[ \-_\.]{0,1}\d{3,4}");
	
    if (match.Success)
	{
	  var model = match.ToString();
	}
}

maybe this is a push in the right direction...

 

the regex will look for 2 to 3 letters: RX or GTX, it could be more specific with [GRTX] instead of [a-zA-Z] Then it will match 0 or more spaces underscores or hyphens you can also limit this to a single one if nescessary by changing {0,} to {0,1} an then it will match 3 to 4 digits for the model number.

 

the next thing you have to do is to compare the match to your entered model number

 

edit: apologies for the code formatting... the forum makes it quite difficult

#killedmywife #howtomakebombs #vgamasterrace

Link to comment
https://linustechtips.com/topic/799297-c-matching-strings/#findComment-10055694
Share on other sites

Link to post
Share on other sites

The String class has a method call "Contains()", which returns a bool. You can define a number of standard card names. EG "rx 570, radeon 570, etc". and try to find those definitions within your list.

Quote or tag if you want me to answer! PM me if you are in a real hurry!

Why do Java developers wear glasses? Because they can't C#!

 

My Machines:

The Gaming Rig:

Spoiler

-Processor: i5 6600k @4.6GHz

-Graphics: GTX1060 6GB G1 Gaming

-RAM: 2x8GB HyperX DDR4 2133MHz

-Motherboard: Asus Z170-A

-Cooler: Corsair H100i

-PSU: EVGA 650W 80+bronze

-AOC 1080p ultrawide

My good old laptop:

Spoiler

Lenovo T430

-Processor: i7 3520M

-4GB DDR3 1600MHz

-Graphics: intel iGPU :(

-Not even 1080p

 

Link to comment
https://linustechtips.com/topic/799297-c-matching-strings/#findComment-10057469
Share on other sites

Link to post
Share on other sites

On 27.6.2017 at 3:26 PM, simson0606 said:

var cards = new List<string>
	{
		"RX-570P4DFD6/RX-570",
		"XFX Radeon RX 570 RS XXX Edition",
		"NVIDIA GeForce GTX 1080 Ti",
		"ROG-STRIX-GTX1080TI-O11G-GAMING"
	};
		
foreach (var card in cards)
{
	var match = Regex.Match(card, @"[a-zA-Z]{2,3}[ \-_\.]{0,}\d{3,4}");
    
	//this one is more strict
    //var match = Regex.Match(card, @"[GRTX]{2,3}[ \-_\.]{0,1}\d{3,4}");
	
    if (match.Success)
	{
	  var model = match.ToString();
	}
}

maybe this is a push in the right direction...

 

the regex will look for 2 to 3 letters: RX or GTX, it could be more specific with [GRTX] instead of [a-zA-Z] Then it will match 0 or more spaces underscores or hyphens you can also limit this to a single one if nescessary by changing {0,} to {0,1} an then it will match 3 to 4 digits for the model number.

 

the next thing you have to do is to compare the match to your entered model number

 

edit: apologies for the code formatting... the forum makes it quite difficult

thanks for the advice sadly its not gonna cut it. here is an example of it in useregex.JPG.683bcb8b233229d44d1b7074bbf3a7bc.JPG

the model column represence the results of the regex and as you can see not only doesnt it work perfectly but i also found some genius who didnt even put a model in the title :P

"You know it'll clock down as soon as it hits 40°C, right?" - "Yeah ... but it doesnt hit 40°C ... ever  😄"

 

GPU: MSI GTX1080 Ti Aero @ 2 GHz (watercooled) CPU: Ryzen 5600X (watercooled) RAM: 32GB 3600Mhz Corsair LPX MB: Gigabyte B550i PSU: Corsair SF750 Case: Hyte Revolt 3

 

Link to comment
https://linustechtips.com/topic/799297-c-matching-strings/#findComment-10071531
Share on other sites

Link to post
Share on other sites

On 27.6.2017 at 9:40 PM, dany_boy said:

The String class has a method call "Contains()", which returns a bool. You can define a number of standard card names. EG "rx 570, radeon 570, etc". and try to find those definitions within your list.

yeah sorry but im more looking for an automated solution that not only works for specified models.

"You know it'll clock down as soon as it hits 40°C, right?" - "Yeah ... but it doesnt hit 40°C ... ever  😄"

 

GPU: MSI GTX1080 Ti Aero @ 2 GHz (watercooled) CPU: Ryzen 5600X (watercooled) RAM: 32GB 3600Mhz Corsair LPX MB: Gigabyte B550i PSU: Corsair SF750 Case: Hyte Revolt 3

 

Link to comment
https://linustechtips.com/topic/799297-c-matching-strings/#findComment-10071536
Share on other sites

Link to post
Share on other sites

On 28.6.2017 at 4:13 PM, Erik Sieghart said:

Hey approximate string matching is something I actually code professionally. Not in C# though.

I don't believe only regex is the right solution, since regex matches patterns and isn't particularly well suited to matching unpredictable text. It can be useful for confirming that an approximate string match is valid, however.

What you need is a combination of creating tokens, matching patterns, a lookup table, and fuzzy searching. Fuzzy searching is most commonly implemented with Levenshtein distances. Wikipedia has the pseudo code necessary to implement it; it's not terribly complicated until you start trying to optimize for limitations on your parameters. C# likely has a library for it somewhere that's already implemented anyway.

 

The first thing you're going to have to do is start splitting tokens out. Splitting by spaces, slashes, hyphens, and commas is usually good. Then you run your dictionary of terms over the tokens. If you recognize something matches perfectly then you can bin that particularly category. If it has multiple matches you'll have to bin it as "multiple matches". Categories can be like manufacturer, vendor, model number, etc.

After that you have to take what's left and start running your fuzzy search on tokens for all the data you don't have, to bin items more strongly. The item should be matched to the item it has the most matches for (this can be weighted). If it's too ambiguous, then it may be better to either toss the item or bin it as "unmatched".

This is a laborious approach, as a fuzzy search requires running distances on everything in your dictionary for every single token in a brute force sort of approach. Although you could speed it up by escaping the calculation if it's clear it won't be a match (such as comparing a short token to a long token).

In summary building a fuzzy search is easy, making it useful can be much harder.

fuzzy search seems like an advanced enough solutioon but sadly im way not familiar enough with it so my result are really bad using the libraries in existence. like worse than the regex solution. so unless you want to walk me through everything i guess ill be looking for a different solution

"You know it'll clock down as soon as it hits 40°C, right?" - "Yeah ... but it doesnt hit 40°C ... ever  😄"

 

GPU: MSI GTX1080 Ti Aero @ 2 GHz (watercooled) CPU: Ryzen 5600X (watercooled) RAM: 32GB 3600Mhz Corsair LPX MB: Gigabyte B550i PSU: Corsair SF750 Case: Hyte Revolt 3

 

Link to comment
https://linustechtips.com/topic/799297-c-matching-strings/#findComment-10071590
Share on other sites

Link to post
Share on other sites

Convert all strings to uppercase or lowercase (or use case insensitive functions to search for texts within a string)

 

You want to "pretty-fy" your title first, make it more easier to parse.

 

Look for some keywords in the the text like RX or GTX or GT and check what character is BEFORE and AFTER these keywords. If it's not SPACE, then ADD a space. Basically you want to convert "RX470" to "RX 470"  or if you have "GTX1070" , you want to convert it to "GTX 1070"

Optionally, do the same for memory, look for "GB" and add space before if the character is not space, for example convert "8GB to 8 GB". Could also do for MB (ex 6144 MB for the GTX 980 Ti in the image above)

 

Now you can just search the whole text for double spaces and replace them with a single space, because you're going to split the title into an array of strings later.

So for example,  your "Sapphire Radeon RX460 OC" becomes "Sapphire Radeon RX 460 OC" which then is split into an array that has the words.

Now you can simply use a for to go through each word, if you find the keyword "RX" or GTX , look at the previous word or the next word ... if it's "470" , or "580" or whatever for RX, put the proper version in a variable in your code. If the keyword is GTX , the look before and after for "1060", "1070", etc ...

 

This won't solve the problem with some special codes like  GTX650-DC-1GD5 but it would be an improvement

 

 

Link to comment
https://linustechtips.com/topic/799297-c-matching-strings/#findComment-10071607
Share on other sites

Link to post
Share on other sites

This is an example of one of many problems that seem easy for a human to comprehend but turns out to be very difficult for a computer to automate.

Fuzzy searching is probably going to be your best bet, but as you've said, that may have already exceeded your skill level. It might take more time to automate than to parse it yourself.

Link to comment
https://linustechtips.com/topic/799297-c-matching-strings/#findComment-10073315
Share on other sites

Link to post
Share on other sites

On 27/06/2017 at 0:34 PM, cluelessgenius said:

so im currently trying to build a deal finder for craigs list where it finds bench scores for listed gpus and basically gives me price/performance ratings on used offerings.

the problem at hand is the following:

i already have a list of gpu names and scores. thats done but now i have to compare the craigslist offering titles to that list for partial matches.

e.g.: "XFX Radeon RX 570 RS XXX Edition, 4GB GDDR5, RX-570P4DFD6/RX-570" has to correlate to "Radeon RX 570" from the list, rather than "GeForce GTX 570" wich is also on the list.

 

if anybody got some idea for an algorithm id love to discuss it...

Regular expression for each card number, then check which card is a match - if there's more than one match, you would have to sort that out and take it into account next time.

||| Drakon (Desktop Build) |||

|| CPU: 3800X || Cooler: Kraken X63 || Motherboard: B450 Aorus M || Memory: HyperX DDR4-3200MHz 16G ||

|| Storage: 512GB 970 Pro + 500GB 850 EVO + 250GB 850 EVO + 1TB HDD + 2TB HDD || Graphics Card: RX 5700 XT Red Devil || Case: Thermaltake Core V21 || PSU: XFX XTR 750W 80+Gold || 

 

Link to comment
https://linustechtips.com/topic/799297-c-matching-strings/#findComment-10073323
Share on other sites

Link to post
Share on other sites

On 1.7.2017 at 3:25 AM, Erik Sieghart said:

Fuzzy searching is easier to use than regex.

 

The Levenshtein distance is the number of insertions, deletions, and alterations necessary to change one word into another. To turn "cat" into "bat" you would substitute the c with an b and then they would be the same word, so it's 1. Easy. It's usage is theoretically easy as well:


int distance = LevenshteinDistance(stringA, stringB)

You can use this value as a "similarity rating" of how close one word is for another. Let's say your token is "GT", you can match it to GTX with a confidence of the Levenshtein distance. It's not as complicated as people may have you believe.

it looks like im gonna cut up the titles at every occurance of "space" and every time digits switch to letters and the other way around. then im gonna try to implement my own version of the levenshtein distance using those fragments as a whole instead of each character.

"You know it'll clock down as soon as it hits 40°C, right?" - "Yeah ... but it doesnt hit 40°C ... ever  😄"

 

GPU: MSI GTX1080 Ti Aero @ 2 GHz (watercooled) CPU: Ryzen 5600X (watercooled) RAM: 32GB 3600Mhz Corsair LPX MB: Gigabyte B550i PSU: Corsair SF750 Case: Hyte Revolt 3

 

Link to comment
https://linustechtips.com/topic/799297-c-matching-strings/#findComment-10085471
Share on other sites

Link to post
Share on other sites

ok it is done! in the end i ended up throwing all code away because it started to get way to confusing and rewrote it. the result is pretty simple and basic :). now maybe only a bit of performance enhancement is left to do.

static void Main(string[] args)
{
    string title = "XFX Radeon RX 570 RS XXX Edition, 4GB GDDR5, RX-570P4DFD6/RX-570";
    string model = "Radeon RX 570";
    string[] parts = splitintoparts(title);
    int diff = levenshtein(splitintoparts(title), splitintoparts(model));
}

static int levenshtein(string[] s1, string[] s2)
{
  int n = s1.Length;
  int m = s2.Length;
  int[,] d = new int[n + 1, m + 1];

  if (n == 0)
  {
      return m;
  }

  if (m == 0)
  {
      return n;
  }
          
  for (int i = 1; i <= n; i++)
  {
  	for (int j = 1; j <= m; j++)
  	{
  	    int cost = (s2[j - 1] == s1[i - 1]) ? 0 : 1;

  	    d[i, j] = Math.Min(Math.Min(d[i - 1, j] + 1,
                                    d[i, j - 1] + 1),
  	        			   d[i - 1, j - 1] + cost);
  	}
  }
  return d[n, m];
}

static string[] splitintoparts(string whole)
{
  string splitchars = " ,.-/";
  List<string> parts = new List<string>();
  
  int lastsplit = 0;
  string previous = null;
  
  for (int i = 0; i < whole.Length; i++)
  {
      string parttoadd = "";
      if (splitchars.Contains(whole[i]))
      {
          parttoadd = whole.Substring(lastsplit, i - lastsplit);
      }
      if (i==whole.Length-1)
      {
          parttoadd = whole.Substring(lastsplit);
      }
      if (previous!=null&&Char.IsDigit(previous.ToCharArray()[0])!=Char.IsDigit(whole[i]))
      {
          parttoadd = whole.Substring(lastsplit, i - lastsplit);
      }
      parttoadd = strip(parttoadd);
      if (!String.IsNullOrWhiteSpace(parttoadd))
      {
          if (!parts.Contains(parttoadd))
          {
              parts.Add(parttoadd);
          }
          lastsplit = i;
      }
      previous = whole[i].ToString();
  }
  return parts.ToArray();
}

static string strip(string before)
{
  string specialchars = " ,.-/";
  string after = "";
  
  for (int j = 0; j < before.Length; j++)
  {
      if (!specialchars.Contains(before[j]))
      {
          after += before[j];
      }
  }
  return after.ToUpper();
}

this way parts correctly contains the following entries:

XFX ; RADEON ; RX ; 570 ; RS ; XXX ; EDITION ; 4 ; GB ; GDDR ; 5 ; P ; DFD ; 6 

 

and diff is 11 :)

"You know it'll clock down as soon as it hits 40°C, right?" - "Yeah ... but it doesnt hit 40°C ... ever  😄"

 

GPU: MSI GTX1080 Ti Aero @ 2 GHz (watercooled) CPU: Ryzen 5600X (watercooled) RAM: 32GB 3600Mhz Corsair LPX MB: Gigabyte B550i PSU: Corsair SF750 Case: Hyte Revolt 3

 

Link to comment
https://linustechtips.com/topic/799297-c-matching-strings/#findComment-10099968
Share on other sites

Link to post
Share on other sites

soo qiuck and probably last update:

sadly the qebsite only allows for the first 50 pages to be loaded so thats about 1200 offers of about 7500. sounds bad but hey better than nothing.

and interestingly enough the best price/passmark score right now seems to be the 500 series from nvidia even though that probably got something to do with them being handed out for 10-30 bucks

also i can probably very easily adapt this now for cpu as well

i mean startup time until calculation are done is about 2 minutes but hey who cares

"You know it'll clock down as soon as it hits 40°C, right?" - "Yeah ... but it doesnt hit 40°C ... ever  😄"

 

GPU: MSI GTX1080 Ti Aero @ 2 GHz (watercooled) CPU: Ryzen 5600X (watercooled) RAM: 32GB 3600Mhz Corsair LPX MB: Gigabyte B550i PSU: Corsair SF750 Case: Hyte Revolt 3

 

Link to comment
https://linustechtips.com/topic/799297-c-matching-strings/#findComment-10105260
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×