Jump to content

Retreiving data from a website (HTML) in C#

lubblig

So, I'm in the process of building myself a simple c# program that gets data from like divs/paragraphs etc on HTML websites. This is because I only want to retrieve some of the data on the site. Not all of the fancy pictures etc.

 

Therefore I would like to either load the built-in webbrowser with it "Visible = false" or by downloading the website code, to not have to see it but be able to get data from for example: <div class="blah"> DATA I want </div> and <p> Other data I want </p> and skip everything else like <img src="images/blah.img /> .

 

So then, when the webpage has loaded out of sight, I can retrieve the data I want (like, DATA I want & Other data I want) and put these in a string I could then load into strings.

 

 

So, my question is pretty much how to get a page in a simple way, find certain data I want and convert only that into a string(s) in my program that I can display how ever ever I want in the C# program.

 

 

Please ask if there is anything unclear about this. I can understand if it is since I had troubles formulating this in English...

 

Thanks

Spoiler

System:

i5 3570k @ 4.4 GHz, MSI Z77A-G43, Dominator Platinum 1600MHz 16GB (2x8GB), EVGA GTX 980ti 6GB, CM HAF XM, Samsung 850 Pro 256GB + Some WD Red HDD, Corsair RM850 80+ Gold, Asus Xonar Essence STX, Windows 10 Pro 64bit

PCPP:

http://pcpartpicker.com/p/znZqcf

 

Link to comment
Share on other sites

Link to post
Share on other sites

What you are talking about is website scraping. If you'd have carried out a search on Google you'd have seen an absolute ton of resources detailing how to accomplish exactly what you are after more or less.

The single biggest problem in communication is the illusion that it has taken place.

Link to comment
Share on other sites

Link to post
Share on other sites

You could simply save the HTML as string and then parse it as you like.

 

That is indeed the basic jist of the process. You can use RegEx for the parsing.

The single biggest problem in communication is the illusion that it has taken place.

Link to comment
Share on other sites

Link to post
Share on other sites

What you are talking about is website scraping. If you'd have carried out a search on Google you'd have seen an absolute ton of resources detailing how to accomplish exactly what you are after more or less.

Oh, I didn't know it was called that. Thanks. My searches didn't really result in anything, which is why I asked here.

 

I'll google it as soon as I've got some spare time :)

Spoiler

System:

i5 3570k @ 4.4 GHz, MSI Z77A-G43, Dominator Platinum 1600MHz 16GB (2x8GB), EVGA GTX 980ti 6GB, CM HAF XM, Samsung 850 Pro 256GB + Some WD Red HDD, Corsair RM850 80+ Gold, Asus Xonar Essence STX, Windows 10 Pro 64bit

PCPP:

http://pcpartpicker.com/p/znZqcf

 

Link to comment
Share on other sites

Link to post
Share on other sites

You could simply save the HTML as string and then parse it as you like.

Huh, that's pretty smart. Why the heck didn't I think of that ;)

Spoiler

System:

i5 3570k @ 4.4 GHz, MSI Z77A-G43, Dominator Platinum 1600MHz 16GB (2x8GB), EVGA GTX 980ti 6GB, CM HAF XM, Samsung 850 Pro 256GB + Some WD Red HDD, Corsair RM850 80+ Gold, Asus Xonar Essence STX, Windows 10 Pro 64bit

PCPP:

http://pcpartpicker.com/p/znZqcf

 

Link to comment
Share on other sites

Link to post
Share on other sites

Oh, I didn't know it was called that. Thanks. My searches didn't really result in anything, which is why I asked here.

 

I'll google it as soon as I've got some spare time :)

 

Huh, that's pretty smart. Why the heck didn't I think of that ;)

 

Why even waste time reinventing the wheel? Have a look at HtmlAgilityPack.

The single biggest problem in communication is the illusion that it has taken place.

Link to comment
Share on other sites

Link to post
Share on other sites

Why even waste time reinventing the wheel? Have a look at HtmlAgilityPack.

The agility pack is great but I think he wants to build the functionality from scratch.

Link to comment
Share on other sites

Link to post
Share on other sites

I actually have a decent amount of practice with this! Alright, so to start off you need to get the raw html from the website that you want. The simplest implementation I have for that is as follows:

public static string retrieveData(string url) {            WebClient connection = new WebClient();            return connection.DownloadString(url);        }

This is used in a static environment, such as static class, not an object (although you could use it in that way by removing static).

 

Now to find and extract the stuff you want is a bit tricky to explain without knowing the page and the format, but usually how I go about this is find a string that occurs before what you want and after what you want. So for example if your data is 

<div class="specialclass">THIS IS WHAT YOU WANT</div>

Then the two strings you will use is <div class="specialclass"> and </div>

 

Once you have the raw html from the function above, you need to split it up using built in string functions and arrays. Keeping with the implementation above, the code will look something like this.

//pretend we are in a function somewherestring res = retrieveData("http://websitething.com/");//res is the full html, so now we do the div split.string[] firstarray = res.Split(new string[] { "<div class=\"specialclass\">" }, StringSplitOptions.None);//this first array is split in two, the first part is the html you don't want, the second part starts with the data you do want. So now we split it again!string[] secondarray = firstarray[1].Split(new string[] { "</div>" }, StringSplitOptions.None);//See how we took the second part of the first array, then we split it again, but now the first part is what we want and the second part is just rubbish.string datayouwant = secondarray[0];//Now you have your data!

Now this is a simple way of looking at it where you can follow easily, however if I was programming this with space in mind I would implement as follows:

string data = retrieveData("http://somewebsite.com").Split(new string[] { "<div class=\"specialclass\">" }, StringSplitOptions.None)[1].Split(new string[] { "</div>" }, StringSplitOptions.None)[0];

This should yield the same result without a bunch of variables taking up memory.

 

Also understand that you should put this in a try and catch or at least check each array to see if it split, because as soon as you call [1] on the array you would get an out of bounds error.

 

Hope this helps!

Link to comment
Share on other sites

Link to post
Share on other sites

I actually have a decent amount of practice with this! Alright, so to start off you need to get the raw html from the website that you want. The simplest implementation I have for that is as follows:

public static string retrieveData(string url) {            WebClient connection = new WebClient();            return connection.DownloadString(url);        }

This is used in a static environment, such as static class, not an object (although you could use it in that way by removing static).

 

Now to find and extract the stuff you want is a bit tricky to explain without knowing the page and the format, but usually how I go about this is find a string that occurs before what you want and after what you want. So for example if your data is 

<div class="specialclass">THIS IS WHAT YOU WANT</div>

Then the two strings you will use is <div class="specialclass"> and </div>

 

Once you have the raw html from the function above, you need to split it up using built in string functions and arrays. Keeping with the implementation above, the code will look something like this.

//pretend we are in a function somewherestring res = retrieveData("http://websitething.com/");//res is the full html, so now we do the div split.string[] firstarray = res.Split(new string[] { "<div class=\"specialclass\">" }, StringSplitOptions.None);//this first array is split in two, the first part is the html you don't want, the second part starts with the data you do want. So now we split it again!string[] secondarray = firstarray[1].Split(new string[] { "</div>" }, StringSplitOptions.None);//See how we took the second part of the first array, then we split it again, but now the first part is what we want and the second part is just rubbish.string datayouwant = secondarray[0];//Now you have your data!

Now this is a simple way of looking at it where you can follow easily, however if I was programming this with space in mind I would implement as follows:

string data = retrieveData("http://somewebsite.com").Split(new string[] { "<div class=\"specialclass\">" }, StringSplitOptions.None)[1].Split(new string[] { "</div>" }, StringSplitOptions.None)[0];

This should yield the same result without a bunch of variables taking up memory.

 

Also understand that you should put this in a try and catch or at least check each array to see if it split, because as soon as you call [1] on the array you would get an out of bounds error.

 

Hope this helps!

 

Awesome, thanks! I haven't been able to work on my project due to school but as soon as I have some spare time I'll try and come up with something on my own. But thank you for this. It's great seeing something and getting ideas for how I plan on doing it (and as worst case scenario, I'll borrow some of your code for my stuff if what I'm thinking fails ;) ).

Spoiler

System:

i5 3570k @ 4.4 GHz, MSI Z77A-G43, Dominator Platinum 1600MHz 16GB (2x8GB), EVGA GTX 980ti 6GB, CM HAF XM, Samsung 850 Pro 256GB + Some WD Red HDD, Corsair RM850 80+ Gold, Asus Xonar Essence STX, Windows 10 Pro 64bit

PCPP:

http://pcpartpicker.com/p/znZqcf

 

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×