Retreiving data from a website (HTML) in C#

lubblig · November 11, 2015

So, I'm in the process of building myself a simple c# program that gets data from like divs/paragraphs etc on HTML websites. This is because I only want to retrieve some of the data on the site. Not all of the fancy pictures etc.

Therefore I would like to either load the built-in webbrowser with it "Visible = false" or by downloading the website code, to not have to see it but be able to get data from for example: <div class="blah"> DATA I want </div> and <p> Other data I want </p> and skip everything else like <img src="images/blah.img /> .

So then, when the webpage has loaded out of sight, I can retrieve the data I want (like, DATA I want & Other data I want) and put these in a string I could then load into strings.

So, my question is pretty much how to get a page in a simple way, find certain data I want and convert only that into a string(s) in my program that I can display how ever ever I want in the C# program.

Please ask if there is anything unclear about this. I can understand if it is since I had troubles formulating this in English...

Thanks

Nuluvius · November 11, 2015

What you are talking about is website scraping. If you'd have carried out a search on Google you'd have seen an absolute ton of resources detailing how to accomplish exactly what you are after more or less.

as96 · November 11, 2015

You could simply save the HTML as string and then parse it as you like.

Nuluvius · November 11, 2015

You could simply save the HTML as string and then parse it as you like.

That is indeed the basic jist of the process. You can use RegEx for the parsing.

lubblig · November 11, 2015

What you are talking about is website scraping. If you'd have carried out a search on Google you'd have seen an absolute ton of resources detailing how to accomplish exactly what you are after more or less.

Oh, I didn't know it was called that. Thanks. My searches didn't really result in anything, which is why I asked here.

I'll google it as soon as I've got some spare time

lubblig · November 11, 2015

You could simply save the HTML as string and then parse it as you like.

Huh, that's pretty smart. Why the heck didn't I think of that

Nuluvius · November 11, 2015

Oh, I didn't know it was called that. Thanks. My searches didn't really result in anything, which is why I asked here.

I'll google it as soon as I've got some spare time

Huh, that's pretty smart. Why the heck didn't I think of that

Why even waste time reinventing the wheel? Have a look at HtmlAgilityPack.

as96 · November 11, 2015

Why even waste time reinventing the wheel? Have a look at HtmlAgilityPack.

The agility pack is great but I think he wants to build the functionality from scratch.

ItsMTC · November 12, 2015

I actually have a decent amount of practice with this! Alright, so to start off you need to get the raw html from the website that you want. The simplest implementation I have for that is as follows:

public static string retrieveData(string url) {            WebClient connection = new WebClient();            return connection.DownloadString(url);        }

This is used in a static environment, such as static class, not an object (although you could use it in that way by removing static).

Now to find and extract the stuff you want is a bit tricky to explain without knowing the page and the format, but usually how I go about this is find a string that occurs before what you want and after what you want. So for example if your data is

<div class="specialclass">THIS IS WHAT YOU WANT</div>

Then the two strings you will use is <div class="specialclass"> and </div>

Once you have the raw html from the function above, you need to split it up using built in string functions and arrays. Keeping with the implementation above, the code will look something like this.

//pretend we are in a function somewherestring res = retrieveData("http://websitething.com/");//res is the full html, so now we do the div split.string[] firstarray = res.Split(new string[] { "<div class=\"specialclass\">" }, StringSplitOptions.None);//this first array is split in two, the first part is the html you don't want, the second part starts with the data you do want. So now we split it again!string[] secondarray = firstarray[1].Split(new string[] { "</div>" }, StringSplitOptions.None);//See how we took the second part of the first array, then we split it again, but now the first part is what we want and the second part is just rubbish.string datayouwant = secondarray[0];//Now you have your data!

Now this is a simple way of looking at it where you can follow easily, however if I was programming this with space in mind I would implement as follows:

string data = retrieveData("http://somewebsite.com").Split(new string[] { "<div class=\"specialclass\">" }, StringSplitOptions.None)[1].Split(new string[] { "</div>" }, StringSplitOptions.None)[0];

This should yield the same result without a bunch of variables taking up memory.

Also understand that you should put this in a try and catch or at least check each array to see if it split, because as soon as you call [1] on the array you would get an out of bounds error.

Hope this helps!

lubblig · November 12, 2015

I actually have a decent amount of practice with this! Alright, so to start off you need to get the raw html from the website that you want. The simplest implementation I have for that is as follows:
public static string retrieveData(string url) {            WebClient connection = new WebClient();            return connection.DownloadString(url);        }
This is used in a static environment, such as static class, not an object (although you could use it in that way by removing static).

Now to find and extract the stuff you want is a bit tricky to explain without knowing the page and the format, but usually how I go about this is find a string that occurs before what you want and after what you want. So for example if your data is
<div class="specialclass">THIS IS WHAT YOU WANT</div>
Then the two strings you will use is <div class="specialclass"> and </div>

Once you have the raw html from the function above, you need to split it up using built in string functions and arrays. Keeping with the implementation above, the code will look something like this.
//pretend we are in a function somewherestring res = retrieveData("http://websitething.com/");//res is the full html, so now we do the div split.string[] firstarray = res.Split(new string[] { "<div class=\"specialclass\">" }, StringSplitOptions.None);//this first array is split in two, the first part is the html you don't want, the second part starts with the data you do want. So now we split it again!string[] secondarray = firstarray[1].Split(new string[] { "</div>" }, StringSplitOptions.None);//See how we took the second part of the first array, then we split it again, but now the first part is what we want and the second part is just rubbish.string datayouwant = secondarray[0];//Now you have your data!
Now this is a simple way of looking at it where you can follow easily, however if I was programming this with space in mind I would implement as follows:
string data = retrieveData("http://somewebsite.com").Split(new string[] { "<div class=\"specialclass\">" }, StringSplitOptions.None)[1].Split(new string[] { "</div>" }, StringSplitOptions.None)[0];
This should yield the same result without a bunch of variables taking up memory.

Also understand that you should put this in a try and catch or at least check each array to see if it split, because as soon as you call [1] on the array you would get an out of bounds error.

Hope this helps!

Awesome, thanks! I haven't been able to work on my project due to school but as soon as I have some spare time I'll try and come up with something on my own. But thank you for this. It's great seeing something and getting ideas for how I plan on doing it (and as worst case scenario, I'll borrow some of your code for my stuff if what I'm thinking fails ).

Sign In

Retreiving data from a website (HTML) in C#

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Create an account or sign in to comment

Create an account

Sign in

Featured Topics

Topics

Latest From Linus Tech Tips:

I shouldn’t have kept the $1,000,000 computer

Latest From Tech Quickie:

This Guy BUILT His Own Graphics Card!

Latest From TechLinked:

Microsoft, Give Up Already.

Latest From GameLinked:

Roblox and Walmart... Are One

Latest From ShortCircuit:

Dell Has Destroyed the XPS - Dell XPS 16 (2024)

Latest From Mac Address:

Why did you buy an Apple Vision Pro?

Latest From Channel Super Fun:

I Swapped the CEO's Assistant For a Day!