Jump to content

Trying to get a program that takes information about a website

PedroBarbosa

Hey!

I'm trying to get information about a site like steam market:

http://steamcommunity.com/market/search?appid=730#p1_price_asc

 

I wan't to do a program that access the website and takes information about the weapons and the respective prices, just to do some chart's testing.

Whats the best way to do that?

 

I'm thinking of something like:

using System.Net;using (WebClient client = new WebClient()){     string htmlCode = client.DownloadString("http://steamcommunity.com/market/search?appid=730#p1_price_asc");}

And then work with the string htmlcode to take the information out?

Link to comment
Share on other sites

Link to post
Share on other sites

Python w/ BeautifulSoup.

 

Or whatever c#'s equivalent is, but python would probably be the easiest.

--Neil Hanlon

Operations Engineer

Link to comment
Share on other sites

Link to post
Share on other sites

I imagine that the variable htmlCode contains the entire html structure in a string format?
 
If so:
- All individual results are confined in tags like this :

<div class="market_listing_row market_recent_listing_row market_listing_searchresult" id="result_0">				<img id="result_0_image" src="http://steamcommunity-a.akamaihd.net/economy/image/fWFc82js0fmoRAP-qOIPu5THSWqfSmTELLqcUywGkijVjZYMUrsm1j-9xgEObwgfEh_nvjlWhNzZCveCDfIBj98xqodQ2CZknz51O_W0DyR3TR7HA7JfX_Q3ywTlDi8mppNiBYS087hSL13s5oeVZ7d4ONEfF5ODUvWBZgmp6Ro5g6JfKcSP8ynxnXO-jzlq88o/62fx62f" style="border-color: #D2D2D2;" class="market_listing_item_img" alt="">						<div class="market_listing_right_cell market_listing_their_price">			<span class="market_table_value">				Vanaf:<br>				<span style="color:white">$0.03 USD</span>			</span>			<span class="market_arrow_down" style="display: none"></span>			<span class="market_arrow_up" style="display: none"></span>		</div>		<div class="market_listing_right_cell market_listing_num_listings">			<span class="market_table_value">				<span class="market_listing_num_listings_qty">6,259</span>			</span>		</div>						<div class="market_listing_item_name_block">			<span id="result_0_name" class="market_listing_item_name" style="color: #D2D2D2;">Nova | Predator (Minimal Wear)</span>			<br>			<span class="market_listing_game_name">Counter-Strike: Global Offensive</span>		</div>	</div>

Now it is only your task to first:

- Cut the string before and after the "Html list".

- Create the individual result elements as objects. (Don't know about C, I'm a Java guy)

That time I saved Linus' WiFi pass from appearing on YouTube: 

A sudden Linus re-appears : http://linustechtips.com/main/topic/390793-important-dailymotion-account-still-active/

Link to comment
Share on other sites

Link to post
Share on other sites

Python:

__author__ = 'Neil'from bs4 import BeautifulSoupimport urllib2def main():    steam_page = urllib2.urlopen("http://steamcommunity.com/market/search?appid=730#p1_price_asc")    soup = BeautifulSoup(steam_page)    items = soup.findAll("div", {"class": "market_listing_row"})    item_list = []    for item in items:        item_name = item.find("span", {"class": "market_listing_item_name"}).contents[0].strip()        market_value = item.find("span", {"class": "market_table_value"}).contents[3].contents[0].strip("$ ")        game_name = item.find("span", {"class": "market_listing_game_name"}).contents[0].strip()        quantity = item.find("span", {"class": "market_listing_num_listings_qty"}).contents[0].strip()        item_list.append({            "name": item_name,            "value": market_value,            "game": game_name,            "quantity": quantity        })    print item_listif __name__ == '__main__':    main()

https://gist.github.com/NeilHanlon/c4e746765f9475dba43b

 

This does what you want, I believe.

 

2015-05-23_1038.png

 

You can also easily pipe this into a CSV to use in excel, etc.

--Neil Hanlon

Operations Engineer

Link to comment
Share on other sites

Link to post
Share on other sites

What i've been using recently in c#, requires

using System.Text.RegularExpressions; 
public string GetBetween(string main, string start, string finish, int index = 0)        {            Match gbMatch = new Regex(Regex.Escape(start) + "(.+?)" + Regex.Escape(finish)).Match(main, index);            if (gbMatch.Success)            {                return gbMatch.Groups[1].Value;            }            else            {                return string.Empty;            }        } 

Usage:

 

(string strhtml = "<p>hello world</p>")

string text = GetBetween(strhtml, "<p>", " world</p>");

Text will be "hello".

Link to comment
Share on other sites

Link to post
Share on other sites

What i've been using recently in c#, requires

using System.Text.RegularExpressions; 
public string GetBetween(string main, string start, string finish, int index = 0)        {            Match gbMatch = new Regex(Regex.Escape(start) + "(.+?)" + Regex.Escape(finish)).Match(main, index);            if (gbMatch.Success)            {                return gbMatch.Groups[1].Value;            }            else            {                return string.Empty;            }        } 

Usage:

 

(string strhtml = "<p>hello world</p>")

string text = GetBetween(strhtml, "<p>", " world</p>");

Text will be "hello".

 

 

Python:

__author__ = 'Neil'from bs4 import BeautifulSoupimport urllib2def main():    steam_page = urllib2.urlopen("http://steamcommunity.com/market/search?appid=730#p1_price_asc")    soup = BeautifulSoup(steam_page)    items = soup.findAll("div", {"class": "market_listing_row"})    item_list = []    for item in items:        item_name = item.find("span", {"class": "market_listing_item_name"}).contents[0].strip()        market_value = item.find("span", {"class": "market_table_value"}).contents[3].contents[0].strip("$ ")        game_name = item.find("span", {"class": "market_listing_game_name"}).contents[0].strip()        quantity = item.find("span", {"class": "market_listing_num_listings_qty"}).contents[0].strip()        item_list.append({            "name": item_name,            "value": market_value,            "game": game_name,            "quantity": quantity        })    print item_listif __name__ == '__main__':    main()

https://gist.github.com/NeilHanlon/c4e746765f9475dba43b

 

This does what you want, I believe.

 

2015-05-23_1038.png

 

You can also easily pipe this into a CSV to use in excel, etc.

 

 

I imagine that the variable htmlCode contains the entire html structure in a string format?

 

If so:

- All individual results are confined in tags like this :

<div class="market_listing_row market_recent_listing_row market_listing_searchresult" id="result_0">				<img id="result_0_image" src="http://steamcommunity-a.akamaihd.net/economy/image/fWFc82js0fmoRAP-qOIPu5THSWqfSmTELLqcUywGkijVjZYMUrsm1j-9xgEObwgfEh_nvjlWhNzZCveCDfIBj98xqodQ2CZknz51O_W0DyR3TR7HA7JfX_Q3ywTlDi8mppNiBYS087hSL13s5oeVZ7d4ONEfF5ODUvWBZgmp6Ro5g6JfKcSP8ynxnXO-jzlq88o/62fx62f" style="border-color: #D2D2D2;" class="market_listing_item_img" alt="">						<div class="market_listing_right_cell market_listing_their_price">			<span class="market_table_value">				Vanaf:<br>				<span style="color:white">$0.03 USD</span>			</span>			<span class="market_arrow_down" style="display: none"></span>			<span class="market_arrow_up" style="display: none"></span>		</div>		<div class="market_listing_right_cell market_listing_num_listings">			<span class="market_table_value">				<span class="market_listing_num_listings_qty">6,259</span>			</span>		</div>						<div class="market_listing_item_name_block">			<span id="result_0_name" class="market_listing_item_name" style="color: #D2D2D2;">Nova | Predator (Minimal Wear)</span>			<br>			<span class="market_listing_game_name">Counter-Strike: Global Offensive</span>		</div>	</div>

Now it is only your task to first:

- Cut the string before and after the "Html list".

- Create the individual result elements as objects. (Don't know about C, I'm a Java guy)

 

 

Python w/ BeautifulSoup.

 

Or whatever c#'s equivalent is, but python would probably be the easiest.

The thing is: If you look on the website, it shows diferent than the source code...

If you try to get this page source code:

http://steamcommunity.com/market/search?appid=730#p1_price_asc

You will get this source code:  http://steamcommunity.com/market/search?appid=730

 

And i would like to do that over and over again... go to all the pages in a loop (thats eazy) but the source code dosent change if i do:

http://steamcommunity.com/market/search?appid=730#p1_price_ascor http://steamcommunity.com/market/search?appid=730#p2_price_asc

Its allways the same

Link to comment
Share on other sites

Link to post
Share on other sites

The thing is: If you look on the website, it shows diferent than the source code...

If you try to get this page source code:

http://steamcommunity.com/market/search?appid=730#p1_price_asc

You will get this source code:  http://steamcommunity.com/market/search?appid=730

 

And i would like to do that over and over again... go to all the pages in a loop (thats eazy) but the source code dosent change if i do:

http://steamcommunity.com/market/search?appid=730#p1_price_ascor http://steamcommunity.com/market/search?appid=730#p2_price_asc

Its allways the same

 

Ah so, the problem is that the link does not lead directly to the filtered and resorted page :/

Try filtering locally on "Counter Strike: Global Offensive"? (Edit: Does not solve)

That time I saved Linus' WiFi pass from appearing on YouTube: 

A sudden Linus re-appears : http://linustechtips.com/main/topic/390793-important-dailymotion-account-still-active/

Link to comment
Share on other sites

Link to post
Share on other sites

Do you know how many pages there are?

 

you could do a lookup first to get the amount, and then pull all the data, then you cut it all down to 2 requests.

 

{"success":true,"start":0,"pagesize":10,"total_count":4379,"results_html":.... }

you can look at the total_count

 

so request could look like:

//lookup: (count=1)http://steamcommunity.com/market/search/render/?query=&start=0&count=1&search_descriptions=0&sort_column=price&sort_dir=asc&appid=730 ->{"success":true,"start":0,"pagesize":1,"total_count":4377,"results_html": ... }//fetch: (count=4377)http://steamcommunity.com/market/search/render/?query=&start=0&count=4377&search_descriptions=0&sort_column=price&sort_dir=asc&appid=730 ->{"success":true,"start":0,"pagesize":100,"total_count":4377,"results_html": ... }

Edit:

ah okay, they have a limit on 100 for each page, so you need to pull 100 at a time.

but you can get the offset by looking at pagesize and the totalt_count, then chang the start to the page * 100 you wanna look at.

 

4377/100 = 43,77 pages == 44 requests, well still better than pulling 10 at a time :)

Link to comment
Share on other sites

Link to post
Share on other sites

I do not have enough time to give you a full response but,

 

The method you are doing right now should work fine.

 

If you haven't invested too much time into the project I would suggest switching to the HTML Agility Pack.

 

It was designed specifically to work and extract information from pages and gets rid of a lot of the awkward string manipulations.

 

You can grab the packaged using Nuget assuming you are running a newer copy of visual studios. If not you may need to install the Nuget Plugin

Link to comment
Share on other sites

Link to post
Share on other sites

I made a small node application to show you one way you can run through all pages.

var request = require('request'), //https://www.npmjs.com/package/request    cheerio = require('cheerio'); //https://www.npmjs.com/package/cheeriofunction getUrl(offset){    return "http://steamcommunity.com/market/search/render/?query=&start="+offset+"&count=100&search_descriptions=0&sort_column=price&sort_dir=asc&appid=730";}var offset = 0,    target = 0,    result = [];var pull = function(error, response, raw){    if (!error) {        try{            var json = JSON.parse(raw);            target = json.total_count;            offset = json.start + 100;            parseItems(json.results_html);            if(target > offset){                request(getUrl(offset), pull);            }else{                //done loading all pages: do something with result array            }        }catch(err){            console.log("error",err);            console.log("data", raw);        }    }else{        console.log(error);    }};var parseItems = function(html){    var $ = cheerio.load(html, {xmlMode: true});    $(".market_listing_row").each(function(index, item){        result.push({            "name"      : $(item).find(".market_listing_item_name").text(),            "price"     : $(item).find(".market_table_value span").first().text(),            "quantity"  : $(item).find(".market_listing_num_listings_qty").text()        });    });};request(getUrl(offset), pull);
Link to comment
Share on other sites

Link to post
Share on other sites

 

I made a small node application to show you one way you can run through all pages.

... - to short

Node? Can i "run" node in Visual Studio 2013?

 

//EDIT: YES i can, i'm gonna test it out... thanks so far

 

The only problem, is that steam as a Counter "spam" thingy... i think... after some pages it just shows "null" for some time and it crashes the program

Link to comment
Share on other sites

Link to post
Share on other sites

Node? Can i "run" node in Visual Studio 2013?

 

//EDIT: YES i can, i'm gonna test it out... thanks so far

 

The only problem, is that steam as a Counter "spam" thingy... i think... after some pages it just shows "null" for some time and it crashes the program

 

Then you need to set a timeout between the pulls.

if(target > offset){  setTimeout(function(){ request(getUrl(offset), pull); }, 5000); //5sec delay}else{  //done loading all pages: do something with result array}
Link to comment
Share on other sites

Link to post
Share on other sites

 

Then you need to set a timeout between the pulls.

if(target > offset){  setTimeout(function(){ request(getUrl(offset), pull); }, 5000); //5sec delay}else{  //done loading all pages: do something with result array}

Ok, thanks!

One more thing... is there a way to change the language of the page?

Because, i'm Portuguese, and if i go to the page i see the stuff in portuguese, but if my program goes to the page it sees the stuff in US, and the problem is that it gets the money in USD and theres some stuff too that are diferent

 

I only need to know the URL if possible, the code i can manage myself  :)

Link to comment
Share on other sites

Link to post
Share on other sites

Ok, thanks!

One more thing... is there a way to change the language of the page?

Because, i'm Portuguese, and if i go to the page i see the stuff in portuguese, but if my program goes to the page it sees the stuff in US, and the problem is that it gets the money in USD and theres some stuff too that are diferent

I only need to know the URL if possible, the code i can manage myself :)

Sorry for double post... i'm soo distracted with my exams comming soon.

Link to comment
Share on other sites

Link to post
Share on other sites

I only need to know the URL if possible, the code i can manage myself :)

Sorry for double post... i'm soo distracted with my exams comming soon.

i guess its taken from the browsers language header

  1. Accept-Language:
    da-DK,da;q=0.8,en-US;q=0.6,en;q=0.4
Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×