Jump to content

How do I create a simple web scraping program?

Speaker1264

Does anyone know how to create a simple web scraping program, or a good place to ask/get more information?

 

I want to write a simple program that opens a text file that contains a Newegg web-link, such as this, reads that link, opens it, and retrieves the price of that item, and then saves that price into a new text file.  Anyone have any idea on how to do it?

 

Or better yet would open an excel file, and does the same thing, but overwrites the cell containing the old price.

Link to comment
Share on other sites

Link to post
Share on other sites

To get that you could just use file_get_contents method in php. Then use something like domxpath and use the elements id (which seems to be "singleFinalPrice") and with that you wil be able to get the price in the "contents" attribute of that element.

 

I am not sure about writing to an excel spreadsheet but reading from one is really simple. First make it a csv file and then use something like this to be able to get all of the urls that you are wanting to get information from.

 

Hope that helps

Link to comment
Share on other sites

Link to post
Share on other sites

The "singleFinalPrice" element is added dynamically, it won't be in the source.

After screwing around for a bit there doesn't seem to be a single consistent way to get the price on every page. But the most common would be a regex search for "price":(\d+\.\d\d) but again, that won't work on every page.

1474412270.2748842

Link to comment
Share on other sites

Link to post
Share on other sites

I looked at the source and isn't the price stored in "product_sale_price"?

Then you can make a simple script in - for example - Python to just open the url and find that and parse it in some string/float.

Just a quick idea @Speaker1264

 

UPDATE:

I did a very quick code in Ruby so it's far from perfect but it (sort of) works 

require 'net/http'require 'uri'def open(url)  Net::HTTP.get(URI.parse(url))endpage_content = open('http://www.newegg.com/Product/Product.aspx?Item=N82E16819117372')page_content.split("\n").each do |line|puts line[33,6] if line.include? "product_sale_price"end

that outputs: "239.99"

Then you can ">>" it to a txt file

Asrock 890GX Extreme 3 - AMD Phenom II X4 955 @3.50GHz - Arctic Cooling Freezer XTREME Rev.2 - 4GB Kingston HyperX - AMD Radeon HD7850 - Kingston V300 240GB - Samsung Spinpoint F3 1TB - Chieftec APS-750 - Cooler Master HAF912 PLUS


osu! profile

Link to comment
Share on other sites

Link to post
Share on other sites

The "singleFinalPrice" element is added dynamically, it won't be in the source.

Ah yes, thanks, did not even pick that up. I feel stupid now lol.

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×