Jump to content

Need help with a program

Cryosec

I'm coding a program that will extract the text from a specific HTML tag in webpages loaded by url from a file (around a hundred links). The problem is, I have no idea how to go throu the HTML source of each webpage, read the text from the specific (and only one in the page) tag and eventually delete unwanted text (I need numbers, but there could be unnecessary text with it).

 

What would be the best language to use? And how can I do it?

Computer Case: NZXT S340 || CPU: AMD Ryzen 5 1600 || Cooler: CM Hyper212 Evo || MoBo: MSI B350 Mortar || RAM Vengeance LPX 2x8GB 3200MHz || PSU: Corsair CX600 || SSD: HyperX Fury 120GB & 240GB || HDD: WD Blue 1TB + 1TB 2.5'' backup drive || GPU: Sapphire Nitro+ RX 580 4GB

Laptop 1 HP x360 13-u113nl

Laptop Lenovo z50-75 with AMD FX-7500 || OS: Windows 10 / Ubuntu 17.04

DSLR Nikon D5300 w/ 18-105mm lens

Link to comment
Share on other sites

Link to post
Share on other sites

maybe try php5 

"Learn from yesterday, live for today, hope for tomorrow. The important thing is not to stop questioning." -Albert Einstein

Link to comment
Share on other sites

Link to post
Share on other sites

7 minutes ago, Cryosec said:

What would be the best language to use?

What language(s) do you know already? Chances are you won't need to learn a new language just for this task.

 

8 minutes ago, Cryosec said:

And how can I do it?

Look into web scraping.

Link to comment
Share on other sites

Link to post
Share on other sites

Just now, madknight3 said:

What language(s) do you know already?

C++, Python and Java. I'm still learning other languages, so just these for the moment.

Computer Case: NZXT S340 || CPU: AMD Ryzen 5 1600 || Cooler: CM Hyper212 Evo || MoBo: MSI B350 Mortar || RAM Vengeance LPX 2x8GB 3200MHz || PSU: Corsair CX600 || SSD: HyperX Fury 120GB & 240GB || HDD: WD Blue 1TB + 1TB 2.5'' backup drive || GPU: Sapphire Nitro+ RX 580 4GB

Laptop 1 HP x360 13-u113nl

Laptop Lenovo z50-75 with AMD FX-7500 || OS: Windows 10 / Ubuntu 17.04

DSLR Nikon D5300 w/ 18-105mm lens

Link to comment
Share on other sites

Link to post
Share on other sites

In Java you could use something like Jsoup. It can fetch the page from URL and parse it for you. Syntax is very easy, you just have to give it a CSS selector to get some HTML element.

Link to comment
Share on other sites

Link to post
Share on other sites

Just now, Cr3at1v3 said:

In Java you could use something like Jsoup. It can fetch the page from URL and parse it for you. Syntax is very easy, you just have to give it a CSS selector to get some HTML element.

I'll try this out, thanks

Computer Case: NZXT S340 || CPU: AMD Ryzen 5 1600 || Cooler: CM Hyper212 Evo || MoBo: MSI B350 Mortar || RAM Vengeance LPX 2x8GB 3200MHz || PSU: Corsair CX600 || SSD: HyperX Fury 120GB & 240GB || HDD: WD Blue 1TB + 1TB 2.5'' backup drive || GPU: Sapphire Nitro+ RX 580 4GB

Laptop 1 HP x360 13-u113nl

Laptop Lenovo z50-75 with AMD FX-7500 || OS: Windows 10 / Ubuntu 17.04

DSLR Nikon D5300 w/ 18-105mm lens

Link to comment
Share on other sites

Link to post
Share on other sites

17 minutes ago, Gachr said:

 

 

Unless the website is so malformed a real parser won't work or you're trying to parse inline JS, using regex on html is a pretty bad idea. 

 

Using an html / xml parser like the ones @madknight3 listed will be much easier, less prone to breaking on small changes to the website and be faster.

1474412270.2748842

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×