Web scraping with python3

hirushaadi · August 6, 2021

Purpose: to scrap the first paragraph from Wikipedia

The Code:

import requests
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/wiki/Web_scraping"
r = requests.get(url)
x = r.content

soup = BeautifulSoup(x, 'html.parser')

all_p_div = soup.find("div", {"class":"mw-parser-output"})
all_p = all_p_div.find_all("p")
first_p = all_p[0]

# print all paragraphs
# for i in range(len(all_p)):
# 	print(f'{all_p[i]}')

print(first_p)

output:

<p><b>Web scraping</b>, <b>web harvesting</b>, or <b>web data extraction</b> is <a href="/wiki/Data_scraping" title="Data scraping">data scraping</a> used 
for <a href="/wiki/Data_extraction" title="Data extraction">extracting data</a> from <a href="/wiki/Website" title="Website">websites</a>. The web scraping software may directly access the <a href="/wiki/World_Wide_Web" title="World Wide Web">World Wide Web</a> using the <a href="/wiki/Hypertext_Transfer_Protocol" title="Hypertext Transfer Protocol">Hypertext Transfer Protocol</a> or a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a <a href="/wiki/Internet_bot" title="Internet bot">bot</a> or <a href="/wiki/Web_crawler" 
title="Web crawler">web crawler</a>. It is a form of copying in which specific data is gathered and copied from the web, typically into a central local <a 
href="/wiki/Database" title="Database">database</a> or spreadsheet, for later <a href="/wiki/Data_retrieval" title="Data retrieval">retrieval</a> or <a href="/wiki/Data_analysis" title="Data analysis">analysis</a>.

What i want is to get the text in normal format by removing the <a> tags

jaslion · August 6, 2021

Since I'm not sure what your library does or what it is a simple dirty and quick solution would be to take the (I assume) string output run it through a loop with a simple if function that looks for "<a" and "/a>".

What it would do is basically read the whole first_p and if the if is triggered at "<a" simply stop adding to a temp string variable. Then when "/a>" is seen it will basically start doing tempstring += first_p(whatever method you wish).

There are a bunch of methods to read for the <a like a simple substring if you like doing it the old way, split(),...

Biohazard777 · August 6, 2021

8 minutes ago, hirusha.adikari said:

Purpose: to scrap the first paragraph from Wikipedia

The Code:


import requests
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/wiki/Web_scraping"
r = requests.get(url)
x = r.content

soup = BeautifulSoup(x, 'html.parser')

all_p_div = soup.find("div", {"class":"mw-parser-output"})
all_p = all_p_div.find_all("p")
first_p = all_p[0]

# print all paragraphs
# for i in range(len(all_p)):
# 	print(f'{all_p[i]}')

print(first_p)

output:


<p><b>Web scraping</b>, <b>web harvesting</b>, or <b>web data extraction</b> is <a href="/wiki/Data_scraping" title="Data scraping">data scraping</a> used 
for <a href="/wiki/Data_extraction" title="Data extraction">extracting data</a> from <a href="/wiki/Website" title="Website">websites</a>. The web scraping software may directly access the <a href="/wiki/World_Wide_Web" title="World Wide Web">World Wide Web</a> using the <a href="/wiki/Hypertext_Transfer_Protocol" title="Hypertext Transfer Protocol">Hypertext Transfer Protocol</a> or a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a <a href="/wiki/Internet_bot" title="Internet bot">bot</a> or <a href="/wiki/Web_crawler" 
title="Web crawler">web crawler</a>. It is a form of copying in which specific data is gathered and copied from the web, typically into a central local <a 
href="/wiki/Database" title="Database">database</a> or spreadsheet, for later <a href="/wiki/Data_retrieval" title="Data retrieval">retrieval</a> or <a href="/wiki/Data_analysis" title="Data analysis">analysis</a>.

What i want is to get the text in normal format by removing the <a> tags

Why scrape a service that has an API?
https://www.mediawiki.org/wiki/API:Main_page

And if you wanna access it via Python, here is a couple:

https://pypi.org/project/wikipedia/

https://pypi.org/project/Wikipedia-API/

Sauron · August 6, 2021

In the first page of the BeautifulSoup documentation there's an example doing what you asked...

print(soup.get_text())

hirushaadi · August 7, 2021

19 hours ago, Biohazard777 said:

Why scrape a service that has an API?
https://www.mediawiki.org/wiki/API:Main_page

And if you wanna access it via Python, here is a couple:

https://pypi.org/project/wikipedia/

https://pypi.org/project/Wikipedia-API/

tried them before, but for me, they are not reliable for some reason

hirushaadi · August 7, 2021

19 hours ago, Sauron said:
In the first page of the BeautifulSoup documentation there's an example doing what you asked...
print(soup.get_text())

Thank you, but this works perfectly!

from BeautifulSoup import BeautifulSoup

soup = BeautifulSoup('<p>Hello <a href="http://google.com">Google</a></p>')
for a in soup.findAll('a'):
    a.replaceWithChildren()

Link to original article: https://stackoverflow.com/questions/19080957/how-to-remove-all-a-href-tags-from-text

ComputerSaysNo · September 8, 2021

you should take a look into selenium its simple to use, i had a web scraper up and running in 10minutes, you can specify tags, classes or even full xpaths that you specifically want to look for, you will want to download the web driver for the browser your using and you can use selenium in most common programming languages (c,java,python etc..)

https://www.selenium.dev/

codesidian · September 9, 2021

9 hours ago, ComputerSaysNo said:

you should take a look into selenium its simple to use, i had a web scraper up and running in 10minutes, you can specify tags, classes or even full xpaths that you specifically want to look for, you will want to download the web driver for the browser your using and you can use selenium in most common programming languages (c,java,python etc..)

https://www.selenium.dev/

Selenium is a browser automation tool. Although it can be used for web scraping, it's most definitely not ideal. Requests with a parser like what they're doing here, or a specifically designed scraping framework such as scrapy are much better options.

Sign In

Web scraping with python3

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Create an account or sign in to comment

Create an account

Sign in

Topics

Latest From Linus Tech Tips:

The Future of PC Cooling?

Latest From ShortCircuit:

The coolest looking monitor. Period. - ASUS ROG display at Computex (Sponsored)

Latest From TechLinked:

Microsoft Just Can’t Help Itself

Latest From GameLinked:

Wait wasn't this game dead??

Latest From Tech Quickie:

Who's Tracking Your Phone Right Now?

Latest From The WAN Show:

Pizza Hut is Being Sued Over AI