Jump to content

Web scraping with python3

hirushaadi
Go to solution Solved by Sauron,

In the first page of the BeautifulSoup documentation there's an example doing what you asked...

 

print(soup.get_text())

 

Purpose: to scrap the first paragraph from Wikipedia

 

The Code:

import requests
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/wiki/Web_scraping"
r = requests.get(url)
x = r.content

soup = BeautifulSoup(x, 'html.parser')

all_p_div = soup.find("div", {"class":"mw-parser-output"})
all_p = all_p_div.find_all("p")
first_p = all_p[0]

# print all paragraphs
# for i in range(len(all_p)):
# 	print(f'{all_p[i]}')

print(first_p)

 

output:

 

<p><b>Web scraping</b>, <b>web harvesting</b>, or <b>web data extraction</b> is <a href="/wiki/Data_scraping" title="Data scraping">data scraping</a> used 
for <a href="/wiki/Data_extraction" title="Data extraction">extracting data</a> from <a href="/wiki/Website" title="Website">websites</a>. The web scraping software may directly access the <a href="/wiki/World_Wide_Web" title="World Wide Web">World Wide Web</a> using the <a href="/wiki/Hypertext_Transfer_Protocol" title="Hypertext Transfer Protocol">Hypertext Transfer Protocol</a> or a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a <a href="/wiki/Internet_bot" title="Internet bot">bot</a> or <a href="/wiki/Web_crawler" 
title="Web crawler">web crawler</a>. It is a form of copying in which specific data is gathered and copied from the web, typically into a central local <a 
href="/wiki/Database" title="Database">database</a> or spreadsheet, for later <a href="/wiki/Data_retrieval" title="Data retrieval">retrieval</a> or <a href="/wiki/Data_analysis" title="Data analysis">analysis</a>.

 

What i want is to get the text in normal format by removing the <a> tags

hey! i know to use a computer

Link to comment
Share on other sites

Link to post
Share on other sites

Since I'm not sure what your library does or what it is a simple dirty and quick solution would be to take the (I assume) string output run it through a loop with a simple if function that looks for "<a" and "/a>".

 

What it would do is basically read the whole first_p and if the if is triggered at "<a" simply stop adding to a temp string variable. Then when "/a>" is seen it will basically start doing tempstring += first_p(whatever method you wish).

 

There are a bunch of methods to read for the <a like a simple substring if you like doing it the old way, split(),...

Link to comment
Share on other sites

Link to post
Share on other sites

8 minutes ago, hirusha.adikari said:

Purpose: to scrap the first paragraph from Wikipedia

 

The Code:


import requests
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/wiki/Web_scraping"
r = requests.get(url)
x = r.content

soup = BeautifulSoup(x, 'html.parser')

all_p_div = soup.find("div", {"class":"mw-parser-output"})
all_p = all_p_div.find_all("p")
first_p = all_p[0]

# print all paragraphs
# for i in range(len(all_p)):
# 	print(f'{all_p[i]}')

print(first_p)

 

output:

 


<p><b>Web scraping</b>, <b>web harvesting</b>, or <b>web data extraction</b> is <a href="/wiki/Data_scraping" title="Data scraping">data scraping</a> used 
for <a href="/wiki/Data_extraction" title="Data extraction">extracting data</a> from <a href="/wiki/Website" title="Website">websites</a>. The web scraping software may directly access the <a href="/wiki/World_Wide_Web" title="World Wide Web">World Wide Web</a> using the <a href="/wiki/Hypertext_Transfer_Protocol" title="Hypertext Transfer Protocol">Hypertext Transfer Protocol</a> or a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a <a href="/wiki/Internet_bot" title="Internet bot">bot</a> or <a href="/wiki/Web_crawler" 
title="Web crawler">web crawler</a>. It is a form of copying in which specific data is gathered and copied from the web, typically into a central local <a 
href="/wiki/Database" title="Database">database</a> or spreadsheet, for later <a href="/wiki/Data_retrieval" title="Data retrieval">retrieval</a> or <a href="/wiki/Data_analysis" title="Data analysis">analysis</a>.

 

What i want is to get the text in normal format by removing the <a> tags

Why scrape a service that has an API?
https://www.mediawiki.org/wiki/API:Main_page

 

And if you wanna access it via Python, here is a couple:

https://pypi.org/project/wikipedia/

https://pypi.org/project/Wikipedia-API/

VGhlIHF1aWV0ZXIgeW91IGJlY29tZSwgdGhlIG1vcmUgeW91IGFyZSBhYmxlIHRvIGhlYXIu

^ not a crypto wallet

Link to comment
Share on other sites

Link to post
Share on other sites

In the first page of the BeautifulSoup documentation there's an example doing what you asked...

 

print(soup.get_text())

 

Don't ask to ask, just ask... please 🤨

sudo chmod -R 000 /*

Link to comment
Share on other sites

Link to post
Share on other sites

19 hours ago, Biohazard777 said:

Why scrape a service that has an API?
https://www.mediawiki.org/wiki/API:Main_page

 

And if you wanna access it via Python, here is a couple:

https://pypi.org/project/wikipedia/

https://pypi.org/project/Wikipedia-API/

tried them before, but for me, they are not reliable for some reason

hey! i know to use a computer

Link to comment
Share on other sites

Link to post
Share on other sites

19 hours ago, Sauron said:

In the first page of the BeautifulSoup documentation there's an example doing what you asked...

 


print(soup.get_text())

 

Thank you, but this works perfectly! 
 

from BeautifulSoup import BeautifulSoup

soup = BeautifulSoup('<p>Hello <a href="http://google.com">Google</a></p>')
for a in soup.findAll('a'):
    a.replaceWithChildren()

 

Link to original article: https://stackoverflow.com/questions/19080957/how-to-remove-all-a-href-tags-from-text

hey! i know to use a computer

Link to comment
Share on other sites

Link to post
Share on other sites

  • 1 month later...

you should take a look into selenium its simple to use, i had a web scraper up and running in 10minutes, you can specify tags, classes or even full xpaths that you specifically want to look for, you will want to download the web driver for the browser your using and you can use selenium in most common programming languages (c,java,python etc..)

 

https://www.selenium.dev/

 

 

Link to comment
Share on other sites

Link to post
Share on other sites

9 hours ago, ComputerSaysNo said:

you should take a look into selenium its simple to use, i had a web scraper up and running in 10minutes, you can specify tags, classes or even full xpaths that you specifically want to look for, you will want to download the web driver for the browser your using and you can use selenium in most common programming languages (c,java,python etc..)

 

https://www.selenium.dev/

 

 

Selenium is a browser automation tool. Although it can be used for web scraping, it's most definitely not ideal. Requests with a parser like what they're doing here, or a specifically designed scraping framework such as scrapy are much better options. 

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×