Jump to content

Python Program Not Working Correctly

LtStaffel

Hello everyone,

 

I am trying to write my own Python script to be given a link and find all links on that link, then find all links on those links, and so on. I believe it is called a scraper.

 

However, it is apparently not finding any links. The only code I've not written myself is the Regex URL finder string, which I imagine is probably the problem. I'm going to research and see if I can figure out if anything is wrong with it; but the code does not throw any errors. So I'm a bit lost.

 

import requests
import re
links = []
url = "LINK"
links.append(url)
for i in links:
    urli = requests.get(i)
    text = urli.text
    for j in text:
        link = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', j)
        if type(link) == str:
            links.append(link)
            continue
        elif type(link) == list:
            for k in link:
                links.append(k)
            continue
        else:
            continue
    continue
print " ".join(links)

Thanks in advance.

Join the Appleitionist cause! See spoiler below for answers to common questions that shouldn't be common!

Spoiler

Q: Do I have a virus?!
A: If you didn't click a sketchy email, haven't left your computer physically open to attack, haven't downloaded anything sketchy/free, know that your software hasn't been exploited in a new hack, then the answer is: probably not.

 

Q: What email/VPN should I use?
A: Proton mail and VPN are the best for email and VPNs respectively. (They're free in a good way)

 

Q: How can I stay anonymous on the (deep/dark) webzz???....

A: By learning how to de-anonymize everyone else; if you can do that, then you know what to do for yourself.

 

Q: What Linux distro is best for x y z?

A: Lubuntu for things with little processing power, Ubuntu for normal PCs, and if you need to do anything else then it's best if you do the research yourself.

 

Q: Why is my Linux giving me x y z error?

A: Have you not googled it? Are you sure StackOverflow doesn't have an answer? Does the error tell you what's wrong? If the answer is no to all of those, message me.

 

Link to comment
Share on other sites

Link to post
Share on other sites

How levels deep are you hoping to go? Also it would be better to use a set rather than a list to store URLs as recursive linking situations could get you stuck in an infinite loop.

Link to comment
Share on other sites

Link to post
Share on other sites

On 27/11/2016 at 7:12 PM, mtwest said:

How levels deep are you hoping to go? Also it would be better to use a set rather than a list to store URLs as recursive linking situations could get you stuck in an infinite loop.

HyperloopText Markup Language

 

@LtStaffel what will you use that for?

Link to comment
Share on other sites

Link to post
Share on other sites

21 hours ago, Niemand said:

HyperloopText Markup Language

 

@LtStaffel what will you use that for?

To help find links on a website.

Join the Appleitionist cause! See spoiler below for answers to common questions that shouldn't be common!

Spoiler

Q: Do I have a virus?!
A: If you didn't click a sketchy email, haven't left your computer physically open to attack, haven't downloaded anything sketchy/free, know that your software hasn't been exploited in a new hack, then the answer is: probably not.

 

Q: What email/VPN should I use?
A: Proton mail and VPN are the best for email and VPNs respectively. (They're free in a good way)

 

Q: How can I stay anonymous on the (deep/dark) webzz???....

A: By learning how to de-anonymize everyone else; if you can do that, then you know what to do for yourself.

 

Q: What Linux distro is best for x y z?

A: Lubuntu for things with little processing power, Ubuntu for normal PCs, and if you need to do anything else then it's best if you do the research yourself.

 

Q: Why is my Linux giving me x y z error?

A: Have you not googled it? Are you sure StackOverflow doesn't have an answer? Does the error tell you what's wrong? If the answer is no to all of those, message me.

 

Link to comment
Share on other sites

Link to post
Share on other sites

Like @fizzlesticks said, use the BeautifulSoup library.  Something like this:

 

import bs4
from urllib.requests import urlopen

page = bs4.BeautifulSoup(urlopen(whatever), "html.parser")
links = page.find_all("a")
links = [i['href'] for i in links]

Something like that should get you a list of all links on the page.

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×