How do I tell wget not to go outside of the current webpage?
example wget script that downloads all images on a webpage
https://github.com/eduardschaeli/wget-image-scraper/blob/master/scraper.sh
#!/bin/sh
for line in `cat sites.txt`; do
# replace http://
stripped_url=`echo $line| cut -c8-`
target_folder="downloads/`echo $stripped_url|sed 's/\//_/g'`"
echo $stripped_url
echo ""
echo ""
echo ""
echo "Scraping $stripped_url"
echo "-----------------------------------"
echo "> creating folder.."
echo $target_folder
mkdir -p $target_folder
echo "> scraping $stripped_url"
wget -e robots=off \
-H -nd -nc -np \
--recursive -p \
--level=1 \
--accept jpg,jpeg,png,gif,webm \
--convert-links -N \
-P $target_folder $stripped_url
echo ""
echo ""
echo "> Finished scraping $stripped_url"
done
I download a webpage, for example, at https://boards.4channel.org/v/thread/441907771/the-game-awards-2018 and it downloads all of the images on the webpage, but it also downloads those on other links, such as 4channel.org/a/, etc.
Before the domain change to 4channel, the scraper was working fine, but now I can't figure out how to fix it from wandering to other domains
Do I blacklist every other link on the page?
Thank you