I recently wanted to download images from HTML, and I wanted to share the script that I wrote for this.
Download Images from HTML - Introduction
I had some WordPress posts that I wanted to backup, but I wanted to make sure that I had the text as well as the images.
This was a simple script, but it solved a problem that I had had recently. That said, I could see myself using this or an updated version for some OSINT or phishing in the future.
This will be another short script like my last post, but I've been automating some small problems as of late!
Parsing the HTML
First, I tried to just use regex to parse out any img src attributes from my flat text file.
As I should have expected, and this Stack Overflow answer has warned, this proved to be annoying at best.
That said, I ended up using Beautiful Soup to parse the file and extract the links that I wanted. Then, I downloaded the images, and saved them locally.
Note that I was using a flat text file with multiple WordPress posts, but this process should work fine for any HTML document that you are interacting with.
You can find the code for this process below. Note that there is also a small loop with some regex in the middle of the script. This is to download the original image files from WordPress, as opposed to the resized versions. You should not see any adverse effects from this section, if your files do not end in what looks like a resolution (i.e.: -300x400.jpg).
Also, I'm sure that RecViking is proud of me for finally using some regex again.
import bs4 import re import requests import shutil import sys from urllib import urlopen links =  origLinks =  # https://stackoverflow.com/questions/18042661/using-bs4-to-extract-text-in-html-files page = urlopen('posts.txt').read().decode('utf-8') soup = bs4.BeautifulSoup(page, "html.parser") for node in soup.findAll('img'): #print(node['src']) links.append(node['src']) # https://docs.python.org/3.4/library/re.html for link in links: if re.search("-[0-9]*x[0-9]*\.jpg", link): origLinks.append(re.sub("-[0-9]*x[0-9]*\.jpg", ".jpg", link)) else: origLinks.append(link) # https://www.dev2qa.com/how-to-download-image-file-from-url-use-python-requests-or-wget-module/ for link in origLinks: filename = "" if link.find('/'): filename = link.rsplit('/', 1) resp = requests.get(link, stream=True) localFile = open(filename, 'wb') resp.raw.decode_content = True shutil.copyfileobj(resp.raw, localFile) del resp
When I executed this script, it downloaded all my images as expected!
root@kali:~/Documents/imgExtract# ls imgExtract.py posts.txt root@kali:~/Documents/imgExtract# python imgExtract.py root@kali:~/Documents/imgExtract# ls *.jpg 13177978_10206776407554475_5913650144102677662_n.jpg IMG_1951.jpg IMG_1119.jpg IMG_1952.jpg IMG_1120.jpg IMG_1953.jpg IMG_1136.jpg IMG_1954.jpg IMG_1195.jpg IMG_2151-e1468181469199.jpg IMG_1237.jpg IMG_2160.jpg IMG_1238.jpg IMG_2161-e1468181539638.jpg IMG_1245.jpg IMG_2163.jpg IMG_1247.jpg IMG_2210-e1468181566583.jpg IMG_1270.jpg IMG_2418-e1468183375745.jpg IMG_1271.jpg IMG_2426.jpg IMG_1934.jpg IMG_2427.jpg IMG_1936.jpg IMG_2429-e1468183409990.jpg IMG_1948.jpg
As usual, you can find the code and any updates in my GitHub repository.
Please feel free to submit any pull requests, if you use this for anything else, especially offensive related.
Download Images from HTML - Conclusion
This was a simpler script, but it solved a problem that I was having.
I'm not sure if it qualifies as a "security" tool, but I could see a few uses for it here and there.
I also found some older EverSec challenges, so I'm hoping to go through them soon and see if any are worth blogging about.