Pro-tip: Using wget Mirror Mode with Custom HTML Attributes
In this quick post I will explain how to edit an external website transparently, allowing wget
to follow links it would not have seen otherwise.
Section intitulée wget-mirror-issuesWget mirror issues
You may already know this, wget
has an awesome --mirror
option allowing to capture an entire website with all the images, JavaScript, CSS and Fonts it needs. It also fixes all the links to make everything relative, so browsing it locally is possible.
Here is the command I use for example:
wget --mirror --convert-links --adjust-extension \
--page-requisites --no-parent --progress=dot \
--recursive --level=6 https://target.com/
This will give me a folder with everything nice and tidy with directories, HTML files, etc.
wget
is able to follow links in various places like href
, img src
, iframe
… as you can see in the source code.
But some websites may use custom attributes like the famous data-src
, used by plenty of lazy-loading JavaScript library, in a time when native laziness was not implemented in our browsers.
As wget
does not offer any option to use custom attributes on a tag (the option --follow-tags
is used to specify the HTML tag, you cannot control the attribute itself) we will have to cheat if we want it to download our images.
Section intitulée using-a-proxy-to-edit-the-html-on-the-flyUsing a proxy to edit the HTML on the fly
If you cannot control the website you are mirroring, one option is to put a local proxy in front of it.
That’s what we are going to do, instead of wget
going directly to our target, we will add a proxy:
local
┌────────────┐
│ │
│ ┌──────┐ │ ┌──────────┐
│ │ wget ├──┼────────►│target.com│
│ └──────┘ │ └──────────┘
│ │
│ │
│ ┌──────┐ │ ┌──────────┐
│ │ wget │ │ │target.com│
│ └──┬───┘ │ └──────────┘
│ │ │ ▲
│ ▼ │ │
│ ┌──────┐ │ │
│ │proxy ├──┼──────────────┘
│ └──────┘ │
│ │
└────────────┘
This proxy can be anything but let’s use an open-source one: https://mitmproxy.org/ ❤️
We start by writing a small Python script that will edit all the HTTP responses passing through this proxy:
"""
Fix the lazy loader SRC.
"""
from mitmproxy import http
def response(flow: http.HTTPFlow) -> None:
if flow.response and flow.response.content:
flow.response.content = flow.response.content.replace(
b"data-src", b"src"
)
and then we can launch the proxy:
mitmdump -s src.py
This is starting a proxy that can be called at localhost:8080
– so let’s tell wget
to use this proxy and we are good to go!
wget -e use_proxy=yes -e http_proxy=localhost:8080 -e https_proxy=localhost:8080 \
--mirror --convert-links --adjust-extension \
--page-requisites --no-parent --progress=dot \
--recursive --level=6 https://target.com/
Every request now passes through our proxy and our Python script, the <img data-src
tags are seen as <img src
and our local mirror will be complete!
This tip can be used for a lot of other use-cases, and thanks to Mitmproxy, it’s really easy to set up. Happy proxifying! 👋