Wget Static Backup of Dynamic Sites
Creating static backups of dynamic websites using wget and handling query string URL encoding issues
Creating Static Backups of Dynamic Websites
When backing up dynamic websites (sites that use PHP, ASP, or similar server-side processing), you often want to create a static HTML version for archival purposes. Tools like wget can mirror entire websites, but there are challenges when dealing with URLs that contain query strings.
The Query String Problem
Dynamic sites use URLs like:
index.php?module=news&page=1article.php?id=123&category=tech
When you create a static backup, Apache (or other web servers) may try to parse these query strings as actual dynamic requests rather than static file paths. This causes 404 errors or broken functionality in your backup.
The Solution: URL Encoding
The issue occurs because question marks (?) have special meaning in URLs. To make static backups work properly, you need to URL-encode the question marks to %3F.
Before (broken in static backup):
index.php?module=news&page=1
After (works in static backup):
index.php%3Fmodule=news&page=1
The Fix
After creating your static backup with wget, run this Perl command to fix all the query string URLs:
perl -pi -e 's/index.php\?/index.php\%3f/g' ./*
This command:
- Uses
perl -pi -efor in-place editing - Searches for
index.php?(escaped asindex.php\?) - Replaces with
index.php%3f(URL-encoded question mark) - Runs on all files in current directory and subdirectories (
.*/*)
Using wget for the Initial Backup
To create the initial static backup, use wget with appropriate options:
wget --mirror --convert-links --adjust-extension --page-requisites --no-parent http://example.com
This will:
- Mirror the entire site (
--mirror) - Convert links to work locally (
--convert-links) - Add
.htmlextensions where appropriate (--adjust-extension) - Download CSS, JS, and images (
--page-requisites) - Stay within the target domain (
--no-parent)
After running wget, apply the Perl fix above to handle any remaining query string issues.