Wget Static Backup of Dynamic Sites

wget backup static dynamic url-encoding perl web

Creating static backups of dynamic websites using wget and handling query string URL encoding issues

Creating Static Backups of Dynamic Websites

When backing up dynamic websites (sites that use PHP, ASP, or similar server-side processing), you often want to create a static HTML version for archival purposes. Tools like wget can mirror entire websites, but there are challenges when dealing with URLs that contain query strings.

The Query String Problem

Dynamic sites use URLs like:

  • index.php?module=news&page=1
  • article.php?id=123&category=tech

When you create a static backup, Apache (or other web servers) may try to parse these query strings as actual dynamic requests rather than static file paths. This causes 404 errors or broken functionality in your backup.

The Solution: URL Encoding

The issue occurs because question marks (?) have special meaning in URLs. To make static backups work properly, you need to URL-encode the question marks to %3F.

Before (broken in static backup):

index.php?module=news&page=1

After (works in static backup):

index.php%3Fmodule=news&page=1

The Fix

After creating your static backup with wget, run this Perl command to fix all the query string URLs:

perl -pi -e 's/index.php\?/index.php\%3f/g' ./*

This command:

  • Uses perl -pi -e for in-place editing
  • Searches for index.php? (escaped as index.php\?)
  • Replaces with index.php%3f (URL-encoded question mark)
  • Runs on all files in current directory and subdirectories (.*/*)

Using wget for the Initial Backup

To create the initial static backup, use wget with appropriate options:

wget --mirror --convert-links --adjust-extension --page-requisites --no-parent http://example.com

This will:

  • Mirror the entire site (--mirror)
  • Convert links to work locally (--convert-links)
  • Add .html extensions where appropriate (--adjust-extension)
  • Download CSS, JS, and images (--page-requisites)
  • Stay within the target domain (--no-parent)

After running wget, apply the Perl fix above to handle any remaining query string issues.