How to wget to mirror a website without html file extensions

Wget is a great and powerful tool but it does have some hidden features that are not really common in linux world.

For example it has set of default settings that can be altered via some misterious .wgetrc (good luck on mac!).

How to mirror a website onto local machine?

wget -m -k -K URL 

This will crawl all pages in the site, save them on disk and rewrite urls to relative ones so you can browse them locally.

Its cool but it also adds .html extensions so index.php becomes index.php.html. But in my case i wanted to crawl the site to flatten it
and make static html pages. Site was like 4 years old, and i wanted to make it easier to deploy and safer than some freaky old crappy php app sending
emails and god knows what else.

How to turn off html extensions in wget

So i finallu found out you can switch off html extensions by addind -e switch with a command from wgetrc like this:

wget -m -k -K -e html_extension=Off URL 

This makes exact copy of all pages and does not append any extensions.

Be careful

This will not mirror any content that is in javascript or css files like images or on rollover actions etc.

This will also not mirron any dangling pages that happen to be in google but are not linked up from main page.

So consider it all before deleting your old php driven site ... you might still need it some day ;- )

Comments

Post new comment

Image CAPTCHA