Tuesday, March 18, 2014

How to Download An Entire Website?

Sometimes, we like to have an entire site or part of it archived in our disk so that we can read it even without Internet connection. But how to do it? Here are two easy to use tools that does the trick.

1. wget:

"wget" is a simple yet powerful command line tool. As usual it is a free software that respects your freedom as well as gives you the freedom to read a site even without a working Internet connection. This tool is available by default in almost all GNU/Linux distributions. It has lot of options. You can use it to download a single file or a web page. You can also download recursively, i.e. a page and all the links in that page. You can choose to download pictures, style sheets, JavaScript files. You can choose the domains from which the files can be downloaded (by this, you can skip unwanted ads getting downloaded).

I am giving here a sample of wget script using which you can download this site.

wget --recursive --timestamping --span-hosts --page-requisites --adjust-extension --convert-links --domains=blogspot.in,bp.blogspot.com --no-parent techmusicnmore.blogspot.in

Let us see the options one by one

--recursive

This tells wget to download a page and all the links in the page recursively. Us it with caution as some sites have many levels of links and you may end up downloading lot of data. You may need to specify the depth using --level=n option to restrict downloads after a 'n' layers of depth.

--timestamping

This is a clever option that downloads a file only if a newer version is available. For example, if you have downloaded a site already or canceled it in between, then you can restart it without worrying about wasting your bandwidth to re-download all the files. This option will make wget to download only non-existing files and re-download only outdated files.

--span-hosts

This tell that wget to look in to other servers if needed. As our blog's images are stored in other domain(s), we need to give this option. But be informed that this may cause you to download the entire Internet!!! So you can restrict the domains by white-listing them using --domains option or black-listing them using --exclude-domains option.

--domains=server1,server2

As discussed in previous option, this option is used to restrict the downloads to just the domains specified in the comma separated list.

--page-requisites

This option is to download all the artifacts required to display a page properly. That is to download images, scripts, css etc.

--adjust-extension

This option cleverly converts the file extensions of pages to html if they are not so by default. Some sites will have other namings like .jsp or .gsp. In such cases too, the files will be converted to .htm(l) extension.

--convert-links

This option is used to convert all the links in the pages to point to local location. Without this, all pages will be downloaded. But the links will still point to the actual website.

--no-parent

To tell wget that it should download ONLY the links/files below the given URL. It will not allow the parents of current link to be downloaded. For example, you have a web page with links to all articles written by a particular author. If you use wget without this option, then all the pages including home page, about us etc will be downloaded. With this option, only the pages written by that author will be downloaded.

wget in action; downloading this blog ;-)

These are not the only way to do the things. You can do lot of things with wget. There are some websites which will prevent automatic downloads like this. You can cheat them by acting like a real user by hitting the server with some time intervals and masquerading that the requests are from real browser.  For that you may need to go through the manual. The link is given below.

wget manual: https://www.gnu.org/software/wget/manual/wget.html

2. Httrack

If you are not running GNU/Linux, my first advice is to switch your OS. If you can't or you need a GUI based alternate to wget even in GNU/Linux, here is Httrack. This has all the options as in wget and lot more. But it has some other options that wget doesn't. That is mainly because, wget is developed for general reasons, while Httrack is specifically developed for website archiving and crawling. So it has a slight edge over wget.

This awesome tool is available for Windows, OSX, Android and of course GNU/Linux and BSD platforms. It is a free software released under GPLv3. So have no guilt in using it. If you are a GNU/Linux user, then download the source package using wget and compile it locally for extra fun!

You can get Httrack from here: http://www.httrack.com/
Httrack for Android: https://play.google.com/store/apps/details?id=com.httrack.android
Httrack manual: http://www.httrack.com/html/index.html

No comments: