mike chambers | about

Using Jekyll to archive a wordpress based blog

Wednesday, December 25, 2013

I recently completed migrating my blog from Wordpress to Jekyll (complete post on that soon). One of the main reasons for the move is that I didn’t want to continue to have to maintain MySQL, Apache, PHP and everything else required by Wordpress. However, before I shut everything down, I wanted to create a static archive of the entire Wordpress blog, in case I find some content in the future that did not export corectly to the Jekyll based blog.

There are a number of Wordpress plugins that will create a static version of the site which you can download and archive, but they require that you have zip support compiled into PHP (which I don’t). Since I made sure to maintain the same URLs when migrating to Jekyll, I realized I could use Jekyll itself to generate a list of blog post URL, which I could then pass to wget to download an archive.

First, here is the file that jekyll will use to generate the list of URLs to archive:

---
 
layout: nil
 
---
{% for post in site.posts %}http://www.mikechambers.com/blog2{{ post.url }}
{% endfor %}

I place this in _utils/siteurls.txt.

You can then build the site using:

jekyll build

which will create the at _site/_utils/siteurls.txt. The file will contain one url per-line for each blog post on the Wordpress blog.

http://www.mikechambers.com/blog2/2012/04/08/simple-http-server-for-local-testing/
http://www.mikechambers.com/blog2/2012/04/02/north-american-flash-community-tour/
http://www.mikechambers.com/blog2/2012/02/22/flash-roadmap-whitepaper-published/
...

You can then pass this file to wget to download and save all of the posts.

wget -nc -x -P blog -i siteurls.txt

Here are what each of the options mean:

This creates an archive of all of the URLs contained in the siteurls.txt file, with each post placed in a directory based on the URL path. You now have an archive of all of your wordpress posts.

Note, this does not archive the entire wordpress site. For example, it doesn’t archive the main index page, or category pages. If you would like to also archive those, just manually add the URLs to the file (or use Jekyll to generate them for the categories).

twitter github flickr behance