HTTrack: new go-to program for web mirroring / archiving
Thursday, April 2nd, 2009Faced with a big site full of URLs like http://mysite.com/Internal1.asp?id=357 to mirror & archive, I recently tried out a new (to me) tool, HTTrack. I’ve fiddled with wget for this sort of job in the past, but it always takes me ages of man-page reading to get my options right, and even then not everything seems to work out.
This time around, for example, I’d convinced myself that wget -r -N -l inf --no-remove-listing -E -k -p http://mysite.com would do the trick. It mostly did, except for seemingly random pages that didn’t get all of their links converted.
HTTrack, on the other hand, did The Right Thing without any switches or arguments whatsoever. It was a bit more of a pain to get running; even though it’s in macports, right now the port is lagging behind the available versions, so I had to actually type ./configure and ./make myself. Well worth it for a usable mirror.
