How to download a website

Micky C

Honored Donor

#1 Posted 13 December 2017 - 06:58 AM

My university online account is going to shut down permanently in about 22 hours and there are countless files in dozens of courses that I wouldn’t mind having a copy of.

What’s a quick and a dirty way of automatically and recursively downloading all PDFs, images, word and PowerPoint documents buried within this labyrinth? In case it matters; I need to log in by username and password to access the site.

Hendricks266

Weaponized Autism

#2 Posted 13 December 2017 - 09:23 AM

Try wget --mirror.

Forge

Speaker of the Outhouse

#3 Posted 13 December 2017 - 09:32 AM

Pretty cold of the university. The couple I attended allow Alumnus to maintain accounts and access new & old material.

Paul B

#4 Posted 13 December 2017 - 12:11 PM

I would have made some recommendations but i'm not sure how well a site can be downloaded or ripped if the contents of the website are contained in a database. You might end up with the formatting CSS/HTML crap but not the actual content of what it is you are looking for as that requires authentication and access to the DB server. Let us know how you make out?

This post has been edited by Paul B: 13 December 2017 - 12:18 PM

Micky C

Honored Donor

#5 Posted 13 December 2017 - 03:03 PM

Hendricks266, on 13 December 2017 - 09:23 AM, said:

Try wget --mirror.

Sounds like it could be useful. I'll get back to you in a few hours. Only issue is that the site also has many videos on it which I don't need to download. Is there a way of excluding those? So far it looks like the '-R "*.mp4"' might do what I want.

Forge, on 13 December 2017 - 09:32 AM, said:

Pretty cold of the university. The couple I attended allow Alumnus to maintain accounts and access new & old material.

Yeah a bit of a shame. I still have access since I'm still working at the University, however they moved over to a new online system a few years ago and are getting rid of the old one which has most of my older files. So I'm lucky I had access this long.

Micky C

Honored Donor

#6 Posted 13 December 2017 - 06:19 PM

No luck with wget Posted Image

It seems to download a page which asks for a username and password and that's it. I've tried a few different methods of passing the username and password to wget but none of it seems to be working.

leilei

#7 Posted 13 December 2017 - 06:34 PM

I'd say the Scrapbook (or its forks Plus or X) extension for Firefox (pre-quantum or any of those older forks like Palemoon) for a very dirty way (it rewrites links and doesn't preserve directories) of saving the page recursively if wget and httrack fail for you. I don't have any better ideas in mind since this is time critical. Just tossing this out there.

This post has been edited by leilei: 13 December 2017 - 06:36 PM

Kyanos

#8 Posted 13 December 2017 - 06:43 PM

Not too sure if this will get past the login but here is the program I used once when saving a webpage worth of links to documents. It lets you include or exclude file types too.
http://www.httrack.com/

Paul B

#9 Posted 13 December 2017 - 08:53 PM

Micky I don't know any Universities that would use static pages to host web content. It's usually dynamically served through calls by php, java, asp. Anyone who knows anything about web programming doesn't store information in static HTML files anymore.

If you really need to get the information you can go about it several ways.

1) Ask the IT department to provide you with the content or find out what technologies they are using to host the website. There are many different languages and databases that can be used to serve web content. Professionally built sites store as much information in databases as possible as the access times are much faster and can handle volume. If you have access to the internal network the data is hosted on you could possibly just dump the data you want through a simple one line command. If you know what services that are being used you will know how to go about extracting the information, otherwise you're shooting in the dark.

2) The second thing is get on the same internal network as the web server if possible, run an NMap against the server to see what services / versions are running on it. You can also collect information from the url links of the site you are visiting to see what scripting languages are being used and if you're lucky, you might come across a lazy admin who has credentials hard coded in the websites db _connect string. You also may want to track down a program similar to this: http://binhgiang.sou...l%20system.html This dumps databases and rips websites for you.

3) Social Engineering, call IT from within the University and ask them to hand over the last known good web backups for their server on behalf of the IT director before they are decommissioned. Make it seem like this should have been done last week.

4) Failing the above 3 steps: DDOS their network, crash their firewalls and run in and take the data like a ~~thief~~ boss.

This post has been edited by Paul B: 13 December 2017 - 09:50 PM

Micky C

Honored Donor

#10 Posted 13 December 2017 - 11:23 PM

Thanks for the advice everyone. I tried a few more things, but in the end I decided to bite the bullet and spend about 2 hours or so simply going through the pages and manually downloading files. I have pretty much everything I need now. Fortunately I've tutored such a range of courses that I have access to a lot of content on the new system anyway.

Forge

Speaker of the Outhouse

#11 Posted 14 December 2017 - 07:46 AM

It might not help you now, but it could help others - any chance of convincing the online content director to archive the older stuff instead of discarding it in the transition from one system to the other?

Maybe have the uni offer their students a copy of the material in digital format upon graduation.

Share this topic:

Page 1 of 1

You cannot start a new topic
You cannot reply to this topic

Duke4.net Forums: How to download a website - Duke4.net Forums

Choose a header image