Some time ago there was a comment which said that Wikileaks had done a major file dump at file.wikileaks.org and I’d taken a look with the browser. There it was.
A day or three later I started a scrape of it. It has a “no robots” setting that I had to get past, but then the scrape proceeded just fine. Until it didn’t. AT&T hiccuped and I had to start over. I had a delay between attempts set to 20 seconds (so as to be a low workload on their server and not cause issues, and look more like a human) but it would take 3 days just to cover again all the files I’d already downloaded to check that “nothing changed”. So I set the time delay down to 10 seconds.
That seemed to be running fine the last few days so I’ve largely ignored it. Until the check on it tonight when it was failing with a
HTTP request sent, awaiting response... 502 Bad Gateway 2020-04-30 xx:xx:xx ERROR 502: Bad Gateway.
That is often because you’ve got a server on their end that can’t talk to the actual backend (i.e. not my problem) but it can also be that DNS is not right (mine is) and some other issues. It might also be that their admin (or some automated widget) detected my 10 second requests and blacklisted my IP for a while… (Often these time out after a day or two of doing nothing, so right now I’m no longer running the scrape.)
So it would be helpful to me if someone can, even just in a browser, take a look at their site and see if it is working for you (so it is a “me problem”) or not.
Also, just as an FYI, here’s a script fragment of what I’m using for the scrape under Linux. The actual guts of it is just the one line with wget. The rest is just to assure it’s in the right directory when I launch it and data goes into the archive / scratch area.
cd /scrape/archive/Wikileaks pwd; ls wget -w 10 -np -m -e robots=off https://file.wikileaks.org/file/
the -w 10 parameter says wait 10 seconds between requests, -np says “no parent” i.e. don’t go “up” the hierarchy and off to God Only Knows Where if there’s a link out and up, -m sets a bunch of other flags to “mirror” or recursively descend the target site and then -e robots=off says to ignore the robots.txt file that says to not allow automated scripts.
It worked a champ for a while:
root@odroid:/scrape/archive/Wikileaks/file.wikileaks.org# du -ms * 35955 file 1 robots.txt
Having downloaded almost 36 GB of stuff. That I’ve not yet looked at, at all. So it’s fundamentally shown to work.
Leaving me with really only three likely / probable cases:
1) 10 seconds was too short and I’m on a block list as a detected robot.
2) Someone is blocking their site, attacking Wikileaks.
3) Technical fault. Either inside their shop or on some network gear in between.
IF other folks get in “no problem”, then 2 & 3 leave the list…