Wikileaks File Scrape

Some time ago there was a comment which said that Wikileaks had done a major file dump at file.wikileaks.org and I’d taken a look with the browser. There it was.

A day or three later I started a scrape of it. It has a “no robots” setting that I had to get past, but then the scrape proceeded just fine. Until it didn’t. AT&T hiccuped and I had to start over. I had a delay between attempts set to 20 seconds (so as to be a low workload on their server and not cause issues, and look more like a human) but it would take 3 days just to cover again all the files I’d already downloaded to check that “nothing changed”. So I set the time delay down to 10 seconds.

That seemed to be running fine the last few days so I’ve largely ignored it. Until the check on it tonight when it was failing with a

HTTP request sent, awaiting response... 502 Bad Gateway
2020-04-30 xx:xx:xx ERROR 502: Bad Gateway.

That is often because you’ve got a server on their end that can’t talk to the actual backend (i.e. not my problem) but it can also be that DNS is not right (mine is) and some other issues. It might also be that their admin (or some automated widget) detected my 10 second requests and blacklisted my IP for a while… (Often these time out after a day or two of doing nothing, so right now I’m no longer running the scrape.)

So it would be helpful to me if someone can, even just in a browser, take a look at their site and see if it is working for you (so it is a “me problem”) or not.

Also, just as an FYI, here’s a script fragment of what I’m using for the scrape under Linux. The actual guts of it is just the one line with wget. The rest is just to assure it’s in the right directory when I launch it and data goes into the archive / scratch area.

cd /scrape/archive/Wikileaks
pwd; ls

wget -w 10 -np -m -e robots=off https://file.wikileaks.org/file/

the -w 10 parameter says wait 10 seconds between requests, -np says “no parent” i.e. don’t go “up” the hierarchy and off to God Only Knows Where if there’s a link out and up, -m sets a bunch of other flags to “mirror” or recursively descend the target site and then -e robots=off says to ignore the robots.txt file that says to not allow automated scripts.

It worked a champ for a while:

root@odroid:/scrape/archive/Wikileaks/file.wikileaks.org# du -ms *
35955	file
1	robots.txt

Having downloaded almost 36 GB of stuff. That I’ve not yet looked at, at all. So it’s fundamentally shown to work.

Leaving me with really only three likely / probable cases:

1) 10 seconds was too short and I’m on a block list as a detected robot.

2) Someone is blocking their site, attacking Wikileaks.

3) Technical fault. Either inside their shop or on some network gear in between.

IF other folks get in “no problem”, then 2 & 3 leave the list…

Subscribe to feed

About E.M.Smith

A technical managerial sort interested in things from Stonehenge to computer science. My present "hot buttons' are the mythology of Climate Change and ancient metrology; but things change...
This entry was posted in News Related, Political Current Events, Tech Bits and tagged , . Bookmark the permalink.

13 Responses to Wikileaks File Scrape

  1. Julian Jones says:

    I can get in no problem, working for download this a.m. from UK.

  2. Rienk says:

    I’m getting 502 bad gateway from the Netherlands, may 1 11:00 local. I’ll try again later.

  3. Tim. says:

    I can get in from UK. Some files will not open.

  4. Gary P Smith says:

    502 Bad Gateway from Texas.

  5. E.M.Smith says:

    Given the variable, and geographically diverse, failure:

    My guess at this point is “It isn’t me” and “it isn’t them” leaving “network issues”.

    I’d speculate that with half the world at home watching Netflix or browsing the internet, and service providers using QOS Quality Of Service router settings to down rank non-human traffic (I.e. promote real time media) the failure is a side effect of heavy congestion.

    I’ll likely segment my scrape by subdirectory (skipping the long re-check of done stuff) and run it in chunks tonight.

    Thanks all for the function / fail data. It really helped.

  6. Nancy & John Hultquist says:

    Not a computer guru here, but it might be that Hillary has gotten to those files with a dishcloth and disinfectant. This ingredient has been hard to come by in recent weeks by regular folks.
    Sorry – couldn’t resist.

  7. philjourdan says:

    NO problems from Virginia about Noon, EDT

  8. Rienk says:

    20:00 local and I can now traverse the directories but not download files. Is torrenting an option for you?

  9. Harry Whitehat says:

    Also trying to scrape data there. It’s over 400GB from my original calculations using the info provided in the index.html files, so be prepared before grabbing everything. Best to look for what you want first and grab specific things. I’ve seen a 502 a number of times over the past week or so, but it is presently working. To not deal with DNS lookup issues, do an nslookup on the site. There are 3 or 4 IPs that it returns. Add them to your hosts file to circumvent DNS. Speed of the site has been up and down all week. Pages load fast at times, other times you can go brew a pot of coffee in between clicks. HTH

  10. dadgervais says:

    I Tried the .torrent file they provide using Transmission in Ubuntu, got a 31GB volume; full of files and folders. I did get a few tracker errors and connection fails, but the download succeded and the file verified. Took about 6 hours. I’m still seeding (18 hours up time) so it still seems fine. Don’t think we need to scrape the whole site.

  11. E.M.Smith says:

    @dadgervais & Rienk:

    I can do torrent easily. And have. I just didn’t look for it… I think I’ll set one of them running too.

    Need to scrape? Nope. Like doing it? Ah, yeah…. ;-)

  12. E.M.Smith says:

    @Harry Whitehat:

    Nice idea on the DNS hard coding. I have a caching DNS server in my net, so only one outside lookup per day…

    I have several TB of free disk so not worried there. Time to complete over internet matters though…

  13. jim2 says:

    I haven’t been able to play xbox on-line with a friend for the last week. Either the internet or xbox servers or both are way overloaded.

Anything to say?

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.