Scraping Wikileaks – A Partial Story

I was looking at the Wikileaks cache of Hillary’s Emails and thought maybe it would be a good idea to have an archival copy. You know, what with Biden and all the censorship, who knows how long before Wikileaks is nuked.

See: “https: //file.wikileaks.org/file/clinton-emails/”

Index of /file/clinton-emails/
../
Clinton_Email_August_Release/ 01-Jan-1970 00:01 –
Clinton_Email_December_Release/ 01-Jan-1970 00:01 –
Clinton_Email_February_13_Release/ 01-Jan-1970 00:01 –
Clinton_Email_February_19_Release/ 01-Jan-1970 00:01 –
Clinton_Email_February_26_Release/ 01-Jan-1970 00:01 –
Clinton_Email_February_29_Release/ 01-Jan-1970 00:01 –
Clinton_Email_January_29_Release/ 01-Jan-1970 00:01 –
Clinton_Email_January_7_Release/ 01-Jan-1970 00:01 –
Clinton_Email_July_Release/ 01-Jan-1970 00:01 –
Clinton_Email_June_Release/ 01-Jan-1970 00:01 –
Clinton_Email_May_Release/ 01-Jan-1970 00:01 –
Clinton_Email_November_Release/ 01-Jan-1970 00:01 –
Clinton_Email_October_Release/ 01-Jan-1970 00:01 –
Clinton_Email_September_Release/ 01-Jan-1970 00:01 –
Litigation_F-2016-07895/ 01-Jan-1970 00:01 –
Litigation_F-2016-07895_10/ 01-Jan-1970 00:01 –
Litigation_F-2016-07895_11/ 01-Jan-1970 00:01 –
Litigation_F-2016-07895_12/ 01-Jan-1970 00:01 –
Litigation_F-2016-07895_13/ 01-Jan-1970 00:01 –
Litigation_F-2016-07895_14/ 01-Jan-1970 00:01 –
Litigation_F-2016-07895_15/ 01-Jan-1970 00:01 –
Litigation_F-2016-07895_16/ 01-Jan-1970 00:01 –
Litigation_F-2016-07895_17/ 01-Jan-1970 00:01 –
Litigation_F-2016-07895_18/ 01-Jan-1970 00:01 –
Litigation_F-2016-07895_19/ 01-Jan-1970 00:01 –
Litigation_F-2016-07895_2/ 01-Jan-1970 00:01 –
Litigation_F-2016-07895_20/ 01-Jan-1970 00:01 –
Litigation_F-2016-07895_21/ 01-Jan-1970 00:01 –
Litigation_F-2016-07895_22/ 01-Jan-1970 00:01 –
Litigation_F-2016-07895_23/ 01-Jan-1970 00:01 –
Litigation_F-2016-07895_24/ 01-Jan-1970 00:01 –
Litigation_F-2016-07895_25/ 01-Jan-1970 00:01 –
Litigation_F-2016-07895_26/ 01-Jan-1970 00:01 –
Litigation_F-2016-07895_27/ 01-Jan-1970 00:01 –
Litigation_F-2016-07895_28/ 01-Jan-1970 00:01 –
Litigation_F-2016-07895_29/ 01-Jan-1970 00:01 –
Litigation_F-2016-07895_3/ 01-Jan-1970 00:01 –
Litigation_F-2016-07895_30/ 01-Jan-1970 00:01 –
Litigation_F-2016-07895_31/ 01-Jan-1970 00:01 –
Litigation_F-2016-07895_32/ 01-Jan-1970 00:01 –
Litigation_F-2016-07895_4/ 01-Jan-1970 00:01 –
Litigation_F-2016-07895_5/ 01-Jan-1970 00:01 –
Litigation_F-2016-07895_6/ 01-Jan-1970 00:01 –
Litigation_F-2016-07895_7/ 01-Jan-1970 00:01 –
Litigation_F-2016-07895_8/ 01-Jan-1970 00:01 –
Litigation_F-2016-07895_9/ 01-Jan-1970 00:01 –
Nov03_2016/ 01-Jan-1970 00:01 –
Nov04_2016/ 01-Jan-1970 00:01 –
Powell_9-23-2016/ 01-Jan-1970 00:01 –
readme.txt 08-Oct-2018 20:06 200

The “readme” says:

ems@OdroidN2:/T1/ext/Clinton/file.wikileaks.org/file/clinton-emails$ cat readme.txt
This directory contains raw data obtained and used by WikiLeaks to create searchable archive available at https: //wikileaks.org/clinton-emails/

Now all published in one place for ease of mirroring.

Now you might think that if they are doing this to encourage mirroring, they would allow automated copy to a mirror. Think again…

I started with a plain ‘wget’ and was promptly dropped. OK, they have a “robots” file so reject automated downloads. There are options to wget to get around that. So I used a modest set:

wget -e robots=off -r -np -w 30 --random-wait 
-U "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)" 
'https://file.wikileaks.org/file/clinton-emails/'

Note that this is really all one line, but I’ve line wrapped it so you won’t have to scroll of the edge of the screen to see it.

The “-e robots=off’ sets wget to ignore the robots request file. -r is recursive decent to get several layers (I think the default is 5, but in some option choices ends up unlimited). Then -np means “no parent” so it doesn’t take the ../ and crawl back up to download ALL of Wikileaks…

Then we get to the stuff that plays with what you look like. The “-w 30” says to wait 30 seconds between download attempts. This is done both to be polite (i.e. not hammer their site with a tidal wave of computer generated download requests) and to appear more “like a person” at a keyboard taking a moment to find the next link. “–random-wait” says to ‘mix it up’ by a semi-random but bounded variation. Again, more human like. Finally we the User Agent text line.

Each device / program / browser has a ‘tag’ that identifies what it is. This both lets a web site know any ‘quirks’ their page might need to program / work past, and it identifies you are a real browser with person (sometimes ;-) In this case, Mozilla browser running on Windows NT. Yeah, I likely ought to find a newer User Agent handle to use. Maybe Chrome on a Chromebook, but it was what I’ve used for years, and still seems to work (though some sites will now be telling you you are an unsupported browser and to update…) I really ought to find a new string to send…

At any rate, that worked (starting this morning after my prior ‘lockout’ from being a Bad Robot timed out and I could download again…) until now, a few hours later. 2 or 3 maybe? Hey, being polite is also sometimes slow…

Then what happened is that it started to give Gateway Errors. This can be because they are just swamped with traffic or having technical difficulties, but IMHO it more often means their computer has become suspicious that your computer is not really a person. Total Volume? Persistence over time? Who knows…

–2021-01-21 16:43:30– https: //file.wikileaks.org/file/clinton-emails/Clinton_Email_August_Release/C05765990.pdf
Connecting to 192.168.0.254:3128… connected.
Proxy request sent, awaiting response… 504 Gateway Time-out
Retrying.

–2021-01-21 16:43:32– (try: 2) https: //file.wikileaks.org/file/clinton-emails/Clinton_Email_August_Release/C05765990.pdf
Reusing existing connection to file.wikileaks.org:443.
Proxy request sent, awaiting response… 502 Bad Gateway
2021-01-21 17:45:32 ERROR 502: Bad Gateway.

–2021-01-21 16:45:34– https: //file.wikileaks.org/file/clinton-emails/Clinton_Email_August_Release/C05765994.pdf
Connecting to 192.168.0.254:3128… connected.
Proxy request sent, awaiting response… 502 Bad Gateway
2021-01-21 17:48:35 ERROR 502: Bad Gateway.

Now I know my proxy server is working fine, as I’m still using it, so it’s on their end.

The “good news” is that wget can be set to not re-download what it already has (I’ll need to look that up and add it) and it has an ‘index’ file to check through, so a ‘restart’ can often fill in the missing bits. BUT, as I’m likely again on the ‘ban for a while’ list, I can’t do this until tomorrow (or maybe the next day if they have exponential growth of timeouts…)

So what did I get “so far”?

ems@OdroidN2:~clinton-emails$ du -ks *
848 Clinton_Email_August_Release
364 Clinton_Email_December_Release
72 Clinton_Email_February_13_Release
76 Clinton_Email_February_19_Release
112 Clinton_Email_February_26_Release
208 Clinton_Email_February_29_Release
116 Clinton_Email_January_29_Release
156 Clinton_Email_January_7_Release
168 Clinton_Email_July_Release
232 Clinton_Email_June_Release
40 Clinton_Email_May_Release
600 Clinton_Email_November_Release
520 Clinton_Email_October_Release
456 Clinton_Email_September_Release
16 Litigation_F-2016-07895
16 Litigation_F-2016-07895_10
16 Litigation_F-2016-07895_11
16 Litigation_F-2016-07895_12
12 Litigation_F-2016-07895_13
8 Litigation_F-2016-07895_14
20 Litigation_F-2016-07895_15
16 Litigation_F-2016-07895_16
12 Litigation_F-2016-07895_17
12 Litigation_F-2016-07895_18
16 Litigation_F-2016-07895_19
20 Litigation_F-2016-07895_2
16 Litigation_F-2016-07895_20
12 Litigation_F-2016-07895_21
8 Litigation_F-2016-07895_22
8 Litigation_F-2016-07895_23
8 Litigation_F-2016-07895_24
8 Litigation_F-2016-07895_25
8 Litigation_F-2016-07895_26
8 Litigation_F-2016-07895_27
8 Litigation_F-2016-07895_28
8 Litigation_F-2016-07895_29
48 Litigation_F-2016-07895_3
8 Litigation_F-2016-07895_30
12 Litigation_F-2016-07895_31
8 Litigation_F-2016-07895_32
16 Litigation_F-2016-07895_4
48 Litigation_F-2016-07895_5
20 Litigation_F-2016-07895_6
64 Litigation_F-2016-07895_7
16 Litigation_F-2016-07895_8
40 Litigation_F-2016-07895_9
48 Nov03_2016
16 Nov04_2016
8 Powell_9-23-2016
8 index.html
4 readme.txt

The ones with 8 kb size are just holding the index.html for that directory, so you can see which ones have a directory built, but were not populated yet. I can, I suppose, go look at those index files and figure out how many files are missing in any one directory. But there’s a faster way:

“A Find is a terrible thing to waste. -E.M.Smith”

The find command is your friend, even if it is a pain to be around and hard to understand…

ems@OdroidN2:~clinton-emails$ find . -type f -print | wc -l
80

So I got 80 regular files before it crapped out. Since there’s about 30,000, that’s
375 days IFF I’m only banned until the next day after each run.

I think I need a better camo pattern for my scrape…

If anyone has a better idea on how to mirror a Wikileaks sub-tree, I’d love to hear it. Clearly my wget Foo is not strong enough (yet…)

It does look a lot like Wikileaks wants to know about, control, and finger anyone who mirrors what they have, per:

https://pastebin.com/6q6E3Z0C

#### tutorial about how to set up a wikileaks mirror on Debian with Apache webserver ####
#### see http: //213.251.145.96/mass-mirror.html for further information ####

# create a DNS entry (named ‘wikileaks.mydomain.com’ here) in your DNS provider interface
# make it reach to your webserver IP

# login to your webserver

# create the user that WL will use (named ‘wikileaks’ here)
useradd -m -d /home/wikileaks wikileaks

# set the wikileaks ssh public key
mkdir /home/wikileaks/.ssh
wget http: //213.251.145.96/id_rsa.pub -O /home/wikileaks/.ssh/authorized_keys
[…]
# go to http: //46.59.1.2/mass-mirror.html to register your mirror
# for the name used in this example, the fields value woud be :
# login : ‘wikileaks’
# password : empty (since we use ssh here)
# absolute path : /var/www/wikileaks.mydomain.com
# hostname : wikileaks.mydomain.com
# checkbox : that’s up to you ;)

# after submitting, the WL teams will begin to send updates through ssh and rsync

So, OK, I can see why they might not want Bad Actors (State or otherwise) to make fake mirrors of the Wikileaks and want to assure reputation and data quality, but…

It also means a “capture the king” strategy can also lead to all those other ‘authorized’ copies.

It also is a big pile of effort, to mirror ALL of wikileaks, when you just want one, off-the-record and incognito, copy of part of it.

So I’m going on to Morning Coffee #2 while I have a bit of a think… Also, since I’m unable to read their “go here for more information” as I’m locked out, not much more I can read to find out more… This is another case where a VPN would be helpful, for bypassing IP timers / lockouts.

Subscribe to feed

About E.M.Smith

A technical managerial sort interested in things from Stonehenge to computer science. My present "hot buttons' are the mythology of Climate Change and ancient metrology; but things change...
This entry was posted in Tech Bits. Bookmark the permalink.

17 Responses to Scraping Wikileaks – A Partial Story

  1. gallopingcamel says:

    This is what the “Left” does. They kill the messenger (e.g. Julian Assange or Snowden) to avoid having to defend their criminality.

    This strategy can only work when the “Media” are corrupt. Donald Trump summed it up when he said the “Media” are the enemies of the people.

  2. p.g.sharrow says:

    That bad “gateway” is something I get often with my setup. I’m on an old satellite motem , a hundred feet of wire and an old router. If any one gets flaky, generally hot caused throttling, I get that message. some times the motem or router forgets who it is, refuses to work, and I have to reboot the system.
    Maybe all of your security looks like a black hole or a government agency from their end…pg

  3. philjourdan says:

    I do not much care for Snowden. He reaped what he sowed. But I really wanted Assange Pardoned. Not because I think he broke any laws (but that is irrelevant to the deep state), but because his persecution is pure government terrorism. He never was here. He never touched a thing here. He is not a citizen. And he did no more than the NY Crimes and WaPo with the Pentagon papers. The difference is, he did it to the left.

  4. WatchinIt says:

    Perhaps could grab the “go here for more information” and grab the file/screenshot if you give me directions. DM or here. Course if that worked you’d reach out to someone with creds ;-)

  5. WatchinIt says:

    @philjourdan
    Actually it was the Trump Administration that went after Assange – spearheaded by Mike Pompeo and Jeff Sessions, with able assistance from Pence, and Richard Grenell (under direct orders from Trump) worked to pry him out of the Ecuadorean Embassy. The Obama DOJ investigated Assange since 2010, but didn’t charge him. Chelsea Manning was court-martialled and sentenced to 35 yrs, which Obama commuted in 2017. The only charges arising from the Clinton email leaks were Richard Mueller’s indictment of 12 Russians and 3 Russian companies as part of his investigation of Russian interference in the 2016 election.

    From Greenwald’s piece on the Trump DOJ charges of Assange, unsealed Apr 2019.. https://theintercept.com/2019/04/11/the-u-s-governments-indictment-of-julian-assange-poses-grave-threats-to-press-freedoms/

    “The first crucial fact about the indictment is that its key allegation — that Assange did not merely receive classified documents from Chelsea Manning but tried to help her crack a password in order to cover her tracks — is not new. It was long known by the Obama DOJ and was explicitly part of Manning’s trial, yet the Obama DOJ — not exactly renowned for being stalwart guardians of press freedoms — concluded that it could not and should not prosecute Assange because indicting him would pose serious threats to press freedom. In sum, today’s indictment contains no new evidence or facts about Assange’s actions; all of it has been known for years.”

  6. gallopingcamel says:

    You guys just proved what I said earlier.

    You expressed your opinions about Assange and Snowden but failed to mention the corruption that Assange exposed in the DNC and the fact that the murder of Seth Rich was plausibly part of the “Cover Up”.

    Never mind what you think of Snowden or his motives. The important question is why are three letter agencies spying on American citizens? Note that I used the present tense since there has been no attempt to correct the abuses.

  7. E.M.Smith says:

    @G.C.:

    Because they can, and there is no countervailing power to stop them.

  8. Rienk says:

    Just for fun I tried to download it as wel. Since I’m on windows (yeah I know), I use Winhttrack.
    I think it is a front end gui for wget. Didn’t work with standard setting so I set it to ignore robots.txt and I changed the user agent to some random Palemoon version. After 40 or so minutes I’m 6000 files in at 60kB/min. So far 0 errors.

  9. philjourdan says:

    @Watchinit – Sorry,, you are wrong, The deep state went against Assange. Trump did not. And Obama was using the server as his blackmail on Clinton. So why bother with Assange? The bottom line is that Assange is a threat to the deep state and McConnell blackmailed Trump into NOT pardoning him. Those are the facts.

    And GC – while I do not agree with all exposure (especially the northern type). I welcome ALL exposure, Snowden did his job. His fault? The okd USSR Sorry that is not exposure, that is espionage.

  10. philjourdan says:

    @Rienk – you can download a WIndows WGET. It is CLI, but you see more of the operation.

  11. Rienk says:

    @ philjourdan says: 26 January 2021 at 11:44 pm
    Thanks, I sort of remember that from long ago when I was looking into scraping websites. Being a point and click guy, first computer was an Atari 1040ST, so I settled on httrack.

  12. philjourdan says:

    @Rienk – yea we have chewed the same road. I have found that almost every Linux trick is available on Windows., Had to! Company requires Windows. SO I search for those things on a windows version and find them.

    I have WSL on my home computer. But sadly, my work computer is a 2018 version that will not accept WSL! I work for a backward company.

  13. Rienk says:

    @philjourdan, On the other hand, I can fully understand people not wanting to go to windows 10. Trying to turn of telemetry is not for the faint of heart and networking has changed enough to cause some serious head schratching.

    Starting with linux is difficult because I don’t have a mental map of how the system is laid out. I can follow a recipe but when it doesn’t work….. So I do have a raspberry pi working as a time server and as a pihole web filter with dns via ubound. I can even get gentoo installed on a pi but audio only worked in mono and distorted and I haven’t a clue where to even start looking.

  14. E.M.Smith says:

    @Rienk:

    Audio used to be fairly reliable, then Pottering made “Pulse Audio” that’s a Swiss Army Knife of audio to please folks who have no life beyond audio systems… and made it a PITA for Regular Folks to just have regular audio.

    MY guess is you have Pulse Audio installed and need to learn how to use and configure it. (I prefer Alsa https://alsa-project.org/wiki/Main_Page as in my experience it generally just works…)

    Alternatively, find out how to remove PulseAudio and install Alsa…

    https://forums.linuxmint.com/viewtopic.php?t=70675

    How To: Remove PulseAudio & replace it with ALSA (Mint 10)
    Post by mads » Sat Apr 16, 2011 6:28 am

    The following How To was originally posted on the thread: Is Pulse Audio the worst tragedy in the history of Linux?

    Please note:
    – There is no reason to remove PulseAudio unless you are having some issues with it.
    – This how to is only for Linux Mint 10 users. (May work with earlier or later versions, not tested.)
    – LMDE users: please refer to this guide.

    The prior thread link Points to:

    Re: Is Pulse Audio the worst trajedy in the history of Linux
    Post by mads » Thu Nov 11, 2010 7:35 pm

    Nick_Djinn wrote:
    I mean, some people need it, which is why it should be in the repos, or even in the software center with special scripting instructions for fully changing over to PA if your need it…..but it just breaks too many peoples systems for it to be included by default on an operating system intended for noobs who dont know how to trouble shoot.
    Couldn’t be more agreed..!

    Nick_Djinn wrote:
    Sound working perfectly with ALSA in KDE.

    Pottering, the same person who brought you SystemD….

  15. Rienk says:

    @E.M.Smith, Thanks for the links! I just looked, and it is Pulse Audio. Now I have some learning to do. I’ve read about the fun you had with systemd. I think I’ll go the ALSA route.

    R

  16. E.M.Smith says:

    @Rienk:

    You are most welcome…

    FWIW, when you look at a bit of code and everyone is complaining about it but a few who love it, that’s usually a good clue that it’s crap code. Pottering has made 2 Giant Examples that have exactly that kind of profile. A lot of folks who hate it, and it is known to cause “issues” while being big, fat, complex and not that reliable. FWIW, I’d not hire him as a programmer.

  17. Jim Masterson says:

    @E.M.Smith

    >>
    FWIW, I’d not hire him as a programmer.
    <<

    Yeah, and I wouldn’t hire me as a programmer either. But I enjoy coding.

    Jim

Comments are closed.