Last modified: 2014-03-11 13:12:31 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T64468, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 62468 - Add option to have the Internet Archiver (and/or other robots) retrieve raw wikitext of all pages
Add option to have the Internet Archiver (and/or other robots) retrieve raw w...
Status: UNCONFIRMED
Product: MediaWiki
Classification: Unclassified
General/Unknown (Other open bugs)
1.23.0
All All
: Lowest enhancement (vote)
: ---
Assigned To: Nobody - You can work on this!
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2014-03-10 02:26 UTC by Nathan Larson
Modified: 2014-03-11 13:12 UTC (History)
2 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Nathan Larson 2014-03-10 02:26:21 UTC
I propose to add an option to have the Internet Archiver (and/or other robots) retrieve raw wikitext. This way, if a wiki goes down, it will be possible to more easily create a successor wiki by gathering that data from the Internet Archive. As it is now, all that can be obtained are the parsed pages. That's okay for a static archive, but one might want to revive the wiki for further editing. Also, Archive should be opened

I would say that someone should write a script to convert the parsed pages retrieved from the Internet Archive back into wikitext, but that will run into problems with templates and such, unless it's designed to identify them and recreate them. It would be a much easier and cleaner solution to just make the wikitext available from the get-go.
Comment 1 Nathan Larson 2014-03-10 02:29:05 UTC
I was going to say, there should also be an option to let the Archiver access Special:AllPages or a variant of it, so that all the pages can be easily browsed; currently it seems like, when browsing archived pages, it's often necessary to find the page one is looking for by going from link to link, category to category, etc.
Comment 2 Nathan Larson 2014-03-10 03:06:13 UTC
Theoretically, you could put something in your robots.txt allowing the Internet Archiver to index the edit pages: https://www.mediawiki.org/wiki/Robots.txt#Allow_indexing_of_edit_pages_by_the_Internet_Archiver

I'm not sure how well the particular implementation suggested there works, though; from what I can tell, it doesn't. Also, most archived wiki pages I've seen haven't had an "Edit" link.
Comment 3 Nemo 2014-03-10 07:32:58 UTC
It make little sense to "archive" wikitext via action=edit, there is action=raw for that. But the IA crawler won't follow action=raw links (there are none) and as you say there is no indication that fetching action=edit would work.
I propose two things:
1) install heritrix and check if it can fetch action=edit: if not file a bug and see what they say, if yes ask the IA folks on the "FAQ" forum and see what they say;
2) just download such data yourself and upload it to archive.org: you only need wget --warc and then upload in your favorite way.
Comment 4 Nathan Larson 2014-03-10 14:01:01 UTC
I suspect most MediaWiki installations probably have robots.txt set up, as recommended at [[mw:Manual:Robots.txt#With_short_URLs]], with

User-agent: *
Disallow: /w/

See for example:

*https://web.archive.org/web/20140310075905/http://en.wikipedia.org/w/index.php?title=Main_Page&action=edit
* https://web.archive.org/web/20140310075905/http://en.wikipedia.org/w/index.php?title=Main_Page&action=raw

So, they couldn't retrieve action=raw even if they wanted to. In fact, if I were to set up a script to download it, might I not be in violation of robots.txt, which would make my script an ill-behaving bot? I'm not sure my moral fiber can handle an ethical breach of that magnitude. However, some sites do allow indexing of their edit and raw pages, e.g.

https://web.archive.org/web/20131204083339/https://encyclopediadramatica.es/index.php?title=Wikipedia&action=edit
https://web.archive.org/web/20131204083339/https://encyclopediadramatica.es/index.php?title=Wikipedia&action=raw
https://web.archive.org/web/20131012144928/http://rationalwiki.org/w/index.php?title=Wikipedia&action=edit
https://web.archive.org/web/20131012144928/http://rationalwiki.org/w/index.php?title=Wikipedia&action=raw

Dramatica and RationalWiki use all kinds of secret sauces, though, so who knows what's going on there. Normally, edit pages have a <meta name="robots" content="noindex,nofollow" /> but that's not the case with Dramatica or RationalWiki edit pages. Is there some config setting or extension that changes the robot policy on edit pages? Also, I wonder if they had to tell the Internet Archive to archive those pages, or if the Internet Archive just did it on its own initiative.
Comment 5 Nemo 2014-03-11 11:15:28 UTC
IA doesn't crawl on request.
On what "Allow" directives and other directives do or should take precedence, please see (and reply) on https://archive.org/post/1004436/googles-robotstxt-rules-interpreted-too-strictly-by-wayback-machine
Comment 6 Nathan Larson 2014-03-11 12:28:18 UTC
(In reply to Nemo from comment #5)
> IA doesn't crawl on request.
> On what "Allow" directives and other directives do or should take
> precedence, please see (and reply) on
> https://archive.org/post/1004436/googles-robotstxt-rules-interpreted-too-
> strictly-by-wayback-machine

I might reply to that, as more information becomes available. Today, I set my site's robots.txt to say:

User-agent: *
Disallow: /w/

User-agent: ia_archiver
Allow: /*&action=raw

So, I guess a few months from now, I'll see whether the archive of my wiki for 12 March 2014 and thereafter has the raw pages. If not, that's a bug, I think.
Comment 7 Nemo 2014-03-11 12:39:45 UTC
(In reply to Nathan Larson from comment #6)
> So, I guess a few months from now, I'll see whether the archive of my wiki
> for 12 March 2014 and thereafter has the raw pages. If not, that's a bug, I
> think.

Did you include dofollow links to action=raw URLs in your skin?
Comment 8 Nathan Larson 2014-03-11 13:12:31 UTC
(In reply to Nemo from comment #7)
> Did you include dofollow links to action=raw URLs in your skin?

I put in MediaWiki:Sidebar **{{fullurl:{{FULLPAGENAMEE}}|action=raw}}|View raw wikitext

As a backup, I also added a sidebar link to Special:WikiWikitext (per instructions at [[mw:Extension:ViewWikitext]]) just to be sure. Of course, most people won't want to have that on their sidebar. I started a page on this at [[mw:Manual:Internet Archive]].

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links