Last modified: 2014-03-11 10:45:53 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T64494, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 62494 - Add exception in robots.txt to allow the Internet Archiver to index action=raw
Add exception in robots.txt to allow the Internet Archiver to index action=raw
Status: NEW
Product: Wikimedia
Classification: Unclassified
Site requests (Other open bugs)
wmf-deployment
All All
: Low enhancement (vote)
: ---
Assigned To: Nobody - You can work on this!
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2014-03-10 16:54 UTC by Nathan Larson
Modified: 2014-03-11 10:45 UTC (History)
6 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Nathan Larson 2014-03-10 16:54:59 UTC
Presently, robots.txt has Disallow: /w/ Thus, the raw wikitext of pages isn't accessible via the Internet Archive; see e.g. https://web.archive.org/web/20140307111730/http://en.wikipedia.org/w/index.php?title=Main_Page&action=raw

This is in contrast to sites like WikiIndex, which allow it: https://web.archive.org/web/20131021230044/http://wikiindex.org/index.php?title=Welcome&action=raw

We should allow the Internet Archiver to index these pages so that the raw wikitext will be available for future generations, even if the page goes away. See [[mw:Manual:Robots.txt#Allow_indexing_of_raw_pages_by_the_Internet_Archiver]].
Comment 1 jeremyb 2014-03-10 17:03:11 UTC
(In reply to Nathan Larson from comment #0)
> We should allow the Internet Archiver to index these pages so that the raw
> wikitext will be available for future generations, even if the page goes
> away.

We already regularly dump the DBs and push those dumps to Internet Archive.

What else does action=raw get us?
Comment 2 Nathan Larson 2014-03-10 17:09:14 UTC
(In reply to jeremyb from comment #1)
> We already regularly dump the DBs and push those dumps to Internet Archive.
> 
> What else does action=raw get us?

I guess it depends; is there a way to get wikitext of individual pages without downloading the whole dump, assuming the page is no longer on-wiki?
Comment 3 Nemo 2014-03-10 18:04:47 UTC
(In reply to jeremyb from comment #1)
> We already regularly dump the DBs and push those dumps to Internet Archive.
> 
> What else does action=raw get us?

Integration in Wayback machine. Not sure it's worth it though.

(In reply to Nathan Larson from comment #2)
> I guess it depends; is there a way to get wikitext of individual pages
> without downloading the whole dump, assuming the page is no longer on-wiki?

Server-side bzgrep? :P Probably no.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links