Last modified: 2014-03-11 10:45:53 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T64494, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 62494 - Add exception in robots.txt to allow the Internet Archiver to index action=raw


Summary:	Add exception in robots.txt to allow the Internet Archiver to index action=raw

Status:	NEW

Product:	Wikimedia
Classification:	Unclassified
Component:	Site requests (Other open bugs)
Version:	wmf-deployment
Hardware:	All All

Importance:	Low enhancement (vote)
Target Milestone:	---
Assigned To:	Nobody - You can work on this!

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:
	Show dependency tree / graph

Reported:	2014-03-10 16:54 UTC by Nathan Larson
Modified:	2014-03-11 10:45 UTC (History)
CC List:	6 users (show)

See Also:
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Description Nathan Larson 2014-03-10 16:54:59 UTC

Presently, robots.txt has Disallow: /w/ Thus, the raw wikitext of pages isn't accessible via the Internet Archive; see e.g. https://web.archive.org/web/20140307111730/http://en.wikipedia.org/w/index.php?title=Main_Page&action=raw

This is in contrast to sites like WikiIndex, which allow it: https://web.archive.org/web/20131021230044/http://wikiindex.org/index.php?title=Welcome&action=raw

We should allow the Internet Archiver to index these pages so that the raw wikitext will be available for future generations, even if the page goes away. See [[mw:Manual:Robots.txt#Allow_indexing_of_raw_pages_by_the_Internet_Archiver]].

Comment 1 jeremyb 2014-03-10 17:03:11 UTC

(In reply to Nathan Larson from comment #0)
> We should allow the Internet Archiver to index these pages so that the raw
> wikitext will be available for future generations, even if the page goes
> away.

We already regularly dump the DBs and push those dumps to Internet Archive.

What else does action=raw get us?

Comment 2 Nathan Larson 2014-03-10 17:09:14 UTC

(In reply to jeremyb from comment #1)
> We already regularly dump the DBs and push those dumps to Internet Archive.
> 
> What else does action=raw get us?

I guess it depends; is there a way to get wikitext of individual pages without downloading the whole dump, assuming the page is no longer on-wiki?

Comment 3 Nemo 2014-03-10 18:04:47 UTC

(In reply to jeremyb from comment #1)
> We already regularly dump the DBs and push those dumps to Internet Archive.
> 
> What else does action=raw get us?

Integration in Wayback machine. Not sure it's worth it though.

(In reply to Nathan Larson from comment #2)
> I guess it depends; is there a way to get wikitext of individual pages
> without downloading the whole dump, assuming the page is no longer on-wiki?

Server-side bzgrep? :P Probably no.

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links