Last modified: 2014-03-10 12:20:42 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T58893, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 56893 - Let Internet Archive's Wayback machine archive etherpads
Let Internet Archive's Wayback machine archive etherpads
Status: RESOLVED FIXED
Product: Wikimedia
Classification: Unclassified
Etherpad (Other open bugs)
unspecified
All All
: Low enhancement (vote)
: ---
Assigned To: Nobody - You can work on this!
http://etherpad.wikimedia.org/robots.txt
: easy
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2013-11-11 11:06 UTC by Nemo
Modified: 2014-03-10 12:20 UTC (History)
3 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Nemo 2013-11-11 11:06:36 UTC
We all make heavy use of web.archive.org and we're expanding it ([[mw:Archived Pages]]), so let's use it also for Etherpad. 
Akosiaris tells me the current robots.txt is just the default, so this is IMHO a trivially desirable change.

Hopefully, adding this should be enough (https://webarchive.jira.com/browse/HER-1):

User-agent: ia_archiver
Allow: /
Allow: /p/

But once deployed it's easy to check with their new live-retrieving/on-demand saving feature.

More background from #wikimedia-tech:
akosiaris> [...] I must say etherpad.wikimedia.org never was intended for permanent storage. Preservation of a pad is up to the people interested in preserving that pad in another format. The software is well known to corrupt pads (hopefully the latest issues are resolved with 1.3.0 but we never know when others might show up) and restoring a pad from database backups is neigh to impossible. [...]
Nemo_bis> akosiaris: that's what I'm saying :) if we don't plan to make archives, let's let others do so
Comment 1 Alexandros Kosiaris 2014-02-04 12:50:51 UTC
Commenting just to make something clear. Changing the robots.txt will not have the Internet Archive automagically archive pads. The reason being that no links exist for any spider to follow. It might be possible for pads whose links have been posted in various places to be archived but whether that will happen or not depends entirely on IA's spider implementation. The "no links" problem can be solved by having a page list all pads. That in turn could possibly be solved with any of the various pad listing plugins but last we checked none of them were production quality.

Some more info can be found here:
https://bugzilla.wikimedia.org/show_bug.cgi?id=30240
Comment 2 Nemo 2014-02-04 13:06:11 UTC
Yes, I plan to list or submit all publicly known URLs myself later.
Comment 3 Alexandros Kosiaris 2014-02-04 13:08:55 UTC
Do we know this approach will work ?
Comment 4 Nemo 2014-02-04 13:12:07 UTC
(In reply to comment #3)
> Do we know this approach will work ?

What do you mean? This bug is currently "Let Internet Archive's Wayback machine archive etherpads", not "Ensure Internet Archive's Wayback machine has copies of all etherpads". As long as retrieval works, this bug can be closed. Enhancing the crawling over their average performance will be a separate effort.
Comment 5 Alexandros Kosiaris 2014-02-04 13:23:32 UTC
Quite true. I was looking at the forest and forgot about the tree. Anyway I 'll submit a patchset to implement this.
Comment 6 Alexandros Kosiaris 2014-03-10 12:20:42 UTC
I believe this is fixed by https://gerrit.wikimedia.org/r/#/c/117845/

I will now close this ticket, feel free to reopen.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links