Last modified: 2013-12-21 18:53:30 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T60758, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 58758 - Prevent mirrors of Wikipedia from including NOINDEXED pages
Prevent mirrors of Wikipedia from including NOINDEXED pages
Status: RESOLVED WONTFIX
Product: Wikimedia
Classification: Unclassified
Site requests (Other open bugs)
wmf-deployment
All All
: Normal enhancement (vote)
: ---
Assigned To: Nobody - You can work on this!
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2013-12-20 20:11 UTC by Steven Walling
Modified: 2013-12-21 18:53 UTC (History)
11 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Steven Walling 2013-12-20 20:11:41 UTC
This may be actually impossible, but I'm filing a bug to discuss strategies for preventing mirrors of Wikipedia from including pages we NOINDEX. A good example of this is user pages or user talk pages, and the new Draft namespace on English Wikipedia.

Technically speaking these pages are free content just like anything else on Wikipedia (with the exception of fair use images, etc.). However, there are good reasons for us to not want content to be indexed by search engines and found by readers. 

Numerous times, I've had Wikipedians bring up the valid point that mirrors erode our ability to control search indexing, because they mirror content we NOINDEX, but do not replicate the contents of our robots.txt. 

There may practically be no way to prevent this. Even if that's the case, we should record why that is and WONTFIX this, as a point of reference.
Comment 1 Chad H. 2013-12-20 20:17:32 UTC
(In reply to comment #0)
> This may be actually impossible, but I'm filing a bug to discuss strategies
> for
> preventing mirrors of Wikipedia from including pages we NOINDEX. A good
> example
> of this is user pages or user talk pages, and the new Draft namespace on
> English Wikipedia.
> 

I can't think of any possible way to /enforce/ this, nor should we. We definitely shouldn't redact the info from the API or dumps (which I assume are the two most common ways of mirroring us).

Now, we might be able to expose the NOINDEX to reusers and encourage people to respect it, but I can't see any way of preventing people from using the content if they really want it.
Comment 2 Chad H. 2013-12-20 20:17:58 UTC
(Also, this isn't really a search thing for me and Nik. __NOINDEX__ is core, not Cirrus)
Comment 3 James Forrester 2013-12-20 20:28:59 UTC
Maybe we could drop the NOINDEX'ed namespaces (and maybe even pages) from the primary dumps?

However, looking at it, the main dumps people use seems to be pages-articles-multistream, which is only :, Template:, File:, Category: and Project: – so Draft: wouldn't enter?
Comment 4 Chad H. 2013-12-20 20:36:36 UTC
(In reply to comment #3)
> Maybe we could drop the NOINDEX'ed namespaces (and maybe even pages) from the
> primary dumps?
> 
> However, looking at it, the main dumps people use seems to be
> pages-articles-multistream, which is only :, Template:, File:, Category: and
> Project: – so Draft: wouldn't enter?

Indeed. Any we wouldn't want to drop them from the "full" dump since they do still need to get dumped :)

I doubt Ariel wants a new "full dump except things that are NOINDEXED" :)
Comment 5 Derk-Jan Hartman 2013-12-20 20:47:02 UTC
ok, this is 'crazy' i'm sure, but it's the only thing i could come up with.

Make noindex add the *.domainname into the page_props of the article db row. Have noindex add this value into the html structure of the page, non visible but with an id.

DB gets dumped.
DB gets imported

Different host + mediawiki + dump, leads to rendering of same wikipedia.org domain name into the html structure.

JS in mediawiki core checks pages for presence of the hidden noindex element. Finds it doesn't match what it expects, JS blanks the page.

For scraped content, the only thing I can come up with, is to have this noindex element somewhere smack in the middle of the content (still hidden of course) and have it contain the blanking script fully inline and then hope they scrape the html dom instead of the text content and hope they are stupid enough to not filter out scripts :D
Comment 6 Bawolff (Brian Wolff) 2013-12-21 02:07:56 UTC
I don't think we should prevent people from downloading it if they want it. All (non-deleted) content should be available for download. We could maybe encourage people to download the articles only dump, but its important that all content is available.
Comment 7 Chad H. 2013-12-21 07:54:30 UTC
Comment 5 made me cry and I would revert any such hackery on the spot :p

(In reply to comment #6)
> I don't think we should prevent people from downloading it if they want it.
> All
> (non-deleted) content should be available for download. We could maybe
> encourage people to download the articles only dump, but its important that
> all
> content is available.

This.
Comment 8 Matthew Flaschen 2013-12-21 18:36:25 UTC
(In reply to comment #0)
> This may be actually impossible, but I'm filing a bug to discuss strategies
> for preventing mirrors of Wikipedia from including pages we NOINDEX. A good
> example of this is user pages or user talk pages, and the new Draft namespace 
> on English Wikipedia.

Preventing them is a WONTFIX.

For reference, the user namespace is not NOINDEX by default on English Wikipedia, though __NOINDEX__ works.

> Technically speaking these pages are free content just like anything else on
> Wikipedia (with the exception of fair use images, etc.).

Yes, this (along with the Right to Fork) is why we must not do this.  If we exclude the pages from the dumps, it will make the freedom of the content much less meaningful.  It would also encourage people to mirror by crawling the HTML (or even worse, mirroring it live), which is a poor practice and loses a lot of information from the dumps.

> Numerous times, I've had Wikipedians bring up the valid point that mirrors
> erode our ability to control search indexing, because they mirror content we
> NOINDEX, but do not replicate the contents of our robots.txt. 

Free content means giving up some control over what people do with it.  The edit screen used to say, "If you do not want your writing to be edited mercilessly and redistributed at will, do not submit it."  It no longer says that, but it's just as true under our current licenses.

Wikipedia has a high overall search engine ranking, and sites simply mirroring drafts (which by definition are generally not ready for the primetime) probably won't rank that high.  But I accept this could change, does not apply to many other sites, and that there are probably exceptions even on Wikipedia.

People have to comply with our license (attribution, stating license, etc.), but they are allowed to distribute everything with or without marking it NOINDEX.  It is reasonable to encourage mirrors to preserve the robot policies on their own HTML output, though.

Since a3aac44 in 2010 (pages last saved before then don't seem to have it judging by a check of the akwiki dump), __NOINDEX__ and __INDEX__ have been stored in the page_props table (along with all other __DOUBLEUNDERSCORE__ magic words).  This is dumped, so it is relatively easy to check this on a per-page basis.

I don't think the namespace robot policies are currently anywhere in the dump.  I've filed this as bug 58805.
Comment 9 Matthew Flaschen 2013-12-21 18:53:30 UTC
It doesn't look like the commit hash link worked.  It's https://git.wikimedia.org/commitdiff/mediawiki%2Fcore.git/a3aac44f0481fb635877f161b8208ba830e83a78 .

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links