Last modified: 2014-06-29 00:33:07 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T50856, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 48856 - Prevent zero.wikipedia.org from being indexed by search engines
Prevent zero.wikipedia.org from being indexed by search engines
Status: REOPENED
Product: MediaWiki extensions
Classification: Unclassified
ZeroPortal (Other open bugs)
unspecified
All All
: High major (vote)
: ---
Assigned To: Nobody - You can work on this!
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2013-05-27 12:03 UTC by Benjamin Chen
Modified: 2014-06-29 00:33 UTC (History)
5 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Benjamin Chen 2013-05-27 12:03:20 UTC
https://www.google.com/search?q=site:en.zero.wikipedia.org

Sometimes searching certain terms displays zero.wikipedia.org URL as the first result, this should be avoided, as users on unsupported carriers (or non-mobile users) have no direct way to jump to the standard site.
Comment 1 Jon 2013-05-28 00:22:05 UTC
https://bugzilla.wikimedia.org/show_bug.cgi?id=35233

Unfortunately without a cache flush this will stay in search results for some time but will disappear from search results eventually.

*** This bug has been marked as a duplicate of bug 35233 ***
Comment 2 MZMcBride 2013-05-28 01:01:13 UTC
I don't understand how Google was able to index these pages. Isn't the page content restricted to specific IP addresses? How was Googlebot able to index the page content?

While it's difficult to test, it appears that pages such as <http://en.zero.wikipedia.org/wiki/Britney_Spears> do not have a "noindex" directive within them.
Comment 3 Jon 2013-05-28 16:44:27 UTC
I can't speak about the IP address content restriction (although personally i don't understand this and think at the very least there should be a link rather than the current broken experience! - what if someone shares a link for example). I also believe that if people are sharing a zero link that is the same as a normal wikipedia page link it will boost the wikipedia page's page ranking.

Anyway in terms of indexing, if Google finds a page the first thing it should do is look for the canonical link tag [1]. If it finds it instead of indexing the current page (the zero one in this case) it boosts the non-zero page's ranking.

If you looked at the cached versions of these indexed page the canonical link tag is not there so they got indexed (see bug 35233).  When the page HTML for these pages is rewritten it will have the canonical url and they will disappear from search results without any further work. 

As far as I know we don't set a noindex directive and I don't believe we should. I believe that since people share links, and might share a link on zero via some other service (which maybe also has data free charges) I think we should improve the experience for users landing on this page who are not on zero. Instead of wiping content we should either automatically redirect or show a different banner linking to the original content.

So with this in mind MZ should a new bug be created or do we still want to brute force via a noindex?

[1] http://support.google.com/webmasters/bin/answer.py?hl=en&answer=139394
Comment 4 MZMcBride 2013-05-28 23:50:53 UTC
This appears to be related to Gerrit change #64629 (I63a4542c9792e4979f2a9668d0a5c858f21f591b).
Comment 5 MZMcBride 2013-05-29 00:02:40 UTC
(In reply to comment #3)
> So with this in mind MZ should a new bug be created [...]

I filed bug 48921 to track the issue I think you were describing.
Comment 6 dr0ptp4kt 2013-05-29 21:42:14 UTC
We will be setting up a no-index rule for zero.wikipedia.org requests. The business team confirmed that zero.wikipedia.org pages are not supposed to be in the Google index. I prefer that we have a re-crawl of the site first to help Google's existing canonical links updated. But eventually, the business team wants zero.wikipedia.org out of the search index completely.

MZMcBride, as to why the pages were indexed, from what I can tell:

* At some point a code change resulted in article content other than the "Sorry" warning being echoed into the <language>.zero.wikipedia.org pages below the warning (making them on par if I understand correctly, with <language>.m.wikipedia.org pages sans the warning).
* With the fulltext content from each <language>.zero.wikipedia.org page, Google's crawlers were able to discover more links.
* In the absence of a canonical link for each <language>.zero.wikipedia.org page, Google's algorithms wouldn't have had a perfect, non-heuristic means of identifying the pages as being the same. The heuristics seem to have correctly classified a number of pages as dupes, but not all of them based on a site:en.zero.wikipedia.org Google search, for example.

My Gerrit change #64113 was introduced to stop content from being echoed below the "Sorry" warning. This in concert with Jon's Gerrit change #61809 will allow the Google index to self-correct, although as you note, my Gerrit change #64629 provides the means to have no indexing whatsoever.
Comment 7 dr0ptp4kt 2013-06-18 23:11:15 UTC
<cross-posted from mailing list>

Update: 
We've added an enhancement to Wikipedia Zero so that if a user who isn't on a participating carrier network navigates to a Wikipedia Zero page on <language>.zero.wikipedia.org, such as http://en.zero.wikipedia.org/wiki/Muse_%28band%29 , the user will be presented an option to visit the canonical URL of the article. If clicked, the canonical URL should get the user to the mobile or desktop version of the page, based on device type.

We're hoping that by next week the Google index will be refreshed so as to correctly mark the <language>.zero.wikipedia.org pages as duplicate pages in the omitted section. Upon confirmation of as much, the current plan is to introduce https://gerrit.wikimedia.org/r/#/c/69420/ to prevent indexing of <language>.zero.wikipedia.org altogether.
Comment 8 dr0ptp4kt 2013-06-27 00:54:04 UTC
<cross-posted from mailing list>

Okay, looks like the index of zero.wikipedia.org pages in Google has shrunk by some 20 million entries. Nonetheless, a number of really old pages (e.g., going back to 6-May-2013) are still in the Google index with article text. I'll set a reminder to check on the Google index again in 30 days, and hopefully then we can finally put the no-index rules in place at that time.

The good news is that many of the pages are now correctly suppressed in natural search as non-canonical pages. In other words, a user would need to go through omitted results or do a site:<domain> search to see them.
Comment 9 dr0ptp4kt 2013-10-23 19:52:15 UTC
Ineligible zerodot pageview attempts are now redirected to Special:ZeroRatedMobileAccess. So even the robots.txt-defined ineligible pages on zerodot are bound to fall out of specific search engine query results (e.g., site:zero.wikipedia.org is currently a way to google for pages that have been previously indexed with content "below the [warning] fold").

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links