Last modified: 2013-06-28 22:05:20 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T37233, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 35233 - Mobile sites being indexed by search engines
Mobile sites being indexed by search engines
Status: RESOLVED FIXED
Product: Wikimedia
Classification: Unclassified
General/Unknown (Other open bugs)
unspecified
All All
: Normal normal (vote)
: ---
Assigned To: Jon
:
: 50400 (view as bug list)
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2012-03-14 22:14 UTC by MZMcBride
Modified: 2013-06-28 22:05 UTC (History)
10 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments
https://www.google.com/search?ix=acb&sourceid=chrome&ie=UTF-8&q=disable+gnu+generalpublic+license#hl=en&sclient=psy-ab&q=disable+gnu+general+public+license&oq=disable+gnu+general+public+license&aq=f&a (24.80 KB, image/png)
2012-04-13 02:25 UTC, Patrick Reilly
Details

Description MZMcBride 2012-03-14 22:14:03 UTC
Sites such as <http://en.m.wikipedia.org> are currently being indexed by search engines such as Google.

I don't believe having these mobile sites indexed is necessary or appropriate.

I'd like to see the sites marked as "no index", via robots.txt or a <meta> tag or whatever other reliable method is available. I thought this was already the case, but it clearly is not: <https://www.google.com/search?q=site%3Aen.m.wikipedia.org>.
Comment 1 MZMcBride 2012-03-14 22:20:14 UTC
The mobile site previously had "<meta name="ROBOTS" content="NOINDEX, NOFOLLOW" />". This was removed in r113378.
Comment 2 Patrick Reilly 2012-03-14 22:20:47 UTC
We removed the NOINDEX, NOFOLLOW at Google's request. They want to index the mobile site for their mobile search index.
Comment 3 AlexSm 2012-03-22 20:48:17 UTC
Does the mobile site support NOINDEX and MediaWiki:Robots.txt? There is a complaint here: http://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#User_talk_pages_not_NOINDEXed_for_mobile_site
Comment 4 MZMcBride 2012-04-05 22:15:21 UTC
(In reply to comment #3)
> Does the mobile site support NOINDEX and MediaWiki:Robots.txt? There is a
> complaint here:
> http://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#User_talk_pages_not_NOINDEXed_for_mobile_site

The __NOINDEX__ part is covered by bug 35425. And it looks like <https://en.m.wikipedia.org/robots.txt> loads properly.
Comment 5 Erik Moeller 2012-04-10 02:36:32 UTC
Reopening.

Indexing for the purpose of measuring is fine, but seeing m.wikipedia.org search results in Google's non-mobile search seems confusing and misleading to our users. But that's what's currently happening, and we need to stop it.

Patrick or Tomasz, can you send me the communication that's occurred so far with Google so I understand where they're coming from, and give me any additional background that may be helpful?
Comment 6 Erik Moeller 2012-04-13 01:54:46 UTC
The noindex, nofollow has been restored for now.

If you see Google search results which include m., please report them here.

We'll try to comply with Google's request in a way that doesn't affect other crawlers, and once they've confirmed that they can fully exclude m. from Google search results.
Comment 7 Erik Moeller 2012-04-13 02:21:33 UTC
My own experience, specifically with Google: I've seen m. results come up for ordinary searches a few days ago. Now I find it much harder to reproduce. However, I can reliably get them both for "site:m.wikipedia.org" and for some searches which pick up unique m. content. For example, the mobile site includes the phrase "Disable images", so the word "disable", in combination with some searches, brings up m. results. (Example: search for "disable gnu general public license" and you'll see a simple.m result.)

So it looks like they're trying aggressively to filter but not always succeeding.
Comment 9 Erik Moeller 2012-04-15 23:47:36 UTC
Google's recommendation is to use rel="canonical", i.e. to allow crawlers to crawl the mobile site but to signal that the content is substantially identical to the desktop version. They're recommending for it to be crawler-visible to pick up mobile-optimized pages and serve those directly to users of Google mobile search.

I'm fine with giving this a go. Supposedly Bing, Yahoo! and Google all support rel="canonical" to filter duplicate content.

We'll still see increased crawler traffic relative to noindex but this should help to reliably exclude m. pages from the desktop index.
Comment 10 MZMcBride 2012-12-04 17:19:43 UTC
(In reply to comment #9)
> Google's recommendation is to use rel="canonical", i.e. to allow crawlers to
> crawl the mobile site but to signal that the content is substantially identical
> to the desktop version. They're recommending for it to be crawler-visible to
> pick up mobile-optimized pages and serve those directly to users of Google
> mobile search.
> 
> I'm fine with giving this a go. Supposedly Bing, Yahoo! and Google all support
> rel="canonical" to filter duplicate content.
> 
> We'll still see increased crawler traffic relative to noindex but this should
> help to reliably exclude m. pages from the desktop index.

When I look at the page source of <http://en.m.wikipedia.org/> currently, I notice two things:

<meta name="robots" content="noindex,nofollow"/>

and...

<link rel="canonical" href="http://en.wikipedia.org/wiki/Main_Page" >

So what is needed to resolve this bug? Simply removing the noindex/nofollow HTML output (presumably this is behind a PHP configuration variable)?
Comment 11 Gerrit Notification Bot 2013-05-01 18:18:43 UTC
Related URL: https://gerrit.wikimedia.org/r/61809 (Gerrit Change I1790f38880458588b9ccc5c2d5e0fa67ff00e386)
Comment 12 Gerrit Notification Bot 2013-05-01 18:33:50 UTC
https://gerrit.wikimedia.org/r/61809 (Gerrit Change I1790f38880458588b9ccc5c2d5e0fa67ff00e386) | change APPROVED and MERGED [by Jdlrobson]
Comment 13 Jon 2013-05-28 00:22:05 UTC
*** Bug 48856 has been marked as a duplicate of this bug. ***
Comment 14 James Alexander 2013-06-24 09:41:32 UTC
Reopening this bug so as to avoid duplicate. I came across it for mediawiki.org during a google search. Perhaps we just didn't get all WMF sites?

Replication: 

Google search for Wikimedia bugzilla groups [oh the irony]
https://www.google.com/search?q=wikimedia+bugzilla+groups&oq=wikimedia+bugzilla+groups

For me the 4th option was:
User:AKlapper (WMF)/BugzillaAdminPolicy - MediaWiki
https://m.mediawiki.org/wiki/User:AKlapper.../BugzillaAdminPolicy
Comment 15 Jon 2013-06-24 17:34:37 UTC
As stated in:
https://bugzilla.wikimedia.org/show_bug.cgi?id=48856#c1 (dated 2013-05-28 00:22:05 UTC)

This will remain the case without a a clash flush.

I believe our caches run for 6 weeks so please reopen bug if you notice this behaviour at the end of July.

(In this particular example the search result in question hasn't been edited since Apr 23, 2013 so will still be loading from cache)
Comment 16 Jon 2013-06-28 22:05:20 UTC
*** Bug 50400 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links