Last modified: 2013-06-28 22:05:20 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T37233, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 35233 - Mobile sites being indexed by search engines


Summary:	Mobile sites being indexed by search engines

Status:	RESOLVED FIXED

Product:	Wikimedia
Classification:	Unclassified
Component:	General/Unknown (Other open bugs)
Version:	unspecified
Hardware:	All All

Importance:	Normal normal (vote)
Target Milestone:	---
Assigned To:	Jon

URL:
Whiteboard:
Keywords:

Duplicates:	50400 (view as bug list)
Depends on:
Blocks:
	Show dependency tree / graph

Reported:	2012-03-14 22:14 UTC by MZMcBride
Modified:	2013-06-28 22:05 UTC (History)
CC List:	10 users (show)

See Also:
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
https://www.google.com/search?ix=acb&sourceid=chrome&ie=UTF-8&q=disable+gnu+generalpublic+license#hl=en&sclient=psy-ab&q=disable+gnu+general+public+license&oq=disable+gnu+general+public+license&aq=f&a (24.80 KB, image/png) 2012-04-13 02:25 UTC, Patrick Reilly	Details
Add an attachment (proposed patch, testcase, etc.)

Description MZMcBride 2012-03-14 22:14:03 UTC

Sites such as <http://en.m.wikipedia.org> are currently being indexed by search engines such as Google.

I don't believe having these mobile sites indexed is necessary or appropriate.

I'd like to see the sites marked as "no index", via robots.txt or a <meta> tag or whatever other reliable method is available. I thought this was already the case, but it clearly is not: <https://www.google.com/search?q=site%3Aen.m.wikipedia.org>.

Comment 1 MZMcBride 2012-03-14 22:20:14 UTC

The mobile site previously had "<meta name="ROBOTS" content="NOINDEX, NOFOLLOW" />". This was removed in r113378.

Comment 2 Patrick Reilly 2012-03-14 22:20:47 UTC

We removed the NOINDEX, NOFOLLOW at Google's request. They want to index the mobile site for their mobile search index.

Comment 3 AlexSm 2012-03-22 20:48:17 UTC

Does the mobile site support NOINDEX and MediaWiki:Robots.txt? There is a complaint here: http://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#User_talk_pages_not_NOINDEXed_for_mobile_site

Comment 4 MZMcBride 2012-04-05 22:15:21 UTC

(In reply to comment #3)
> Does the mobile site support NOINDEX and MediaWiki:Robots.txt? There is a
> complaint here:
> http://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#User_talk_pages_not_NOINDEXed_for_mobile_site

The __NOINDEX__ part is covered by bug 35425. And it looks like <https://en.m.wikipedia.org/robots.txt> loads properly.

Comment 5 Erik Moeller 2012-04-10 02:36:32 UTC

Reopening.

Indexing for the purpose of measuring is fine, but seeing m.wikipedia.org search results in Google's non-mobile search seems confusing and misleading to our users. But that's what's currently happening, and we need to stop it.

Patrick or Tomasz, can you send me the communication that's occurred so far with Google so I understand where they're coming from, and give me any additional background that may be helpful?

Comment 6 Erik Moeller 2012-04-13 01:54:46 UTC

The noindex, nofollow has been restored for now.

If you see Google search results which include m., please report them here.

We'll try to comply with Google's request in a way that doesn't affect other crawlers, and once they've confirmed that they can fully exclude m. from Google search results.

Comment 7 Erik Moeller 2012-04-13 02:21:33 UTC

My own experience, specifically with Google: I've seen m. results come up for ordinary searches a few days ago. Now I find it much harder to reproduce. However, I can reliably get them both for "site:m.wikipedia.org" and for some searches which pick up unique m. content. For example, the mobile site includes the phrase "Disable images", so the word "disable", in combination with some searches, brings up m. results. (Example: search for "disable gnu general public license" and you'll see a simple.m result.)

So it looks like they're trying aggressively to filter but not always succeeding.

Comment 8 Patrick Reilly 2012-04-13 02:25:09 UTC

Created attachment 10416 [details]
https://www.google.com/search?ix=acb&sourceid=chrome&ie=UTF-8&q=disable+gnu+generalpublic+license#hl=en&sclient=psy-ab&q=disable+gnu+general+public+license&oq=disable+gnu+general+public+license&aq=f&a

https://www.google.com/search?ix=acb&sourceid=chrome&ie=UTF-8&q=disable+gnu+generalpublic+license#hl=en&sclient=psy-ab&q=disable+gnu+general+public+license&oq=disable+gnu+general+public+license&aq=f&aqi=&aql=1&gs_nf=1&gs_l=serp.3...5222.5222.0.5509.1.1.0.0.0.0.52.52.1.1.0.cfis.1.&pbx=1&bav=on.2,or.r_gc.r_pw.r_cp.r_qf.,cf.osb&fp=24ec0a65142634f2&biw=1275&bih=806&ix=acb

Comment 9 Erik Moeller 2012-04-15 23:47:36 UTC

Google's recommendation is to use rel="canonical", i.e. to allow crawlers to crawl the mobile site but to signal that the content is substantially identical to the desktop version. They're recommending for it to be crawler-visible to pick up mobile-optimized pages and serve those directly to users of Google mobile search.

I'm fine with giving this a go. Supposedly Bing, Yahoo! and Google all support rel="canonical" to filter duplicate content.

We'll still see increased crawler traffic relative to noindex but this should help to reliably exclude m. pages from the desktop index.

Comment 10 MZMcBride 2012-12-04 17:19:43 UTC

(In reply to comment #9)
> Google's recommendation is to use rel="canonical", i.e. to allow crawlers to
> crawl the mobile site but to signal that the content is substantially identical
> to the desktop version. They're recommending for it to be crawler-visible to
> pick up mobile-optimized pages and serve those directly to users of Google
> mobile search.
> 
> I'm fine with giving this a go. Supposedly Bing, Yahoo! and Google all support
> rel="canonical" to filter duplicate content.
> 
> We'll still see increased crawler traffic relative to noindex but this should
> help to reliably exclude m. pages from the desktop index.

When I look at the page source of <http://en.m.wikipedia.org/> currently, I notice two things:

<meta name="robots" content="noindex,nofollow"/>

and...

<link rel="canonical" href="http://en.wikipedia.org/wiki/Main_Page" >

So what is needed to resolve this bug? Simply removing the noindex/nofollow HTML output (presumably this is behind a PHP configuration variable)?

Comment 11 Gerrit Notification Bot 2013-05-01 18:18:43 UTC

Related URL: https://gerrit.wikimedia.org/r/61809 (Gerrit Change I1790f38880458588b9ccc5c2d5e0fa67ff00e386)

Comment 12 Gerrit Notification Bot 2013-05-01 18:33:50 UTC

https://gerrit.wikimedia.org/r/61809 (Gerrit Change I1790f38880458588b9ccc5c2d5e0fa67ff00e386) | change APPROVED and MERGED [by Jdlrobson]

Comment 13 Jon 2013-05-28 00:22:05 UTC

*** Bug 48856 has been marked as a duplicate of this bug. ***

Comment 14 James Alexander 2013-06-24 09:41:32 UTC

Reopening this bug so as to avoid duplicate. I came across it for mediawiki.org during a google search. Perhaps we just didn't get all WMF sites?

Replication: 

Google search for Wikimedia bugzilla groups [oh the irony]
https://www.google.com/search?q=wikimedia+bugzilla+groups&oq=wikimedia+bugzilla+groups

For me the 4th option was:
User:AKlapper (WMF)/BugzillaAdminPolicy - MediaWiki
https://m.mediawiki.org/wiki/User:AKlapper.../BugzillaAdminPolicy‎

Comment 15 Jon 2013-06-24 17:34:37 UTC

As stated in:
https://bugzilla.wikimedia.org/show_bug.cgi?id=48856#c1 (dated 2013-05-28 00:22:05 UTC)

This will remain the case without a a clash flush.

I believe our caches run for 6 weeks so please reopen bug if you notice this behaviour at the end of July.

(In this particular example the search result in question hasn't been edited since Apr 23, 2013 so will still be loading from cache)

Comment 16 Jon 2013-06-28 22:05:20 UTC

*** Bug 50400 has been marked as a duplicate of this bug. ***

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links