Last modified: 2012-05-17 08:26:40 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T38581, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 36581 - HTML garbage that might be bad for search engines (e.g. <span dir="auto">inside h1)
HTML garbage that might be bad for search engines (e.g. <span dir="auto">insi...
Status: RESOLVED INVALID
Product: MediaWiki
Classification: Unclassified
Parser (Other open bugs)
1.19
All All
: Unprioritized normal (vote)
: ---
Assigned To: Nobody - You can work on this!
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2012-05-07 04:46 UTC by Edward Chernenko
Modified: 2012-05-17 08:26 UTC (History)
3 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Edward Chernenko 2012-05-07 04:46:19 UTC
It would be good to have an option to disable these tags inside <h1>.

Frankly, I don't care much about multiple text directions, while "badly optimized HTML" report from SEO analyzers looks like a problem.
Comment 1 Andre Klapper 2012-05-07 10:36:27 UTC
> isn't that bad for search engines?

Care to elaborate? Who cares why about "badly optimized HTML"?
Comment 2 Edward Chernenko 2012-05-17 07:47:15 UTC
I'd say that every wiki except Wikipedia should care.
Most encyclopedias get >60% of visitors from search engines.

<h1> tag is always used by a search engine to determine a topic, most important keywords, thematic category.  

Knowing that, there are two reasons why the garbage inside <h1> is bad:

1) it was never expected that <h1> would contain anything but pure text (every book in HTML and SEO says that), therefore a search engine might not (I stress MIGHT) parse its contents to the end. That means "span", "auto" etc. may be recognized as the high-priority keywords for the page. That means sacrificing the actual keywords.

2) some search engines do not like garbage in H1-H6 tags; they treat it as a malformed HTML code or even as an attempt to deceive a search engine, and the positioning of pages containing this may be penalized.

This problem is not just about <h1> and those spans. HTML generated by MediaWiki contains tonn of things (like <meta name="generator" etc.) which are extremely not recommended by any SEO guidebook.

Personally I had to divise a filter for my wiki which erases all these tags.
However I believe it would be great to have a "$wgTurnOffAllTheseUnneededHTMLGarbage" variable in MediaWiki config which would spare the same work for others.
Comment 3 Amir E. Aharoni 2012-05-17 08:26:40 UTC
Closing INVALID:

1. MediaWiki is designed for managing information online, not for pleasing bad SEO analyzers.

2. It's wrong that every HTML book suggests only text in H1. The W3C HTML standard doesn't. SEO-ers and search engine designers are welcome to read it and implement it.

3. Most importantly, this bug doesn't define "garbage". If you think that there are particular HTML elements or attributes that should be possible to disable, configuration variables can be added for it, but it must be properly defined. Patches welcome.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links