Last modified: 2014-09-19 11:12:23 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T72721, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 70721 - http://wikipedia.org/index.html takes you somewhere unexpected.
http://wikipedia.org/index.html takes you somewhere unexpected.
Status: RESOLVED WONTFIX
Product: Wikimedia
Classification: Unclassified
Apache configuration (Other open bugs)
wmf-deployment
All All
: Lowest enhancement (vote)
: ---
Assigned To: Nobody - You can work on this!
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2014-09-11 15:02 UTC by Steve Baker
Modified: 2014-09-19 11:12 UTC (History)
3 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Steve Baker 2014-09-11 15:02:51 UTC
Go to http://wikipedia.org/index.html - note that you're not at the Wikipedia home page!

Background: If I enter the URL "https://en.wikipedia.org/zebra", or "https://wikipedia.org/zebra" or even "https://en.wikipedia.org/wiki/zebra" I get a 404 page that helpfully redirects me to the English language Zebra article after a 5 second pause. That's a nice touch. (although less so if you're a non-English speaker!)

However, it has an unintended consequence. If I go to Wikipedia's main page using https://wikipedia.org/index.html or https://wikipedia.org/index.htm or https://wikipedia.org/index.php - I get redirected to the technical article "Webserver directory index" (via the redirect "index.html" or whatever). This is not a useful behavior! For non-english speakers, it's a very bad thing!

I think most people would expect http://wikipedia.org/index.html to take them to the main page.

In case you doubt the depth of the problem, note that the "index.html" redirect comes up as a remarkably frequently-accessed page. In 2008 it was the 5th most visited page in the entire encyclopedia - with the only actual article to beat it being the one about the 2008 Olympic games!   About 1.5 million hits per month go to the *articles* index.html, index.php and index.htm - which is an insanely unlikely number for a relatively obscure topic.  That suggests that about 1.5% of the people trying to get to our home page (many of whom are non-English speakers) are winding up at this very obscure article about webserver directories instead of the home page!

IMHO, we should change that 404 page to treat "index.html", "index.php" and "index.htm" as special cases and redirect you to the main page (preferably without the 5 second delay) instead of this rather obscure article!

I think this should be a trivial check in whatever creates our 404 page - and will improve the Wikipedia experience for 1.5 million people every month.  It should be fixed.
Comment 1 Andre Klapper 2014-09-11 17:04:41 UTC
This was discussed already on IRC and I don't consider it important enough to create a redirect rule for this specific cornercase.
It might be unexpected for some users but not too hard either to find the main page with yet another click, as the error page offers you that link.
Comment 2 Steve Baker 2014-09-11 17:22:43 UTC
WHAT?!   This is affecting at least 1.5 million people per month!   How can you possibly not consider that important?

You say "it might be unexpected for some users" - but since (for sure) this obscure topic isn't remotely that interesting, we know for sure that it's unexpected for 1.5 million visitors each month.   So it's DEFINITELY unexpected.

I took an informal poll around the office here at work (we develop web software) and not one person expected the result you actually get here.

I can't believe that you don't think it's worth fixing!   The fix has gotta be really trivial to do - and you don't think it's worth doing it to help 1.5 MILLION people?!
Comment 3 Sam Reed (reedy) 2014-09-11 17:24:30 UTC
Why would people be going to index.html?
Comment 4 Bartosz Dziewoński 2014-09-11 17:25:33 UTC
That's most likely not people, but bots. I'm sure you could get some stats about the user-agents and so on from Analytics if you asked nicely, that would help inform our actions.
Comment 5 Bartosz Dziewoński 2014-09-11 17:27:22 UTC
(I'm going to mark the bug UNCONFIRMED until we get some data, so that we can either fix it if it's a real issue or wontfix it properly?)
Comment 6 Chad H. 2014-09-11 17:33:13 UTC
(In reply to Bartosz Dziewoński from comment #4)
> That's most likely not people, but bots. I'm sure you could get some stats
> about the user-agents and so on from Analytics if you asked nicely, that
> would help inform our actions.

Agreed, that's the most likely case. Either bots or a misbehaving browser (or plugin) of some sort. In any case, stats will help us figure out what's actually going on here.

To the analyticsmobile!
Comment 7 Steve Baker 2014-09-11 17:46:44 UTC
Even if it's a herd of misbehaving bots...wouldn't we want those bots to end up at the "expected" place?

The "index.htm" page is almost certainly being hit by bots (or perhaps just one bot) - it gets a steady 680 hits per day...plus or minus a handful.

But "index.php" and "index.html" get millions of hits (in 2008, "index.html" was the second most popular article on the entire site!!)...I don't think bots would hit the ".html" and ".php" URL's that much more often than ".htm"...but regular people and strange browser behavior certainly might.
Comment 8 Chad H. 2014-09-11 17:52:55 UTC
(In reply to Steve Baker from comment #7)
> Even if it's a herd of misbehaving bots...wouldn't we want those bots to end
> up at the "expected" place?
> 

Depends on what the bot expects? Maybe the bot is lazy and just puts en.wikipedia.org/articlename in and expects content? In which case, moving it to a new location for them might break behavior. Maybe the solution isn't redirecting, but getting the bot author to fix their code :)

Anyway, stats shall help (I've pinged analytics to please weigh in here).

> The "index.htm" page is almost certainly being hit by bots (or perhaps just
> one bot) - it gets a steady 680 hits per day...plus or minus a handful.
> 

Good to know.

> But "index.php" and "index.html" get millions of hits (in 2008, "index.html"
> was the second most popular article on the entire site!!)...I don't think
> bots would hit the ".html" and ".php" URL's that much more often than
> ".htm"...but regular people and strange browser behavior certainly might.

I'm not entirely convinced regular people are doing this. I've never once seen a person type a url and actually include the index.html part unless they're copying it from something. Strange browser behavior is more likely here imho.
Comment 9 Steve Baker 2014-09-11 19:04:48 UTC
> Depends on what the bot expects? Maybe the bot is lazy and just
> puts en.wikipedia.org/articlename in and expects content? 

Doesn't work...why would these hypothetical bots be viewing this particular article tens of thousands of times more often than any other comparable article?  Why is it accessing "index.html" *and* "index.php"?  (They both redirect to the same article)

The comparable place "index.htm" (no 'l') gets almost exactly 680 hits per day...THAT is robotic behavior and the quantity of bots is believable.  680 'broken' (arguably) bots from around the world accessing the 'wrong' (arguably) location.  But 200,000 bot hits per day with this same behavior?

I actually don't think it matters - whether it's people or bots - wouldn't we want them to get to the expected place?   Sure, if it were very few hits - but it's not. 1.5 million per month is 1.5% of all hits to the real front page.
Comment 10 Chad H. 2014-09-11 19:25:57 UTC
(In reply to Steve Baker from comment #9)
> I actually don't think it matters - whether it's people or bots - wouldn't
> we want them to get to the expected place?   Sure, if it were very few hits
> - but it's not. 1.5 million per month is 1.5% of all hits to the real front
> page.

I'm not opposed to a redirect, I just want us to be well informed as to what's going on  :)
Comment 11 Erik Zachte 2014-09-12 21:31:55 UTC
I scanned one day 1:1000 sampled squid log, so multiply all numbers by 1000

I find 1587 lines with index.html, of which only 34 without curid.

Most lines are like https://en.wikipedia.org/wiki/index.html?curid=32681660

Out of 1587 only 68 had a user agent that did not contain crawl,spider,bot or http (http is by unofficial convention only user for bots) 

Of the lines with index.html?curid= the following bots were found:

   8 Android (compatible baidu spider)
  13 AhrefsBot 
 113 Googlebot
   1 Mail.RU_Bot
   3 YandexBot
1337 bingbot
  21 iPhone etc (but compatible GoogleBot) 
   1 Sogu web spider

Of course bingbot doesn't have to be Bing really. Some bots cloak.

Does this answer your question?
Comment 12 Bartosz Dziewoński 2014-09-18 21:28:12 UTC
(In reply to Erik Zachte from comment #11)
> I find 1587 lines with index.html, of which only 34 without curid.
> 
> Most lines are like https://en.wikipedia.org/wiki/index.html?curid=32681660

So these don't actually visit index.html, it's just the stats that are wrong.

Using 34/1587 as the percentage of real visits, we arrive at about 1000 hits per day. This is comparable with other articles on these subjects, like "Web server" or "HTTP". This thousand includes both humans and bots, right?


> Out of 1587 only 68 had a user agent that did not contain crawl,spider,bot
> or http (http is by unofficial convention only user for bots) 

I'm curious how many of the non-curid URLs are non-bots.


Either way, this seems to be just a stats issue and we do not actually have millions of humans every month accidentally learning everything about webserver directory indices. I suggest re-closing this bug as WONTFIX. Steve?
Comment 13 Steve Baker 2014-09-18 23:39:43 UTC
Ah...so to put it simply:

http://wikipedia.org/index.html does indeed redirect to the article of that name...but http://wikipedia.org/index.html?whatever doesn't...but it *does* increment the stats for the article of that name.

Then, most of the hits we're recording for this article are of the "index.html?whatever" variety, so there isn't a problem for those people.

(Interestingly:  http://wikipedia.org/Zebra?curid=32681660 takes you to the same place!)

OK - then I guess we can call this a don't-fix issue.   I'll pass the news on to the affected article talk pages so they can understand what's going on.
Comment 14 Bartosz Dziewoński 2014-09-19 11:12:23 UTC
Thanks for getting this to everyone's attention :)

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links