Last modified: 2014-09-23 22:41:08 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T34514, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 32514 - Access to HTTP 404 logs for Wiktionary
Access to HTTP 404 logs for Wiktionary
Status: REOPENED
Product: Wikimedia
Classification: Unclassified
General/Unknown (Other open bugs)
unspecified
All All
: Low enhancement (vote)
: ---
Assigned To: Nobody - You can work on this!
: analytics
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2011-11-20 07:05 UTC by Srikanth Logic
Modified: 2014-09-23 22:41 UTC (History)
6 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments
November 1-3, en.wikt, top 300 requested non-existent pages (14.65 KB, text/plain)
2011-11-26 23:12 UTC, Dan Collins
Details

Description Srikanth Logic 2011-11-20 07:05:46 UTC
Most often many users use wikitionary by directly entering the word they are looking out in the url and if article doesnt exist, get 404, page not found. Is it possible to get access to these 404 logs and share with the community, so that the community can create entries for nonexistant pages which are being looked up by the readers
Comment 1 Mark A. Hershberger 2011-11-21 20:17:09 UTC
We don't log this information.  You should be able to use the page views (http://dumps.wikimedia.org/other/) and compare it to the list of existing pages.

This would take some work, but I think it is your best bet.
Comment 2 Roan Kattouw 2011-11-22 21:50:11 UTC
We log /some/ of this information, but not all. We have 1:1000 sampled Apache logs that we use for internal analysis, but we don't release these publicly because they contain private data. I guess we could release anonymized versions of them, but we don't do that currently.

The page view statistics at e.g. http://stats.grok.se are obtained using the UDP logger that counts the requests for each page but doesn't write a log line to disk for each request (disk I/O tends to be the limiting factor here, AIUI). With the UDP log stream it should definitely be possible to produce 404 statistics.

Reopening because this isn't as impossible as suggested in comment 1.
Comment 3 Ariel T. Glenn 2011-11-22 21:55:10 UTC
The sampled logs wouldn't be of use here anyways; most misses would probably not even show up there. 

We could write code to grab 404 statistics but that still wouldn't cover a chunk of the cases here (urls that are well formed but point to an article that isn't written yet).  Comparing a list of known viewed titles against known titles on the project is the best bet for right now, and doable immediately by anyone who can script a little bit.
Comment 4 Dan Collins 2011-11-26 23:12:16 UTC
Created attachment 9565 [details]
November 1-3, en.wikt, top 300 requested non-existent pages

I have attached the 300 most commonly requested pages that do not exist for the english wiktionary. These pages have been requested 50 times or more in a three day period. Some of the titles look a little strange, such as for example "%25D8%25AC%25D9%2585%25D8%25A7%25D8%25B9", as though it was urlencoded twice, but the names used are the ones I got from the pageviews dumps. These mostly contain url fragments - index.php is the number 1 most requested, but there are also some strange ones - as well as years and unicode gunk. I'll let the script keep running and update the attachment with data for the full month.
Comment 5 Srikanth Logic 2011-11-27 06:24:58 UTC
Nice, can the monthly log of this made available in some place like dumps.wikimedia.org ? My original request was based on Tamil Wiktionary in mind and the urlencoding needs to be decoded for the final output to be useful and unicode might not be junk there. After we get the data from across wiktionaries over a period, we could probably find patterns to exclude junk and give some useful data to community
Comment 6 Nemo 2012-08-23 22:01:28 UTC
So what is this currently, a request for an automated regularly updated filtered version of the udp logs? (Maybe to be sent to the analytics team?)
A request for a tool which works on stats.grok.se data or its replacement of the mysterious future?
Please clarify summary and component.
Comment 7 stratoprutser 2013-02-27 23:28:15 UTC
Not exactly the same, but I developed a small javascript extension which is handy dealing with the 404s described in the Srikanth Logi's scenario.

See http://en.wiktionary.org/wiki/Wiktionary:Beer_parlour/2013/February#Yahoo_Pipe_for_404s for the thread. 

I realize probably it would be better if the api is totally parsed inside wiktionary, so not with using pipes, but well, its the idea...
Comment 8 Andre Klapper 2013-03-01 19:28:02 UTC
Srikanth: Could you answer comment 6, please?

[removing ops keyword -> analytics area]
Comment 9 Sumana Harihareswara 2014-09-23 22:41:08 UTC
Pinging Srikanth once more. :)

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links