Last modified: 2014-08-27 22:39:00 UTC
I want to integrate the pagecounts of Tool Labs resp. Labs into tool https://tools.wmflabs.org/wikiviewstats/ .For this, it would be necessary to have access to redacted webproxy logs, which include old web (apache) and new web (lighttpd) setups. It would be very helpful, if these logs could be structured in the same way as the current pagecount-dumps and be released on an per hour basis. Further suggestions: - identifier could be toollabs resp. labs.toollabs - querystring part of url (?xyz=..) should be removed completely Reference: 1.) IRC Petan Jan 2, 2014 2.) WIP: Tools: Add infrastructure for AWStats https://gerrit.wikimedia.org/r/#/c/80332/ 3:) IRC scfc_de Jan 2, 2014 scfc_de: hedonil: I hope to have finished puppetizing tools-webproxy by the end of the week (the AWStats stuff is done IIRC). As -webproxy is the heart of the web access, review & deployment will then be *very* careful :-), but in general, depending on Coren's schedule, it should be deployable by between the end of next week and the end of the month. The current pagecount-dumps are generated on an per hour basis and share the following structure: filename eg: pagecounts-20140101-020000.gz 1.: identifier 2.: pagetitle 3.: hits 4.: bytes En.d perform 3 60088 En.d rainforest 3 33780 En.d servers 3 22471 En.d situation 1 107043 En.d upwards 1 32565 En.d variety 2 59495 En Allergy 3 324964 En Arthur_Rubinstein 1 0 En Article 1 0 En British_cuisine 1 191021 hierarchical structure of identifier en - Wikipedia (en) en.b - Wikibooks (en) en.d - Wikdionary (en) en.n - Wikinews (en) etc.
That should be relatively simple to do. I do not, however, have the bandwidth to write this myself at this time. The logs are currently in Apache common format; if someone provides a suitable script to generate them I'll add them to the tool chain.
(18:38:31) hedonil: YuviPanda: Coren: AFAIK apergos is the one who manages this log stuff in operations. maybe one could borrow some lines of his script, so that logfiles are summarized per hour and share the same structure. Seems to be powerful as it handles all syslogs and accesslogs from the varnishes. So I added apergos to cc. Maybe can provide some help.
Heh, I don't manage it, I just know where stuff that lands on dumps.wikimedia.org comes from. Just for the sake of clarification, we have logs written already that get saved someplace?
Now that new YuviProxy is in place, I just need access to logdumps (IP's stripped off). sed & awk will do the rest of the job.
Any progress on this thing? As already mentioned, both nginx-proxies (domainproxy & urlsproxy) went live. Thus it should be knickknack to run some sed to sanitize the logs - and make them publicly available on /dumps or /shared. Even if they can't be summarized in the requested manner, which would be fine though. Any objections with that approach? If yes, which? Why? If not, when?
I can make redacted logs available in a familiar pattern, with the following stripped out: 1. IP Address 2. Referrer fields The only problem is that currently the proxy's logs are rotated pretty frequently, so I'll have to find some method of archiving them.
Great! (UA & referer would be fine though, as they are already present in tools logs). Concerning archive - maybe one could steal some ideas for this from prod.-varnishes ;-)
Hmm, I don't see any non WMF Referrers in the access.log (looked at heritage's logs). Can someone verify / confirm?
After conversations with Coren: Lighty's default format doesn't record referrers, but there's no reason for that. So I'll just strip out IPs.
So, current plan would be to: 1. Have lograte set to rotate logs daily 2. Setup a post-processing script that runs after the rotation has happened, and strip IPs (more probably, just replace them with 127.0.0.1). 3. Move them to somewhere appropriate This would incur a one day delay between logs being made available, which I guess is ok?
Would it be possible to logrotate/process them on an hourly basis? Like: https://dumps.wikimedia.org/other/pagecounts-raw/2014/2014-06/ Just to be compatible and to allow a more fine-grained analysis (load over day). this would be really great. If anyhow possible..
(In reply to metatron from comment #11) > Would it be possible to logrotate/process them on an hourly basis? > Like: https://dumps.wikimedia.org/other/pagecounts-raw/2014/2014-06/ > Just to be compatible and to allow a more fine-grained analysis (load over > day). this would be really great. If anyhow possible.. Well, this only applies, if logs are not summarized. But with an hourly rotation, single files are kept small and delay for "near-real-time" analysis would only be 1 hour (instead of 1 day).
If you need some helping hands, provide me some 100k raw logs and I'll write a bash-script with awk to summarize & format the logs exactly like pageview dumps.
@metratron: Help would be appreciated! I've copied scrubbed-of-IPs sample log (with 1000 entries) to /shared/sample-nginx-log/cleaned-samplelog.log. If you can write a script (Python please? pretty please?) that summarizes them to be like the pageview dumps, I'd be happy to get that puppetized.
Hopefully the 1000 log entries are enough. I can provide a larger sample if needed.
And I guess the format would be: 1. toolname 2. url 3. hits 4. bytes I wonder if we should actually augment this with other stats, such as: 5. error responses (non-200) 6. UAs. Perhaps the solution is to run something like awstats or similar on the nginx host itself. investigating.
Working on the little routine right now. It will provide 2 formats. - query-strings are stripped from both: request and referer 1) "Tools Aggregate Format" Period: per hour & per day Format: toolname - hits, size, 2xx, 3xx, 4xx, 5xx tools.admin - 5 69588 5 0 0 0 tools.awb - 1 349 0 0 0 1 tools.betacommand-dev - 2 1639 1 0 1 0 tools.bibleversefinder - 2 3195 2 0 0 0 tools.blockcalc - 4 42144 4 0 0 0 tools.bookmanagerv2 - 4 5990 0 0 4 0 tools.catfood - 5 24568 4 0 0 1 tools.catscan2 - 11 392817 5 0 6 0 tools.checkwiki - 4 241 4 0 0 0 tools.cluebot - 5 7886 3 2 0 0 tools.connectivity - 3 11448 0 0 0 3 tools.croptool - 1 19553 1 0 0 0 tools.dewikinews-rss - 3 31143 3 0 0 0 tools.dupdet - 1 11433 1 0 0 0 tools.enwp10 - 7 25721 0 0 0 7 tools.geohack - 383 3705405 319 49 15 0 tools.glamtools - 30 881853 30 0 0 0 2.) "Std. pageview-dumps format" (to be compatible) Period: per hour Format: project request hits size labs.tools / 3 51037 labs.tools /Tool_Labs_logo_thumb.png 2 20140 labs.tools /admin/img/desc_dark.png 1 1036 labs.tools /admin/libs/jquery.js 2 57324 labs.tools /admin/libs/jquery.tablesorter.min.js 2 11228 labs.tools /apple-touch-icon-precomposed.png 1 1382 labs.tools /apple-touch-icon.png 1 6451 labs.tools /awb/stats/ 1 349 labs.tools /betacommand-dev/UserCompare/TreCoolGuy.html 1 1491 labs.tools /betacommand-dev/cgi-bin/uc 1 148 labs.tools /bibleversefinder/ 2 3195 labs.tools /blockcalc/index.php 1 974 labs.tools /blockcalc/style/backdrop.png 1 37681 labs.tools /blockcalc/style/style.css 1 361 labs.tools /blockcalc/style/wikimedia-toolserver-button.png 1 3128 labs.tools /bookmanagerv2/w/index.php 2 714 labs.tools /catfood/catfood.php 5 24568 labs.tools /catscan2/catscan2.php/CategoryIntersect.php 1 16041 labs.tools /catscan2/catscan2.php/Gallery.php 2 7451 labs.tools /catscan2/cross_cats.php 1 357043 labs.tools /catscan2/pages_in_cats.php 2 3889 labs.tools /catscan2/quick_intersection.php 5 8393 labs.tools /checkwiki/cgi-bin/checkwiki_bots.cgi 4 241 labs.tools /cluebot/ 5 7886 labs.tools /connectivity/cgi-bin/go.sh 3 11448 labs.tools /croptool/ 1 19553 labs.tools /dewikinews-rss/ 1 16589 labs.tools /dewikinews-rss/kategorie 2 14554 labs.tools /dupdet/compare.php 1 11433 labs.tools /enwp10/cgi-bin/list2.fcgi 7 25721 labs.tools /favicon.ico 17 256462