Last modified: 2014-08-27 22:39:00 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T61222, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 59222 - Request to access redacted webproxy logfiles of (Tool) Labs
Request to access redacted webproxy logfiles of (Tool) Labs
Status: NEW
Product: Wikimedia Labs
Classification: Unclassified
General (Other open bugs)
unspecified
All All
: Unprioritized normal
: ---
Assigned To: Nobody - You can work on this!
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2014-01-02 19:13 UTC by metatron
Modified: 2014-08-27 22:39 UTC (History)
11 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description metatron 2014-01-02 19:13:18 UTC
I want to integrate the pagecounts of Tool Labs resp. Labs into tool https://tools.wmflabs.org/wikiviewstats/ .For this, it would be necessary to have access to redacted webproxy logs, which include old web (apache) and new web (lighttpd) setups.

It would be very helpful, if these logs could be structured in the same way as the current pagecount-dumps and be released on an per hour basis.

Further suggestions:
- identifier could be toollabs  resp.  labs.toollabs
- querystring part of url (?xyz=..) should be removed completely


Reference:
1.) IRC Petan Jan 2, 2014

2.) WIP: Tools: Add infrastructure for AWStats
https://gerrit.wikimedia.org/r/#/c/80332/

3:) IRC scfc_de Jan 2, 2014
scfc_de: hedonil: I hope to have finished puppetizing tools-webproxy by the end of the week (the AWStats stuff is done IIRC).  As -webproxy is the heart of the web access, review & deployment will then be *very* careful :-), but in general, depending on Coren's schedule, it should be deployable by between the end of next week and the end of the month.


The current pagecount-dumps are generated on an per hour basis and share the following structure:

filename eg:
pagecounts-20140101-020000.gz

1.: identifier  2.: pagetitle  3.: hits  4.: bytes

En.d perform 3 60088
En.d rainforest 3 33780
En.d servers 3 22471
En.d situation 1 107043
En.d upwards 1 32565
En.d variety 2 59495
En Allergy 3 324964
En Arthur_Rubinstein 1 0
En Article 1 0
En British_cuisine 1 191021

hierarchical structure of identifier

en    - Wikipedia   (en)
en.b  - Wikibooks   (en)
en.d  - Wikdionary  (en)
en.n  - Wikinews    (en)
etc.
Comment 1 Marc A. Pelletier 2014-01-06 17:00:33 UTC
That should be relatively simple to do.  I do not, however, have the bandwidth to write this myself at this time.

The logs are currently in Apache common format; if someone provides a suitable script to generate them I'll add them to the tool chain.
Comment 2 metatron 2014-01-06 19:09:12 UTC
(18:38:31) hedonil: YuviPanda: Coren: AFAIK  apergos is the one who manages this log stuff in operations. maybe one could borrow some lines of his script, so that logfiles are summarized per hour and share the same structure. Seems to be powerful as it handles all syslogs and accesslogs from the varnishes.

So I added apergos to cc. Maybe can provide some help.
Comment 3 Ariel T. Glenn 2014-01-08 08:16:27 UTC
Heh, I don't manage it, I just know where stuff that lands on dumps.wikimedia.org comes from.  Just for the sake of clarification, we have logs written already that get saved someplace?
Comment 4 metatron 2014-05-04 18:30:17 UTC
Now that new YuviProxy is in place, I just need access to logdumps (IP's stripped off). sed & awk will do the rest of the job.
Comment 5 metatron 2014-06-06 19:23:08 UTC
Any progress on this thing?

As already mentioned, both nginx-proxies (domainproxy & urlsproxy) went live.
Thus it should be knickknack to run some sed to sanitize the logs - and make them publicly available on /dumps or /shared.

Even if they can't be summarized in the requested manner, which would be fine though.

Any objections with that approach? If yes, which? Why? If not, when?
Comment 6 Yuvi Panda 2014-06-06 20:05:21 UTC
I can make redacted logs available in a familiar pattern, with the following stripped out:

1. IP Address
2. Referrer fields

The only problem is that currently the proxy's logs are rotated pretty frequently, so I'll have to find some method of archiving them.
Comment 7 metatron 2014-06-06 20:11:03 UTC
Great! (UA & referer would be fine though, as they are already present in tools logs). Concerning archive - maybe one could steal some ideas for this from prod.-varnishes ;-)
Comment 8 Yuvi Panda 2014-06-06 20:12:28 UTC
Hmm, I don't see any non WMF Referrers in the access.log (looked at heritage's logs). Can someone verify / confirm?
Comment 9 Yuvi Panda 2014-06-06 20:22:03 UTC
After conversations with Coren:

Lighty's default format doesn't record referrers, but there's no reason for that. So I'll just strip out IPs.
Comment 10 Yuvi Panda 2014-06-06 20:23:39 UTC
So, current plan would be to:

1. Have lograte set to rotate logs daily
2. Setup a post-processing script that runs after the rotation has happened, and strip IPs (more probably, just replace them with 127.0.0.1). 
3. Move them to somewhere appropriate

This would incur a one day delay between logs being made available, which I guess is ok?
Comment 11 metatron 2014-06-06 20:33:38 UTC
Would it be possible to logrotate/process them on an hourly basis? 
Like: https://dumps.wikimedia.org/other/pagecounts-raw/2014/2014-06/
Just to be compatible and to allow a more fine-grained analysis (load over day). this would be really great. If anyhow possible..
Comment 12 metatron 2014-06-06 20:41:51 UTC
(In reply to metatron from comment #11)
> Would it be possible to logrotate/process them on an hourly basis? 
> Like: https://dumps.wikimedia.org/other/pagecounts-raw/2014/2014-06/
> Just to be compatible and to allow a more fine-grained analysis (load over
> day). this would be really great. If anyhow possible..

Well, this only applies, if logs are not summarized. But with an hourly rotation, single files are kept small and delay for "near-real-time" analysis would only be 1 hour (instead of 1 day).
Comment 13 metatron 2014-06-06 20:57:41 UTC
If you need some helping hands, provide me some 100k raw logs and I'll write a bash-script with awk to summarize & format the logs exactly like pageview dumps.
Comment 14 Yuvi Panda 2014-06-14 18:59:30 UTC
@metratron: Help would be appreciated! I've copied scrubbed-of-IPs sample log (with 1000 entries) to /shared/sample-nginx-log/cleaned-samplelog.log. If you can write a script (Python please? pretty please?) that summarizes them to be like the pageview dumps, I'd be happy to get that puppetized.
Comment 15 Yuvi Panda 2014-06-14 19:14:14 UTC
Hopefully the 1000 log entries are enough. I can provide a larger sample if needed.
Comment 16 Yuvi Panda 2014-06-14 19:36:00 UTC
And I guess the format would be:

1. toolname
2. url
3. hits
4. bytes

I wonder if we should actually augment this with other stats, such as:

5. error responses (non-200)
6. UAs.

Perhaps the solution is to run something like awstats or similar on the nginx host itself. investigating.
Comment 17 metatron 2014-08-27 22:39:00 UTC
Working on the little routine right now. It will provide 2 formats.

- query-strings are stripped from both: request and referer

1) "Tools Aggregate Format"

Period: per hour & per day
Format: toolname - hits, size, 2xx, 3xx, 4xx, 5xx

tools.admin - 5 69588 5 0 0 0
tools.awb - 1 349 0 0 0 1
tools.betacommand-dev - 2 1639 1 0 1 0
tools.bibleversefinder - 2 3195 2 0 0 0
tools.blockcalc - 4 42144 4 0 0 0
tools.bookmanagerv2 - 4 5990 0 0 4 0
tools.catfood - 5 24568 4 0 0 1
tools.catscan2 - 11 392817 5 0 6 0
tools.checkwiki - 4 241 4 0 0 0
tools.cluebot - 5 7886 3 2 0 0
tools.connectivity - 3 11448 0 0 0 3
tools.croptool - 1 19553 1 0 0 0
tools.dewikinews-rss - 3 31143 3 0 0 0
tools.dupdet - 1 11433 1 0 0 0
tools.enwp10 - 7 25721 0 0 0 7
tools.geohack - 383 3705405 319 49 15 0
tools.glamtools - 30 881853 30 0 0 0


2.) "Std. pageview-dumps format" (to be compatible)

Period: per hour
Format: project request hits size

labs.tools / 3 51037
labs.tools /Tool_Labs_logo_thumb.png 2 20140
labs.tools /admin/img/desc_dark.png 1 1036
labs.tools /admin/libs/jquery.js 2 57324
labs.tools /admin/libs/jquery.tablesorter.min.js 2 11228
labs.tools /apple-touch-icon-precomposed.png 1 1382
labs.tools /apple-touch-icon.png 1 6451
labs.tools /awb/stats/ 1 349
labs.tools /betacommand-dev/UserCompare/TreCoolGuy.html 1 1491
labs.tools /betacommand-dev/cgi-bin/uc 1 148
labs.tools /bibleversefinder/ 2 3195
labs.tools /blockcalc/index.php 1 974
labs.tools /blockcalc/style/backdrop.png 1 37681
labs.tools /blockcalc/style/style.css 1 361
labs.tools /blockcalc/style/wikimedia-toolserver-button.png 1 3128
labs.tools /bookmanagerv2/w/index.php 2 714
labs.tools /catfood/catfood.php 5 24568
labs.tools /catscan2/catscan2.php/CategoryIntersect.php 1 16041
labs.tools /catscan2/catscan2.php/Gallery.php 2 7451
labs.tools /catscan2/cross_cats.php 1 357043
labs.tools /catscan2/pages_in_cats.php 2 3889
labs.tools /catscan2/quick_intersection.php 5 8393
labs.tools /checkwiki/cgi-bin/checkwiki_bots.cgi 4 241
labs.tools /cluebot/ 5 7886
labs.tools /connectivity/cgi-bin/go.sh 3 11448
labs.tools /croptool/ 1 19553
labs.tools /dewikinews-rss/ 1 16589
labs.tools /dewikinews-rss/kategorie 2 14554
labs.tools /dupdet/compare.php 1 11433
labs.tools /enwp10/cgi-bin/list2.fcgi 7 25721
labs.tools /favicon.ico 17 256462

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links