Last modified: 2014-05-29 17:25:30 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T56359, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 54359 - country data identified by row number rather than actual country name (or ISO code)
country data identified by row number rather than actual country name (or ISO...
Status: REOPENED
Product: Analytics
Classification: Unclassified
Visualization (Other open bugs)
unspecified
All All
: Unprioritized major
: ---
Assigned To: Nobody - You can work on this!
:
Depends on: 54611 54612
Blocks:
  Show dependency treegraph
 
Reported: 2013-09-20 00:35 UTC by Asaf Bartov
Modified: 2014-05-29 17:25 UTC (History)
7 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Asaf Bartov 2013-09-20 00:35:22 UTC
(this is not, strictly, a limn bug, but a bug in how we generate the data, though I guess it could be resolved by adding a feature to limn, such as an explicit "key" column rather than relying on row numbers.)

We use Limn to display "active editors by country", based on the files generated by scripts, visible in /datasources.  In the past (I don't know how to reproduce this), a graph created and saved featuring data for country X, _after some time_ (i.e. when different data was generated by the scripts) began displaying data for country Y.  

My theory, confirmed at one point by Evan, was that the graphs relied on row numbers in the CSVs, and for whatever reason, countries must have been added or removed from the report (or perhaps countries with 0 editors don't get listed at all?  That could make for very erratic changes month over month...), which shifted the row numbers, which caused the graph to display false and misleading data.

I cannot overstate how crucial this is: it makes sharing links to the actual graphs (as distinct from saving screenshots of them) impossible, because we can't guarantee a future viewer of that link would actually see correct data.

Since I have no insight into the data-generating scripts themselves, I can't ascertain this is no longer a problem, nor can I force this to reproduce.  But I tried to describe the problem as technically as I can, to help you decide if this can still happen.

If you are confident it can't happen any more, I'd be thrilled to hear, and you can close this bug.
Comment 1 Diederik van Liere 2013-09-20 00:46:25 UTC
Prioritization and scheduling of this bug is tracked on Mingle card https://mingle.corp.wikimedia.org/projects/analytics/cards/1164
Comment 2 christian 2013-09-20 10:06:58 UTC
We recently fixed a bug that is very similar to the problem you
describe. Let me make sure it's only related and not the same thing.

So until some weeks back, when looking at the active editors for
country X (e.g.: [1]), the graph showed the data for one of the
countries in the set {X, Y1, Y2, Y3}. Reloading the page might show
the data for a different country of the same set {X, Y1, Y2, Y3}. When
ordering Y1, Y2, Y3 alphabetically, they were really close to each
other.

The root cause seems to have been column order mismatches between
different versions of the same file. This problem was solved by trying
to make sense of ~17k files and removing ~15k stale files/duplicates.


(In reply to comment #0)
> [...] than relying on row numbers.)
> [...]
> My theory, confirmed at one point by Evan, was that the graphs relied on row
> numbers

Sorry to be nitpicking here, but since you are talking both here and
also some lines above about /row/ numbers, let me make sure we are
talking about the same files. Do you really mean /row/ numbers (that
could totally be the case, but would hint towards you using files that
I have not yet discovered in our repos), or /column/ numbers (As for
example used in [2])?

The files produced by Evans geowiki scripts (i.e.: "Active Editor"
data) rely on column numbers. Yes, the column number of country X in
file Z.csv might change between any given day.

And in fact, they not only “might” change, they actually do change often.

(For the current geowiki dashboards, graphs, ... those frequent
changes are not a problem, as we regenerate the relevant files for
each run using the current csvs)

> [...], which
> shifted the row numbers, which caused the graph to display false and
> misleading
> data.

If that graph displayed false data, that's a real problem from my
perspective.

But since you are using past tense in your description and you also
state that you cannot force to reproduce the problem… are we still
affected by the problem?

If so, could you point me to a concrete file/URL that causes problems?

> Since I have no insight into the data-generating scripts themselves, [...]

The scripts are at
  https://gerrit.wikimedia.org/r/#/admin/projects/analytics/geowiki
. As usual: Patches welcome :-)
You can find a rough overview of the geowiki dataflow at
  https://wikitech.wikimedia.org/wiki/Analytics/Geowiki#Dataflow
.

[1] http://gp.wmflabs.org/graphs/en_germany_all
[2] http://gp.wmflabs.org/data/datafiles/gp/en_all.csv
Comment 3 Asaf Bartov 2013-09-24 02:58:17 UTC
Thanks for this insightful comment!  

Yes, this sounds exactly like the problem we had been having.  My use of "row number" is naive -- that is, I meant that the data was identified by index (incorrectly assumed a row index rather than a column index) rather than by key.  So yes, it matches the column number problem you describe.

So I'm hoping this is resolved now, but I am still confused by two statements you make:
1. One one hand, you say the "problem was solved by trying to make sense of ~17k files and removing ~15k stale files/duplicates."
2. OTOH, you say "Yes, the column number of country X in file Z.csv might change between any given day."

So... if graphs still rely on column numbers, are we still in essentially the same situation, wherein we can't trust a graph to still be pointing at data for the same country after N days/months?
Comment 4 christian 2013-09-24 11:41:56 UTC
(In reply to comment #3)
> So... if graphs still rely on column numbers, are we still in essentially the
> same situation, wherein we can't trust a graph to still be pointing at data
> for
> the same country after N days/months?

I do not think so.

On the one hand, we are not only generating the data files, but also
the graph definitions daily. So the referenced columns within the
graph files and the columns in the data files should correspond.
Even after columns got rearranged.

For example if column X1 of data file Z becomes X2, the corresponding
graph file for Z is also updated to use X2 instead of X1.

On the other hand, the cleaning up of the served data repositories
made sure that no stale files (with outhdated column indices) are
lying around, waiting to be picked up by limn.

As it seems the problem we fixed matches your observations, I am
marking the bug as fixed for now.

However, if you notice (once gp.wmflabs.org goes online with data
again) that graphs come with wrong captions, do let us know and reopen
the bug.

Thanks!
Comment 5 Asaf Bartov 2013-09-24 18:13:27 UTC
Excellent, thanks!  This was the biggest show-stopper for me.
Comment 6 christian 2013-09-25 18:34:34 UTC
It seems it was a short party :-(

Meanwhile, there was a change that allows to create graphs on your own.
Those "user-created" graphs are not updated if we recreate the
"script-generated" graphs.

So while "script-generated" graphs are not affected, the "user-created"
graphs are now again affected.

Hence, reopening the bug.
Comment 7 Andre Klapper 2014-05-29 17:25:30 UTC
[moving tickets as per bug 65903]

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links