Last modified: 2013-09-04 11:51:28 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T31687, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 29687 - Some pages have a 'null' lastmod field in the sitemap!
Some pages have a 'null' lastmod field in the sitemap!
Status: NEW
Product: MediaWiki
Classification: Unclassified
Maintenance scripts (Other open bugs)
1.17.x
All All
: Normal normal (vote)
: ---
Assigned To: Nobody - You can work on this!
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2011-07-02 22:07 UTC by Dan Bolser
Modified: 2013-09-04 11:51 UTC (History)
5 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Dan Bolser 2011-07-02 22:07:42 UTC
I'm creating a sitemap for my wiki with the following command:

php /memberroot/dmb/public_html/metabase/mw/maintenance/generateSitemap.php \
--fspath /memberroot/dmb/public_html/metabase/mw/sitemap \
--server http://metadatabase.org \
--urlpath http://metadatabase.org/sitemap


When I load this into google webmaster tools, almost everything works fine. However, a couple of pages have a weird 'null' lastmod field:

        <url>
                <loc>http://metadatabase.org/wiki/Main_Page</loc>
                <lastmod></lastmod>
                <priority>1.0</priority>
        </url>


and:
        <url>
                <loc>http://metadatabase.org/wiki/Help:About</loc>
                <lastmod></lastmod>
                <priority>0.5</priority>
        </url>


It's always these two pages!


This causes Google to barf with an error about an incorrect date format.
Comment 1 Dan Bolser 2011-07-03 14:25:26 UTC
Here is the exact error message from Google Webmaster Tools:

6680	Invalid date
An invalid date was found. Please fix the date or formatting before resubmitting.	

Parent tag: url
Tag: lastmod
Value:

Problem detected on: Jul 3, 2011
Comment 2 Mark A. Hershberger 2011-07-04 00:44:13 UTC
Another possibility might be http://www.mediawiki.org/wiki/Extension:GoogleNewsSitemap
Comment 3 Brion Vibber 2011-07-05 20:51:10 UTC
Dan, can you check what the page_touched values for these rows in the page table are?

Normally this should carry a timestamp, which in MediaWiki on MySQL is stored as a 14-character string (YYYYMMDDHHMMSS). Null or empty *should* end up formatting the current time, though there may be some bad values or such.
Comment 4 Dan Bolser 2011-07-06 09:28:06 UTC
I debugged this a bit with TimStarling, but the results are still a bit confusing.

It seems that MediaWiki is corrupting the page_touched field!

I recently imported the data for this wiki from MW 1.11 (using MySQL dump 10.13  Distrib 5.1.56) into MW 1.17 (using MySQL 4.1.22-standard-log). Since that import, I touched a couple of pages (guess which?) and discovered this problem with the site map.

Since reporting the two problem pages, I ran a process that touched many pages, and here is the state of the page_touched field:

mysql> select page_touched, count(*) from mb_page group by page_touched limit 20;
+----------------+----------+
| page_touched   | count(*) |
+----------------+----------+
| 2.01107052046E |      606 |
| 2.01107052047E |     1179 |
| 2.01107052048E |     1116 |
| 2.01107052049E |     1255 |
| 2.01107052092E |        2 |
| 2.01107052094E |        1 |
| 2.01107052095E |        1 |
| 2.01107052096E |        5 |
| 2.01107052097E |      275 |
| 2.01107052098E |      132 |
| 2.01107052227E |        2 |
| 2.01107052229E |        1 |
| 2.0110705223E+ |        1 |
| 20070810130314 |        1 |
| 20090609211125 |        1 |
| 20100315174918 |        1 |
| 20110705214006 |        1 |
+----------------+----------+
17 rows in set (0.01 sec)


With help from TimStarling, I checked the data in my import 'dump' file, and concluded that it looks fine. I re-imported the dump, and it looked fine in the database (no 'corruption' like the above). I then followed the 1.17 DB update procedure (first using the GUI and then again using the CLI), and both looked fine (no corruption).

Then I 'recovered' the correct page_touched field using the update (to keep my changes post import).

Then I edited a page, and saw the same corruption!


Before edit:

mysql> select * from mb_page where page_title = "Main_Page"; 
+---------+----------------+------------+-------------------+--------------+------------------+-------------+----------------+----------------+-------------+----------+
| page_id | page_namespace | page_title | page_restrictions | page_counter | page_is_redirect | page_is_new | page_random    | page_touched   | page_latest | page_len |
+---------+----------------+------------+-------------------+--------------+------------------+-------------+----------------+----------------+-------------+----------+
|    4792 |              0 | Main_Page  |                   |        93887 |                0 |           0 | 0.940133380737 | 20090811074455 |       15093 |     1069 |
+---------+----------------+------------+-------------------+--------------+------------------+-------------+----------------+----------------+-------------+----------+
2 rows in set (0.01 sec)


After edit:

mysql> select * from mb_page where page_title = "Main_Page"; 
+---------+----------------+------------+-------------------+--------------+------------------+-------------+----------------+----------------+-------------+----------+
| page_id | page_namespace | page_title | page_restrictions | page_counter | page_is_redirect | page_is_new | page_random    | page_touched   | page_latest | page_len |
+---------+----------------+------------+-------------------+--------------+------------------+-------------+----------------+----------------+-------------+----------+
|    4792 |              0 | Main_Page  |                   |        93888 |                0 |           0 | 0.940133380737 | 2.01107060996E |       15097 |     1074 |
+---------+----------------+------------+-------------------+--------------+------------------+-------------+----------------+----------------+-------------+----------+
2 rows in set (0.02 sec)



As I said, in the interim I had touched many pages. Going to Google Webmaster Tools, I now see many errors!


Seems pretty clear, now that I've set it out, that MW 1.17 (+ extensions) is borking the page_touched field on this version of MySQL, leading to an error in the sitemap.
Comment 5 Andre Klapper 2012-10-16 13:52:34 UTC
Dan: Is this still a problem? 
If so, which MW version do you use nowadays?

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links