Last modified: 2014-10-16 12:08:21 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T49407, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 47407 - Tool to check ZIM integrity needed
Tool to check ZIM integrity needed
Status: NEW
Product: openZIM
Classification: Unclassified
zimlib (Other open bugs)
unspecified
All All
: Lowest enhancement
: ---
Assigned To: Kiran Mathew Koshy
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2013-04-19 12:35 UTC by Kelson [Emmanuel Engelhart]
Modified: 2014-10-16 12:08 UTC (History)
1 user (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Kelson [Emmanuel Engelhart] 2013-04-19 12:35:14 UTC
We need to have a way to check the quality of of zim file.

Should be at least checked:
* (WARNING) has a welcome page
* (ERROR) broken local HTML links
* (WARNING) redundant content
* ...

------- Comment #1 From Tommi Mäkitalo 2010-03-27 10:35:25 -------

I will make a zimlint for that.

------- Comment #2 From Emmanuel Engelhart 2010-04-16 10:52:30 -------

Should also be checked if the HTML content do not have any online dependences.
For example <img src=http://....
Also in the CSS.

/* This bug was migrated from the previous openZIM bug tracker */
Comment 1 Kelson [Emmanuel Engelhart] 2013-05-14 18:19:26 UTC
Here is in detailed a list of things which should be checked:

1 - Internal checksum: launch internal checksum verification

2 - Dead internal urls: check all ZIM internal urls an verify if the target exists. That means css/javascript loading urls, images src and url href.... an probably a few others

3 - Checks that urls in CSS files are not external, and internal urls are valid

4 - Veryfy that there is not online dependencies (images, javascript/css loading, ....) in HTML code

5 - Check if the following metadata entries are there: title, creator, publisher, date, description language. Check if date and language are in the correct format.http://openzim.org/wiki/Metadata 

6 - Verify that the favicon is there 

7 - Verify the main page header entry is defined and point to a valid content.

8 - Check duplicate content: be sure that the same content is not available under two different url. For example two times the same picture.

9 - Verify that internal urls are not absolute
Comment 2 Kiran Mathew Koshy 2013-05-25 15:14:08 UTC
I have implemented a primitive version of the above tool...

https://github.com/kiranmathewkoshy/zimcheck/

It implements the following checks:
1- Internal checkSum
2- Verify that there are no online dependencies
3- Check for all metadata entries 
4- Verify favicon.png
5- Main Page Header.
6- Duplicate content.


Although search for Duplicate content was initially slow on large files, I have managed to speed it up to run in less than 2 minutes on the 2.6 GB wikipedia zim file.

However, checking internal URLs is still slow, and being a CPU intensive process, I have decided to go on with dividing the work on a few threads.

Also note that the regex library used is a part of C++11, and I'm not aware if the rest of zimlib is compatible with C++11.
Comment 3 Kelson [Emmanuel Engelhart] 2013-05-27 11:55:59 UTC
My feedback, sorry if I only speaks about things which does not work ;)

* Should be good to have a help about the availabel options and purposes printed by calling "zimcheck" or "zimcheck --help" or "zimcheck -h"

* By running it against ICD-10, it seems to be a problem with the favicon... but in favicon is OK AFAIK

* It reports a "unknown mime type code 65535" (with the same file"... this is unclear what it means for me.

* In the code add the license on the top of the file (GPL2 for the openzim project)

* Regarding the redundancy check computing a hash for all the contents in every case seems to me to be a little bit slow. I propose a way to get it done faster:
1 - go trough all articles and all and save the size
2 - for all articles with the same size make a hash comparison

* I think the check of internal urls will always returns false for the simple reason that this is allowed to have external links "href" (but not external dependencies) in the pages... and that is what you check.
Comment 4 Kiran Mathew Koshy 2013-05-28 21:52:58 UTC
A more polished version of the program,which fixes a few bugs mentioned above, has been implemented.

https://github.com/kiranmathewkoshy/zimcheck/

Checking internal URLs and redundancy check have not been modified. They will be modified in the next version.
Comment 5 Kelson [Emmanuel Engelhart] 2013-05-29 19:17:01 UTC
* Usage is more or less OK (few visual things to fix, have a look to the help of tools like "grep" or "perl"). The way to get it should use pre-existing code and not reinventing the wheel, have a look to "getopt"

* "./zimcheck ICD10-fr.zim" prints the usage(), it should run the checks.

* Code style should be similar to the rest of the zimlib and clean.

My general advise is: Take your time and try to code as perfectly as you can. Don't ask for a review if you see yourself a better way to do it, still something you could improve. What matters is the quality, not the quantity. Don't try to do everything/all features, focus on a few features, but try to implement them is the most intelligent and beautiful manner. And the most important: test your own code as much as possible.
Comment 6 Kiran Mathew Koshy 2013-06-08 18:04:16 UTC
Everything except the MIME checks have been implemented, the MIME checks can be implemented after functions to return the MIME types are implemented in zimlib.

https://github.com/kiranmathewkoshy/zimcheck

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links