Last modified: 2011-03-13 18:06:32 UTC
This is my first time using bugzilla, so if I did something wrong, let me know. I wrote a small bit of code that I think will make Special:Shortpages more useful. It checks the text of the pages and excludes those which are redirects, soft redirects, links to the wiktionary definition (basically soft redirects), or the template for copyvios. I've tested the code on my sandbox wiki, so I'm fairly confidant it's stable. The two problems with it that I know of are that the count at the top is wrong ("Showing below up to 8 results starting with #1." lists the number before filtering, so will actually be lower) and the related problem that instead of having 1000 results, it has 1000 - results filtered. I'm not entirely sure how to fix these yet, but I'll get there.
Created attachment 1439 [details] Diff for patch I think this is the correct format.
Performance murder.
Excluding based on redirect status, if done, should be done based on the page_is_redirect field. Pulling from page text is expensive, and "#REDIRECT" is not the only possible form so it would not be correct. If done, this should probably be done in the query rather than a filter afterwards. This may require changes to the indexes on the table to be done efficiently. Excluding based on size, if done, should be done based on a cap to the page_len field in the query. The filters for specific templates look a bit dodgy; also they're very specific to one web site and would be incorrect for everybody else in the world. If done, it may or may not be cheaper and more accurate to check the templatelinks table rather than loading page text out of indirected, compressed external storage.
(In reply to comment #3) > Excluding based on redirect status, if done, should be done based on the > page_is_redirect field. Pulling from page text is expensive, and "#REDIRECT" is > not the only possible form so it would not be correct. > Yeah, that makes sense, I hadn't thought of that. > If done, this should probably be done in the query rather than a filter > afterwards. This may require changes to the indexes on the table to be done > efficiently. Hm, I'll see if I can find out how to do that in SQL, I don't know that well (or at all). > Excluding based on size, if done, should be done based on a cap to the > page_len field in the query. Okay, I'll put that, once I look up how to do that in MySQL. > The filters for specific templates look a bit dodgy; also they're very specific to > one web site and would be incorrect for everybody else in the world. If done, it > may or may not be cheaper and more accurate to check the templatelinks > table rather than loading page text out of indirected, compressed external > storage. Yeah, that's true, I have been thinking of wikipedia. I've been thinking about it, and I think it would probably be enough to exclude any page that includes a template, that would be much more general, and I think that would help find the right one. > Performance murder. Granted, I have no clue how long this would take to run, but since they are cached for a long time, it would only matter once. I think it would be able to run during an off-time, I don't think it'd take *too* long. It does take time to grab the text, but all of these pages will be less than 1k, and the majority will be around 15 bytes, so I don't think that would be overly time consuming, at least if only run once in a while.
"It does take time to grab the text, but all of these pages will be less than 1k, and the majority will be around 15 bytes" Er, well, the SQL being run on Special:Shortpages is what's used to cache up the content on larger installations like the Wikimedia cluster. So the cron job to cache the things would hit all the pages, thus you're still looking at an unpleasant load. Clean implementation in MySQL is probably not too much of a problem, however and wouldn't be a bad thing; we can justify adding a short time to the execution by the improved "algorithm" we'd be using and the more accurate result set.
Created attachment 1444 [details] New patch Okay, I realized that this script already excludes redirects, so I removed that from my patch. Added limit to size in the SQL, and generalized script so all pages with "{{" are excluded. Number problem as above is still present.
Anyone have any comments on my new version of the patch?
(In reply to comment #7) > Anyone have any comments on my new version of the patch? Would still be too expensive to look at the text of each like that. Consider doing something with the templatelinks table to test for transclusions in the page; this means your query is more efficient and doesn't pull redundant data off the MySQL server in the first place.
Closing due to a number of reasons. 1. Inefficient, performance-killing patch. 2. There is an effective limit on page size provided by the limit on results when the page is being cached. 3. That a page contains a template doesn't mean it's a "long" page.