Last modified: 2014-02-25 14:07:46 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T27931, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 25931 - Implement efficient way to select random page from specified category on Wikimedia wikis
Implement efficient way to select random page from specified category on Wiki...
Status: RESOLVED FIXED
Product: MediaWiki
Classification: Unclassified
General/Unknown (Other open bugs)
unspecified
All All
: Normal enhancement with 2 votes (vote)
: ---
Assigned To: Bawolff (Brian Wolff)
:
: 29373 (view as bug list)
Depends on:
Blocks: 31254
  Show dependency treegraph
 
Reported: 2010-11-15 02:15 UTC by jlatta6
Modified: 2014-02-25 14:07 UTC (History)
20 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description jlatta6 2010-11-15 02:15:38 UTC
Hi,
I was hoping that there would be a way to implement "Categories" into the Random Page function.  Often I would like to be able to be able to stumble around Wikipedia using the Random Page link, but focusing on a certain subject, such as Computer Science, or Technology, etc. Using the Random Page function of Wikipedia is fun, but being able to focus randomly on a category will help me learn about information I never knew to search for in the first place.
Comment 1 Bawolff (Brian Wolff) 2010-11-15 05:13:22 UTC
Thats Bug 2170. I'm not sure if this bug should be duped to that, as your request is more to make it work on wikipedia, and the result of Bug 2170 was an extension that is not currently (and not likely to be in the future) enabled on wikimedia.
Comment 2 Antoine "hashar" Musso (WMF) 2011-03-13 14:59:06 UTC
We have to review and then enable the extension 
http://www.mediawiki.org/wiki/Extension:RandomInCategory

Rephrased title and changed component
Comment 3 Bawolff (Brian Wolff) 2011-03-14 02:52:00 UTC
(In reply to comment #2)
> We have to review and then enable the extension 
> http://www.mediawiki.org/wiki/Extension:RandomInCategory
> 
> Rephrased title and changed component

The comments on bug 2170 seem to indicate the extension is not efficient enough, at least for enwikipedia.
Comment 4 Sam Reed (reedy) 2011-03-22 01:40:24 UTC
mysql> describe select page_title, page_namespace FROM page JOIN categorylinks ON (page_id=cl_from) AND cl_to="Test" AND page_random >= 0.2265\G
+----+-------------+---------------+--------+---------------------------------+--------------+---------+------------------------------+------+--------------------------+
| id | select_type | table         | type   | possible_keys                   | key          | key_len | ref                          | rows | Extra                    |
+----+-------------+---------------+--------+---------------------------------+--------------+---------+------------------------------+------+--------------------------+
|  1 | SIMPLE      | categorylinks | ref    | cl_from,cl_timestamp,cl_sortkey | cl_timestamp | 257     | const                        |    3 | Using where; Using index |
|  1 | SIMPLE      | page          | eq_ref | PRIMARY,page_random             | PRIMARY      | 4       | enwiki.categorylinks.cl_from |    1 | Using where              |
+----+-------------+---------------+--------+---------------------------------+--------------+---------+------------------------------+------+--------------------------+
2 rows in set (0.00 sec)



That's against enwiki. It doesn't seem '''that''' bad...
Comment 5 Roan Kattouw 2011-03-23 15:19:04 UTC
The worst case scenario I could think of was:
mysql> select page_title, page_namespace FROM page JOIN categorylinks ON (page_id=cl_from) AND cl_to='Living_people' and page_random>=0.999 limit 1;
1 row in set (0.15 sec)

That's not very fast, but it's much faster than I feared it would be.
Comment 6 Tim Starling 2011-03-24 03:48:12 UTC
(In reply to comment #5)
> The worst case scenario I could think of was:
> mysql> select page_title, page_namespace FROM page JOIN categorylinks ON
> (page_id=cl_from) AND cl_to='Living_people' and page_random>=0.999 limit 1;
> 1 row in set (0.15 sec)
> 
> That's not very fast, but it's much faster than I feared it would be.

The worst case is pretty realistic here, since there's not much point in picking a random article from a small category. 

I'd like to hear what Domas thinks about it. What would happen if it was linked to from every category page?
Comment 7 Roan Kattouw 2011-03-27 13:04:52 UTC
(In reply to comment #6)
> The worst case is pretty realistic here, since there's not much point in
> picking a random article from a small category. 
> 
Note that my worst case also includes a very high page_random value. Even setting it to >= 0.99 makes the query run in 0.00 seconds. With 0.995 is took 0.10 seconds. So roughly speaking, this means this query may take up to 100-150ms, but only if run on a large category and only in 1% of those cases. In other cases it seems to run in <10 ms.
Comment 8 Domas Mituzas 2011-04-07 02:59:50 UTC
this random isn't really random
Comment 9 Domas Mituzas 2011-04-07 03:01:51 UTC
I mean this random isn't any close to any idea of 'random'
Comment 10 Tim Starling 2011-04-07 03:11:35 UTC
He means that because there is no "ORDER BY page_random", the query just fetches the first page from the category that satisfies the page_random condition. So for page_random >= 0.99, it will scan 100 pages on average, even if the category is 100k pages. Comment #4 shows that it uses the cl_timestamp index, so it's very likely to return a page from the first 10 or so pages that were added to the category.
Comment 11 Domas Mituzas 2011-04-07 03:14:10 UTC
yes, thats what I mean, thanks Tim!
Comment 12 MZMcBride 2011-04-07 03:32:32 UTC
Hmm. Using the Toolserver's enwiki_p:

mysql> select page_title, page_namespace FROM page JOIN categorylinks ON(page_id=cl_from) AND cl_to='Living_people' and page_random>=0.999 ORDER BY page_random ASC limit 1;
+-----------------+----------------+
| page_title      | page_namespace |
+-----------------+----------------+
| Jan_van_Deinsen |              0 |
+-----------------+----------------+
1 row in set (0.34 sec)

mysql> select page_title, page_namespace FROM page JOIN categorylinks ON(page_id=cl_from) AND cl_to='Living_people' and page_random>=0.999 ORDER BY page_random ASC limit 1;
+-----------------+----------------+
| page_title      | page_namespace |
+-----------------+----------------+
| Jan_van_Deinsen |              0 |
+-----------------+----------------+
1 row in set (0.00 sec)

mysql> select page_title, page_namespace FROM page JOIN categorylinks ON(page_id=cl_from) AND cl_to='Living_people' and page_random>=0.999 ORDER BY page_random ASC limit 1;
+-----------------+----------------+
| page_title      | page_namespace |
+-----------------+----------------+
| Jan_van_Deinsen |              0 |
+-----------------+----------------+
1 row in set (0.00 sec)

mysql> select page_title, page_namespace FROM page JOIN categorylinks ON(page_id=cl_from) AND cl_to='Living_people' and page_random>=0.989 ORDER BY page_random ASC limit 1;
+---------------+----------------+
| page_title    | page_namespace |
+---------------+----------------+
| Pavol_Baláž |              0 |
+---------------+----------------+
1 row in set (0.19 sec)

mysql> select page_title, page_namespace FROM page JOIN categorylinks ON(page_id=cl_from) AND cl_to='Living_people' and page_random>=0.19 ORDER BY page_random ASC limit 1;
+---------------+----------------+
| page_title    | page_namespace |
+---------------+----------------+
| Anthony_Tupou |              0 |
+---------------+----------------+
1 row in set (29.98 sec)

mysql> select page_title, page_namespace FROM page JOIN categorylinks ON(page_id=cl_from) AND cl_to='Living_people' and page_random>=0.28 ORDER BY page_random ASC limit 1;
+--------------+----------------+
| page_title   | page_namespace |
+--------------+----------------+
| Chris_Lehane |              0 |
+--------------+----------------+
1 row in set (3.80 sec)

mysql> select page_title, page_namespace FROM page JOIN categorylinks ON(page_id=cl_from) AND cl_to='Living_people' and page_random>=0.432901 ORDER BY page_random ASC limit 1;
+-----------------+----------------+
| page_title      | page_namespace |
+-----------------+----------------+
| Civard_Sprockel |              0 |
+-----------------+----------------+
1 row in set (2.96 sec)

This seems unacceptably slow. I think it'd be fairly trivial to disable this for categories with greater than X members until a better solution is implemented, however.
Comment 13 Domas Mituzas 2011-04-07 03:51:32 UTC
"better solution" needs more disk space, memory and IOPS, even if we have a decent index for that - is it worth for such feature, or is this just another "would be nice" eyecandy?
Comment 14 Sam Reed (reedy) 2011-04-07 12:55:30 UTC
(In reply to comment #13)
> "better solution" needs more disk space, memory and IOPS, even if we have a
> decent index for that - is it worth for such feature, or is this just another
> "would be nice" eyecandy?

I think it seems to be the latter - "would be nice" eyecandy

Unless the initial requester can tell us why otherwise
Comment 15 Brion Vibber 2011-06-13 16:53:48 UTC
*** Bug 29373 has been marked as a duplicate of this bug. ***
Comment 16 Steve Sperandeo 2011-09-29 15:25:12 UTC
While I'm not the original requester, I did post a duplicate bug (Bug 29373).

I can assure you such a feature wouldn't be just eyecandy or a "would be nice" feature. It would be a serious learning tool.

Most people use wikipedia as a reference tool, linked from google's results. However, there are cases when you don't want to use it as a book, but a study guide. For example, when someone is new to a field, they'll want to immerse themselves in the subject and learn as much vocabulary about the subject as possible. Even people who have been out of the subject for a while would benefit from brushing up from time to time.

I personally use the Random Article feature every day. I have a link on my bookmarks toolbar in chrome that I click to learn about random things to spread my breadth of knowledge. However, most of the articles that I land on are biographies of athletes, which is totally useless to me. 

I've been a computer scientist for about 8 years now. And I can say with certainty, that if I had a "Random Computer Science Article" button on toolbar, I'd use it every day too. Just look at how huge the subject is: http://en.wikipedia.org/wiki/Computer_science

I'm sure other people would like to immerse themselves in a subject, like Biology, Finance or some other large field. 

I think that was the reasoning for this request.

Hope that helps!

PS. Thanks for working on wikimedia and wikipedia. It's really appreciated by so many people. Cheers!
Comment 17 Helder 2011-09-29 18:44:56 UTC
If we had a way to restrict the random pages to a specific category, the feature could be used to get a random chapter of a book from Wikibooks, such as a random recipe from [[b:Cookbook]], or a random animal from [[Wikijunior:Animal Alphabet]] or a random sonnet from [[s:Category:Sonnets]].

I marked this as a blocker for bug 25931.
Comment 18 jlatta6 2011-11-24 04:19:09 UTC
Steve et all,

I am now using stumbleupon to do this.  It works pretty good except you can't just search a category off the top of your head, stumbleupon will "stumble" through your pre-defined interests on wikipedia.  Asides from that it works ok.  Below is the link on how to do it.

http://getsatisfaction.com/stumbleupon/topics/why_cant_i_specifically_stumble_wikipedia_with_my_chrome_plugin


(In reply to comment #16)
> While I'm not the original requester, I did post a duplicate bug (Bug 29373).
> 
> I can assure you such a feature wouldn't be just eyecandy or a "would be nice"
> feature. It would be a serious learning tool.
> 
> Most people use wikipedia as a reference tool, linked from google's results.
> However, there are cases when you don't want to use it as a book, but a study
> guide. For example, when someone is new to a field, they'll want to immerse
> themselves in the subject and learn as much vocabulary about the subject as
> possible. Even people who have been out of the subject for a while would
> benefit from brushing up from time to time.
> 
> I personally use the Random Article feature every day. I have a link on my
> bookmarks toolbar in chrome that I click to learn about random things to spread
> my breadth of knowledge. However, most of the articles that I land on are
> biographies of athletes, which is totally useless to me. 
> 
> I've been a computer scientist for about 8 years now. And I can say with
> certainty, that if I had a "Random Computer Science Article" button on toolbar,
> I'd use it every day too. Just look at how huge the subject is:
> http://en.wikipedia.org/wiki/Computer_science
> 
> I'm sure other people would like to immerse themselves in a subject, like
> Biology, Finance or some other large field. 
> 
> I think that was the reasoning for this request.
> 
> Hope that helps!
> 
> PS. Thanks for working on wikimedia and wikipedia. It's really appreciated by
> so many people. Cheers!
Comment 19 Sumana Harihareswara 2012-04-04 14:28:53 UTC
Victor, could you put this extension into  https://www.mediawiki.org/wiki/Git/Conversion/Extensions_queue so it can be moved to Git, which is a prerequisite for deployment on Wikimedia Foundation sites?  Thanks.
Comment 20 Sumana Harihareswara 2012-04-25 01:55:56 UTC
Asher Feldman has agreed to do a database administration review of this extension.
Comment 21 Sumana Harihareswara 2012-04-25 18:19:19 UTC
Some discussion in IRC just now (#mediawiki):

<sumanah> "Waiting for database administration review by Asher Feldman, waiting for author Victor Vasiliev to move extension to Git."  bug https://bugzilla.wikimedia.org/show_bug.cgi?id=25931  in https://www.mediawiki.org/wiki/Review_queue#Extensions
<vvv> Well

<vvv> I think it would fail the first stage
<vvv> Because it uses an unindexed query IIRC

<RoanKattouw> It's also not really indexable
<RoanKattouw> Unless you add a cl_random field to categorylinks
<vvv> Yes, this is the problem we found back in 2007
<RoanKattouw> As written it'd have to fetch all categorylinks rows for a given category, join them against page, then do a range scan on page_random

<RoanKattouw> The range scan may or may not be indexed, I couldn't say offhand, but joining an entire category against the page table is a problem when you consider stuff like [[Category:Living people]]
<vvv> Well, it was not even me who split it as an extension
<RoanKattouw> (I'm assuming the query is something like SELECT stuff FROM categorylinks, page WHERE cl_to='Category_name' AND page_id=cl_from AND page_random > 0.123 ORDER BY page_random LIMIT 1; )
<vvv> I believe someone made it after the original version was reverted
<vvv> And cl_random would sound something you would want to have in core
<RoanKattouw> That would probably want to be in core yeah
<RoanKattouw> But adding cl_random is an expensive operation

<vvv> RoanKattouw: I remember nobody wanted to do it because switching masters were done manually back then
<RoanKattouw> Still, switching masters is not what people do for fun
<RoanKattouw> To me it just seems like a lot of effort for a minor feature

<RoanKattouw> OTOH on smaller wikis it might work, but when Sumana and I talked to Asher about this last night he said he didn't want to assume that small wikis will stay small forver
<RoanKattouw> It would be particularly ironic if we enabled it on some Indic language wiki because it's small, while 3 floors above me there's people whose job it is to try and get that wiki to grow

<binasher> vvv: RoanKattouw: if the query is like what roan mentioned above, -1.  this sort of thing should be done with a search engine and is probably even doable with the one we have
Comment 22 MZMcBride 2012-04-26 03:57:54 UTC
(In reply to comment #21)
> <binasher> vvv: RoanKattouw: if the query is like what roan mentioned above,
> -1.  this sort of thing should be done with a search engine and is probably
> even doable with the one we have

What do you mean? Which search engines return random pages?
Comment 23 Tim Starling 2012-04-26 04:19:43 UTC
(In reply to comment #22)
> (In reply to comment #21)
> > <binasher> vvv: RoanKattouw: if the query is like what roan mentioned above,
> > -1.  this sort of thing should be done with a search engine and is probably
> > even doable with the one we have
> 
> What do you mean? Which search engines return random pages?

Search engines have pregenerated document lists stored in an efficient format for various criteria. Usually the presence or absence of a given keyword is the criterion of interest, but membership in a category can be handled in the same way. Since the list is pregenerated, the length is known, so you can choose a random offset into the category and perhaps even skip to that offset efficiently. Asher probably means that if Lucene doesn't have such a feature already, it could be patched in.
Comment 24 MZMcBride 2012-04-26 04:41:53 UTC
For those wondering, this bug is not a duplicate of bug 2170. Bug 2170 is about having the feature generally available in MediaWiki (which was implemented as the "RandomInCategory" extension). This bug is about having the feature available on exceptional MediaWiki installations, namely those that run Wikimedia wikis.
Comment 25 Asher Feldman 2012-04-27 00:05:47 UTC
(In reply to comment #23)
> (In reply to comment #22)
> > (In reply to comment #21)
> > > <binasher> vvv: RoanKattouw: if the query is like what roan mentioned above,
> > > -1.  this sort of thing should be done with a search engine and is probably
> > > even doable with the one we have
> > 
> > What do you mean? Which search engines return random pages?
> 
> Search engines have pregenerated document lists stored in an efficient format
> for various criteria. Usually the presence or absence of a given keyword is the
> criterion of interest, but membership in a category can be handled in the same
> way. Since the list is pregenerated, the length is known, so you can choose a
> random offset into the category and perhaps even skip to that offset
> efficiently. Asher probably means that if Lucene doesn't have such a feature
> already, it could be patched in.

Indeed, the method Tim outlines would let you grab a random result from any search engine that supports pagination.  

You can also get randomized output directly from a search engine given control over sorting, which would normally be in descending order on an IR score. Solr has a random result module and it's implementable in Lucene, including version 2 which we run in production.

See the section "Bonus! For those of you trapped in Lucene 2" at the bottom of:
http://stackoverflow.com/questions/7201638/lucene-2-9-2-how-to-show-results-in-random-order
Comment 26 Asher Feldman 2012-04-27 01:25:46 UTC
How to implement a category based random feature in wikipedia without touching the database:

1) Send lucene an incategory query with a limit of 1 to cheaply get the total number of articles indexed in the given category. Stuff this in memcache with a reasonable ttl (couple hours?) and try to grab there next time so lucene is only called once.

2) Send the same category with an offset of rand(0, $doc_count - 1)

3) Redirect to the article returned in step 2.

Command line example to get a random "Domestic animals" article.

Step one - the very first item in the response is the match count (36 in this case): 

asher@bast1001:~/srchtest$ curl 'http://search1001:8123/search/enwiki/incategory:%22Domesticated%20animals%22?limit=1'
36
#info search=[search1001,search1001], highlight=[search1004] in 4 ms
#no suggestion
#interwiki 0 0
#results 1
12.586081 0 Genomics_of_domestication
#h.text [] [] [+] date+November+2011
#h.text [] [] [] Genomics+is+the+study+of+the+structure%2C+content%2C+and+evolution++of+genomes+%2C+or+the+entire+genetic+information+of+
#h.date 2012-04-04T12:25:13Z
#h.wordcount 2252
#h.size 15955

Step two - pick a random number between 0 - 35.. let's go with 16.

asher@bast1001:~/srchtest$ curl 'http://search1001:8123/search/enwiki/incategory:%22Domesticated%20animals%22?offset=16&limit=1'
36
#info search=[search1001,search1001], highlight=[search1005] in 3 ms
#no suggestion
#interwiki 0 0
#results 1
6.2930403 0 Fancy_pigeon
#h.text [] [] [+] Fancy+pigeons+are+domesticated++varieties+of+the+Rock+Pigeon++%28Columba+livia%29.+
#h.text [] [] [] They+are+bred+by+pigeon+fanciers++for+various+traits+
#h.date 2012-02-21T12:23:54Z
#h.wordcount 903
#h.size 7354

Hi, Fancy Pigeon!
Comment 27 Asher Feldman 2012-04-27 06:47:47 UTC
Our current build of lsearchd won't go deeper than an offset of 100000 (SearchEngine.java:protected static int maxoffset = 100000;) so for categories like Living People, we wouldn't be able to provide random results over the full set, just the first 100k as they appear in the index, which appears to be ordered on create time.

Actually getting the 100kth result (upper latency bound) takes ~280ms

asher@bast1001:~/srchtest$ curl 'http://search1001:8123/search/enwiki/incategory:%22Living%20people%22?limit=1&offset=99999&searchall=0'
567274
#info search=[search1001,search1001], highlight=[search1005] in 283 ms
#no suggestion
#interwiki 0 0
#results 1
1.4743276 0 Boris_Boillon

If you ditch the join and take the same approach with mysql, it's several times faster than lucene:

mysql> select cl_from from categorylinks where cl_to='Living_people'  limit 1 offset 99999;
+----------+
| cl_from  |
+----------+
| 13546433 |
+----------+
1 row in set (0.06 sec)

The worst case for Living_people isn't great (~350ms), but still faster than lucene would be if we upped lsearchd's max offset:

mysql> select cl_from from categorylinks where cl_to='Living_people'  limit 1 offset 560000;
+----------+
| cl_from  |
+----------+
| 27345638 |
+----------+
1 row in set (0.35 sec)
Comment 28 Domas Mituzas 2012-04-27 14:13:42 UTC
We need more features that scan full datasets to return single row.
Comment 29 Asher Feldman 2012-04-27 16:43:17 UTC
Domas is of course right. Adding a precomputed cl_random column+index is needed to make this feature acceptable via mysql. Doing so incurs a permanent cost. Alternatively, we could add the existing page_random field to the lucene index and make it searchable to eliminate offset scanning there. The latter may be cheaper.
Comment 30 Bawolff (Brian Wolff) 2012-04-30 17:42:30 UTC
>1) Send lucene an incategory query with a limit of 1 to cheaply get the total
>number of articles indexed in the given category. Stuff this in memcache with a
>reasonable ttl (couple hours?) and try to grab there next time so lucene is
>only called once.

I was under the impression that lucence's incategory only worked for categories directly listed on a page (aka not inherited from a template). That would be a major negative point for using a lucence based solution (unless that issue could be fixed)
Comment 31 MZMcBride 2012-04-30 23:18:34 UTC
(In reply to comment #30)
>>1) Send lucene an incategory query with a limit of 1 to cheaply get the total
>>number of articles indexed in the given category. Stuff this in memcache with a
>>reasonable ttl (couple hours?) and try to grab there next time so lucene is
>>only called once.
> 
> I was under the impression that lucence's incategory only worked for categories
> directly listed on a page (aka not inherited from a template). That would be a
> major negative point for using a lucence based solution (unless that issue
> could be fixed)

True, but tangential. The relevant bug is bug 18861.
Comment 32 Asher Feldman 2012-05-03 20:58:17 UTC
There's no reason why the indexer couldn't pull in categorylinks instead of whatever its doing now (parsing wikitext?) but we are currently short on resources when it comes to developing around lucene.  An upgraded search infrastructure with real-time indexing and greater accessibility around index definitions could open the door to all sorts of features that aren't currently practical in mediawiki at wikipedia scale. 

(In reply to comment #30)
> >1) Send lucene an incategory query with a limit of 1 to cheaply get the total
> >number of articles indexed in the given category. Stuff this in memcache with a
> >reasonable ttl (couple hours?) and try to grab there next time so lucene is
> >only called once.
> 
> I was under the impression that lucence's incategory only worked for categories
> directly listed on a page (aka not inherited from a template). That would be a
> major negative point for using a lucence based solution (unless that issue
> could be fixed)
Comment 33 Sumana Harihareswara 2012-05-15 01:55:07 UTC
Asher says that the easiest path to implementing this in a way that performs suitably is the precomputed cl_random column+index solution mentioned in comment 29 -- though it has a real cost in terms of hardware utilization.  Assigning to Victor to see whether he would like to follow up on this.
Comment 34 Yuvi Panda 2013-04-13 06:16:48 UTC
Update: The e3 team implemented a limited version of this in https://gerrit.wikimedia.org/r/#/c/52468/ and https://gerrit.wikimedia.org/r/#/c/51881/ - it only makes it available for a pre-configured small set of categories. 

Most of the code required for this is written, it just needs extracting out to its own Extension (RedisRandomCategory?), and perhaps an API Module. And there shouldn't be *too* much performance issues in getting this on cluster, considering that it is already deployed (albeit in a limited way) by E3.
Comment 35 Bawolff (Brian Wolff) 2013-04-16 14:23:18 UTC
(In reply to comment #34)
> Update: The e3 team implemented a limited version of this in
> https://gerrit.wikimedia.org/r/#/c/52468/ and
> https://gerrit.wikimedia.org/r/#/c/51881/ - it only makes it available for a
> pre-configured small set of categories. 
> 
> Most of the code required for this is written, it just needs extracting out
> to
> its own Extension (RedisRandomCategory?), and perhaps an API Module. And
> there
> shouldn't be *too* much performance issues in getting this on cluster,
> considering that it is already deployed (albeit in a limited way) by E3.

I was reading up on redis, and it sounds really cool. However what ive gathered from my brief look is that it stores all data in memory (?) I can't imagine that would scale to all cats on enwiki (let alone all cats everywhere)
Comment 36 Yuvi Panda 2013-04-16 17:58:44 UTC
(In reply to comment #35)
> I was reading up on redis, and it sounds really cool. However what ive
> gathered
> from my brief look is that it stores all data in memory (?) I can't imagine
> that would scale to all cats on enwiki (let alone all cats everywhere)

Ah, you're right! Though with appropriate swapping, I suppose you could use it indefinitely (as it swaps out unused pages). But yes, Redis doesn't look to meet our exact requirements, as is.
Comment 37 Gerrit Notification Bot 2013-07-04 16:24:31 UTC
Change 71997 had a related patch set uploaded by Brian Wolff:
Add Special:RandomInCategory.

https://gerrit.wikimedia.org/r/71997
Comment 38 Bawolff (Brian Wolff) 2013-07-04 16:31:51 UTC
(In reply to comment #37)
> Change 71997 had a related patch set uploaded by Brian Wolff:
> Add Special:RandomInCategory.
> 
> https://gerrit.wikimedia.org/r/71997

I had an idea for an efficient method that doesn't need a schema change. It however gives quite biased results in some cases (You can have 2 of cheap [in the amount of ops work needed for a schema], fast and good. This one is cheap and fast).

I think this is good enough for the common use case of people just wanting an entry from a category that is different from last time they hit the random button. (For example to get a random thing out of articles for cleanup or whatever). To do something better would need a schema change, or some other more exotic solution. I think this method could be "good" enough for now.
----

Algorithm is:
*Get earliest and newest cl_timestamp in a category
*Pick a date in between
*Pick an offset between 0 and 30
*Get the page that is offset number of pages after the date picked.

Thoughts?
Comment 39 Bawolff (Brian Wolff) 2013-07-04 19:00:25 UTC
Possible tweak could also be to randomly change wether we do cl_timestamp > random_timestamp or use cl_timestamp < random_timestamp (along with asc vs desc), which might even things out if one had a category with mostly old entries from very long ago, and a few outlier new entries from very recent.
Comment 40 Nemo 2013-07-05 15:55:08 UTC
(In reply to comment #38)
> Algorithm is:
> *Get earliest and newest cl_timestamp in a category
> *Pick a date in between
> *Pick an offset between 0 and 30
> *Get the page that is offset number of pages after the date picked.
> 
> Thoughts?

So, the downside for this is that bulk of pages added to the category in similar times would be constantly underrepresented, if I understand correctly. Those might be category renames, bot additions, new templates including the category... it may make it very hard to clear such big backlogs, but give a better representation of more "human" (slow) additions to the category.
Comment 41 MZMcBride 2013-07-08 12:18:51 UTC
Personally, I'd rather see a schema change or Lucene/Solr improvements cover this.
Comment 42 Gerrit Notification Bot 2013-08-01 17:38:12 UTC
Change 71997 merged by Brion VIBBER:
Add Special:RandomInCategory.

https://gerrit.wikimedia.org/r/71997
Comment 43 Bawolff (Brian Wolff) 2013-08-01 17:39:23 UTC
(In reply to comment #41)
> Personally, I'd rather see a schema change or Lucene/Solr improvements cover
> this.

Perhaps open a separate bug for that, since this patch has been merged.
Comment 45 Nemo 2013-08-31 11:11:39 UTC
(In reply to comment #44)
> Thx for this new special page.
> 
> It doesn't work on [[Catégorie:Portail:Hélicoptères/Articles liés]]
> 
> https://fr.wikipedia.org/wiki/Sp%C3%A9cial:RandomInCategory/Portail:
> H%C3%A9licopt%C3%A8res/Articles_li%C3%A9s

The Portail: prefix is being eaten. Can you check if it happens with any prefix matching the name of a namespace, or just with any : prefix, and file a bug?
Comment 46 Bawolff (Brian Wolff) 2013-08-31 18:29:33 UTC
(In reply to comment #44)
> Thx for this new special page.
> 
> It doesn't work on [[Catégorie:Portail:Hélicoptères/Articles liés]]
> 
> https://fr.wikipedia.org/wiki/Sp%C3%A9cial:RandomInCategory/Portail:
> H%C3%A9licopt%C3%A8res/Articles_li%C3%A9s

This is fixed on master. Next time wikimedia sites are updated, (thursday) it should be fixed.

Until then, include the category: prefix with the page name and it should work.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links