Last modified: 2012-05-28 19:10:53 UTC
reported by NemoBis. between 7.000 and 8.000 wiki API URLs are here: https://code.google.com/p/wikiteam/source/browse/trunk/batchdownload/lists/ import them all into the stats mediawiki table
imported thousands of raw API URLs. so far the database expected wiki name and stats URL, but since we can also ask the API for the wiki name itself, i added an "import" function to the update script. It is going through all wikis in the mediawikis list without a name now, and tries to fetch name and statistics, then updates db. running now... mediawikis already has more articles than wikia and wiktionaries.. and growing
Awesome! I hope it doesn't matter, but there are also some wikinews.org subdomains in those lists. Some bluwiki.com subdomains may need to be moved to another list?
yeah, expect more leftovers, the current strategy is "dump everything in there", update, and cleanup non-working, dupes and those belonging to other tables afterwards
4775 wikis succesfully imported.
still needs cleanup of the non-working ones, thought it would be better after renaming the existing wikis (see other ticket), because some of the new ones had issues for duplicate names with existing ones (of course)
clean up after import: moved anarchopedia.org URLs into own table. adds a few languages, also revealed duplicates due to them changing their language subdomains from 3-letter to 2-letter scheme. added language / local language names from wikipedia where prefix was the same. deleted a few duplicates, f.e. kept "es" in favor of "spa" and "www.spa". some are still missing the language names now.hrv? bos? nno? nor? ..
more cleanup. changed all old API URLs in db to just end in "api.php", drop any api parameters that may have been manually added, these are globally defined in config now. after that deleted all (~ 300) that now had duplicate URLs with newly added ones. still fetching full siteinfo from all .. wherever it works, with different number of fields returned depending on mw versions..
deleted duplicates, same wikis on different domains / URLs: Archiplanet (greatbuildings.com is the same it seems) wikidoc (they have different language subdomains but they are just 1 wiki) sourcewatch (also disinfopedia.org) bluwiki , deleted old method 0, kept just one new one. (yes, a LOT of subdomains but they all have the same API and stats)... but keep watching them, as they say: --- Coming soon: Automated MediaWiki deployment Get a unique MediaWiki installation on your own subdomain. --- bgwiki com/net .. various www duplicates. ....etc -- dropped unique index on old "name" column to allow duplicates there. set all name to si_sitename where name was null. This will show wikis with duplicate names like "My wiki test" (from API) which are actually different wikis. Later humans can overwrite the name in "name" if desired. Also we have 44 wikis just named "Wiki". I can provide you with a list if you want to find "manual names" for them. --- dupe detection in the table: please see here: http://meta.wikimedia.org/wiki/User:Mutante/mw-dupes creating mw table syntax with mysql command, but can't paste the table because it triggers spam filter for having wiki URLs in there that are blocked :p
Then add these, too: http://s23.org/wiki/Wikistats/WikiHoster.net
please add that to the other bug for new hives/farms i just openened. i found a lot more. all "real" duplicates (same name and same stats) are deleted from the table now :) i also removed wikinews / pardus. see what is left now that has the same names but different stats in the other bug. calling the import done now and setting this to resolved.
Removed upon request