Last modified: 2012-05-28 19:10:53 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T38265, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 36265 - import several thousand wikis from wikiteam lists
import several thousand wikis from wikiteam lists
Status: RESOLVED FIXED
Product: Wikimedia Labs
Classification: Unclassified
wikistats (Other open bugs)
unspecified
All All
: Normal minor
: ---
Assigned To: Daniel Zahn
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2012-04-26 08:39 UTC by Daniel Zahn
Modified: 2012-05-28 19:10 UTC (History)
7 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Daniel Zahn 2012-04-26 08:39:55 UTC
reported by NemoBis.

between 7.000 and 8.000 wiki API URLs are here:

https://code.google.com/p/wikiteam/source/browse/trunk/batchdownload/lists/

import them all into the stats mediawiki table
Comment 1 Daniel Zahn 2012-04-26 08:55:28 UTC
imported thousands of raw API URLs.

so far the database expected wiki name and stats URL, but since we can also ask the API for the wiki name itself, i added an "import" function to the update script. It is going through all wikis in the mediawikis list without a name now, and tries to fetch name and statistics, then updates db.

running now...  mediawikis already has more articles than wikia and wiktionaries.. and growing
Comment 2 Nemo 2012-04-26 10:26:22 UTC
Awesome!

I hope it doesn't matter, but there are also some wikinews.org subdomains in those lists. Some bluwiki.com subdomains may need to be moved to another list?
Comment 3 Daniel Zahn 2012-04-26 10:45:01 UTC
yeah, expect more leftovers, the current strategy is "dump everything in there", update, and cleanup non-working, dupes and those belonging to other tables afterwards
Comment 4 Daniel Zahn 2012-04-26 13:04:52 UTC
4775 wikis succesfully imported.
Comment 5 Daniel Zahn 2012-04-27 19:16:04 UTC
still needs cleanup of the non-working ones, thought it would be better after renaming the existing wikis (see other ticket), because some of the new ones had issues for duplicate names with existing ones (of course)
Comment 6 Daniel Zahn 2012-04-29 10:26:02 UTC
clean up after import:  moved anarchopedia.org URLs into own table. adds a few languages, also revealed duplicates due to them changing their language subdomains from 3-letter to 2-letter scheme. added language / local language names from wikipedia where prefix was the same. deleted a few duplicates, f.e. kept "es" in favor of "spa" and "www.spa". some are still missing the language names now.hrv? bos? nno? nor? ..
Comment 7 Daniel Zahn 2012-04-29 14:23:10 UTC
more cleanup. changed all old API URLs in db to just end in "api.php", drop any api parameters that may have been manually added, these are globally defined in config now. 

after that deleted all (~ 300) that now had duplicate URLs with newly added ones.

still fetching full siteinfo from all .. wherever it works, with different number of fields returned depending on mw versions..
Comment 8 Daniel Zahn 2012-05-06 10:21:15 UTC
deleted duplicates, same wikis on different domains / URLs:

Archiplanet (greatbuildings.com is the same it seems)

wikidoc (they have different language subdomains but they are just 1 wiki)

sourcewatch (also disinfopedia.org)

bluwiki , deleted old method 0, kept just one new one. (yes, a LOT of subdomains but they all have the same API and stats)... but keep watching them, as they say:
---
Coming soon: Automated MediaWiki deployment
Get a unique MediaWiki installation on your own subdomain. 
---
bgwiki com/net .. various www duplicates. ....etc


--
dropped unique index on old "name" column to allow duplicates there. set all name to si_sitename where name was null. This will show wikis with duplicate names like "My wiki test" (from API) which are actually different wikis. Later humans can overwrite the name in "name" if desired.

Also we have 44 wikis just named "Wiki". I can provide you with a list if you want to find "manual names" for them.


---

dupe detection in the table: please see here:

http://meta.wikimedia.org/wiki/User:Mutante/mw-dupes

creating mw table syntax with mysql command, but can't paste the table because it triggers spam filter for having wiki URLs in there that are blocked :p
Comment 9 Robert Hanke 2012-05-06 13:01:32 UTC
Then add these, too:
http://s23.org/wiki/Wikistats/WikiHoster.net
Comment 10 Daniel Zahn 2012-05-06 13:23:21 UTC
please add that to the other bug for new hives/farms i just openened. i found a lot more. all "real" duplicates (same name and same stats) are deleted from the table now :) 

i also removed wikinews / pardus. see what is left now that has the same names but different stats in the other bug.

calling the import done now and setting this to resolved.
Comment 11 Robert Hanke 2012-05-28 19:10:53 UTC
Removed upon request

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links