Last modified: 2014-07-19 03:57:09 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T63508, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 61508 - [Regression] Jenkins: Jobs for npm testing are broken due to npm certificate issues on the new slaves
[Regression] Jenkins: Jobs for npm testing are broken due to npm certificate ...
Status: RESOLVED FIXED
Product: Wikimedia
Classification: Unclassified
Continuous integration (Other open bugs)
wmf-deployment
All All
: Normal normal (vote)
: ---
Assigned To: Krinkle
:
: 66048 (view as bug list)
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2014-02-18 19:43 UTC by Krinkle
Modified: 2014-07-19 03:57 UTC (History)
5 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Krinkle 2014-02-18 19:43:23 UTC
Last working job:
  https://integration.wikimedia.org/ci/job/mwext-VisualEditor-npm/942/console

  - Feb 16, 2014
  - Building remotely on integration-slave01

  - node v0.10.22
  - npm v1.1.38

First failing job:
  https://integration.wikimedia.org/ci/job/mwext-VisualEditor-npm/943/console

  - Feb 18, 2014
  - Building remotely on integration-slave02

  - node v0.8.2
  - npm v1.1.39

16:23:41 npm ERR! Error: SSL Error: CERT_UNTRUSTED
16:23:41 npm ERR!     at ClientRequest.<anonymous> 
16:23:41 npm ERR!     at Socket.ondata (stream.js:38:26)
16:23:41 npm ERR!  [Error: SSL Error: CERT_UNTRUSTED]

Server log:
  https://wikitech.wikimedia.org/w/index.php?title=Server_Admin_Log&diff=99596&oldid=99549


  - February 17
    - 16:15 hashar: Jenkins deleting slave integration-slave01
    - 16:14 hashar: Jenkins added two labs slaves with 4 CPU: integration-slave02 and integration-slave03
    - 08:46 hashar: Upgrading Jenkins, half an hour downtime



So npm was upgraded one minor version, and nodejs was downgraded *2 major versions*, and (possibly unrelated) it seems to be unable to verify the certificate properly.

According to existing bug reports, this is related to it being a self-signed certificate, however this shouldn't be a problem since a validation mechanism for their official certificate ships with the npm package. Upstream recommends upgrading to the most recent minor version, but that doesn't seem to be the problem considering the bug started happening for us overnight (no certificate change upstream) when we went from v1.1.38 to v1.1.39 (not down).
Comment 1 Krinkle 2014-02-18 20:10:29 UTC
So the certificates partially ship with nodejs, not the npm package.

Upstream nodejs takes care to backport these cert changes to v0.8, however the slaves not only downgraded from v0.10 to v0.8 but also to an way older minor release of the v0.8 branch.

There have been 24 (!) minor releases since v0.8.2, latest is v0.8.22.

In case this particular cert change can be mitigated by only upgrading npm, I've done a manual upgrade of npm on the individual instances using npm itself:

 $ sudo -s
 $ npm conf set strict-ssl false
 $ npm install -g npm
 npm 1.1.39
 ...
 npm 1.4.3
 $ npm conf set strict-ssl true
 $ cd /tmp && mkdir foo123 && cd foo123
 $ npm install jshint
 success

The jobs still fail after this because the new labs slaves are also missing grunt.

 $ sudo -s
 $ npm install -g grunt-cli
 success


Note that I couldn't interact with npm the normal way because for some reason /home is read-only on the integration instances (even in root, npm still uses your original home as location to do some temporary work and caching).

Had to set `export HOME=/root` to bypass that.
Comment 2 Krinkle 2014-02-18 20:26:12 UTC
krinkle at integration-slave03:

# Enter root and fix HOME so that npm doesn't put cache in /home
# which is read-only in labs (why?)
$ sudo -s
$ export HOME=/root

# Temporarily disable ssl check
$ npm conf set strict-ssl false

# Remove symlink to apt-get installed version
# because npm is not allowed to to delete this shadow 
$ l /usr/bin/npm
  /usr/bin/npm -> /etc/alternatives/npm*
$ rm /usr/bin/npm

# Upgrade npm
$ /etc/alternatives/npm install -g npm

 npm@1.1.39
 ...
 npm@1.4.3

# Re-enable ssl check
$ npm conf set strict-ssl true

# Verify that stuff works by doing an
# install of an example package (jshint)
# in a tmp dir
$ cd /tmp && mkdir foo123 && cd foo123
$ npm install jshint
 ...
 success
$ cd ~
$ rm -rf /tmp/foo123

# Install Grunt
$ npm install -g grunt-cli

 ...
 success
Comment 3 Antoine "hashar" Musso (WMF) 2014-02-19 16:30:53 UTC
integration-slave01 received nodejs 0.10.x when it got added to apt.wikimedia.org. It has been later removed but the instance never got cleaned up.

npm, I have no idea, probably similar.


I dont want the slaves to be tweaked manually, everything must be in puppet. So there is a few bugs that we should fill all related to updating packages in apt.wikimedia.org:

* nodejs 0.10.x (that is apparently a work in progress)
* npm 1.3.10 should be backported from Ubuntu Trusty
* grunt-cli needs to be packaged

Then we can update the list of packages in operations/puppet.git file ./modules/contint/manifests/packages/labs.pp . It list npm but no grunt-cli since there is no package there.

Does it sound right?
Comment 4 Antoine "hashar" Musso (WMF) 2014-02-24 11:08:40 UTC
Lowering priority and assigning back to Timo.  He applied a workaround.  Still have to fill bugs as mentioned in comment #3
Comment 5 Krinkle 2014-03-12 21:34:14 UTC
(In reply to Antoine "hashar" Musso from comment #3)
> integration-slave01 received nodejs 0.10.x when it got added to
> apt.wikimedia.org. It has been later removed but the instance never got
> cleaned up.
> 
> npm, I have no idea, probably similar.
> 
> 
> I dont want the slaves to be tweaked manually, everything must be in puppet.
> So there is a few bugs that we should fill all related to updating packages
> in apt.wikimedia.org:
> 
> * nodejs 0.10.x (that is apparently a work in progress)
> * npm 1.3.10 should be backported from Ubuntu Trusty
> * grunt-cli needs to be packaged
> 
> Then we can update the list of packages in operations/puppet.git file
> ./modules/contint/manifests/packages/labs.pp . It list npm but no grunt-cli
> since there is no package there.
> 
> Does it sound right?

Yes, except for grunt-cli needing to be packaged. We explicitly don't want to do that, like the over 300 other arbitrary npm modules we fetch daily on the integration slaves based on things listed in package.json in local repositories, this yet just another package like that. We can and should (for consistency and for it being the right version) install this via npm.

I'm sure there is a puppet syntax for ensuring a certain shell command has been executed (e.g. based on a certain file existing). Similar to how we use git::clone in some places and the puppet file{} syntax. They're not provisioned packages, just inline specified within our manifest created by something other than a package (a rb template file, a git clone, or, in this case, an npm install)
Comment 6 Antoine "hashar" Musso (WMF) 2014-03-19 10:49:25 UTC
For some reason I managed to get the pmtpa slave nodes back to nodejs 0.8.x which break the VisualEditor npm jobs.

I also created two new slaves in eqiad (integration-slave1001 and integration-slave1002) and they come up with nodejs 0.8.x as well.

Will mail ops list to figure out how to get nodejs 0.10.x marked for install on those hosts.
Comment 7 Nemo 2014-03-19 10:59:55 UTC
(In reply to Antoine "hashar" Musso from comment #6)
> Will mail ops list to figure out how to get nodejs 0.10.x marked for install
> on those hosts.

I don't know if it's the same for you, but docs were wrong for me.
<https://www.mediawiki.org/w/index.php?title=Parsoid%2FSetup&diff=930615&oldid=930612>
Comment 8 Antoine "hashar" Musso (WMF) 2014-03-19 11:11:03 UTC
Mailed ops list.  The Parsoid and VisualEditor npm jobs are now failing and preventing changes to be merged until the SSL cert issue is properly fixed.
Comment 9 Antoine "hashar" Musso (WMF) 2014-03-19 11:11:57 UTC
(In reply to Nemo from comment #7)
> I don't know if it's the same for you, but docs were wrong for me.
> <https://www.mediawiki.org/w/index.
> php?title=Parsoid%2FSetup&diff=930615&oldid=930612>

The doc instructs to use a ppa which provides 0.10.x.  We do not use ppa.
Comment 10 Antoine "hashar" Musso (WMF) 2014-03-19 14:02:48 UTC
I was a bit upset this morning. I have applied Timo fix from Comment #2 on all four instances:

 integration-slave02.pmtpa.wmflabs
 integration-slave03.pmtpa.wmflabs
 integration-slave1001.eqiad.wmflabs
 integration-slave1002.eqiad.wmflabs

Seems to work now.
Comment 11 James Forrester 2014-06-02 21:42:19 UTC
Has this recurred?
Comment 12 Krinkle 2014-06-02 22:41:22 UTC
https://wikitech.wikimedia.org/w/index.php?title=Server_Admin_Log&oldid=115053

> 20:08 hashar: Jenkins unpolled integration-slave1003 npm is outdated there and does not trust npmregistry.org ( bug 61508 )
> 22:37 Krinkle: Hack-patching integration-slave1003.eqiad.wmflabs per https://bugzilla.wikimedia.org/show_bug.cgi?id=61508#c2


krinkle at integration-slave1003.eqiad.wmflabs in ~
$ node --version
v0.8.2

$ npm --version
1.1.39

$ sudo -s
# export HOME=/root
# npm conf set strict-ssl false
# l /usr/bin/npm

  /usr/bin/npm -> /etc/alternatives/npm*

# rm /usr/bin/npm
# /etc/alternatives/npm install -g npm

  ...
  npm@1.4.13 /usr/lib/node_modules/npm

# cd /tmp && mkdir foo123 && cd foo123
# npm install jshint

  .. success ..
  jshint@2.5.1 node_modules/jshint

# l `which npm`

  /usr/bin/npm -> ../lib/node_modules/npm/bin/npm-cli.js*

# npm conf set strict-ssl true

# cd ~ && rm -rf /tmp/foo123/
# npm install -g grunt-cli

  .. success ..
  /usr/bin/grunt -> /usr/lib/node_modules/grunt-cli/bin/grunt
  grunt-cli@0.1.13 /usr/lib/node_modules/grunt-cli

# npm --version
1.4.13
# grunt --version
grunt-cli v0.1.13
Comment 13 Krinkle 2014-06-02 22:48:34 UTC
*** Bug 66048 has been marked as a duplicate of this bug. ***
Comment 14 Krinkle 2014-06-02 22:50:58 UTC
(In reply to Antoine "hashar" Musso from bug 66048 comment #1)
> Node marked offline on
> https://integration.wikimedia.org/ci/computer/integration-slave1003/

Brought back online.
Comment 15 Antoine "hashar" Musso (WMF) 2014-06-02 23:41:03 UTC
Thank you Timo for fixing up the installation on integration-slave1003.  I guess we can close that bug since you proposed to use node 10.x on bug 66056 which would definitely fix the issue.
Comment 16 Krinkle 2014-07-18 21:41:51 UTC
The new
Comment 17 Krinkle 2014-07-19 03:57:09 UTC
Existing instances have been patches so marking this as fixed.

The fact that we need to puppetize the patches is a separate bug.

See:
* bug 68256
* bug 66056

And the patches are documented at:
https://wikitech.wikimedia.org/wiki/Nova_Resource:Integration/Setup

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links