Commit graph

92 commits

Author SHA1 Message Date
Joseph Schorr
2d6a6a1f6c Add a timeout to various operations against etcd in the build manager when it cannot connect to etcd
This will ensure that the build managers don't simply sit there thrashing against a non-existing cluster, thus driving the CPU up on our production nodes, and thus taking them out of service

Addresses https://jira.coreos.com/browse/QUAY-990
2018-07-08 12:25:33 +03:00
Joseph Schorr
400a5db719 Add additional metrics on executor start and failure
This will allow us to register a pager if one of the executors starts failing consistently
2017-11-27 11:52:37 +02:00
Joseph Schorr
ddb1ed7441 Also delete the job key when expiring a job
Otherwise, we can't requeue the job
2017-10-11 15:55:35 -04:00
Joseph Schorr
c799367ac4 Make sure expired startup marks build jobs incomplete immediately
Currently, we wait for the job to expire, which can take a very long time. We also add yet even more logs, in the attempt to track down the root cause
2017-10-11 14:56:19 -04:00
EvB
cedce6f98b fix(buildman/ephemeral): remove exception log on noncritical error 2017-02-09 11:32:41 -08:00
Joseph Schorr
b407f88a26 Remove unnecessary CloudWatch metrics
They are spamming the API and costing us a lot of money
2017-02-01 13:08:21 -05:00
Evan Cordell
dd5f7cbe6c Fix the ephemeral build metrics 2016-12-13 18:28:04 -05:00
Charlton Austin
c6be12e31e Adding in a cancel method to the build component so we can properly clean up the job task. 2016-12-06 13:37:49 -05:00
Charlton Austin
0c7a2e4645 Removing realm key from etcd. 2016-12-02 11:37:56 -05:00
Charlton Austin
8ec14ac3bd Adding in a delete of the etcd key for cancelled jobs. 2016-12-01 16:03:54 -05:00
Jake Moshenko
f0ef4347e5 Make the redis client use AsyncWrapper and coroutines
Change all log messages to be synchronous
2016-11-18 15:59:14 -05:00
Joseph Schorr
ef41e57aad Add executor-specific setup time support
This will allow us to make the setup time TTL for k8s-based builds much lower (on the order of a minute), which means faster timeouts and fallbacks (which is a better user experience).
2016-11-07 15:45:15 -05:00
Joseph Schorr
9f9d32548b Standardize the internal error logs for better tracking 2016-10-31 13:47:24 -04:00
Charlton Austin
0c2fec8314 Fixing the build 2016-10-27 15:10:03 -04:00
Charlton Austin
2147005d2c Adding a method of cancelling a build based on etcd message. 2016-10-25 12:50:58 -04:00
Evan Cordell
3542255db8 buildman: let metric data live longer in etcd 2016-10-04 15:06:46 -04:00
Evan Cordell
943a20f042 buildman: linter fixes 2016-10-04 11:44:31 -04:00
Evan Cordell
f3091c6424 Fix the metrics 2016-10-03 17:53:40 -04:00
Evan Cordell
42ebb0a6c3 Record metrics in a separate etcd record 2016-10-03 16:11:37 -04:00
Evan Cordell
d99c206b47 Fix build time metric 2016-10-01 17:25:13 -04:00
Evan Cordell
07e23a34ed Fix metrics 2016-09-30 13:45:45 -04:00
Evan Cordell
68c5384473 Fixes prometheus start metric 2016-09-30 13:09:03 -04:00
Joseph Schorr
f50bb8a1ce Add missing call to set_phase when a build doesn't start
This change fixes the build manager ephemeral executor to tell the overall build server to call set_phase when a build never starts. Before this change, we'd properly adjust the queue item, but not the repo build row or the logs, which is why users just saw "Preparing Build Node", with no indicating the node failed to start.

Fixes #1904
2016-09-30 14:54:49 +02:00
Evan Cordell
832ee89923 Add duration metric collector decorator (#1885)
Track time-to-start for builders
Track time-to-build for builders
Track ec2 builder fallbacks
Track build time
2016-09-29 15:44:06 -04:00
josephschorr
ad4efba802 Merge pull request #1830 from coreos-inc/superuser-dashboard
Add prometheus stats to enable better dashboarding
2016-09-26 17:19:22 +02:00
Joseph Schorr
1571b2867a Add executor name to the build metric 2016-09-16 16:26:04 -04:00
Joseph Schorr
f9f60b9faf Fix some issues around state in the build managers
- Make sure to cleanup the job if the executor could not be started
- Change the setup leeway to further ensure there isn't any crossover between the queue item timing out and the cleanup of the jobs
- Make the lock used for marking jobs as internal error extremely long, but also based on the execution ID. This should ensure we don't get duplicates while allowing different executions to be handled properly.
- Make sure to invoke the callback update for the queue before we run off to etcd; should reduce certain timeouts

Hopefully Fixes #1836
2016-09-15 14:37:45 -04:00
Joseph Schorr
e67b95ae04 Change log level of an expected log message 2016-08-31 17:25:54 -04:00
Joseph Schorr
e17e0e4172 Add log for when the job key is written 2016-08-30 14:08:56 -04:00
Joseph Schorr
292abb5395 Better handling and logging of exceptions in build manager
Also increases the setup timeout for EC2
2016-08-30 13:52:36 -04:00
Joseph Schorr
bc670611ef Increase the timeout on the atomic lock
Some nodes were still performing the action twice when falling outside of the 30s window
2016-08-23 12:50:38 -04:00
Joseph Schorr
3112388004 Fix multiple reporting of incomplete 2016-08-17 16:01:28 -04:00
Joseph Schorr
742e153133 Fix watch of the jobs key in the build manager 2016-08-16 15:43:09 -04:00
Joseph Schorr
313d65a6a4 Make sure the etcd watch coroutines get called 2016-08-16 13:02:27 -04:00
Joseph Schorr
d78361b041 Cleanup old executions that never start
Fixes #1727
2016-08-15 16:54:02 -04:00
Joseph Schorr
c29f9ccc7f Fix TTL on heartbeat in etcd
Until now, once the heartbeat has expired, we would issue a TTL that is negative, which causes etcd to either raise an exception or simply ignore the expiration (depending on the version of etcd). This change ensures that once the key is expired, it is removed immediately via a set of a TTL of 0. Also adds tests for this case and the normal expiration case.
2016-08-03 11:15:03 -04:00
Joseph Schorr
428a7cb435 Fix decreased setup timeout on ephemeral build manager 2016-07-22 13:35:38 -04:00
Joseph Schorr
392242d20b Another fix for the record keeping in buildman
Adds some more mocked tests as well
2016-07-22 12:01:30 -04:00
Joseph Schorr
68baa51d55 Fix cross-manager handling of realm components 2016-07-21 15:47:25 -04:00
Joseph Schorr
4420b1bac9 Add temporary back-compat shims for the build manager 2016-07-20 13:41:01 -04:00
Joseph Schorr
2c1880b944 Bug fixes, refactoring and "new" tests for the build manager
- Fixes various bugs introduced in the most recent build system commit
- Refactors state management in the build manager to be cleaner and more contained
- Adds back in the mock-based tests, fixed to not use threads and adjusted for the refactoring
- Adds some more simplified unit tests around non-etch related flows
2016-07-18 13:46:48 -04:00
Joseph Schorr
74b87fa813 Build manager cleanup and more logging 2016-07-14 14:33:14 -04:00
Joseph Schorr
811413fe9c Add multiple executor and whitelist support to build manager 2016-07-08 15:50:51 -04:00
Joseph Schorr
7471d0e35f Small code cleanup before whitelist addition 2016-07-08 15:50:51 -04:00
Colin Hom
bc13333f20 Kubernetes build worker 2016-07-08 15:50:51 -04:00
Joseph Schorr
713ba3abaf Further updates to the Prometheus client code 2016-07-01 14:16:51 -04:00
Joseph Schorr
773e73861f Change error into info in build manager
Fixes #1046
2015-12-09 14:30:14 -05:00
Joseph Schorr
edd9a03af5 Catch additional key not found exception
Fixes #806
2015-12-01 12:29:58 -05:00
Joseph Schorr
0f37e66cc8 Better error handling for the build manager
Fixes #604
2015-10-13 11:40:07 -04:00
Matt Jibson
bba1557437 Monitor queue adds and EC2 node starts
fixes #157
see #304
2015-09-18 16:21:16 -04:00