Evan Cordell
07e23a34ed
Fix metrics
2016-09-30 13:45:45 -04:00
Evan Cordell
68c5384473
Fixes prometheus start metric
2016-09-30 13:09:03 -04:00
josephschorr
fa4588c7d9
Merge pull request #1908 from coreos-inc/fix-build-phase
...
Add missing call to set_phase when a build doesn't start
2016-09-30 17:52:39 +02:00
josephschorr
0c2b4ed9c1
Merge pull request #1897 from coreos-inc/hash-executor-whitelist
...
Add hash-based staged rollout to build executors
2016-09-30 17:52:19 +02:00
Joseph Schorr
f50bb8a1ce
Add missing call to set_phase when a build doesn't start
...
This change fixes the build manager ephemeral executor to tell the overall build server to call set_phase when a build never starts. Before this change, we'd properly adjust the queue item, but not the repo build row or the logs, which is why users just saw "Preparing Build Node", with no indicating the node failed to start.
Fixes #1904
2016-09-30 14:54:49 +02:00
Joseph Schorr
51a519f653
Add hash-based staged rollout to build executors
...
Fixes #1882
2016-09-29 22:48:42 +02:00
Evan Cordell
832ee89923
Add duration metric collector decorator ( #1885 )
...
Track time-to-start for builders
Track time-to-build for builders
Track ec2 builder fallbacks
Track build time
2016-09-29 15:44:06 -04:00
Brad Ison
593c3eb9c7
Set dnsPolicy to Default on k8s build jobs
...
This prevents the builder pods from having resolv.conf pointed at the
kube-dns service, which they won't have access to.
2016-09-29 11:22:11 -04:00
Brad Ison
631ad0422d
Default to 4GB memory for k8s builders
2016-09-29 11:20:49 -04:00
josephschorr
ad4efba802
Merge pull request #1830 from coreos-inc/superuser-dashboard
...
Add prometheus stats to enable better dashboarding
2016-09-26 17:19:22 +02:00
Joseph Schorr
1571b2867a
Add executor name to the build metric
2016-09-16 16:26:04 -04:00
Joseph Schorr
f9f60b9faf
Fix some issues around state in the build managers
...
- Make sure to cleanup the job if the executor could not be started
- Change the setup leeway to further ensure there isn't any crossover between the queue item timing out and the cleanup of the jobs
- Make the lock used for marking jobs as internal error extremely long, but also based on the execution ID. This should ensure we don't get duplicates while allowing different executions to be handled properly.
- Make sure to invoke the callback update for the queue before we run off to etcd; should reduce certain timeouts
Hopefully Fixes #1836
2016-09-15 14:37:45 -04:00
Brad Ison
2a1cf2bfd1
Always pull latest image in k8s builds
2016-09-08 15:00:12 -04:00
Joseph Schorr
e67b95ae04
Change log level of an expected log message
2016-08-31 17:25:54 -04:00
Joseph Schorr
e17e0e4172
Add log for when the job key is written
2016-08-30 14:08:56 -04:00
Joseph Schorr
292abb5395
Better handling and logging of exceptions in build manager
...
Also increases the setup timeout for EC2
2016-08-30 13:52:36 -04:00
Joseph Schorr
cd2d0341a7
Fix k8s builder to use the declared volume size
...
Fixes #1773
2016-08-29 15:16:28 -04:00
Joseph Schorr
bc670611ef
Increase the timeout on the atomic lock
...
Some nodes were still performing the action twice when falling outside of the 30s window
2016-08-23 12:50:38 -04:00
Joseph Schorr
3112388004
Fix multiple reporting of incomplete
2016-08-17 16:01:28 -04:00
Joseph Schorr
5e1a117ff3
Delete the job first to prevent Kubernetes from starting another pod
2016-08-16 16:33:43 -04:00
Joseph Schorr
742e153133
Fix watch of the jobs key in the build manager
2016-08-16 15:43:09 -04:00
Joseph Schorr
313d65a6a4
Make sure the etcd watch coroutines get called
2016-08-16 13:02:27 -04:00
Joseph Schorr
d78361b041
Cleanup old executions that never start
...
Fixes #1727
2016-08-15 16:54:02 -04:00
Joseph Schorr
c29f9ccc7f
Fix TTL on heartbeat in etcd
...
Until now, once the heartbeat has expired, we would issue a TTL that is negative, which causes etcd to either raise an exception or simply ignore the expiration (depending on the version of etcd). This change ensures that once the key is expired, it is removed immediately via a set of a TTL of 0. Also adds tests for this case and the normal expiration case.
2016-08-03 11:15:03 -04:00
Joseph Schorr
428a7cb435
Fix decreased setup timeout on ephemeral build manager
2016-07-22 13:35:38 -04:00
Joseph Schorr
392242d20b
Another fix for the record keeping in buildman
...
Adds some more mocked tests as well
2016-07-22 12:01:30 -04:00
Joseph Schorr
68baa51d55
Fix cross-manager handling of realm components
2016-07-21 15:47:25 -04:00
Joseph Schorr
4420b1bac9
Add temporary back-compat shims for the build manager
2016-07-20 13:41:01 -04:00
Joseph Schorr
2c1880b944
Bug fixes, refactoring and "new" tests for the build manager
...
- Fixes various bugs introduced in the most recent build system commit
- Refactors state management in the build manager to be cleaner and more contained
- Adds back in the mock-based tests, fixed to not use threads and adjusted for the refactoring
- Adds some more simplified unit tests around non-etch related flows
2016-07-18 13:46:48 -04:00
Joseph Schorr
74b87fa813
Build manager cleanup and more logging
2016-07-14 14:33:14 -04:00
Joseph Schorr
d8b72e8503
Switch to using a defined branch and not always pulling the VM image
2016-07-08 17:53:25 -04:00
Joseph Schorr
3d4af78f01
Fix label to never allow a space (which breaks Kubernetes)
2016-07-08 17:09:06 -04:00
Joseph Schorr
811413fe9c
Add multiple executor and whitelist support to build manager
2016-07-08 15:50:51 -04:00
Joseph Schorr
7471d0e35f
Small code cleanup before whitelist addition
2016-07-08 15:50:51 -04:00
Colin Hom
1e3351f3f4
local-docker.sh now accepts env vars
2016-07-08 15:50:51 -04:00
Colin Hom
bc13333f20
Kubernetes build worker
2016-07-08 15:50:51 -04:00
Joseph Schorr
713ba3abaf
Further updates to the Prometheus client code
2016-07-01 14:16:51 -04:00
Joseph Schorr
1173192739
Move channel back, as it is referenced by generate_cloud_config
2016-06-22 17:25:06 -04:00
Joseph Schorr
61695eb439
Allow the build node AMI to be overridden in config
2016-06-22 15:13:54 -04:00
Joseph Schorr
7292524d69
Add a cloud watch metric when we fail to start a build via EC2
...
Fixes #1555
2016-06-17 16:19:57 -04:00
Joseph Schorr
f9469a84b3
Make the size of the build node HDD configurable
...
Fixes #1520
2016-06-06 11:35:10 -04:00
Joseph Schorr
5262535945
Boto error_code is a string, not the HTTP status code
2015-12-23 15:12:01 -05:00
Joseph Schorr
773e73861f
Change error into info in build manager
...
Fixes #1046
2015-12-09 14:30:14 -05:00
josephschorr
c06e5cc9c7
Merge pull request #1002 from coreos-inc/buildertagexc
...
Add timeout and failure if an EC2 instance could not be found when ta…
2015-12-09 14:28:31 -05:00
Joseph Schorr
946e5fabc0
Add timeout and failure if an EC2 instance could not be found when tagging
...
Fixes #994
2015-12-09 14:28:19 -05:00
Joseph Schorr
edd9a03af5
Catch additional key not found exception
...
Fixes #806
2015-12-01 12:29:58 -05:00
Jimmy Zelinskie
46b2f10d7f
check for VPC subnet ID before using builder VPC
...
This means you can use legacy networking machines by simply changing the
instance type and removing the specified 'EC2_VPC_SUBNET_ID' from the
executor config.
2015-10-22 14:50:54 -04:00
Joseph Schorr
0f37e66cc8
Better error handling for the build manager
...
Fixes #604
2015-10-13 11:40:07 -04:00
Matt Jibson
bba1557437
Monitor queue adds and EC2 node starts
...
fixes #157
see #304
2015-09-18 16:21:16 -04:00
Joseph Schorr
04cc471585
Increase the HD size on the build nodes
...
Fixes #228
2015-07-14 15:20:17 +03:00