Commit graph

150 commits

Author SHA1 Message Date
Joseph Schorr
bc670611ef Increase the timeout on the atomic lock
Some nodes were still performing the action twice when falling outside of the 30s window
2016-08-23 12:50:38 -04:00
Joseph Schorr
3112388004 Fix multiple reporting of incomplete 2016-08-17 16:01:28 -04:00
Joseph Schorr
5e1a117ff3 Delete the job first to prevent Kubernetes from starting another pod 2016-08-16 16:33:43 -04:00
Joseph Schorr
742e153133 Fix watch of the jobs key in the build manager 2016-08-16 15:43:09 -04:00
Joseph Schorr
313d65a6a4 Make sure the etcd watch coroutines get called 2016-08-16 13:02:27 -04:00
Joseph Schorr
d78361b041 Cleanup old executions that never start
Fixes #1727
2016-08-15 16:54:02 -04:00
Joseph Schorr
c29f9ccc7f Fix TTL on heartbeat in etcd
Until now, once the heartbeat has expired, we would issue a TTL that is negative, which causes etcd to either raise an exception or simply ignore the expiration (depending on the version of etcd). This change ensures that once the key is expired, it is removed immediately via a set of a TTL of 0. Also adds tests for this case and the normal expiration case.
2016-08-03 11:15:03 -04:00
Joseph Schorr
428a7cb435 Fix decreased setup timeout on ephemeral build manager 2016-07-22 13:35:38 -04:00
Joseph Schorr
392242d20b Another fix for the record keeping in buildman
Adds some more mocked tests as well
2016-07-22 12:01:30 -04:00
Joseph Schorr
68baa51d55 Fix cross-manager handling of realm components 2016-07-21 15:47:25 -04:00
Joseph Schorr
4420b1bac9 Add temporary back-compat shims for the build manager 2016-07-20 13:41:01 -04:00
Joseph Schorr
2c1880b944 Bug fixes, refactoring and "new" tests for the build manager
- Fixes various bugs introduced in the most recent build system commit
- Refactors state management in the build manager to be cleaner and more contained
- Adds back in the mock-based tests, fixed to not use threads and adjusted for the refactoring
- Adds some more simplified unit tests around non-etch related flows
2016-07-18 13:46:48 -04:00
Joseph Schorr
74b87fa813 Build manager cleanup and more logging 2016-07-14 14:33:14 -04:00
Joseph Schorr
d8b72e8503 Switch to using a defined branch and not always pulling the VM image 2016-07-08 17:53:25 -04:00
Joseph Schorr
3d4af78f01 Fix label to never allow a space (which breaks Kubernetes) 2016-07-08 17:09:06 -04:00
Joseph Schorr
811413fe9c Add multiple executor and whitelist support to build manager 2016-07-08 15:50:51 -04:00
Joseph Schorr
7471d0e35f Small code cleanup before whitelist addition 2016-07-08 15:50:51 -04:00
Colin Hom
1e3351f3f4 local-docker.sh now accepts env vars 2016-07-08 15:50:51 -04:00
Colin Hom
bc13333f20 Kubernetes build worker 2016-07-08 15:50:51 -04:00
Joseph Schorr
713ba3abaf Further updates to the Prometheus client code 2016-07-01 14:16:51 -04:00
Joseph Schorr
1173192739 Move channel back, as it is referenced by generate_cloud_config 2016-06-22 17:25:06 -04:00
Joseph Schorr
61695eb439 Allow the build node AMI to be overridden in config 2016-06-22 15:13:54 -04:00
Joseph Schorr
7292524d69 Add a cloud watch metric when we fail to start a build via EC2
Fixes #1555
2016-06-17 16:19:57 -04:00
Joseph Schorr
f9469a84b3 Make the size of the build node HDD configurable
Fixes #1520
2016-06-06 11:35:10 -04:00
Joseph Schorr
5262535945 Boto error_code is a string, not the HTTP status code 2015-12-23 15:12:01 -05:00
Joseph Schorr
773e73861f Change error into info in build manager
Fixes #1046
2015-12-09 14:30:14 -05:00
josephschorr
c06e5cc9c7 Merge pull request #1002 from coreos-inc/buildertagexc
Add timeout and failure if an EC2 instance could not be found when ta…
2015-12-09 14:28:31 -05:00
Joseph Schorr
946e5fabc0 Add timeout and failure if an EC2 instance could not be found when tagging
Fixes #994
2015-12-09 14:28:19 -05:00
Joseph Schorr
edd9a03af5 Catch additional key not found exception
Fixes #806
2015-12-01 12:29:58 -05:00
Jimmy Zelinskie
46b2f10d7f check for VPC subnet ID before using builder VPC
This means you can use legacy networking machines by simply changing the
instance type and removing the specified 'EC2_VPC_SUBNET_ID' from the
executor config.
2015-10-22 14:50:54 -04:00
Joseph Schorr
0f37e66cc8 Better error handling for the build manager
Fixes #604
2015-10-13 11:40:07 -04:00
Matt Jibson
bba1557437 Monitor queue adds and EC2 node starts
fixes #157
see #304
2015-09-18 16:21:16 -04:00
Joseph Schorr
04cc471585 Increase the HD size on the build nodes
Fixes #228
2015-07-14 15:20:17 +03:00
Joseph Schorr
6655c7f745 Add exception handling that doesn't log the read-timeout exception
Note: This is a *hack* and needs to be replaced with proper code ASAP
2015-06-25 23:35:29 -04:00
Joseph Schorr
6e6610f31a Switch to a 30s maximum timeout 2015-06-25 23:08:49 -04:00
Joseph Schorr
ecebc06343 Update comment now that restarter is abstracted 2015-06-25 21:53:42 -04:00
Joseph Schorr
9f5f71398c Abstract out the concept of a restart function 2015-06-25 21:40:50 -04:00
Joseph Schorr
52fa9aad5b Fix etcd watching
Etcd can miss events on watches if they are occurring fast enough, so if we can get an exception indicating that we've missed an index, we reset the state of our local tracking structures by re-reading the *full* list and starting a new watch at HEAD
2015-06-25 21:22:39 -04:00
Joseph Schorr
b4c39e8ec0 Fix ephemeral build manager to ask for watches in index order with no gaps 2015-06-23 17:11:46 -04:00
Joseph Schorr
c2dc1c9b75 Handle case where etcd key is already removed on job complete 2015-06-17 15:02:58 -04:00
Jake Moshenko
c435f5c127 Add a comment about why we are taking a lock when terminating a builder machine. 2015-06-10 16:19:51 -04:00
Jake Moshenko
f767fc4d03 Track whether builders ever came online in etcd. Mark builds which never successfully heartbeated as incomplete. 2015-06-10 16:19:51 -04:00
Jake Moshenko
884fedd229 Improve the log messages in the buildman. 2015-06-10 16:19:51 -04:00
Jake Moshenko
d31e25d5cd Allow the individual build manager types to specify how long the queue should wait before retring a job that fails to schedule. 2015-06-10 16:19:50 -04:00
Joseph Schorr
f82831bff6 Log the etcd exception so we can debug this issue 2015-06-09 20:33:55 -04:00
Jake Moshenko
6eead7c860 Add logentries reporting to the ephemeral builders. 2015-03-27 15:28:08 -04:00
Jake Moshenko
0349f3f1a3 Handle the case where YAML config returns a list not a tuple. 2015-03-26 14:53:56 -04:00
Jimmy Zelinskie
8589871f43 buildman: rm unused imports 2015-03-09 13:04:16 -04:00
Jake Moshenko
5c68e52fce Really really fix the exception handling. 2015-02-27 17:33:46 -05:00
Jake Moshenko
cf5bc6f0be Properly catch multiple exceptions. 2015-02-27 17:32:10 -05:00
Jake Moshenko
857c3e2959 Start catching etcd key errors as well. 2015-02-27 17:10:15 -05:00
Joseph Schorr
4551b3a957 Remove the boto timeout set (doesn't work anyway) and add some better logging to the scheduler 2015-02-25 16:00:14 -05:00
Joseph Schorr
5dd78f76c7 Add additional logging, timeouts, and exception checks 2015-02-25 15:15:22 -05:00
Joseph Schorr
2eaec092f0 Handle the case where we cannot write the tags on the build nodes 2015-02-25 13:47:36 -05:00
Joseph Schorr
afe7e14254 Add better exception handling and logging to the ephemeral build manager 2015-02-25 12:09:14 -05:00
Joseph Schorr
524705b88c Get dashboard working and upgrade bootstrap. Note: the bootstrap fixes will be coming in the followup CL 2015-02-17 19:15:54 -05:00
Joseph Schorr
98b4f62ef7 Switch to using a squashed image for the build workers 2015-02-10 15:43:01 -05:00
Jake Moshenko
5b8d65991e Update the space on the builder nodes because its cheap. 2015-02-04 11:58:58 -05:00
Joseph Schorr
361fb33574 - Add a small build script
- Take in the build worker branch name from config
- Add additional logging (to be removed after we figure out the problem)
2015-02-03 12:48:41 -05:00
Jake Moshenko
2215ec6669 Associate a public IP with the network interfaces on our VPC instances. 2015-02-02 15:28:40 -05:00
Jake Moshenko
db8493f254 update the executor template to use VPC instances. 2015-02-02 14:55:34 -05:00
Jake Moshenko
a4b0c8698d Allow the key prefixes in etcd to be configurable. 2015-02-02 12:00:19 -05:00
Jake Moshenko
c308794063 Fix the enterprise manager to use the new coroutine based interface. 2015-01-29 10:56:18 -05:00
Jake Moshenko
ef0806bd9d Make the logs for the build manager more bearable. 2015-01-26 15:27:39 -05:00
Jake Moshenko
86852da4ba Catch exceptions when ELB times out a connection to etcd. 2015-01-23 11:29:38 -05:00
Jake Moshenko
265aeabf60 We need to tell the etcd client which protocol to use. 2015-01-22 16:59:04 -05:00
Jake Moshenko
f2471a86f6 Fix the python requirements. Add the ability to map in etcd client certs and ca. 2015-01-22 10:53:23 -05:00
Jake Moshenko
fc757fecad Tag the EC2 instances with the build uuid. 2015-01-05 15:35:14 -05:00
Jake Moshenko
8037962716 Change the severity of a log message which is actually expected in the happy case. 2015-01-05 14:44:54 -05:00
Jake Moshenko
f58b09a064 Remove the loop argument from the call to build_component_ready. 2015-01-05 13:08:25 -05:00
Jake Moshenko
320ae63ccd Handle the case where there are no realms registered. 2015-01-05 12:23:54 -05:00
Jake Moshenko
b33ee1a474 Register existing builders to watch their expirations. 2015-01-05 11:21:36 -05:00
Jake Moshenko
a9839021af When the etcd key tracking realms is first created the action is create, not set. 2014-12-31 11:46:02 -05:00
Jake Moshenko
cc70225043 Generalize the ephemeral build managers so that any manager may manage a builder spawned by any other manager. 2014-12-31 11:33:56 -05:00
Jake Moshenko
ec87e37d8c EC2 terminate_instances does not take a force flag. 2014-12-23 17:17:53 -05:00
Jake Moshenko
cece94e1da We want to terminate instances, not stop them. 2014-12-23 16:20:42 -05:00
Jake Moshenko
3ce64b4a7f We must yield from stop_builder. 2014-12-23 16:12:10 -05:00
Jake Moshenko
8e16fbf59b The root device on CoreOS is /dev/xvda. 2014-12-23 15:41:58 -05:00
Jake Moshenko
2f2a88825d Try using SSD for root volumes. 2014-12-23 15:35:21 -05:00
Jake Moshenko
723fb27671 Calls to the ec2 service must be async, and responses must be wrapped as well. 2014-12-23 14:54:58 -05:00
Jake Moshenko
2ed9b3d243 Disable the etcd timeout on watch calls to prevent them from disconnecting the client. 2014-12-23 14:54:34 -05:00
Jake Moshenko
4e22e22ba1 We have to serialize our build data before sending it to etc. 2014-12-23 14:09:04 -05:00
Jake Moshenko
709e571b78 Handle read timeouts from etcd when watching a key. 2014-12-23 12:13:49 -05:00
Jake Moshenko
055a6b0c37 Add a total maximum time that a machine is allowed to stick around before we terminate it more forcefully. 2014-12-23 11:18:10 -05:00
Jake Moshenko
34bf92673b Add support for adjusting etcd ttl on job_heartbeat. Switch the heartbeat method to a coroutine. 2014-12-22 17:24:44 -05:00
Jake Moshenko
2b6c2a2a50 Improve tests for the ephemeral build manager. 2014-12-22 16:22:07 -05:00
Jake Moshenko
e53b6b0e21 Merge remote-tracking branch 'origin/master' into ephemeral 2014-12-22 12:14:59 -05:00
Jake Moshenko
12ee8e0fc0 Switch a few of the buildman methods to coroutines in order to support network calls in methods. Add a test for the ephemeral build manager. 2014-12-22 12:14:16 -05:00
Jake Moshenko
a280bbcb6d Add tag metadata to the instances. 2014-12-16 15:17:39 -05:00
Jake Moshenko
1d68594dc2 Extract instance ids from the instance objects returned by boto. 2014-12-16 15:10:50 -05:00
Jake Moshenko
2d7e844753 First implementation of ephemeral build lifecycle manager. 2014-12-16 13:41:30 -05:00
Jimmy Zelinskie
33f12c58ba Add active worker count to buildmanager logs. 2014-12-16 13:37:40 -05:00
Jimmy Zelinskie
09cc4ba4c1 LOGGER -> logger.
While logger may be a global variable, it is not constant. Let the
linters complain!
2014-11-30 17:48:38 -05:00
Joseph Schorr
660a640de6 Better organize the source file structure of the build manager and change it to choose a lifecycle manager based on the config 2014-11-25 16:14:44 -05:00
Joseph Schorr
b8e873b00b Add support to the build system for tracking if/when the build manager crashes and make sure builds are restarted within a few minutes 2014-11-21 14:27:06 -05:00
Jimmy Zelinskie
d0763862b1 Simple code review changes.
I sneakily also added local-test.sh and renamed run-local to
local-run.sh.
2014-11-20 14:36:22 -05:00
Jimmy Zelinskie
6df6f28edf Lint BuildManager 2014-11-18 15:45:56 -05:00
Joseph Schorr
4322b5f81c Get the new build system working for enterprise 2014-11-13 19:41:17 -05:00
Joseph Schorr
f93c0a46e8 WIP: Get everything working except logging and job completion 2014-11-12 14:03:07 -05:00
Joseph Schorr
eacf3f01d2 WIP: Start implementation of the build manager/controller. This code is not yet working completely. 2014-11-11 18:23:15 -05:00