Commit graph

340 commits

Author SHA1 Message Date
Evan Cordell
42ebb0a6c3 Record metrics in a separate etcd record 2016-10-03 16:11:37 -04:00
Evan Cordell
d99c206b47 Fix build time metric 2016-10-01 17:25:13 -04:00
Brad Ison
d8aa22103e Add a dash to generated k8s job names 2016-10-01 14:02:28 -04:00
Jimmy Zelinskie
9b67fff78f buildman: spawn_notification w/ attrdict 2016-09-30 18:54:09 -04:00
Joseph Schorr
a8bc4bf697 Send the correct phase when setting the phase from job_complete 2016-09-30 21:26:45 +02:00
Evan Cordell
07e23a34ed Fix metrics 2016-09-30 13:45:45 -04:00
Evan Cordell
68c5384473 Fixes prometheus start metric 2016-09-30 13:09:03 -04:00
josephschorr
fa4588c7d9 Merge pull request #1908 from coreos-inc/fix-build-phase
Add missing call to set_phase when a build doesn't start
2016-09-30 17:52:39 +02:00
josephschorr
0c2b4ed9c1 Merge pull request #1897 from coreos-inc/hash-executor-whitelist
Add hash-based staged rollout to build executors
2016-09-30 17:52:19 +02:00
Joseph Schorr
f50bb8a1ce Add missing call to set_phase when a build doesn't start
This change fixes the build manager ephemeral executor to tell the overall build server to call set_phase when a build never starts. Before this change, we'd properly adjust the queue item, but not the repo build row or the logs, which is why users just saw "Preparing Build Node", with no indicating the node failed to start.

Fixes #1904
2016-09-30 14:54:49 +02:00
Joseph Schorr
51a519f653 Add hash-based staged rollout to build executors
Fixes #1882
2016-09-29 22:48:42 +02:00
Evan Cordell
832ee89923 Add duration metric collector decorator (#1885)
Track time-to-start for builders
Track time-to-build for builders
Track ec2 builder fallbacks
Track build time
2016-09-29 15:44:06 -04:00
Brad Ison
593c3eb9c7 Set dnsPolicy to Default on k8s build jobs
This prevents the builder pods from having resolv.conf pointed at the
kube-dns service, which they won't have access to.
2016-09-29 11:22:11 -04:00
Brad Ison
631ad0422d Default to 4GB memory for k8s builders 2016-09-29 11:20:49 -04:00
Joseph Schorr
02b8afe127 Add labeling of built manifests with their build IDs
Also sends the digests to the notification

Fixes #593
2016-09-29 10:58:45 +02:00
josephschorr
ad4efba802 Merge pull request #1830 from coreos-inc/superuser-dashboard
Add prometheus stats to enable better dashboarding
2016-09-26 17:19:22 +02:00
Brad Ison
0fadc745cf Revert "Use Google public DNS in builder VMs"
This reverts commit a331eecd0f.
2016-09-20 12:06:19 -04:00
Joseph Schorr
1571b2867a Add executor name to the build metric 2016-09-16 16:26:04 -04:00
Joseph Schorr
f9f60b9faf Fix some issues around state in the build managers
- Make sure to cleanup the job if the executor could not be started
- Change the setup leeway to further ensure there isn't any crossover between the queue item timing out and the cleanup of the jobs
- Make the lock used for marking jobs as internal error extremely long, but also based on the execution ID. This should ensure we don't get duplicates while allowing different executions to be handled properly.
- Make sure to invoke the callback update for the queue before we run off to etcd; should reduce certain timeouts

Hopefully Fixes #1836
2016-09-15 14:37:45 -04:00
Brad Ison
a331eecd0f Use Google public DNS in builder VMs 2016-09-12 15:05:13 -04:00
Joseph Schorr
b5f9666a03 Add labels to the QEMU image with the CoreOS channel and version 2016-09-12 13:01:59 -04:00
Joseph Schorr
818ea38dac Add repo-specific reporting of repository builds 2016-09-09 15:36:54 -04:00
Brad Ison
2a1cf2bfd1 Always pull latest image in k8s builds 2016-09-08 15:00:12 -04:00
Joseph Schorr
e67b95ae04 Change log level of an expected log message 2016-08-31 17:25:54 -04:00
Brad Ison
6365b6dbfb Set defaults in qemu-coreos entrypoint 2016-08-31 15:49:21 -04:00
Joseph Schorr
9e6e3a6c94 Remove our names from the checked in keys
This means they won't go out in the QE binary, nor will be viewable on the ephemeral build nodes

Longer term we should probably move these into the config dir
2016-08-30 18:02:05 -04:00
Joseph Schorr
e17e0e4172 Add log for when the job key is written 2016-08-30 14:08:56 -04:00
Joseph Schorr
2fe896ba6a Restore retries of jobs not started and add some leeway to the processing time 2016-08-30 13:57:26 -04:00
Joseph Schorr
292abb5395 Better handling and logging of exceptions in build manager
Also increases the setup timeout for EC2
2016-08-30 13:52:36 -04:00
Brad Ison
e5cc97d462 Time qemu-img resize in qemu-coreos startup 2016-08-29 15:23:30 -04:00
Joseph Schorr
cd2d0341a7 Fix k8s builder to use the declared volume size
Fixes #1773
2016-08-29 15:16:28 -04:00
Joseph Schorr
bc670611ef Increase the timeout on the atomic lock
Some nodes were still performing the action twice when falling outside of the 30s window
2016-08-23 12:50:38 -04:00
Joseph Schorr
3112388004 Fix multiple reporting of incomplete 2016-08-17 16:01:28 -04:00
Joseph Schorr
0b50928900 Fix build start check for the ephemeral case 2016-08-16 17:18:57 -04:00
Joseph Schorr
433b157531 Add extra check to ensure a build cannot be started without on_ready called 2016-08-16 16:38:48 -04:00
Joseph Schorr
5e1a117ff3 Delete the job first to prevent Kubernetes from starting another pod 2016-08-16 16:33:43 -04:00
Joseph Schorr
742e153133 Fix watch of the jobs key in the build manager 2016-08-16 15:43:09 -04:00
Joseph Schorr
313d65a6a4 Make sure the etcd watch coroutines get called 2016-08-16 13:02:27 -04:00
josephschorr
cddba20ffe Merge pull request #1731 from coreos-inc/k8s-cleanup
Cleanup old executions that never start
2016-08-15 17:00:13 -04:00
Joseph Schorr
d78361b041 Cleanup old executions that never start
Fixes #1727
2016-08-15 16:54:02 -04:00
Brad Ison
d37f32b9c7 Add bison's SSH key to builders 2016-08-15 15:53:26 -04:00
Joseph Schorr
acdfc9369d Allow the version of CoreOS to be specified when building QEMU image 2016-08-05 16:46:11 -04:00
Joseph Schorr
c29f9ccc7f Fix TTL on heartbeat in etcd
Until now, once the heartbeat has expired, we would issue a TTL that is negative, which causes etcd to either raise an exception or simply ignore the expiration (depending on the version of etcd). This change ensures that once the key is expired, it is removed immediately via a set of a TTL of 0. Also adds tests for this case and the normal expiration case.
2016-08-03 11:15:03 -04:00
Joseph Schorr
428a7cb435 Fix decreased setup timeout on ephemeral build manager 2016-07-22 13:35:38 -04:00
Joseph Schorr
392242d20b Another fix for the record keeping in buildman
Adds some more mocked tests as well
2016-07-22 12:01:30 -04:00
Joseph Schorr
68baa51d55 Fix cross-manager handling of realm components 2016-07-21 15:47:25 -04:00
Joseph Schorr
4420b1bac9 Add temporary back-compat shims for the build manager 2016-07-20 13:41:01 -04:00
Joseph Schorr
2c1880b944 Bug fixes, refactoring and "new" tests for the build manager
- Fixes various bugs introduced in the most recent build system commit
- Refactors state management in the build manager to be cleaner and more contained
- Adds back in the mock-based tests, fixed to not use threads and adjusted for the refactoring
- Adds some more simplified unit tests around non-etch related flows
2016-07-18 13:46:48 -04:00
Joseph Schorr
74b87fa813 Build manager cleanup and more logging 2016-07-14 14:33:14 -04:00
Joseph Schorr
d8b72e8503 Switch to using a defined branch and not always pulling the VM image 2016-07-08 17:53:25 -04:00
Joseph Schorr
3d4af78f01 Fix label to never allow a space (which breaks Kubernetes) 2016-07-08 17:09:06 -04:00
Joseph Schorr
811413fe9c Add multiple executor and whitelist support to build manager 2016-07-08 15:50:51 -04:00
Joseph Schorr
7471d0e35f Small code cleanup before whitelist addition 2016-07-08 15:50:51 -04:00
Colin Hom
1e3351f3f4 local-docker.sh now accepts env vars 2016-07-08 15:50:51 -04:00
Colin Hom
bc13333f20 Kubernetes build worker 2016-07-08 15:50:51 -04:00
Joseph Schorr
713ba3abaf Further updates to the Prometheus client code 2016-07-01 14:16:51 -04:00
Matt Jibson
3d9acf2fff Use prometheus as a metric backend
This entails writing a metric aggregation program since each worker has its
own memory, and thus own metrics because of python gunicorn. The python
client is a simple wrapper that makes web requests to it.
2016-07-01 14:16:50 -04:00
Joseph Schorr
1173192739 Move channel back, as it is referenced by generate_cloud_config 2016-06-22 17:25:06 -04:00
Joseph Schorr
61695eb439 Allow the build node AMI to be overridden in config 2016-06-22 15:13:54 -04:00
josephschorr
20a6fdc73f Merge pull request #1557 from jzelinskie/buildargs
buildman: mark missing buildargs as failure
2016-06-20 14:40:17 -04:00
Jimmy Zelinskie
871c1634ed buildman: mark missing buildargs as failure 2016-06-17 18:33:54 -04:00
Joseph Schorr
7292524d69 Add a cloud watch metric when we fail to start a build via EC2
Fixes #1555
2016-06-17 16:19:57 -04:00
Jimmy Zelinskie
5298452fa7 builder cloudconfig: shutdown server after 3 hours (#1554) 2016-06-17 16:03:40 -04:00
Joseph Schorr
f9469a84b3 Make the size of the build node HDD configurable
Fixes #1520
2016-06-06 11:35:10 -04:00
Jimmy Zelinskie
7d356c451b buildman: fix misspell 2016-06-03 15:42:14 -04:00
Jimmy Zelinskie
44b56ae2cf queue: explicitly declare ordering requirement
This change defaults the ordering requirement of queue items to be off
and only enables it for the build manager. This should make the queries
for getting queueitems significantly faster for every other use case.
2016-05-27 14:44:30 -04:00
Jimmy Zelinskie
79aa78906a buildman: refresh and add Evan's key to builders 2016-05-24 14:05:39 -04:00
Joseph Schorr
5262535945 Boto error_code is a string, not the HTTP status code 2015-12-23 15:12:01 -05:00
Jimmy Zelinskie
601b99a083 buildman: add git checkout failure 2015-12-16 14:49:37 -05:00
Joseph Schorr
773e73861f Change error into info in build manager
Fixes #1046
2015-12-09 14:30:14 -05:00
josephschorr
c06e5cc9c7 Merge pull request #1002 from coreos-inc/buildertagexc
Add timeout and failure if an EC2 instance could not be found when ta…
2015-12-09 14:28:31 -05:00
Joseph Schorr
946e5fabc0 Add timeout and failure if an EC2 instance could not be found when tagging
Fixes #994
2015-12-09 14:28:19 -05:00
Joseph Schorr
edd9a03af5 Catch additional key not found exception
Fixes #806
2015-12-01 12:29:58 -05:00
Joseph Schorr
fbc4927544 Change to only exception logging internal errors on builds
Fixes #993
2015-11-30 14:30:55 -05:00
Jake Moshenko
c4b637521c Remove Matt Jibson's public key 2015-11-23 18:18:42 -05:00
Matt Jibson
2325328bbd Update mjibson ssh key 2015-11-06 15:34:52 -05:00
Jimmy Zelinskie
e973289397 Revert "Revert "Merge pull request #682 from jzelinskie/revertrevert""
This reverts commit 278bc736e3.
2015-10-23 15:26:33 -04:00
Jimmy Zelinskie
278bc736e3 Revert "Merge pull request #682 from jzelinskie/revertrevert"
This reverts commit 627ad25c9c, reversing
changes made to 31c392fecc.
2015-10-22 16:02:07 -04:00
Jimmy Zelinskie
46b2f10d7f check for VPC subnet ID before using builder VPC
This means you can use legacy networking machines by simply changing the
instance type and removing the specified 'EC2_VPC_SUBNET_ID' from the
executor config.
2015-10-22 14:50:54 -04:00
Jimmy Zelinskie
39cfe77d42 Revert "Merge pull request #557 from coreos-inc/revert-migration"
This reverts commit c4f938898a, reversing
changes made to 7ad2522dbe.
2015-10-21 15:29:57 -04:00
Joseph Schorr
0f37e66cc8 Better error handling for the build manager
Fixes #604
2015-10-13 11:40:07 -04:00
Matt Jibson
87cc3289a0 Remove transaction from metric reporting 2015-10-06 01:28:43 -04:00
Joseph Schorr
752d05dedb Add exception logging to the build manager
Fixes #547
2015-09-30 15:49:35 -04:00
Joseph Schorr
2d3092b826 Make build system resistant to Redis being broken
Fixes #549
2015-09-30 15:15:10 -04:00
Silas Sewell
9000169b53 Revert "Merge pull request #491 from jakedt/migratebackp2"
This reverts commit 7ad2522dbe, reversing
changes made to a0b191ffa1.
2015-09-28 16:09:22 -04:00
josephschorr
7ad2522dbe Merge pull request #491 from jakedt/migratebackp2
Migrate image data back phase 2
2015-09-26 15:11:46 -04:00
Matt Jibson
bba1557437 Monitor queue adds and EC2 node starts
fixes #157
see #304
2015-09-18 16:21:16 -04:00
Jake Moshenko
8baacd2741 Migrate old data to new locations, read only new. 2015-09-17 15:47:13 -04:00
Jimmy Zelinskie
cb6b6c4091 buildman: add silas keys to builders 2015-09-09 16:53:19 -04:00
Jimmy Zelinskie
0365831015 add barakmich, quentin, mjibson keys to builders
Fixes coreos-inc/quay-policies#38
2015-08-27 11:42:53 -04:00
Jimmy Zelinskie
239f76d39f Merge pull request #368 from coreos-inc/buildarchive
Allow builds to be started with an external archive URL
2015-08-17 17:09:14 -04:00
Joseph Schorr
f092c00621 Allow builds to be started with an external archive URL
Fixes #114
2015-08-17 17:01:49 -04:00
Matt Jibson
cfb6e884f2 Refactor metric collection
This change adds a generic queue onto which metrics can be pushed. A
separate module removes metrics from the queue and adds them to Cloudwatch.
Since these are now separate ideas, we can easily change the consumer from
Cloudwatch to anything else.

This change maintains near feature parity (the only change is there is now
just one queue instead of two - not a big deal).
2015-08-12 12:15:52 -04:00
Jake Moshenko
18100be481 Refactor the util directory to use subpackages. 2015-08-03 16:04:19 -04:00
Jimmy Zelinskie
7dbcbe4706 Merge pull request #234 from coreos-inc/morespace
Increase the HD size on the build nodes
2015-07-27 15:35:45 -04:00
Jake Moshenko
3efaa255e8 Accidental refactor, split out legacy.py into separate sumodules and update all call sites. 2015-07-17 11:56:15 -04:00
Joseph Schorr
04cc471585 Increase the HD size on the build nodes
Fixes #228
2015-07-14 15:20:17 +03:00
Joseph Schorr
d842881608 Don't None the build_status, as it might still be used later 2015-07-14 12:49:03 +03:00
Joseph Schorr
e06435fee4 Record phase information and make better error messages on pull failure 2015-06-30 18:04:44 +03:00
Joseph Schorr
6655c7f745 Add exception handling that doesn't log the read-timeout exception
Note: This is a *hack* and needs to be replaced with proper code ASAP
2015-06-25 23:35:29 -04:00