josephschorr
0c2b4ed9c1
Merge pull request #1897 from coreos-inc/hash-executor-whitelist
...
Add hash-based staged rollout to build executors
2016-09-30 17:52:19 +02:00
Joseph Schorr
f50bb8a1ce
Add missing call to set_phase when a build doesn't start
...
This change fixes the build manager ephemeral executor to tell the overall build server to call set_phase when a build never starts. Before this change, we'd properly adjust the queue item, but not the repo build row or the logs, which is why users just saw "Preparing Build Node", with no indicating the node failed to start.
Fixes #1904
2016-09-30 14:54:49 +02:00
Joseph Schorr
51a519f653
Add hash-based staged rollout to build executors
...
Fixes #1882
2016-09-29 22:48:42 +02:00
Evan Cordell
832ee89923
Add duration metric collector decorator ( #1885 )
...
Track time-to-start for builders
Track time-to-build for builders
Track ec2 builder fallbacks
Track build time
2016-09-29 15:44:06 -04:00
Brad Ison
593c3eb9c7
Set dnsPolicy to Default on k8s build jobs
...
This prevents the builder pods from having resolv.conf pointed at the
kube-dns service, which they won't have access to.
2016-09-29 11:22:11 -04:00
Brad Ison
631ad0422d
Default to 4GB memory for k8s builders
2016-09-29 11:20:49 -04:00
Joseph Schorr
02b8afe127
Add labeling of built manifests with their build IDs
...
Also sends the digests to the notification
Fixes #593
2016-09-29 10:58:45 +02:00
josephschorr
ad4efba802
Merge pull request #1830 from coreos-inc/superuser-dashboard
...
Add prometheus stats to enable better dashboarding
2016-09-26 17:19:22 +02:00
Brad Ison
0fadc745cf
Revert "Use Google public DNS in builder VMs"
...
This reverts commit a331eecd0f
.
2016-09-20 12:06:19 -04:00
Joseph Schorr
1571b2867a
Add executor name to the build metric
2016-09-16 16:26:04 -04:00
Joseph Schorr
f9f60b9faf
Fix some issues around state in the build managers
...
- Make sure to cleanup the job if the executor could not be started
- Change the setup leeway to further ensure there isn't any crossover between the queue item timing out and the cleanup of the jobs
- Make the lock used for marking jobs as internal error extremely long, but also based on the execution ID. This should ensure we don't get duplicates while allowing different executions to be handled properly.
- Make sure to invoke the callback update for the queue before we run off to etcd; should reduce certain timeouts
Hopefully Fixes #1836
2016-09-15 14:37:45 -04:00
Brad Ison
a331eecd0f
Use Google public DNS in builder VMs
2016-09-12 15:05:13 -04:00
Joseph Schorr
b5f9666a03
Add labels to the QEMU image with the CoreOS channel and version
2016-09-12 13:01:59 -04:00
Joseph Schorr
818ea38dac
Add repo-specific reporting of repository builds
2016-09-09 15:36:54 -04:00
Brad Ison
2a1cf2bfd1
Always pull latest image in k8s builds
2016-09-08 15:00:12 -04:00
Joseph Schorr
e67b95ae04
Change log level of an expected log message
2016-08-31 17:25:54 -04:00
Brad Ison
6365b6dbfb
Set defaults in qemu-coreos entrypoint
2016-08-31 15:49:21 -04:00
Joseph Schorr
9e6e3a6c94
Remove our names from the checked in keys
...
This means they won't go out in the QE binary, nor will be viewable on the ephemeral build nodes
Longer term we should probably move these into the config dir
2016-08-30 18:02:05 -04:00
Joseph Schorr
e17e0e4172
Add log for when the job key is written
2016-08-30 14:08:56 -04:00
Joseph Schorr
2fe896ba6a
Restore retries of jobs not started and add some leeway to the processing time
2016-08-30 13:57:26 -04:00
Joseph Schorr
292abb5395
Better handling and logging of exceptions in build manager
...
Also increases the setup timeout for EC2
2016-08-30 13:52:36 -04:00
Brad Ison
e5cc97d462
Time qemu-img resize in qemu-coreos startup
2016-08-29 15:23:30 -04:00
Joseph Schorr
cd2d0341a7
Fix k8s builder to use the declared volume size
...
Fixes #1773
2016-08-29 15:16:28 -04:00
Joseph Schorr
bc670611ef
Increase the timeout on the atomic lock
...
Some nodes were still performing the action twice when falling outside of the 30s window
2016-08-23 12:50:38 -04:00
Joseph Schorr
3112388004
Fix multiple reporting of incomplete
2016-08-17 16:01:28 -04:00
Joseph Schorr
0b50928900
Fix build start check for the ephemeral case
2016-08-16 17:18:57 -04:00
Joseph Schorr
433b157531
Add extra check to ensure a build cannot be started without on_ready called
2016-08-16 16:38:48 -04:00
Joseph Schorr
5e1a117ff3
Delete the job first to prevent Kubernetes from starting another pod
2016-08-16 16:33:43 -04:00
Joseph Schorr
742e153133
Fix watch of the jobs key in the build manager
2016-08-16 15:43:09 -04:00
Joseph Schorr
313d65a6a4
Make sure the etcd watch coroutines get called
2016-08-16 13:02:27 -04:00
josephschorr
cddba20ffe
Merge pull request #1731 from coreos-inc/k8s-cleanup
...
Cleanup old executions that never start
2016-08-15 17:00:13 -04:00
Joseph Schorr
d78361b041
Cleanup old executions that never start
...
Fixes #1727
2016-08-15 16:54:02 -04:00
Brad Ison
d37f32b9c7
Add bison's SSH key to builders
2016-08-15 15:53:26 -04:00
Joseph Schorr
acdfc9369d
Allow the version of CoreOS to be specified when building QEMU image
2016-08-05 16:46:11 -04:00
Joseph Schorr
c29f9ccc7f
Fix TTL on heartbeat in etcd
...
Until now, once the heartbeat has expired, we would issue a TTL that is negative, which causes etcd to either raise an exception or simply ignore the expiration (depending on the version of etcd). This change ensures that once the key is expired, it is removed immediately via a set of a TTL of 0. Also adds tests for this case and the normal expiration case.
2016-08-03 11:15:03 -04:00
Joseph Schorr
428a7cb435
Fix decreased setup timeout on ephemeral build manager
2016-07-22 13:35:38 -04:00
Joseph Schorr
392242d20b
Another fix for the record keeping in buildman
...
Adds some more mocked tests as well
2016-07-22 12:01:30 -04:00
Joseph Schorr
68baa51d55
Fix cross-manager handling of realm components
2016-07-21 15:47:25 -04:00
Joseph Schorr
4420b1bac9
Add temporary back-compat shims for the build manager
2016-07-20 13:41:01 -04:00
Joseph Schorr
2c1880b944
Bug fixes, refactoring and "new" tests for the build manager
...
- Fixes various bugs introduced in the most recent build system commit
- Refactors state management in the build manager to be cleaner and more contained
- Adds back in the mock-based tests, fixed to not use threads and adjusted for the refactoring
- Adds some more simplified unit tests around non-etch related flows
2016-07-18 13:46:48 -04:00
Joseph Schorr
74b87fa813
Build manager cleanup and more logging
2016-07-14 14:33:14 -04:00
Joseph Schorr
d8b72e8503
Switch to using a defined branch and not always pulling the VM image
2016-07-08 17:53:25 -04:00
Joseph Schorr
3d4af78f01
Fix label to never allow a space (which breaks Kubernetes)
2016-07-08 17:09:06 -04:00
Joseph Schorr
811413fe9c
Add multiple executor and whitelist support to build manager
2016-07-08 15:50:51 -04:00
Joseph Schorr
7471d0e35f
Small code cleanup before whitelist addition
2016-07-08 15:50:51 -04:00
Colin Hom
1e3351f3f4
local-docker.sh now accepts env vars
2016-07-08 15:50:51 -04:00
Colin Hom
bc13333f20
Kubernetes build worker
2016-07-08 15:50:51 -04:00
Joseph Schorr
713ba3abaf
Further updates to the Prometheus client code
2016-07-01 14:16:51 -04:00
Matt Jibson
3d9acf2fff
Use prometheus as a metric backend
...
This entails writing a metric aggregation program since each worker has its
own memory, and thus own metrics because of python gunicorn. The python
client is a simple wrapper that makes web requests to it.
2016-07-01 14:16:50 -04:00
Joseph Schorr
1173192739
Move channel back, as it is referenced by generate_cloud_config
2016-06-22 17:25:06 -04:00
Joseph Schorr
61695eb439
Allow the build node AMI to be overridden in config
2016-06-22 15:13:54 -04:00
josephschorr
20a6fdc73f
Merge pull request #1557 from jzelinskie/buildargs
...
buildman: mark missing buildargs as failure
2016-06-20 14:40:17 -04:00
Jimmy Zelinskie
871c1634ed
buildman: mark missing buildargs as failure
2016-06-17 18:33:54 -04:00
Joseph Schorr
7292524d69
Add a cloud watch metric when we fail to start a build via EC2
...
Fixes #1555
2016-06-17 16:19:57 -04:00
Jimmy Zelinskie
5298452fa7
builder cloudconfig: shutdown server after 3 hours ( #1554 )
2016-06-17 16:03:40 -04:00
Joseph Schorr
f9469a84b3
Make the size of the build node HDD configurable
...
Fixes #1520
2016-06-06 11:35:10 -04:00
Jimmy Zelinskie
7d356c451b
buildman: fix misspell
2016-06-03 15:42:14 -04:00
Jimmy Zelinskie
44b56ae2cf
queue: explicitly declare ordering requirement
...
This change defaults the ordering requirement of queue items to be off
and only enables it for the build manager. This should make the queries
for getting queueitems significantly faster for every other use case.
2016-05-27 14:44:30 -04:00
Jimmy Zelinskie
79aa78906a
buildman: refresh and add Evan's key to builders
2016-05-24 14:05:39 -04:00
Joseph Schorr
5262535945
Boto error_code is a string, not the HTTP status code
2015-12-23 15:12:01 -05:00
Jimmy Zelinskie
601b99a083
buildman: add git checkout failure
2015-12-16 14:49:37 -05:00
Joseph Schorr
773e73861f
Change error into info in build manager
...
Fixes #1046
2015-12-09 14:30:14 -05:00
josephschorr
c06e5cc9c7
Merge pull request #1002 from coreos-inc/buildertagexc
...
Add timeout and failure if an EC2 instance could not be found when ta…
2015-12-09 14:28:31 -05:00
Joseph Schorr
946e5fabc0
Add timeout and failure if an EC2 instance could not be found when tagging
...
Fixes #994
2015-12-09 14:28:19 -05:00
Joseph Schorr
edd9a03af5
Catch additional key not found exception
...
Fixes #806
2015-12-01 12:29:58 -05:00
Joseph Schorr
fbc4927544
Change to only exception logging internal errors on builds
...
Fixes #993
2015-11-30 14:30:55 -05:00
Jake Moshenko
c4b637521c
Remove Matt Jibson's public key
2015-11-23 18:18:42 -05:00
Matt Jibson
2325328bbd
Update mjibson ssh key
2015-11-06 15:34:52 -05:00
Jimmy Zelinskie
e973289397
Revert "Revert "Merge pull request #682 from jzelinskie/revertrevert""
...
This reverts commit 278bc736e3
.
2015-10-23 15:26:33 -04:00
Jimmy Zelinskie
278bc736e3
Revert "Merge pull request #682 from jzelinskie/revertrevert"
...
This reverts commit 627ad25c9c
, reversing
changes made to 31c392fecc
.
2015-10-22 16:02:07 -04:00
Jimmy Zelinskie
46b2f10d7f
check for VPC subnet ID before using builder VPC
...
This means you can use legacy networking machines by simply changing the
instance type and removing the specified 'EC2_VPC_SUBNET_ID' from the
executor config.
2015-10-22 14:50:54 -04:00
Jimmy Zelinskie
39cfe77d42
Revert "Merge pull request #557 from coreos-inc/revert-migration"
...
This reverts commit c4f938898a
, reversing
changes made to 7ad2522dbe
.
2015-10-21 15:29:57 -04:00
Joseph Schorr
0f37e66cc8
Better error handling for the build manager
...
Fixes #604
2015-10-13 11:40:07 -04:00
Matt Jibson
87cc3289a0
Remove transaction from metric reporting
2015-10-06 01:28:43 -04:00
Joseph Schorr
752d05dedb
Add exception logging to the build manager
...
Fixes #547
2015-09-30 15:49:35 -04:00
Joseph Schorr
2d3092b826
Make build system resistant to Redis being broken
...
Fixes #549
2015-09-30 15:15:10 -04:00
Silas Sewell
9000169b53
Revert "Merge pull request #491 from jakedt/migratebackp2"
...
This reverts commit 7ad2522dbe
, reversing
changes made to a0b191ffa1
.
2015-09-28 16:09:22 -04:00
josephschorr
7ad2522dbe
Merge pull request #491 from jakedt/migratebackp2
...
Migrate image data back phase 2
2015-09-26 15:11:46 -04:00
Matt Jibson
bba1557437
Monitor queue adds and EC2 node starts
...
fixes #157
see #304
2015-09-18 16:21:16 -04:00
Jake Moshenko
8baacd2741
Migrate old data to new locations, read only new.
2015-09-17 15:47:13 -04:00
Jimmy Zelinskie
cb6b6c4091
buildman: add silas keys to builders
2015-09-09 16:53:19 -04:00
Jimmy Zelinskie
0365831015
add barakmich, quentin, mjibson keys to builders
...
Fixes coreos-inc/quay-policies#38
2015-08-27 11:42:53 -04:00
Jimmy Zelinskie
239f76d39f
Merge pull request #368 from coreos-inc/buildarchive
...
Allow builds to be started with an external archive URL
2015-08-17 17:09:14 -04:00
Joseph Schorr
f092c00621
Allow builds to be started with an external archive URL
...
Fixes #114
2015-08-17 17:01:49 -04:00
Matt Jibson
cfb6e884f2
Refactor metric collection
...
This change adds a generic queue onto which metrics can be pushed. A
separate module removes metrics from the queue and adds them to Cloudwatch.
Since these are now separate ideas, we can easily change the consumer from
Cloudwatch to anything else.
This change maintains near feature parity (the only change is there is now
just one queue instead of two - not a big deal).
2015-08-12 12:15:52 -04:00
Jake Moshenko
18100be481
Refactor the util directory to use subpackages.
2015-08-03 16:04:19 -04:00
Jimmy Zelinskie
7dbcbe4706
Merge pull request #234 from coreos-inc/morespace
...
Increase the HD size on the build nodes
2015-07-27 15:35:45 -04:00
Jake Moshenko
3efaa255e8
Accidental refactor, split out legacy.py into separate sumodules and update all call sites.
2015-07-17 11:56:15 -04:00
Joseph Schorr
04cc471585
Increase the HD size on the build nodes
...
Fixes #228
2015-07-14 15:20:17 +03:00
Joseph Schorr
d842881608
Don't None the build_status, as it might still be used later
2015-07-14 12:49:03 +03:00
Joseph Schorr
e06435fee4
Record phase information and make better error messages on pull failure
2015-06-30 18:04:44 +03:00
Joseph Schorr
6655c7f745
Add exception handling that doesn't log the read-timeout exception
...
Note: This is a *hack* and needs to be replaced with proper code ASAP
2015-06-25 23:35:29 -04:00
Joseph Schorr
6e6610f31a
Switch to a 30s maximum timeout
2015-06-25 23:08:49 -04:00
Joseph Schorr
bead839abd
Make sure build components timeout if the initial connection fails
2015-06-25 22:13:01 -04:00
Joseph Schorr
ecebc06343
Update comment now that restarter is abstracted
2015-06-25 21:53:42 -04:00
Joseph Schorr
9f5f71398c
Abstract out the concept of a restart function
2015-06-25 21:40:50 -04:00
Joseph Schorr
52fa9aad5b
Fix etcd watching
...
Etcd can miss events on watches if they are occurring fast enough, so if we can get an exception indicating that we've missed an index, we reset the state of our local tracking structures by re-reading the *full* list and starting a new watch at HEAD
2015-06-25 21:22:39 -04:00
Jimmy Zelinskie
1195e3ec7c
buildman: rm coroutine decorator from subscribers
...
Python isn't able to figure out that these are generators and properly
handle theme.
2015-06-24 17:38:29 -04:00
josephschorr
2ade08468d
Merge pull request #168 from coreos-inc/etcdindex
...
Fix ephemeral build manager to ask for watches in index order with no gaps
2015-06-23 17:12:18 -04:00
Joseph Schorr
b4c39e8ec0
Fix ephemeral build manager to ask for watches in index order with no gaps
2015-06-23 17:11:46 -04:00
Jimmy Zelinskie
18aa7b6c1e
buildcomponent: use consistent trollius imports
2015-06-23 17:03:26 -04:00
Jimmy Zelinskie
197f3b9b85
buildman: fix ER failing to heartbeat
2015-06-22 18:12:20 -04:00
Jimmy Zelinskie
82287926ab
Merge pull request #140 from coreos-inc/eventinfo
...
Add more build information to the events and have better messaging
2015-06-17 16:49:59 -04:00
Joseph Schorr
c2dc1c9b75
Handle case where etcd key is already removed on job complete
2015-06-17 15:02:58 -04:00
Jimmy Zelinskie
177b96e965
builder: add missing 'yield from' coroutine
2015-06-17 14:16:27 -04:00
Jimmy Zelinskie
59aba93514
builder: update heartbeat timestamp on log message
2015-06-17 14:16:27 -04:00
Joseph Schorr
9b974f6b80
Add more build information to the events and have better messaging
...
Fixes #79
2015-06-16 23:16:36 -04:00
Jake Moshenko
c435f5c127
Add a comment about why we are taking a lock when terminating a builder machine.
2015-06-10 16:19:51 -04:00
Jake Moshenko
f767fc4d03
Track whether builders ever came online in etcd. Mark builds which never successfully heartbeated as incomplete.
2015-06-10 16:19:51 -04:00
Jake Moshenko
79f1181a63
Switch build-scheduled to an official build phase.
2015-06-10 16:19:51 -04:00
Jake Moshenko
884fedd229
Improve the log messages in the buildman.
2015-06-10 16:19:51 -04:00
Jake Moshenko
d31e25d5cd
Allow the individual build manager types to specify how long the queue should wait before retring a job that fails to schedule.
2015-06-10 16:19:50 -04:00
Jimmy Zelinskie
b7303665a2
Merge pull request #111 from coreos-inc/incompletefix
...
Requeue build jobs after the work check timeout + some additional padding.
2015-06-09 20:44:40 -04:00
Joseph Schorr
24ce0decd9
Requeue build jobs after the work check timeout + some additional padding. This ensures that if a build somehow gets wedged, other builds can continue to be picked up.
2015-06-09 20:43:48 -04:00
Joseph Schorr
f82831bff6
Log the etcd exception so we can debug this issue
2015-06-09 20:33:55 -04:00
Jimmy Zelinskie
7f4dd7d42f
triggers: backwards compatible schema for metadata
2015-06-02 16:05:17 -04:00
Jimmy Zelinskie
e01bdd4ab0
triggers: metadata.commit_sha -> metadata.commit
...
This resolves an issue where the custom-git trigger's public facing
schema was not the same as the internal metadata schema. Instead of
breaking users, we rework the internal metadata schema to be the same as
the custom-git JSON schema. This commit also updates everything that
used `metadata.commit_sha` including the test database.
2015-06-02 15:32:28 -04:00
Joseph Schorr
5589bfc6d5
- Have the heartbeat fail to update if the worker has timed out
...
- Add additional build component logging for tracking down problems in the future
2015-05-22 15:24:14 -04:00
Jimmy Zelinskie
db05db6295
cloudconfig: flatten logentries container
2015-05-20 16:34:16 -04:00
Joseph Schorr
598fc6ec46
Add the error code to the worker error logged to redis
2015-05-18 15:01:48 -04:00
Joseph Schorr
91b464d0de
Switch build manager to always just WARN on boto
2015-05-18 12:34:26 -04:00
Jimmy Zelinskie
86f400fdf5
buildman: fix btrfs mounting in worker cloudconfig
2015-05-13 17:40:35 -04:00
Jimmy Zelinskie
6a5cecebc5
buildman: create and mount btrfs volume for docker
...
There are numerous issues with overlayfs that actually aren't present with
btrfs. Btrfs seems to have long-running issues, but our builders are
ephemeral. Example issue: https://github.com/docker/docker/issues/10180
2015-05-12 17:42:34 -04:00
Jimmy Zelinskie
9f31bdd571
buildman: add new io.quay.builder.gitfailure error
2015-05-11 15:25:22 -04:00
Jimmy Zelinskie
15fdae6688
buildman: show base error for buildpack failures
...
Whereas before these were reserved only for S3 errors, users need these
specifics to debug custom-git configurations.
2015-05-11 14:18:48 -04:00
Joseph Schorr
31260d50f5
Rename the new images method to a slightly better name
2015-04-24 16:37:37 -04:00
Joseph Schorr
e70343d849
Faster cache lookup by removing a join with the ImagePlacementTable, removing the extra loop to add the locations and filtering the images looked up by the base image
2015-04-24 16:22:19 -04:00
Jimmy Zelinskie
02498d72ba
almost all PR discussion fixes
2015-04-21 18:04:25 -04:00
Jimmy Zelinskie
ba2cb08904
Merge branch 'master' into git
2015-04-16 17:38:35 -04:00
Jake Moshenko
b10fd4ff22
Tell the journal on the builders to listen on the proper socket.
2015-03-27 16:31:35 -04:00
Jake Moshenko
6eead7c860
Add logentries reporting to the ephemeral builders.
2015-03-27 15:28:08 -04:00
Jake Moshenko
0349f3f1a3
Handle the case where YAML config returns a list not a tuple.
2015-03-26 14:53:56 -04:00
Jimmy Zelinskie
cd1b003ca6
buildcomponent: handle builds without resource_key
2015-03-23 15:46:23 -04:00
Jimmy Zelinskie
d29c8d60c7
trigger: pass trigger into manual_start & handle_trigger_request
2015-03-23 12:14:47 -04:00
Jimmy Zelinskie
b851986cf5
add git_url to metadata, add git to buildargs
2015-03-19 18:09:27 -04:00
Jimmy Zelinskie
b35f6ed25c
buildman: add git_key buildconfig parameter
2015-03-16 13:18:18 -04:00
Jimmy Zelinskie
4c8814866c
buildman: add git_url to build_config
2015-03-13 14:58:05 -04:00
Jimmy Zelinskie
8589871f43
buildman: rm unused imports
2015-03-09 13:04:16 -04:00
Jake Moshenko
5c68e52fce
Really really fix the exception handling.
2015-02-27 17:33:46 -05:00
Jake Moshenko
cf5bc6f0be
Properly catch multiple exceptions.
2015-02-27 17:32:10 -05:00
Jake Moshenko
857c3e2959
Start catching etcd key errors as well.
2015-02-27 17:10:15 -05:00
Joseph Schorr
d973f9df45
Reenable metrics until we know they are the problem
2015-02-25 16:00:46 -05:00
Joseph Schorr
bdb84f1c20
Merge branch 'master' of github.com:coreos-inc/quay
2015-02-25 16:00:17 -05:00
Joseph Schorr
4551b3a957
Remove the boto timeout set (doesn't work anyway) and add some better logging to the scheduler
2015-02-25 16:00:14 -05:00
Jimmy Zelinskie
090a198afc
temporarily comment out metrics
2015-02-25 15:29:35 -05:00
Jimmy Zelinskie
db79ad2dde
unused import
2015-02-25 15:26:36 -05:00
Joseph Schorr
5dd78f76c7
Add additional logging, timeouts, and exception checks
2015-02-25 15:15:22 -05:00
Jimmy Zelinskie
328de0201f
Merge branch 'master' of github.com:coreos-inc/quay
2015-02-25 13:56:05 -05:00
Jimmy Zelinskie
346d6b933a
buildman: initialize queuemetrics asynchronously
2015-02-25 13:55:18 -05:00
Joseph Schorr
2eaec092f0
Handle the case where we cannot write the tags on the build nodes
2015-02-25 13:47:36 -05:00