Commit graph

80 commits

Author SHA1 Message Date
Charlton Austin
3fd8c8a60d feature(app.py): adding queue_metrics to queues
publishing queue metrics for SRE

[none]
2017-02-14 16:01:28 -05:00
Joseph Schorr
b407f88a26 Remove unnecessary CloudWatch metrics
They are spamming the API and costing us a lot of money
2017-02-01 13:08:21 -05:00
Joseph Schorr
71ec23b550 Switch QueueItem state_id to be unique after a backfill 2017-01-18 17:43:41 -05:00
Joseph Schorr
8c4e86f48b Change queue to use state-field for claiming items
Before this change, the queue code would check that none of the fields on the item to be claimed had changed between the time when the item was selected and the item is claimed. While this is a safe approach, it also causes quite a bit of lock contention in MySQL, because InnoDB will take a lock on *any* rows examined by the `where` clause of the `update`, even if they will ultimately thrown out due to other clauses (See: http://dev.mysql.com/doc/refman/5.7/en/innodb-locks-set.html: "A ..., an UPDATE, ... generally set record locks on every index record that is scanned in the processing of the SQL statement. It does not matter whether there are WHERE conditions in the statement that would exclude the row. InnoDB does not remember the exact WHERE condition, but only knows which index ranges were scanned").

As a result, we want to minimize the number of fields accessed in the `where` clause on an update to the QueueItem row. To do so, we introduce a new `state_id` column, which is updated on *every change* to the QueueItem rows with a unique, random value. We can then have the queue item claiming code simply check that the `state_id` column has not changed between the retrieval and claiming steps. This minimizes the number of columns being checked to two (`id` and `state_id`), and thus, should significantly reduce lock contention. Note that we can not (yet) reduce to just a single `state_id` column (which should work in theory), because we need to maintain backwards compatibility with existing items in the QueueItem table, which will be given empty `state_id` values when the migration in this change runs.

Also adds a number of tests for other queue operations that we want to make sure operate correctly following this change.

[Delivers #133632501]
2017-01-17 13:29:26 -05:00
Joseph Schorr
19cb64df5d Remove unused class 2017-01-17 13:26:09 -05:00
Joseph Schorr
7f63cbd14f Remove FOR UPDATE in Queue cancel and complete
We have no need for them anymore and it should reduce lock contention a bit

Fixes #776
2017-01-17 13:26:09 -05:00
Joseph Schorr
1cbacbbb63 Add tool for handling abusing users 2017-01-13 14:42:03 -05:00
Jimmy Zelinskie
00eafff747 Merge pull request #2204 from jzelinskie/429builds
add rate limiting to build queues
2016-12-07 15:03:31 -05:00
Jimmy Zelinskie
ebbe58d311 replace prefix w/ canonical name list 2016-12-07 12:56:56 -05:00
Jimmy Zelinskie
c41de8ded6 build queue rate limiting: address PR comments 2016-12-06 20:40:54 -05:00
Jimmy Zelinskie
eb69abff8b build rate limiting: tests 2016-12-06 16:30:12 -05:00
Jimmy Zelinskie
57770493fa build rate limiting: use a rate 2016-12-06 16:30:12 -05:00
Jimmy Zelinskie
7877c6ab94 add rate limiting to build queues 2016-12-06 16:30:12 -05:00
Jake Moshenko
21e3001446 Add a bulk insert for queue and notifications.
Use it for Clair spawned notifications.
2016-12-06 14:00:16 -05:00
Jimmy Zelinskie
3a7119d499 Merge pull request #2209 from coreos-inc/clair-notification-read
Clair notification read and queue fixes
2016-12-05 19:36:59 -05:00
Joseph Schorr
97d150e281 Have QSS only add security scanner notifications once 2016-12-05 19:08:20 -05:00
Jake Moshenko
7c490b46c8 Only save dirty fields on Queue queries. 2016-12-05 18:12:14 -05:00
Charlton Austin
0a6322015c Fix the queue item delete. 2016-12-02 15:30:35 -05:00
Joseph Schorr
e29cb34336 Fix Set calls to gauges
Fixes #2150

The proper function is `Set` (not `set`), which was causing these gauges to not report to Prometheus
2016-11-21 15:27:17 -05:00
Joseph Schorr
73eb66eac5 Add support for deleting namespaces (users, organizations)
Fixes #102
Fixes #105
2016-10-21 15:41:09 -04:00
Jimmy Zelinskie
20ef43d5fb workers.queuecleanup: remove direct peewee usage 2016-10-20 13:46:00 -04:00
Jimmy Zelinskie
64d0c5b675 data.queue: fix race condition
It's possible that multiple consumers will acquire a queue item if they
race on an expired item. To mitigate this, we check that the
processing_expires time hasn't been changed since we last read.
2016-07-14 15:34:22 -04:00
Jimmy Zelinskie
609f4fccd8 data.queue: simplify put method 2016-07-14 15:34:22 -04:00
Joseph Schorr
713ba3abaf Further updates to the Prometheus client code 2016-07-01 14:16:51 -04:00
Jake Moshenko
668a8edc50 Refactor prometheus integration
Move prometheus to SaaS and make it a plugin
Move static callers to use metrics_queue plugin
Change local-docker to support different quay clone dirnames
Change prom_aggregator to use logrus
2016-07-01 14:16:50 -04:00
Matt Jibson
3d9acf2fff Use prometheus as a metric backend
This entails writing a metric aggregation program since each worker has its
own memory, and thus own metrics because of python gunicorn. The python
client is a simple wrapper that makes web requests to it.
2016-07-01 14:16:50 -04:00
Jimmy Zelinskie
1f488acf12 data.queue: move name matching clause 2016-05-31 15:44:11 -04:00
Jimmy Zelinskie
26300d3c8e data.queue: lint 2016-05-27 14:51:19 -04:00
Jimmy Zelinskie
8a5aa65d74 data.queue: limiting before order by rand 2016-05-27 14:44:30 -04:00
Jimmy Zelinskie
44b56ae2cf queue: explicitly declare ordering requirement
This change defaults the ordering requirement of queue items to be off
and only enables it for the build manager. This should make the queries
for getting queueitems significantly faster for every other use case.
2016-05-27 14:44:30 -04:00
Joseph Schorr
f498e92d58 Implement against new Clair paginated notification system 2016-02-25 15:58:42 -05:00
Joseph Schorr
01723d5546 Catch other cases where the queue item has been removed
Fixes #1096
2015-12-22 15:58:51 -05:00
Matt Jibson
a994b367da Refactor queue locking to not use select for update
The test suggests this works.

fixes #622
2015-11-03 11:32:28 -05:00
Matt Jibson
87cc3289a0 Remove transaction from metric reporting 2015-10-06 01:28:43 -04:00
Matt Jibson
4da66c1219 Move the metric put outside the transaction 2015-09-21 13:37:49 -04:00
Jimmy Zelinskie
2ff77df946 Merge pull request #518 from jzelinskie/fixmysqlssl
move UseThenDisconnect into queueworker
2015-09-21 13:35:35 -04:00
Jimmy Zelinskie
7c82e0b5b3 move UseThenDisconnect into queueworker
This makes the tests pass while maintaining the same behavior.
2015-09-21 13:34:12 -04:00
Jimmy Zelinskie
0de17627d5 Merge pull request #517 from jzelinskie/fixmysqlssl
close connections after getting queue metrics
2015-09-21 12:28:23 -04:00
Jimmy Zelinskie
98d6262a7f close connections after getting queue metrics 2015-09-21 12:21:39 -04:00
Matt Jibson
bba1557437 Monitor queue adds and EC2 node starts
fixes #157
see #304
2015-09-18 16:21:16 -04:00
Matt Jibson
39dc4c7d8d Monitor various sizes for queues
see #304
2015-09-14 15:57:08 -04:00
Joseph Schorr
96d5bbb155 Fix exceptions raised by the diffs worker
Fixes #465
2015-09-10 14:12:16 -04:00
Matt Jibson
fc671f3dde Fix test_queue.py tests
This restores the reporter class as was before the metrics changes.
2015-08-17 17:22:46 -04:00
Matt Jibson
cfb6e884f2 Refactor metric collection
This change adds a generic queue onto which metrics can be pushed. A
separate module removes metrics from the queue and adds them to Cloudwatch.
Since these are now separate ideas, we can easily change the consumer from
Cloudwatch to anything else.

This change maintains near feature parity (the only change is there is now
just one queue instead of two - not a big deal).
2015-08-12 12:15:52 -04:00
Joseph Schorr
d480a204f5 Revert change to queue 2015-08-05 15:27:33 -04:00
Jake Moshenko
ed62339f89 Improve the performance of queue candidate queries. 2015-08-04 18:20:54 -04:00
Joseph Schorr
5f605b7cc8 Fix queue handling to remove the dependency from repobuild, and have a cancel method 2015-02-23 13:38:01 -05:00
Joseph Schorr
524705b88c Get dashboard working and upgrade bootstrap. Note: the bootstrap fixes will be coming in the followup CL 2015-02-17 19:15:54 -05:00
Joseph Schorr
f84d1bad45 Handle internal errors in a better fashion: If a build would be marked as internal error, only do so if there are retries remaining. Otherwise, we mark it as failed (since it won't be rebuilt anyway) 2015-02-12 16:19:44 -05:00
Jake Moshenko
ce7033489b Hopefully fix the deadlock in the queue. 2015-02-03 14:50:01 -05:00