This was causing problems for customers using georeplication over unstable storage engines
Also adds tests for stream_write and copy, to ensure we detect failure
Instead of having the Swift storage engine try to delete the empty chunk(s) synchronously, we simply queue them and have a worker come along after 30s to delete the empty chunks. This has a few key benefits: it is async (doesn't slow down the push code), helps deal with Swift's eventual consistency (less retries necessary) and is generic for other storage engines if/when they need this as well
Amazon S3 does not allow for chunk sizes larger than 5 GB; we currently don't handle that case at all, which is why large uploads are failing. This change ensures that if a storage engine specifies a *maximum* chunk size, we write multiple chunks no larger than that size.
If not specified, then boto will fallback to reading the credentials from IAM if on an EC2 machine. This should be safe as the validator will still ensure the credentials work if not specified.
Fixes#1707
Move prometheus to SaaS and make it a plugin
Move static callers to use metrics_queue plugin
Change local-docker to support different quay clone dirnames
Change prom_aggregator to use logrus
This entails writing a metric aggregation program since each worker has its
own memory, and thus own metrics because of python gunicorn. The python
client is a simple wrapper that makes web requests to it.
The problem only happens when a user has configured the new AWS Frankfurt
region for their S3 backend. It is the only region to require the new
v4 signature. All other regions support both v2 and v4. I'm not sure
which version is used by default on US Standard.
We could attempt to figure out where the bucket is hosted based on its
DNS resolution and auto-populate the host field that way. But I think
the amount of effort to have that work correctly outweighs its benefit
for such a simple solution.
fixes#863fixes#764
The previous change to this file didn't raise the error up to stream_write,
and so the complete_upload function still ran because the loop was only
broken. It errored because the data was already canceled. This is better
than what we had before, which was to silently fail but report success
(even internally to ourselves!) on bad image upload.
This means we discovered a bug where a user could have failed during image
upload, but quay would write that image to the repository, potentially
writing broken images to S3.