Optimize repository search by changing our lookup strategy

Previous to this change, repositories were looked up unfiltered in six different queries, and then filtered using the permissions model, which issued a query per repository found, making search incredibly slow. Instead, we now lookup a chunk of repositories unfiltered and then filter them via a single query to the database. By layering the filtering on top of the lookup, each as queries, we can minimize the number of queries necessary, without (at the same time) using a super expensive join. Other changes: - Remove the 5 page pre-lookup on V1 search and simply return that there is one more page available, until there isn't. While technically not correct, it is much more efficient, and no one should be using pagination with V1 search anyway. - Remove the lookup for repos without entries in the RAC table. Instead, we now add a new RAC entry when the repository is created for *the day before*, with count 0, so that it is immediately searchable - Remove lookup of results with a matching namespace; these aren't very relevant anyway, and it overly complicates sorting
2017-02-27 17:56:44 -05:00 · 2017-02-27 17:56:44 -05:00 · b5bb76cdea
commit b5bb76cdea
parent b45dc07dce
9 changed files with 114 additions and 120 deletions
--- a/endpoints/v1/index.py
+++ b/endpoints/v1/index.py
@ -312,33 +312,19 @@ def get_search():

 def _conduct_repo_search(username, query, limit=25, page=1):
  """ Finds matching repositories. """
-  only_public = username is None
-
-  def can_read(repo):
-    if repo.is_public:
-      return True
-
-    if only_public:
-      return False
-
-    return ReadRepositoryPermission(repo.namespace_user.username, repo.name).can()
-
-  # Note: We put a max 5 page limit here. The Docker CLI doesn't seem to use the
-  # pagination and most customers hitting the API should be using V2 catalog, so this
-  # is a safety net for our slow search below, since we have to use the slow approach
-  # of finding *all* the results, and then slicing in-memory, because this old API requires
-  # the *full* page count in the returned results.
-  _MAX_PAGE_COUNT = 5
-  page = min(page, _MAX_PAGE_COUNT)
+  # Note that we put a maximum limit of five pages here, because this API should only really ever
+  # be used by the Docker CLI, and it doesn't even paginate.
+  page = min(page, 5)
+  offset = (page - 1) * limit

  if query:
-    matching_repos = model.get_sorted_matching_repositories(query, only_public, can_read,
-                                                            limit=limit*_MAX_PAGE_COUNT)
+    matching_repos = model.get_sorted_matching_repositories(query, username, limit=limit+1,
+                                                            offset=offset)
  else:
    matching_repos = []

  results = []
-  for repo in matching_repos[(page - 1) * _MAX_PAGE_COUNT:limit]:
+  for repo in matching_repos[0:limit]:
    results.append({
      'name': repo.namespace_name + '/' + repo.name,
      'description': repo.description,
@ -350,7 +336,7 @@ def _conduct_repo_search(username, query, limit=25, page=1):
  return {
    'query': query,
    'num_results': len(results),
-    'num_pages': (len(matching_repos) / limit) + 1,
+    'num_pages': page + 1 if len(matching_repos) > limit else page,
    'page': page,
    'page_size': limit,
    'results': results,