Commit f35f89cd authored by David Read's avatar David Read
Browse files

Merge commit 'a2e85a91' into fix-travis

parents 397f982b a2e85a91
......@@ -12,14 +12,22 @@ ckanext-archiver
Overview
--------
The CKAN Archiver Extension will download CKAN resources, which can be offered to the user as a 'cached' copy. In addition it provides a 'Broken Links' report showing which resource URLs don't work.
The CKAN Archiver Extension will download all of a CKAN's resources, for three purposes:
1. offer the user it as a 'cached' copy, in case the link becomes broken
2. tell the user (and publishers) if the link is broken, on both the dataset/resource and in a 'Broken Links' report
3. the downloaded file can be analysed by other extensions, such as ckanext-qa or ckanext-pacakgezip.
Compatibility: Requires CKAN version 2.1 or later
TODO:
* Link to the cached file from the dataset
* Link to the reports (including Broken Links) from the main nav
* Mark brokenness on the dataset & resource
* Mark brokenness on the dataset
Operation
---------
When a resource is archived, the information about the archival - if it failed, the filename on disk, file size etc - is stored in the Archival table. (In ckanext-archiver v0.1 it was stored in TaskStatus and on the Resource itself.)
When a resource is archived, the information about the archival - if it failed, the filename on disk, file size etc - is stored in the Archival table. (In ckanext-archiver v0.1 it was stored in TaskStatus and on the Resource itself.) This is added to dataset during the package_show call (using a schema key), so the information is also available over the API.
Other extensions can subscribe to the archiver's ``IPipe`` interface to hear about datasets being archived. e.g. ckanext-qa will detect its file type and give it an openness score, or ckanext-packagezip will create a zip of the files in a dataset.
......@@ -34,7 +42,6 @@ By default, two queues are used:
This means that the 'bulk' queue can happily run slowly, archiving large quantities slowly, such as re-archiving every single resource once a week. And meanwhile, if a new resource is put into CKAN then it can be downloaded straight away via the 'priority' queue.
Compatibility: Requires CKAN version 2.1 or later (but can be easily adapted for older versions).
Installation
------------
......@@ -271,3 +278,24 @@ To run the tests:
3. From the CKAN root directory (not the extension root) do::
(pyenv)~/pyenv/src/ckan$ nosetests --ckan ../ckanext-archiver/tests/ --with-pylons=../ckanext-archiver/test-core.ini
Questions
---------
The archiver information is not appearing on the resource page
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Check that it is appearing in the API for the dataset - see question below.
The archiver information is not appearing in the API (package_show)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
i.e. if you browse this path on your website: `/api/action/package_show?id=<package_name>` then you don't see the `archiver` key at the dataset level or resource level.
Check the `paster archiver update` command completed ok. Check that the `paster celeryd2 run` has done the archiving ok. Check the dataset has at least one resource. If you have another extension with an IDatasetForm that customizes the form or schema, see the question below about this.
My site has an IDatasetForm already - how can I include the archiver information?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If you have another extension with an IDatasetForm for customizing the dataset form/schema, then you can simply add to it the schema customizations from this module - see this module's plugins.py in the section for IDatasetForm.
from ckan.plugins import toolkit as tk
def archiver_resource_show(resource_id):
data_dict = {'id': resource_id}
return tk.get_action('archiver_resource_show')(data_dict)
def archiver_is_resource_broken_html(resource):
archival = resource.get('archiver')
if not archival:
return '<!-- No archival info for this resource -->'
extra_vars = {'resource': resource}
extra_vars.update(archival)
return tk.literal(
tk.render('archiver/is_resource_broken.html',
extra_vars=extra_vars))
def archiver_is_resource_cached_html(resource):
archival = resource.get('archiver')
if not archival:
return '<!-- No archival info for this resource -->'
extra_vars = {'resource': resource}
extra_vars.update(archival)
return tk.literal(
tk.render('archiver/is_resource_cached.html',
extra_vars=extra_vars))
# Replacement for the core ckan helper 'format_resource_items'
# but with our own blacklist
def archiver_format_resource_items(items):
blacklist = ['archiver']
items_ = [item for item in items
if item[0] not in blacklist]
import ckan.lib.helpers as ckan_helpers
return ckan_helpers.format_resource_items(items_)
import logging
import ckan.logic as logic
import ckan.plugins as p
from ckan import model
from ckanext.archiver.model import Archival, aggregate_archivals_for_a_dataset
NotFound = logic.NotFound
_get_or_bust = logic.get_or_bust
log = logging.getLogger(__name__)
@p.toolkit.side_effect_free
def archiver_resource_show(context, data_dict=None):
'''Return a details of the archival of a resource
:param id: the id of the resource
:type id: string
:rtype: dictionary
'''
id_ = _get_or_bust(data_dict, 'id')
archival = Archival.get_for_resource(id_)
if archival is None:
raise NotFound
archival_dict = archival.as_dict()
p.toolkit.check_access('archiver_resource_show', context, data_dict)
return archival_dict
@p.toolkit.side_effect_free
def archiver_dataset_show(context, data_dict=None):
'''Return a details of the archival of a dataset, aggregated across its
resources.
:param id: the name or id of the dataset
:type id: string
:rtype: dictionary
'''
id_ = _get_or_bust(data_dict, 'id')
dataset = model.Package.get(id_)
if not dataset:
raise NotFound
archivals = Archival.get_for_package(dataset.id)
archival_dict = aggregate_archivals_for_a_dataset(archivals)
p.toolkit.check_access('archiver_dataset_show', context, data_dict)
return archival_dict
import ckan.plugins as p
@p.toolkit.auth_allow_anonymous_access
def archiver_resource_show(context, data_dict):
# anyone
return {'success': True}
@p.toolkit.auth_allow_anonymous_access
def archiver_dataset_show(context, data_dict):
# anyone
return {'success': True}
import json
import uuid
from datetime import datetime
......
......@@ -102,7 +102,7 @@ class ArchiverPlugin(p.SingletonPlugin, toolkit.DefaultDatasetForm):
return schema
# this is a validator
# this is a validator/converter
def add_archival_information(key, data, errors, context):
archivals = Archival.get_for_package(data[('id',)])
# dataset
......
......@@ -15,7 +15,7 @@ import time
from requests.packages import urllib3
from ckan.lib.celery_app import celery
from ckan.lib.search.index import PackageSearchIndex
from ckan.plugins import toolkit
try:
from ckanext.archiver import settings
except ImportError:
......@@ -108,7 +108,9 @@ def update_package(ckan_ini_filepath, package_id, queue='bulk'):
Archive a package.
'''
from ckan import model
from ckan.logic import get_action
from ckan.plugins import toolkit
get_action = toolkit.get_action
load_config(ckan_ini_filepath)
register_translator()
......@@ -138,12 +140,21 @@ def update_package(ckan_ini_filepath, package_id, queue='bulk'):
# Refresh the index for this dataset, so that it contains the latest
# archive info
_update_search_index(package_id, log)
def _update_search_index(package_id, log):
'''
Tells CKAN to update its search index for a given package.
'''
from ckan import model
from ckan.lib.search.index import PackageSearchIndex
package_index = PackageSearchIndex()
# need to re-get the package to avoid using the cache
context_ = {'model': model, 'ignore_auth': True, 'session': model.Session,
'use_cache': False, 'validate': False}
package = get_action('package_show')(context_, {'id': package_id})
package = toolkit.get_action('package_show')(context_, {'id': package_id})
package_index.index_package(package, defer_commit=False)
log.info('Reindexed %s', package['name'])
def _update_resource(ckan_ini_filepath, resource_id, queue):
......@@ -174,8 +185,10 @@ def _update_resource(ckan_ini_filepath, resource_id, queue):
register_translator()
from ckan import model
from ckan.logic import get_action
from pylons import config
from ckan.plugins import toolkit
get_action = toolkit.get_action
assert is_id(resource_id), resource_id
context_ = {'model': model, 'ignore_auth': True, 'session': model.Session}
......
{#
Displays whether the resource is broken or not
Variable passed-in include:
"resource": {}
and all the Archival.as_dict() info from the package_show's resource['archiver'] e.g.
"status_id": 0,
"status": "Archived successfully",
"is_broken": false,
"is_broken_printable": "Downloaded OK",
"reason": "",
"url_redirected_to": null,
# Details of last successful archival
"cache_filepath": "/tmp/archive/ad/ad30c8f3-b3c7-4d5c-928f-df89f2cd7855/hospitals",
"cache_url": "http://localhost:4050/ad/ad30c8f3-b3c7-4d5c-928f-df89f2cd7855/hospitals",
"size": "7695"
"mimetype": "text/html",
"hash": "5466f7a55a2fc24fab4466c84fcde73d6d31c82a",
# History
"first_failure": null,
"last_success": "2015-11-17T10:28:00.018577",
"failure_count": 0,
"created": "2015-11-16T18:15:14.391913",
"updated": "2015-11-17T10:28:00.018577",
"resource_timestamp": "2015-10-29T11:09:07.258784",
#}
<div class="archiver {% if is_broken %}link-broken{% elif is_broken == None %}link-not-sure{% else %}link-not-broken{% endif %}">
{%- if is_broken == True -%}
<span class="icon icon-exclamation-sign text-error"></span>
Link is broken<br>
{% if reason %}
- {{ reason }}<br>
{% endif %}
{% if failure_count == 1 %}
<span>This is a one-off failure</span><br>
{% else %}
<span>This resource has failed {{ failure_count }} times in a row since it first failed: {{ h.render_datetime(first_failure) }}</span><br>
{% endif %}
{% if last_success %}
<span>This resource was last ok: {{ h.render_datetime(last_success) }}</span><br>
{% else %}
<span>We do not have a past record of it working since the first check: {{ h.render_datetime(created) }}</span><br>
{% endif %}
{%- elif is_broken == None -%}
Link check is not conclusive<br>
{% if reason %}
- {{ reason }}<br>
{% endif %}
{%- else-%}
Link is ok<br>
{% if reason %}
- {{ reason }}<br>
{% endif %}
{%- endif -%}
{# doesn't work
{% if resource_timestamp != resource['revision_timestamp'] %}
This was tested with an older version of this resource. An update should occur soon.<!-- resource_timestamp {{resource_timestamp}} revision_timestamp {{resource['revision_timestamp']}}--> <br>
{% endif %}
#}
<span>Link checked: {{ h.render_datetime(updated) }}</span><br>
</div>
{#
Displays whether the resource is cached
Variable passed-in include:
"resource": {}
and all the Archival.as_dict() info from the package_show's resource['archiver'] e.g.
"status_id": 0,
"status": "Archived successfully",
"is_broken": false,
"is_broken_printable": "Downloaded OK",
"reason": "",
"url_redirected_to": null,
# Details of last successful archival
"cache_filepath": "/tmp/archive/ad/ad30c8f3-b3c7-4d5c-928f-df89f2cd7855/hospitals",
"cache_url": "http://localhost:4050/ad/ad30c8f3-b3c7-4d5c-928f-df89f2cd7855/hospitals",
"size": "7695"
"mimetype": "text/html",
"hash": "5466f7a55a2fc24fab4466c84fcde73d6d31c82a",
# History
"first_failure": null,
"last_success": "2015-11-17T10:28:00.018577",
"failure_count": 0,
"created": "2015-11-16T18:15:14.391913",
"updated": "2015-11-17T10:28:00.018577",
"resource_timestamp": "2015-10-29T11:09:07.258784",
#}
<div class="archiver {% if cache_url %}link-cached{% else %}link-not-cached{% endif %}">
{%- if cache_url -%}
<a href="{{ cache_url }}">
<span class="icon icon-download-alt"></span>
Download cached copy
</a><br>
Size: {{ size|filesizeformat }} <br>
Cached on: {{ h.render_datetime(last_success) }}
{% if is_broken %}
(before it was broken)
{% endif %}<br>
{% else %}
No cached copy available<br>
{% endif %}
</div>
{% ckan_extends %}
{% block resource_additional_information_inner %}
{# This is copied from core ckan, but with the changes marked #}
<div class="module-content">
<h2>{{ _('Additional Information') }}</h2>
<table class="table table-striped table-bordered table-condensed" data-module="table-toggle-more">
<thead>
<tr>
<th scope="col">{{ _('Field') }}</th>
<th scope="col">{{ _('Value') }}</th>
</tr>
</thead>
<tbody>
<tr>
<th scope="row">{{ _('Last updated') }}</th>
<td>{{ h.render_datetime(res.last_modified) or h.render_datetime(res.revision_timestamp) or h.render_datetime(res.created) or _('unknown') }}</td>
</tr>
<tr>
<th scope="row">{{ _('Created') }}</th>
<td>{{ h.render_datetime(res.created) or _('unknown') }}</td>
</tr>
<tr>
<th scope="row">{{ _('Format') }}</th>
<td>{{ res.mimetype_inner or res.mimetype or res.format or _('unknown') }}</td>
</tr>
<tr>
<th scope="row">{{ _('License') }}</th>
<td>{% snippet "snippets/license.html", pkg_dict=pkg, text_only=True %}</td>
</tr>
{# We replaced h.format_resource_items with h.archiver_format_resource_items so that we can hide the archiver key #}
{% for key, value in h.archiver_format_resource_items(res.items()) %}
<tr class="toggle-more"><th scope="row">{{ key }}</th><td>{{ value }}</td></tr>
{% endfor %}
</tbody>
</table>
</div>
{{ h.archiver_is_resource_broken_html(c.resource) }}<br>
{{ h.archiver_is_resource_cached_html(c.resource) }}
{% endblock %}
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment