Commit d8729418 authored by David Read's avatar David Read
Browse files

Merge pull request #15 from datagovuk/archiver-2.0

Archiver 2.0 (to go with QA 2.0)
parents b0b8aec0 42afda32
[report]
omit =
*/site-packages/*
*/python?.?/*
ckan/*
\ No newline at end of file
*.pyc
*.egg-info
build
*.egg
.DS_Store
*.swp
# archiver settings should not be checked in - only its template
ckanext/archiver/settings.py
*~
\ No newline at end of file
.DS_Store
# vim
*.sw?
# emacs
*~
.ropeproject
node_modules
bower_components
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
# C extensions
*.so
# Distribution / packaging
.Python
env/
build/
develop-eggs/
dist/
sdist/
*.egg-info/
.installed.cfg
*.egg
# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec
# Installer logs
pip-log.txt
pip-delete-this-directory.txt
# Unit test / coverage reports
htmlcov/
.tox/
.coverage
.cache
nosetests.xml
coverage.xml
# Sphinx documentation
docs/_build/
language: python
python:
- "2.7"
env: PGVERSION=9.1
install:
- bash bin/travis-build.bash
- pip install coveralls
script: sh bin/travis-run.sh
after_success:
- coveralls
The MIT License (MIT)
Copyright (c) 2015 Open Knowledge & Crown Copyright
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.
include README.rst
recursive-include ckanext/archiver *.html *.json *.js *.less *.css
CKAN Archiver Extension
=======================
.. You should enable this project on travis-ci.org and coveralls.io to make
these badges work. The necessary Travis and Coverage config files have been
generated for you.
**Status:** Production
**CKAN Version:** 1.5.1+
.. image:: https://travis-ci.org/datagovuk/ckanext-archiver.svg?branch=master
:target: https://travis-ci.org/datagovuk/ckanext-archiver
=============
ckanext-archiver
=============
Overview
--------
The CKAN Archiver Extension provides a set of Celery tasks for downloading and
saving CKAN resources. It can be configured to run automatically, saving any
new resources that are added to a CKAN instance (and saving any resources when
their URL is changed). It can also be run manually from the command line in
order to archive resources for specific datasets, or to archive all resources
in a CKAN instance.
The CKAN Archiver Extension will download all of a CKAN's resources, for three purposes:
1. offer the user it as a 'cached' copy, in case the link becomes broken
2. tell the user (and publishers) if the link is broken, on both the dataset/resource and in a 'Broken Links' report
3. the downloaded file can be analysed by other extensions, such as ckanext-qa or ckanext-pacakgezip.
Demo:
.. image:: archiver_resource.png
:alt: Broken link check info and a cached copy offered on resource
.. image:: archiver_report.png
:alt: Broken link report
Compatibility: Requires CKAN version 2.1 or later
TODO:
* Show brokenness on the package page (not just the resources)
* Prettify the html bits
* Add brokenness to search facets using IFacet
Operation
---------
When a resource is archived, the information about the archival - if it failed, the filename on disk, file size etc - is stored in the Archival table. (In ckanext-archiver v0.1 it was stored in TaskStatus and on the Resource itself.) This is added to dataset during the package_show call (using a schema key), so the information is also available over the API.
Other extensions can subscribe to the archiver's ``IPipe`` interface to hear about datasets being archived. e.g. ckanext-qa will detect its file type and give it an openness score, or ckanext-packagezip will create a zip of the files in a dataset.
Archiver works on Celery queues, so when Archiver is notified of a dataset/resource being created or updated, it puts an 'update request' on a queue. Celery calls the Archiver 'update task' to do each archival. You can start Celery with multiple processes, to archive in parallel.
You can also trigger an archival using paster on the command-line.
By default, two queues are used:
1. 'bulk' for a regular archival of all the resources
2. 'priority' for when a user edits one-off resource
This means that the 'bulk' queue can happily run slowly, archiving large quantities slowly, such as re-archiving every single resource once a week. And meanwhile, if a new resource is put into CKAN then it can be downloaded straight away via the 'priority' queue.
Installation
------------
Install the extension as usual, e.g. (from an activated virtualenv):
To install ckanext-archiver:
1. Activate your CKAN virtual environment, for example::
. /usr/lib/ckan/default/bin/activate
2. Install the ckanext-archiver and ckanext-report Python packages into your virtual environment::
pip install -e git+http://github.com/ckan/ckanext-archiver.git#egg=ckanext-archiver
pip install -e git+http://github.com/datagovuk/ckanext-report.git#egg=ckanext-report
3. Install the archiver dependencies::
pip install -r ckanext-archiver/requirements.txt
4. Now create the database tables::
paster --plugin=ckanext-archiver archiver init --config=production.ini
paster --plugin=ckanext-report report initdb --config=production.ini
::
4. Add ``archiver report`` to the ``ckan.plugins`` setting in your CKAN
config file (by default the config file is located at
``/etc/ckan/default/production.ini``).
$ pip install -e git+http://github.com/okfn/ckanext-archiver.git#egg=ckanext-archiver
5. Install a Celery queue backend - see later section.
Install the required libraries:
6. Restart CKAN. For example if you've deployed CKAN with Apache on Ubuntu::
::
sudo service apache2 reload
$ pip install -r pip-requirements.txt
Upgrade from version 0.1 to 2.x
-------------------------------
NB If upgrading ckanext-archiver and use ckanext-qa too, then you will need to upgrade ckanext-qa to version 2.x at the same time.
Configuration
NB Previously you needed both ckanext-archiver and ckanext-qa to see the broken link report. This functionality has now moved to ckanext-archiver. So now you only need ckanext-qa if you want the 5 stars of openness functionality.
1. Activate your CKAN virtual environment, for example::
. /usr/lib/ckan/default/bin/activate
2. Install ckanext-report (if not already installed)
pip install -e git+http://github.com/datagovuk/ckanext-report.git#egg=ckanext-report
3. Add ``report`` to the ``ckan.plugins`` setting in your CKAN config file (it
should already have ``archiver``) (by default the config file is located at
``/etc/ckan/default/production.ini``).
4. Also in your CKAN config file, rename old config option keys if you have them:
* ``ckan.cache_url_root`` to ``ckanext-archiver.cache_url_root``
* ``ckanext.archiver.user_agent_string`` to ``ckanext-archiver.user_agent_string``
5. Upgrade the ckanext-archiver Python package::
cd ckanext-archiver
git pull
python setup.py develop
6. Create the new database tables::
paster --plugin=ckanext-archiver archiver init --config=production.ini
7. Ensure the archiver dependencies are installed::
pip install -r requirements.txt
8. Install the developer dependencies, needed for the migration::
pip install -r dev-requirements.txt
9. Migrate your database to the new Archiver tables::
python ckanext/archiver/bin/migrate_task_status.py --write production.ini
Installing a Celery queue backend
---------------------------------
Archiver uses Celery to manage its 'queues'. You need to install a queue back-end, such as Redis or RabbitMQ.
Redis backend
-------------
1. Enabling Archiver
Redis can be installed like this::
sudo apt-get install redis-server
Install the python library into your python environment::
/usr/lib/ckan/default/bin/activate/pip install redis==2.10.1
It must then be configured in your CKAN config (e.g. production.ini) by inserting a new section, e.g. before `[app:main]`::
If you want the archiver to run automatically when a new CKAN resource is
added, or the url of a resource is changed, then edit your CKAN config file
(eg: development.ini) to enable the extension:
[app:celery]
BROKER_BACKEND = redis
BROKER_HOST = redis://localhost/1
CELERY_RESULT_BACKEND = redis
REDIS_HOST = 127.0.0.1
REDIS_PORT = 6379
REDIS_DB = 0
REDIS_CONNECT_RETRY = True
Number of items in the queue 'bulk'::
redis-cli -n 1 LLEN bulk
See item 0 in the queue (which is the last to go on the queue & last to be processed)::
redis-cli -n 1 LINDEX bulk 0
To delete all the items on the queue::
redis-cli -n 1 DEL bulk
Installing SNI support
----------------------
When archiving resources on servers which use HTTPS, you might encounter this error::
requests.exceptions.SSLError: [Errno 1] _ssl.c:504: error:14077410:SSL routines:SSL23_GET_SERVER_HELLO:sslv3 alert handshake failure
Whilst this could possibly be a problem with the server, it is most likely due to you needing to install SNI support on the machine that ckanext-archiver runs. Server Name Indication (SNI) is for when a server has multiple SSL certificates, which is a relatively new feature in HTTPS. This requires installing a recent version of OpenSSL plus the python libraries to make use of this feature.
If you have SNI support installed then this should command run without the above error::
python -c 'import requests; requests.get("http://files.datapress.com")'
On Ubuntu 12.04 you can install SNI support by doing this::
sudo apt-get install libffi-dev
. /usr/lib/ckan/default/bin/activate
pip install 'cryptography==0.9.3' pyOpenSSL ndg-httpsclient pyasn1
You should also check your OpenSSL version is greater than 1.0.0::
python -c "import ssl; print ssl.OPENSSL_VERSION"
Apparently SNI was added into OpenSSL version 0.9.8j but apparently there are reported problems with 0.9.8y, 0.9.8zc & 0.9.8zg so 1.0.0+ is recommended.
For more about enabling SNI in python requests see:
* https://stackoverflow.com/questions/18578439/using-requests-with-tls-doesnt-give-sni-support/18579484#18579484
* https://github.com/kennethreitz/requests/issues/2022
Config settings
---------------
1. Enabling Archiver to listen to resource changes
If you want the archiver to run automatically when a new CKAN resource is added, or the url of a resource is changed,
then edit your CKAN config file (eg: development.ini) to enable the extension:
::
ckan.plugins = archiver
If there are other plugins activated, add this to the list (each plugin
should be separated with a space).
If there are other plugins activated, add this to the list (each plugin should be separated with a space).
**Note:** You can still run the archiver manually (from the command line)
on specific resources or on all resources in a CKAN instance without
enabling the plugin. See section 'Using Archiver' for details.
**Note:** You can still run the archiver manually (from the command line) on specific resources or on all resources
in a CKAN instance without enabling the plugin. See section 'Using Archiver' for details.
2. Other CKAN config options
The following config variable should also be set in your CKAN config:
::
* ``ckan.site_url`` = URL to your CKAN instance
ckan.site_url: URL to your CKAN instance
This is the URL that the archive process (in Celery) will use to access the CKAN API to update it about the cached URLs. If your internal network names your CKAN server differently, then specify this internal name in config option: ``ckan.site_url_internally``
This is the URL that the archive process (in Celery) will use to access the
CKAN API to update it about the cached URLs. If your internal network names
your CKAN server differently, then specify this internal name in config
option `ckan.site_url_internally`.
Optionally, the following config variables can also be set:
::
3. Additional Archiver settings
ckan.cache_url_root: URL that will be prepended to the file path and saved against the CKAN resource,
providing a full URL to the archived file.
Add the settings to the CKAN config file:
3. Additional Archiver settings
* ``ckanext-archiver.archive_dir`` = path to the directory that archived files will be saved to (e.g. ``/www/resource_cache``)
* ``ckanext-archiver.cache_url_root`` = URL where you will be publicly serving the cached files stored locally at ckanext-archiver.archive_dir.
* ``ckanext-archiver.max_content_length`` = the maximum size (in bytes) of files to archive (default ``50000000`` =50MB)
* ``ckanext-archiver.user_agent_string`` = identifies the archiver to servers it archives from
The following Archiver settings can be changed by creating a copy of ``ckanext/archiver/default_settings.py``
at ``ckanext/archiver/settings.py``, and editing the variables:
4. Nightly report generation
::
Configure the reports to be generated each night using cron. e.g.::
ARCHIVE_DIR: path to the directory that archived files will be saved
to.
MAX_CONTENT_LENGTH: the maximum size (in bytes) of files to archive.
DATA_FORMATS: the data formats that are archived.
USER_AGENT_STRING: the `User-Agent` header used when the archiver makes requests
0 6 * * * www-data /usr/lib/ckan/default/bin/paster --plugin=ckanext-report report generate --config=/etc/ckan/default/production.ini
Alternatively, if you are running CKAN with this patch:
https://github.com/datagovuk/ckan/commit/83dcaf3d875d622ee0cd7f3c1f65ec27a970cd10
then you can instead add the settings to the CKAN config file as normal:
5. Your web server should serve the files from the archive_dir.
::
With nginx you insert a new ``location`` after the ckan one. e.g. here we have configured ``ckanext-archiver.archive_dir`` to ``/www/resource_cache`` and serve these files at location ``/resource_cache`` (i.e. ``http://mysite.com/resource_cache`` )::
server {
# ckan
location / {
proxy_pass http://127.0.0.1:8080/;
...
}
# archived files
location /resource_cache {
root /www/resource_cache;
}
Legacy settings:
ckanext-archiver.archive_dir
ckanext-archiver.max_content_length
ckanext-archiver.data_formats (space separated)
ckanext.archiver.user_agent_string
Older versions of ckanext-archiver put these settings in
ckanext/archiver/settings.py as variables ARCHIVE_DIR and MAX_CONTENT_LENGTH
but this is deprecated as of ckanext-archiver 2.0.
There used to be an option DATA_FORMATS for filtering the resources
archived, but that has now been removed in ckanext-archiver v2.0, since it
is now not only caching files, but is seen as a broken link checker, which
applies whatever the format.
Using Archiver
--------------
First, make sure that Celery is running.
For test/local use, you can do this by going to the CKAN root directory and typing:
First, make sure that Celery is running for each queue. For test/local use, you can run::
paster --plugin=ckanext-archiver celeryd2 run all -c development.ini
::
However in production you'd run the priority and bulk queues separately, or else the priority queue will not have any priority over the bulk queue. This can be done by running these two commands in separate terminals::
paster celeryd -c <path to CKAN config>
paster --plugin=ckanext-archiver celeryd2 run priority -c production.ini
paster --plugin=ckanext-archiver celeryd2 run bulk -c production.ini
For production use, we recommend setting up Celery to run with supervisord.
For more information see
For more information see:
http://docs.ckan.org/en/latest/maintaining/background-tasks.html
* http://docs.ckan.org/en/latest/extensions.html#enabling-an-extension-with-background-tasks
* http://wiki.ckan.org/Writing_asynchronous_tasks
An archival can be triggered by adding a dataset with a resource or updating a resource URL. Alternatively you can run::
The Archiver can be used in two ways:
paster --plugin=ckanext-archiver archiver update [dataset] --queue=priority -c <path to CKAN config>
1. Automatically
Here ``dataset`` is a CKAN dataset name or ID, or you can omit it to archive all datasets.
Install, enable and configure the plugin as described above.
Any changes to resource URLs (either adding new or updating current URLs) in the CKAN instance will
now call the archiver to try and download the resource.
For a full list of manual commands run::
2. Manually
paster --plugin=ckanext-archiver archiver --help
From the ckanext-archiver directory run:
Once you've done some archiving you can generate a Broken Links report::
::
paster --plugin=ckanext-report report generate broken-links --config=production.ini
paster archiver update [dataset] -c <path to CKAN config>
And view it on your CKAN site at ``/report/broken-links``.
Here ``dataset`` is an optional CKAN dataset name or ID.
If given, all resources for that dataset will be archived.
If omitted, all resources for all datasets will be archived.
Testing
-------
For a full list of manual commands run:
To run the tests:
::
1. Activate your CKAN virtual environment, for example::
paster archiver --help
. /usr/lib/ckan/default/bin/activate
2. If not done already, install the dev requirements::
Testing
-------
(pyenv)~/pyenv/src/ckan$ pip install ../ckanext-archiver/dev-requirements.txt
3. From the CKAN root directory (not the extension root) do::
(pyenv)~/pyenv/src/ckan$ nosetests --ckan ../ckanext-archiver/tests/ --with-pylons=../ckanext-archiver/test-core.ini
Questions
---------
The archiver information is not appearing on the resource page
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Check that it is appearing in the API for the dataset - see question below.
The archiver information is not appearing in the API (package_show)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
i.e. if you browse this path on your website: `/api/action/package_show?id=<package_name>` then you don't see the `archiver` key at the dataset level or resource level.
Check the `paster archiver update` command completed ok. Check that the `paster celeryd2 run` has done the archiving ok. Check the dataset has at least one resource. Check that you have ``archiver`` in your ckan.plugins and have restarted CKAN.
'SSL handshake' error
~~~~~~~~~~~~~~~~~~~~~
Tests should be run from the CKAN root directory (not the extension root).
When archiving resources on servers which use HTTPS, you might encounter this error::
::
requests.exceptions.SSLError: [Errno 1] _ssl.c:504: error:14077410:SSL routines:SSL23_GET_SERVER_HELLO:sslv3 alert handshake failure
(pyenv)~/pyenv/src/ckan$ nosetests --ckan ../ckanext-archiver/tests/
This is probably because you don't have SNI support and requires installing OpenSSL - see section "Installing SNI support".
#!/bin/bash
set -e
echo "This is travis-build.bash..."
echo "Installing the packages that CKAN requires..."
sudo apt-get update -qq
sudo apt-get install postgresql-$PGVERSION solr-jetty libcommons-fileupload-java:amd64=1.2.2-1
echo "Installing CKAN and its Python dependencies..."
git clone https://github.com/ckan/ckan
cd ckan
#export latest_ckan_release_branch=`git branch --all | grep remotes/origin/release-v | sort -r | sed 's/remotes\/origin\///g' | head -n 1`
#export ckan_branch=release-v2.2-dgu
export ckan_branch=master
echo "CKAN branch: $ckan_branch"
git checkout $ckan_branch
python setup.py develop
pip install -r requirements.txt --allow-all-external
pip install -r dev-requirements.txt --allow-all-external
cd -
echo "Creating the PostgreSQL user and database..."
sudo -u postgres psql -c "CREATE USER ckan_default WITH PASSWORD 'pass';"
sudo -u postgres psql -c 'CREATE DATABASE ckan_test WITH OWNER ckan_default;'
echo "Initialising the database..."
cd ckan
paster db init -c test-core.ini
cd -
echo "Installing dependency ckanext-report and its requirements..."
pip install -e git+https://github.com/datagovuk/ckanext-report.git#egg=ckanext-report
echo "Installing ckanext-archiver and its requirements..."
python setup.py develop
pip install -r requirements.txt
pip install -r dev-requirements.txt
echo "Moving test-core.ini into a subdir..."
mkdir subdir
mv test-core.ini subdir
echo "travis-build.bash is done."
#!/bin/sh -e
echo "NO_START=0\nJETTY_HOST=127.0.0.1\nJETTY_PORT=8983\nJAVA_HOME=$JAVA_HOME" | sudo tee /etc/default/jetty
sudo cp ckan/ckan/config/solr/schema.xml /etc/solr/conf/schema.xml
sudo service jetty restart
nosetests --nologcapture --with-pylons=subdir/test-core.ini --with-coverage --cover-package=ckanext.archiver --cover-inclusive --cover-erase --cover-tests
import os
def load_config(config_filepath):
import paste.deploy
config_abs_path = os.path.abspath(config_filepath)
conf = paste.deploy.appconfig('config:' + config_abs_path)
import ckan
ckan.config.environment.load_environment(conf.global_conf,
conf.local_conf)
def register_translator():
# Register a translator in this thread so that
# the _() functions in logic layer can work
from paste.registry import Registry
from pylons import translator
from ckan.lib.cli import MockTranslator
global registry
registry = Registry()
registry.prepare()
global translator_obj
translator_obj = MockTranslator()
registry.register(translator, translator_obj)
def get_resources(state='active', publisher_ref=None, resource_id=None,
dataset_name=None):
''' Returns all active resources, or filtered by the given criteria. '''
from ckan import model
resources = model.Session.query(model.Resource) \
.filter_by(state=state)
if hasattr(model, 'ResourceGroup'):
# earlier CKANs had ResourceGroup
resources = resources.join(model.ResourceGroup)
resources = resources \
.join(model.Package) \
.filter_by(state='active')
criteria = [state]
if publisher_ref:
publisher = model.Group.get(publisher_ref)
assert publisher
resources = resources.filter(model.Package.owner_org == publisher.id)
criteria.append('Publisher:%s' % publisher.name)
if dataset_name:
resources = resources.filter(model.Package.name == dataset_name)
criteria.append('Dataset:%s' % dataset_name)
if resource_id:
resources = resources.filter(model.Resource.id == resource_id)
criteria.append('Resource:%s' % resource_id)
resources = resources.all()
print '%i resources (%s)' % (len(resources), ' '.join(criteria))
return resources