Commit 378ac899 authored by David Read's avatar David Read
Browse files

Put Archiver data into the package_dict using after_show instead of...

Put Archiver data into the package_dict using after_show instead of IDatasetForm because you can only have one IDatasetForm - cannot have one for QA as well. Harmonize naming for config options while this is a major version change.
parent edeaaf5c
......@@ -95,29 +95,34 @@ NB Previously you needed both ckanext-archiver and ckanext-qa to see the broken
should already have ``archiver``) (by default the config file is located at
``/etc/ckan/default/production.ini``).
4. Upgrade the ckanext-archiver Python package::
4. Also in your CKAN config file, rename old config option keys if you have them:
* ``ckan.cache_url_root`` to ``ckanext-archiver.cache_url_root``
* ``ckanext.archiver.user_agent_string`` to ``ckanext-archiver.user_agent_string``
5. Upgrade the ckanext-archiver Python package::
cd ckanext-archiver
git pull
python setup.py develop
5. Create the new database tables::
6. Create the new database tables::
paster --plugin=ckanext-archiver archiver init --config=production.ini
6. Ensure the archiver dependencies are installed::
7. Ensure the archiver dependencies are installed::
pip install -r requirements.txt
7. Install the developer dependencies, needed for the migration::
8. Install the developer dependencies, needed for the migration::
pip install -r dev-requirements.txt
8. Migrate your database to the new Archiver tables::
9. Migrate your database to the new Archiver tables::
python ckanext/archiver/bin/migrate_task_status.py --write production.ini
Installing a Celery queue backend
---------------------------------
......@@ -164,9 +169,9 @@ When archiving resources on servers which use HTTPS, you might encounter this er
requests.exceptions.SSLError: [Errno 1] _ssl.c:504: error:14077410:SSL routines:SSL23_GET_SERVER_HELLO:sslv3 alert handshake failure
Whilst this could be a problem with the server, it is likely due to you needing to install SNI support on the machine that ckanext-archiver runs. Server Name Indication (SNL) is for when a server has multiple SSL certificates, which is a relatively new feature. This requires installing a recent version of OpenSSL plus the python libraries to make use of this feature..
Whilst this could possibly be a problem with the server, it is most likely due to you needing to install SNI support on the machine that ckanext-archiver runs. Server Name Indication (SNI) is for when a server has multiple SSL certificates, which is a relatively new feature in HTTPS. This requires installing a recent version of OpenSSL plus the python libraries to make use of this feature.
If you have SNI support installed then this should run without the error::
If you have SNI support installed then this should command run without the above error::
python -c 'import requests; requests.get("http://files.datapress.com")'
......@@ -176,10 +181,12 @@ On Ubuntu 12.04 you can install SNI support by doing this::
. /usr/lib/ckan/default/bin/activate
pip install 'cryptography==0.9.3' pyOpenSSL ndg-httpsclient pyasn1
You should also check your OpenSSL version is greater than 1.0.0. Apparently SNI was added in 0.9.8j but apparently there are reported problems with 0.9.8y, 0.9.8zc & 0.9.8zg so 1.0.0+ is recommended.
You should also check your OpenSSL version is greater than 1.0.0::
python -c "import ssl; print ssl.OPENSSL_VERSION"
Apparently SNI was added into OpenSSL version 0.9.8j but apparently there are reported problems with 0.9.8y, 0.9.8zc & 0.9.8zg so 1.0.0+ is recommended.
For more about enabling SNI in python requests see:
* https://stackoverflow.com/questions/18578439/using-requests-with-tls-doesnt-give-sni-support/18579484#18579484
......@@ -207,20 +214,19 @@ Config settings
The following config variable should also be set in your CKAN config:
* ckan.site_url: URL to your CKAN instance
* ``ckan.site_url`` = URL to your CKAN instance
This is the URL that the archive process (in Celery) will use to access the CKAN API to update it about the cached URLs. If your internal network names your CKAN server differently, then specify this internal name in config option: ckan.site_url_internally
This is the URL that the archive process (in Celery) will use to access the CKAN API to update it about the cached URLs. If your internal network names your CKAN server differently, then specify this internal name in config option: ``ckan.site_url_internally``
* ckan.cache_url_root: URL that will be prepended to the file path and saved against the CKAN resource,
providing a full URL to the archived file.
3. Additional Archiver settings
Add the settings to the CKAN config file:
* ckanext-archiver.archive_dir - path to the directory that archived files will be saved to (e.g. ``/www/resource_cache``)
* ckanext-archiver.max_content_length - the maximum size (in bytes) of files to archive (default ``50000000`` =50MB)
* ckanext.archiver.user_agent_string - identifies the archiver to servers it archives from
* ``ckanext-archiver.archive_dir`` = path to the directory that archived files will be saved to (e.g. ``/www/resource_cache``)
* ``ckanext-archiver.cache_url_root`` = URL where you will be publicly serving the cached files stored locally at ckanext-archiver.archive_dir.
* ``ckanext-archiver.max_content_length`` = the maximum size (in bytes) of files to archive (default ``50000000`` =50MB)
* ``ckanext-archiver.user_agent_string`` = identifies the archiver to servers it archives from
4. Nightly report generation
......@@ -321,12 +327,7 @@ The archiver information is not appearing in the API (package_show)
i.e. if you browse this path on your website: `/api/action/package_show?id=<package_name>` then you don't see the `archiver` key at the dataset level or resource level.
Check the `paster archiver update` command completed ok. Check that the `paster celeryd2 run` has done the archiving ok. Check the dataset has at least one resource. If you have another extension with an IDatasetForm that customizes the form or schema, see the question below about this.
My site has an IDatasetForm already - how can I include the archiver information?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If you have another extension with an IDatasetForm for customizing the dataset form/schema, then you can simply add to it the schema customizations from this module - see this module's plugins.py in the section for IDatasetForm.
Check the `paster archiver update` command completed ok. Check that the `paster celeryd2 run` has done the archiving ok. Check the dataset has at least one resource. Check that you have ``archiver`` in your ckan.plugins and have restarted CKAN.
'SSL handshake' error
~~~~~~~~~~~~~~~~~~~~~
......
......@@ -17,8 +17,8 @@ class Archiver(CkanCommand):
'''
Download and save copies of all package resources.
The result of each download attempt is saved to the CKAN task_status table, so the
information can be used later for QA analysis.
The result of each download attempt is saved to the CKAN task_status table,
so the information can be used later for QA analysis.
Usage:
......@@ -35,7 +35,8 @@ class Archiver(CkanCommand):
It does not change the cache_url etc. in the Resource
paster archiver clean-cached-resources
- Removes all cache_urls and other references to resource files on disk.
- Removes all cache_urls and other references to resource files on
disk.
paster archiver view [{dataset name/id}]
- Views info archival info, in general and if you specify one, about
......@@ -123,7 +124,7 @@ class Archiver(CkanCommand):
def update(self):
from ckan import model
from ckanext.archiver import plugin
from ckanext.archiver import lib
packages = []
resources = []
if len(self.args) > 1:
......@@ -133,7 +134,7 @@ class Archiver(CkanCommand):
if group:
if group.is_organization:
packages.extend(
model.Session.query(model.Package)\
model.Session.query(model.Package)
.filter_by(owner_org=group.id))
else:
packages.extend(group.packages(with_private=True))
......@@ -192,13 +193,13 @@ class Archiver(CkanCommand):
if res.state == 'active']
self.log.info('Queuing dataset %s (%s resources)',
package.name, len(pkg_resources))
plugin.create_archiver_package_task(package, self.options.queue)
lib.create_archiver_package_task(package, self.options.queue)
time.sleep(0.1) # to try to avoid Redis getting overloaded
for resource in resources:
package = resource.resource_group.package
self.log.info('Queuing resource %s/%s', package.name, resource.id)
plugin.create_archiver_resource_task(resource, self.options.queue)
lib.create_archiver_resource_task(resource, self.options.queue)
time.sleep(0.05) # to try to avoid Redis getting overloaded
self.log.info('Completed queueing')
......@@ -315,7 +316,7 @@ class Archiver(CkanCommand):
continue
try:
s = os.stat(fp)
os.stat(fp)
except OSError:
perm_error += 1
writer.writerow([resource.id, fp.encode('utf-8'), "File not readable"])
......@@ -367,7 +368,7 @@ class Archiver(CkanCommand):
{'model': model, 'ignore_auth': True, 'defer_commit': True}, {}
)
site_url_base = config['ckan.cache_url_root'].rstrip('/')
site_url_base = config['ckanext-archiver.cache_url_root'].rstrip('/')
old_dir_regex = re.compile(r'(.*)/([a-f0-9\-]+)/([^/]*)$')
new_dir_regex = re.compile(r'(.*)/[a-f0-9]{2}/[a-f0-9\-]{36}/[^/]*$')
for resource in model.Session.query(model.Resource).\
......
......@@ -7,7 +7,7 @@ ARCHIVE_DIR = config.get('ckanext-archiver.archive_dir', '/tmp/archive')
MAX_CONTENT_LENGTH = int(config.get('ckanext-archiver.max_content_length',
50000000))
USER_AGENT_STRING = config.get('ckanext.archiver.user_agent_string', None)
USER_AGENT_STRING = config.get('ckanext-archiver.user_agent_string', None)
if not USER_AGENT_STRING:
USER_AGENT_STRING = '%s %s ckanext-archiver' % (
config.get('ckan.site_title', ''), config.get('ckan.site_url'))
import os
import logging
from ckan import model
from ckan.model.types import make_uuid
from ckan.lib.celery_app import celery
log = logging.getLogger(__name__)
def create_archiver_resource_task(resource, queue):
from pylons import config
if hasattr(model, 'ResourceGroup'):
# earlier CKANs had ResourceGroup
package = resource.resource_group.package
else:
package = resource.package
task_id = '%s/%s/%s' % (package.name, resource.id[:4], make_uuid()[:4])
ckan_ini_filepath = os.path.abspath(config['__file__'])
celery.send_task('archiver.update_resource',
args=[ckan_ini_filepath, resource.id, queue],
task_id=task_id, queue=queue)
log.debug('Archival of resource put into celery queue %s: %s/%s url=%r',
queue, package.name, resource.id, resource.url)
def create_archiver_package_task(package, queue):
from pylons import config
task_id = '%s/%s' % (package.name, make_uuid()[:4])
ckan_ini_filepath = os.path.abspath(config['__file__'])
celery.send_task('archiver.update_package',
args=[ckan_ini_filepath, package.id, queue],
task_id=task_id, queue=queue)
log.debug('Archival of package put into celery queue %s: %s',
queue, package.name)
import os
import logging
from ckan import model
from ckan.model.types import make_uuid
from ckan import plugins as p
from ckan.lib.celery_app import celery
from ckanext.report.interfaces import IReport
from ckanext.archiver.interfaces import IPipe
from ckanext.archiver.logic import action, auth
from ckanext.archiver import helpers
from ckanext.archiver import lib
from ckanext.archiver.model import Archival, aggregate_archivals_for_a_dataset
log = logging.getLogger(__name__)
......@@ -27,7 +25,7 @@ class ArchiverPlugin(p.SingletonPlugin, p.toolkit.DefaultDatasetForm):
p.implements(p.IActions)
p.implements(p.IAuthFunctions)
p.implements(p.ITemplateHelpers)
#p.implements(p.IDatasetForm, inherit=True)
p.implements(p.IPackageController, inherit=True)
# IDomainObjectModification
......@@ -37,7 +35,7 @@ class ArchiverPlugin(p.SingletonPlugin, p.toolkit.DefaultDatasetForm):
log.debug('Notified of package event: %s %s', entity.id, operation)
create_archiver_package_task(entity, 'priority')
lib.create_archiver_package_task(entity, 'priority')
# IReport
......@@ -73,83 +71,31 @@ class ArchiverPlugin(p.SingletonPlugin, p.toolkit.DefaultDatasetForm):
in helpers.__dict__.items()
if callable(function) and name[0] != '_')
# IDatasetForm
def package_types(self):
return ['dataset']
def is_fallback(self):
# This is just a fallback, so a site-specific extension can have their
# own IDatasetForm for datasets, but they they will lose the ability to
# see broken-link info on the dataset and in the API, unless they
# integrate the following schema changes in this IDataset form into
# their one.
return True
def update_package_schema(self):
schema = p.toolkit.DefaultDatasetForm.update_package_schema(self)
# don't save archiver info in the dataset, since it is stored in the
# archival table instead, and the value added into the package_show
# result in the show_package_schema
ignore = p.toolkit.get_validator('ignore')
schema['archiver'] = [ignore]
schema['resources']['archiver'] = [ignore]
return schema
def show_package_schema(self):
schema = p.toolkit.DefaultDatasetForm.show_package_schema(self)
schema['archiver'] = [add_archival_information]
return schema
# this is a validator/converter
def add_archival_information(key, data, errors, context):
archivals = Archival.get_for_package(data[('id',)])
# dataset
dataset_archival = aggregate_archivals_for_a_dataset(archivals)
data[key] = dataset_archival
# resources
# (insert archival info into resources here, rather than in a separate
# per-resource validator, because that would mean getting the archival info
# from the database again separately for each resource)
archivals_by_res_id = dict((a.resource_id, a) for a in archivals)
res_index = 0
while True:
res_id_key = ('resources', res_index, u'id')
if res_id_key not in data:
# no more resources
break
res_id = data[res_id_key]
archival = archivals_by_res_id.get(res_id)
if archival:
archival_dict = archival.as_dict()
del archival_dict['id']
del archival_dict['package_id']
del archival_dict['resource_id']
data[('resources', res_index, key[0])] = archival_dict
res_index += 1
def create_archiver_resource_task(resource, queue):
from pylons import config
if hasattr(model, 'ResourceGroup'):
# earlier CKANs had ResourceGroup
package = resource.resource_group.package
else:
package = resource.package
task_id = '%s/%s/%s' % (package.name, resource.id[:4], make_uuid()[:4])
ckan_ini_filepath = os.path.abspath(config['__file__'])
celery.send_task('archiver.update_resource', args=[ckan_ini_filepath, resource.id, queue],
task_id=task_id, queue=queue)
log.debug('Archival of resource put into celery queue %s: %s/%s url=%r', queue, package.name, resource.id, resource.url)
def create_archiver_package_task(package, queue):
from pylons import config
task_id = '%s/%s' % (package.name, make_uuid()[:4])
ckan_ini_filepath = os.path.abspath(config['__file__'])
celery.send_task('archiver.update_package', args=[ckan_ini_filepath, package.id, queue],
task_id=task_id, queue=queue)
log.debug('Archival of package put into celery queue %s: %s', queue, package.name)
# IPackageController
def after_show(self, context, pkg_dict):
# Insert the archival info into the package_dict so that it is
# available on the API.
# When you edit the dataset, these values will not show in the form,
# it they will be saved in the resources (not the dataset). I can't see
# and easy way to stop this, but I think it is harmless. It will get
# overwritten here when output again.
archivals = Archival.get_for_package(pkg_dict['id'])
if not archivals:
return
# dataset
dataset_archival = aggregate_archivals_for_a_dataset(archivals)
pkg_dict['archiver'] = dataset_archival
# resources
archivals_by_res_id = dict((a.resource_id, a) for a in archivals)
for res in pkg_dict['resources']:
archival = archivals_by_res_id.get(res['id'])
if archival:
archival_dict = archival.as_dict()
del archival_dict['id']
del archival_dict['package_id']
del archival_dict['resource_id']
res['archiver'] = archival_dict
class TestIPipePlugin(p.SingletonPlugin):
......
......@@ -108,7 +108,6 @@ def update_package(ckan_ini_filepath, package_id, queue='bulk'):
Archive a package.
'''
from ckan import model
from ckan.plugins import toolkit
get_action = toolkit.get_action
......@@ -133,7 +132,7 @@ def update_package(ckan_ini_filepath, package_id, queue='bulk'):
raise
# Any problem at all is logged and reraised so that celery can log it too
log.error('Error occurred during archiving package: %s\nPackage: %r %r',
e, package_id, package['name'] if package in dir() else '')
e, package_id, package['name'] if 'package' in dir() else '')
raise
notify_package(package, queue, ckan_ini_filepath)
......@@ -217,7 +216,7 @@ def _update_resource(ckan_ini_filepath, resource_id, queue):
download_status_id = Status.by_text('Archived successfully')
context = {
'site_url': config.get('ckan.site_url_internally') or config['ckan.site_url'],
'cache_url_root': config.get('ckan.cache_url_root'),
'cache_url_root': config.get('ckanext-archiver.cache_url_root'),
}
try:
download_result = download(context, resource)
......@@ -438,7 +437,7 @@ def archive_resource(context, resource, log, result=None, url_timeout=30):
if not context.get('cache_url_root'):
log.warning('Not saved cache_url because no value for cache_url_root '
'in config')
raise ArchiveError('No value for cache_url_root in config')
raise ArchiveError('No value for ckanext-archiver.cache_url_root in config')
cache_url = urlparse.urljoin(context['cache_url_root'],
'%s/%s' % (relative_archive_path, file_name))
return {'cache_filepath': saved_file,
......
......@@ -378,7 +378,7 @@ class TestDownload(BaseCase):
config
cls.fake_context = {
'site_url': config.get('ckan.site_url_internally') or config['ckan.site_url'],
'cache_url_root': config.get('ckan.cache_url_root'),
'cache_url_root': config.get('ckanext-archiver.cache_url_root'),
}
def teardown(self):
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment