README.rst 14.4 KB
Newer Older
1
2
3
.. You should enable this project on travis-ci.org and coveralls.io to make
   these badges work. The necessary Travis and Coverage config files have been
   generated for you.
John Glover's avatar
John Glover committed
4

David Read's avatar
David Read committed
5
6
.. image:: https://travis-ci.org/ckan/ckanext-archiver.svg?branch=master
    :target: https://travis-ci.org/ckan/ckanext-archiver
John Glover's avatar
John Glover committed
7

8
9
10
=============
ckanext-archiver
=============
John Glover's avatar
John Glover committed
11
12
13

Overview
--------
14

15
16
17
18
19
20
The CKAN Archiver Extension will download all of a CKAN's resources, for three purposes:

1. offer the user it as a 'cached' copy, in case the link becomes broken
2. tell the user (and publishers) if the link is broken, on both the dataset/resource and in a 'Broken Links' report
3. the downloaded file can be analysed by other extensions, such as ckanext-qa or ckanext-pacakgezip.

David Read's avatar
David Read committed
21
Demo:
David Read's avatar
David Read committed
22

David Read's avatar
David Read committed
23
24
25
26
27
.. image:: archiver_resource.png
    :alt: Broken link check info and a cached copy offered on resource

.. image:: archiver_report.png
    :alt: Broken link report
David Read's avatar
David Read committed
28

29
Compatibility: Requires CKAN version 2.1 or later
David Read's avatar
David Read committed
30
31

TODO:
David Read's avatar
David Read committed
32
33
34
35

* Show brokenness on the package page (not just the resources)
* Prettify the html bits
* Add brokenness to search facets using IFacet
36
37
38

Operation
---------
39

40
When a resource is archived, the information about the archival - if it failed, the filename on disk, file size etc - is stored in the Archival table. (In ckanext-archiver v0.1 it was stored in TaskStatus and on the Resource itself.) This is added to dataset during the package_show call (using a schema key), so the information is also available over the API.
41
42
43
44
45
46

Other extensions can subscribe to the archiver's ``IPipe`` interface to hear about datasets being archived. e.g. ckanext-qa will detect its file type and give it an openness score, or ckanext-packagezip will create a zip of the files in a dataset.

Archiver works on Celery queues, so when Archiver is notified of a dataset/resource being created or updated, it puts an 'update request' on a queue. Celery calls the Archiver 'update task' to do each archival. You can start Celery with multiple processes, to archive in parallel.

You can also trigger an archival using paster on the command-line.
John Glover's avatar
John Glover committed
47

48
By default, two queues are used:
49

50
51
52
1. 'bulk' for a regular archival of all the resources
2. 'priority' for when a user edits one-off resource

53
54
This means that the 'bulk' queue can happily run slowly, archiving large quantities slowly, such as re-archiving every single resource once a week. And meanwhile, if a new resource is put into CKAN then it can be downloaded straight away via the 'priority' queue.

John Glover's avatar
John Glover committed
55
56
57
58

Installation
------------

59
60
61
62
63
To install ckanext-archiver:

1. Activate your CKAN virtual environment, for example::

     . /usr/lib/ckan/default/bin/activate
John Glover's avatar
John Glover committed
64

65
2. Install the ckanext-archiver and ckanext-report Python packages into your virtual environment::
John Glover's avatar
John Glover committed
66

67
     pip install -e git+http://github.com/ckan/ckanext-archiver.git#egg=ckanext-archiver
68
     pip install -e git+http://github.com/datagovuk/ckanext-report.git#egg=ckanext-report
John Glover's avatar
John Glover committed
69

70
3. Install the archiver dependencies::
John Glover's avatar
John Glover committed
71

72
73
74
     pip install -r ckanext-archiver/requirements.txt

4. Now create the database tables::
John Glover's avatar
John Glover committed
75

76
     paster --plugin=ckanext-archiver archiver init --config=production.ini
77
     paster --plugin=ckanext-report report initdb --config=production.ini
John Glover's avatar
John Glover committed
78

79
4. Add ``archiver report`` to the ``ckan.plugins`` setting in your CKAN
80
81
   config file (by default the config file is located at
   ``/etc/ckan/default/production.ini``).
John Glover's avatar
John Glover committed
82

83
84
85
5. Install a Celery queue backend - see later section.

6. Restart CKAN. For example if you've deployed CKAN with Apache on Ubuntu::
John Glover's avatar
John Glover committed
86

87
88
     sudo service apache2 reload

David Read's avatar
David Read committed
89
90
91
92
Upgrade from version 0.1 to 2.x
-------------------------------

NB If upgrading ckanext-archiver and use ckanext-qa too, then you will need to upgrade ckanext-qa to version 2.x at the same time.
93

David Read's avatar
David Read committed
94
NB Previously you needed both ckanext-archiver and ckanext-qa to see the broken link report. This functionality has now moved to ckanext-archiver. So now you only need ckanext-qa if you want the 5 stars of openness functionality.
95
96
97
98
99
100
101
102
103
104
105
106
107

1. Activate your CKAN virtual environment, for example::

     . /usr/lib/ckan/default/bin/activate

2. Install ckanext-report (if not already installed)

     pip install -e git+http://github.com/datagovuk/ckanext-report.git#egg=ckanext-report

3. Add ``report`` to the ``ckan.plugins`` setting in your CKAN config file (it
   should already have ``archiver``) (by default the config file is located at
   ``/etc/ckan/default/production.ini``).

108
109
110
111
112
113
4. Also in your CKAN config file, rename old config option keys if you have them:

     * ``ckan.cache_url_root`` to ``ckanext-archiver.cache_url_root``
     * ``ckanext.archiver.user_agent_string`` to ``ckanext-archiver.user_agent_string``

5. Upgrade the ckanext-archiver Python package::
114
115
116
117

     cd ckanext-archiver
     git pull
     python setup.py develop
118

119
6. Create the new database tables::
120
121
122

     paster --plugin=ckanext-archiver archiver init --config=production.ini

123
7. Ensure the archiver dependencies are installed::
124
125
126

     pip install -r requirements.txt

127
8. Install the developer dependencies, needed for the migration::
David Read's avatar
David Read committed
128

129
     pip install -r dev-requirements.txt
David Read's avatar
David Read committed
130

131
9. Migrate your database to the new Archiver tables::
132
133

     python ckanext/archiver/bin/migrate_task_status.py --write production.ini
134

135
136
137
138
139
140
141
142
143
144
Migrations post 2.0
-------------------

Over time it is possible that the database structure will change.  In these cases you can use the migrate command to update the database schema.  

    ::
        paster --plugin=ckanext-archiver archiver migrate -c <path to CKAN ini file>

This is only necessary if you update ckanext-archiver and already have the database tables in place.

145

146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
Installing a Celery queue backend
---------------------------------

Archiver uses Celery to manage its 'queues'. You need to install a queue back-end, such as Redis or RabbitMQ.

Redis backend
-------------

Redis can be installed like this::

    sudo apt-get install redis-server

Install the python library into your python environment::

    /usr/lib/ckan/default/bin/activate/pip install redis==2.10.1

It must then be configured in your CKAN config (e.g. production.ini) by inserting a new section, e.g. before `[app:main]`::

    [app:celery]
    BROKER_BACKEND = redis
    BROKER_HOST = redis://localhost/1
    CELERY_RESULT_BACKEND = redis
    REDIS_HOST = 127.0.0.1
    REDIS_PORT = 6379
    REDIS_DB = 0
    REDIS_CONNECT_RETRY = True

Number of items in the queue 'bulk'::

    redis-cli -n 1 LLEN bulk

See item 0 in the queue (which is the last to go on the queue & last to be processed)::

    redis-cli -n 1 LINDEX bulk 0

To delete all the items on the queue::

    redis-cli -n 1 DEL bulk

185
186
187
188
189
190
191
Installing SNI support
----------------------

When archiving resources on servers which use HTTPS, you might encounter this error::

    requests.exceptions.SSLError: [Errno 1] _ssl.c:504: error:14077410:SSL routines:SSL23_GET_SERVER_HELLO:sslv3 alert handshake failure

192
Whilst this could possibly be a problem with the server, it is most likely due to you needing to install SNI support on the machine that ckanext-archiver runs. Server Name Indication (SNI) is for when a server has multiple SSL certificates, which is a relatively new feature in HTTPS. This requires installing a recent version of OpenSSL plus the python libraries to make use of this feature.
193

194
If you have SNI support installed then this should command run without the above error::
195
196
197
198
199
200
201
202
203

    python -c 'import requests; requests.get("http://files.datapress.com")'

On Ubuntu 12.04 you can install SNI support by doing this::

    sudo apt-get install libffi-dev
    . /usr/lib/ckan/default/bin/activate
    pip install 'cryptography==0.9.3' pyOpenSSL ndg-httpsclient pyasn1

204
You should also check your OpenSSL version is greater than 1.0.0::
205
206
207

    python -c "import ssl; print ssl.OPENSSL_VERSION"

208
209
Apparently SNI was added into OpenSSL version 0.9.8j but apparently there are reported problems with 0.9.8y, 0.9.8zc & 0.9.8zg so 1.0.0+ is recommended.

210
211
212
213
For more about enabling SNI in python requests see:

    * https://stackoverflow.com/questions/18578439/using-requests-with-tls-doesnt-give-sni-support/18579484#18579484
    * https://github.com/kennethreitz/requests/issues/2022
214
215
216
217


Config settings
---------------
John Glover's avatar
John Glover committed
218

219
1.  Enabling Archiver to listen to resource changes
220

John Glover's avatar
John Glover committed
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
    If you want the archiver to run automatically when a new CKAN resource is added, or the url of a resource is changed,
    then edit your CKAN config file (eg: development.ini) to enable the extension:

    ::

        ckan.plugins = archiver

    If there are other plugins activated, add this to the list (each plugin should be separated with a space).

    **Note:** You can still run the archiver manually (from the command line) on specific resources or on all resources
    in a CKAN instance without enabling the plugin. See section 'Using Archiver' for details.

2.  Other CKAN config options

    The following config variable should also be set in your CKAN config:

237
    * ``ckan.site_url`` = URL to your CKAN instance
John Glover's avatar
John Glover committed
238

239
    This is the URL that the archive process (in Celery) will use to access the CKAN API to update it about the cached URLs. If your internal network names your CKAN server differently, then specify this internal name in config option: ``ckan.site_url_internally``
240

John Glover's avatar
John Glover committed
241
242
243

3.  Additional Archiver settings

244
245
    Add the settings to the CKAN config file:

246
247
248
249
      * ``ckanext-archiver.archive_dir`` = path to the directory that archived files will be saved to (e.g. ``/www/resource_cache``)
      * ``ckanext-archiver.cache_url_root`` = URL where you will be publicly serving the cached files stored locally at ckanext-archiver.archive_dir.
      * ``ckanext-archiver.max_content_length`` = the maximum size (in bytes) of files to archive (default ``50000000`` =50MB)
      * ``ckanext-archiver.user_agent_string`` = identifies the archiver to servers it archives from
250
251

4.  Nightly report generation
John Glover's avatar
John Glover committed
252

253
    Configure the reports to be generated each night using cron. e.g.::
John Glover's avatar
John Glover committed
254

255
        0 6  * * *  www-data  /usr/lib/ckan/default/bin/paster --plugin=ckanext-report report generate --config=/etc/ckan/default/production.ini
256

David Read's avatar
David Read committed
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
5.  Your web server should serve the files from the archive_dir.

    With nginx you insert a new ``location`` after the ckan one. e.g. here we have configured ``ckanext-archiver.archive_dir`` to ``/www/resource_cache`` and serve these files at location ``/resource_cache`` (i.e. ``http://mysite.com/resource_cache`` )::

        server {
            # ckan
            location / {
                proxy_pass http://127.0.0.1:8080/;
                ...
            }
            # archived files
            location /resource_cache {
                root /www/resource_cache;
            }

272
273
Legacy settings
~~~~~~~~~~~~~~~
John Glover's avatar
John Glover committed
274

275
276
277
Older versions of ckanext-archiver put these settings in
ckanext/archiver/settings.py as variables ARCHIVE_DIR and MAX_CONTENT_LENGTH
but this is no longer available.
278

279
280
281
282
There used to be an option DATA_FORMATS for filtering the resources
archived, but that has now been removed in ckanext-archiver v2.0, since it
is now not only caching files, but is seen as a broken link checker, which
applies whatever the format.
283

John Glover's avatar
John Glover committed
284

John Glover's avatar
John Glover committed
285
286
287
Using Archiver
--------------

288
First, make sure that Celery is running for each queue. For test/local use, you can run::
John Glover's avatar
John Glover committed
289

290
    paster --plugin=ckanext-archiver celeryd2 run all -c development.ini
291

292
However in production you'd run the priority and bulk queues separately, or else the priority queue will not have any priority over the bulk queue. This can be done by running these two commands in separate terminals::
John Glover's avatar
John Glover committed
293

294
295
    paster --plugin=ckanext-archiver celeryd2 run priority -c production.ini
    paster --plugin=ckanext-archiver celeryd2 run bulk -c production.ini
John Glover's avatar
John Glover committed
296
297
298
299
300
301
302

For production use, we recommend setting up Celery to run with supervisord.
For more information see:

* http://docs.ckan.org/en/latest/extensions.html#enabling-an-extension-with-background-tasks
* http://wiki.ckan.org/Writing_asynchronous_tasks

303
An archival can be triggered by adding a dataset with a resource or updating a resource URL. Alternatively you can run::
John Glover's avatar
John Glover committed
304

305
    paster --plugin=ckanext-archiver archiver update [dataset] --queue=priority -c <path to CKAN config>
John Glover's avatar
John Glover committed
306

307
Here ``dataset`` is a CKAN dataset name or ID, or you can omit it to archive all datasets.
John Glover's avatar
John Glover committed
308

309
For a full list of manual commands run::
John Glover's avatar
John Glover committed
310

David Read's avatar
David Read committed
311
    paster --plugin=ckanext-archiver archiver --help
John Glover's avatar
John Glover committed
312

313
Once you've done some archiving you can generate a Broken Links report::
John Glover's avatar
John Glover committed
314

315
    paster --plugin=ckanext-report report generate broken-links --config=production.ini
John Glover's avatar
John Glover committed
316

317
And view it on your CKAN site at ``/report/broken-links``.
John Glover's avatar
John Glover committed
318
319


320
321
Testing
-------
John Glover's avatar
John Glover committed
322

323
To run the tests:
John Glover's avatar
John Glover committed
324

325
1. Activate your CKAN virtual environment, for example::
John Glover's avatar
John Glover committed
326

327
     . /usr/lib/ckan/default/bin/activate
John Glover's avatar
John Glover committed
328

329
2. If not done already, install the dev requirements::
330

331
    (pyenv)~/pyenv/src/ckan$ pip install ../ckanext-archiver/dev-requirements.txt
332

333
3. From the CKAN root directory (not the extension root) do::
334

335
    (pyenv)~/pyenv/src/ckan$ nosetests --ckan ../ckanext-archiver/tests/ --with-pylons=../ckanext-archiver/test-core.ini
336
337


David Read's avatar
David Read committed
338
339
340
341
342
343
344
345
346
347
348
349
350
351
Building Debian package
-----------------------

NB this attempt at creating a Debian package is experimental. Important package dependencies have yet to specified. The outstanding issue is that some dependencies do not exist as debian packages (eg: messytables).

To build the debian package::

    cd ckanext-archiver; dpkg-buildpackage -us -uc -i -I -rfakeroot

To list the package contents::

    dpkg --contents ../python-ckanext-archiver_0.1-1_all.deb


352
353
Questions
---------
John Glover's avatar
John Glover committed
354

355
356
The archiver information is not appearing on the resource page
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
John Glover's avatar
John Glover committed
357

358
359
360
361
362
363
364
Check that it is appearing in the API for the dataset - see question below.

The archiver information is not appearing in the API (package_show)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

i.e. if you browse this path on your website: `/api/action/package_show?id=<package_name>` then you don't see the `archiver` key at the dataset level or resource level.

365
Check the `paster archiver update` command completed ok. Check that the `paster celeryd2 run` has done the archiving ok. Check the dataset has at least one resource. Check that you have ``archiver`` in your ckan.plugins and have restarted CKAN.
366
367
368
369
370
371
372
373
374

'SSL handshake' error
~~~~~~~~~~~~~~~~~~~~~

When archiving resources on servers which use HTTPS, you might encounter this error::

    requests.exceptions.SSLError: [Errno 1] _ssl.c:504: error:14077410:SSL routines:SSL23_GET_SERVER_HELLO:sslv3 alert handshake failure

This is probably because you don't have SNI support and requires installing OpenSSL - see section "Installing SNI support".