dd-sword2

DANS SWORD v2 based deposit service

SYNOPSIS

dd-sword2 { server | check }

DESCRIPTION

Overview

Context

The dd-sword2 service is the DANS implementation of the SWORDv2 protocol. It is a rewrite in Java of the Scala-based project easy-sword2. Like its predecessor, it does not implement the full SWORDv2 specifications. Also, since the SWORDv2 specs leave various important issues up to the implementer, the service adds some features.

The best starting point for learning about dd-sword2 is this document. Where appropriate, this document contains references to the SWORDv2 specifications document. When reading the SWORDv2 docs, keep in mind that it is itself built on other specifications, and refers to those often, especially:

Purpose of the service

At the highest level dd-sword2 is a service that accepts ZIP packages that comply with the BagIt packaging format and produces a deposit directory for each.

Interfaces (provided)

The service has the following interfaces.

Overview

SWORDv2

  • Protocol type: HTTP
  • Internal or external: external
  • Purpose: depositing packages, submitting them for archival processing and tracking progress

Deposit directories

  • Protocol type: Shared filesystem
  • Internal or external: internal
  • Purpose: handing packages to the post-submission processing service and reporting back status changes written by that service to the deposit.properties files of the deposit directories

Admin console

  • Protocol type: HTTP
  • Internal or external: internal
  • Purpose: application monitoring and management

Interfaces (consumed)

The service consumes the following interfaces.

Auth delegate

  • Protocol type: HTTP Basic Auth
  • Internal or external: internal
  • Purpose: delegate authentication of a client for SWORD2

Authentication is discussed in more detail ot the User profiles & authentication page.

Processing

The following sections describe the interaction of a client with the SWORDv2 interface. The examples are curl commands. The meaning of the shell variables is as follows:

Variable Meaning
USER user name of sword client
PASSWORD password of the sword client
SWORD_BASE_URL the base URL of the SWORD service
(the same URL is configured in config.yml as sword2.baseUrl)
SWORD_COL_IRI the URL of the collection that the deposit is sent to
SWORD_EDIT_IRI the URL to send subsequent parts to in a continued deposit

Getting the service document

The service document is an XML document that lets the client discover the capabilities and the supported collections of the service. It can be retrieved with a simple GET request:

curl -X GET -u $USER:$PASSWORD $SWORD_BASE_URL/servicedocument

Pretty printing XML

Use xmllint to display XML output in a more readable format (the final dash is intentional!):

curl -X GET -u $USER:$PASSWORD $SWORD_BASE_URL/servicedocument | xmllint --format -

Creating and submitting a deposit

A deposit is created by binary file deposit. The other options that SWORDv2 specifies are currently not supported. Furthermore, the only packaging that is supported is http://purl.org/net/sword/package/BagIt. This means that:

  • the payload of the upload must be a ZIP file containing a bag;
  • the Packaging header must be set to http://purl.org/net/sword/package/BagIt.

It is furthermore mandatory to send along the Content-MD5 header. (Note that SWORD2 requires the content of this header to be a hex encoded MD5 digest, rather than the base64 encoded MD5 digest specified in RFC1864 about Content-MD5.)

If bag.zip is such a ZIP file, then it can be uploaded as follows:

export BAG=bag.zip
curl -X POST \
     -H 'Content-Type: application/zip' \
     -H 'Content-Disposition: attachment; filename=bag.zip' \
     -H "Content-MD5: $(md5 -q $BAG)" \ 
     -H 'Packaging: http://purl.org/net/sword/package/BagIt' \
     --data-binary @$BAG -u $USER:$PASSWORD $SWORD_COL_IRI

(The md5 command used above is the one from BSD and MacOS. You may have to get the correct output in a different way on other systems. On Linux you can use $(md5sum $BAG | cut -d ' ' -f).)

If the upload is successful, the client will receive a deposit receipt. This is an Atom Entry document that contains, among other things, the statement URL (Stat-IRI), which is the URL the client can use to track post-submision processing.

Continued deposit

If the bag to be uploaded is larger than 1G it is recommended to use a continued deposit. The client must split the ZIP file into chunks and send these in separate requests with the In-Progress header set to true for all chunks except the last. The names of the chunk files must be: the name of the complete ZIP file, extended with .n, where n is the sequence number.

(1) The first chunk is sent to the collection URL (Col-IRI in SWORD terms), (2) the subsequent chunks are sent to the SWORD "edit" URL (SE-IRI), which can be found in the deposit receipt of the first upload.

The client indicates that it will be sending more chunks by including the header In-Progress: true. Since the content of each separate chunk is not a valid ZIP file, the Content-Type must be set to application/octet-stream (which is a fancy way of saying the content consists of bytes).

If bag.zip.1, bag.zip.2 and bag.zip.3 are the chunks created by splitting bag.zip, they can be uploaded as follows:

Step (1)

export BAG=bag.zip
curl -X POST \
     -H 'Content-Type: application/octet-stream' \
     -H 'Content-Disposition: attachment; filename=bag.zip.1' \
     -H 'In-Progress: true' \
     -H "Content-MD5: $(md5 -q ${BAG}.1)" \ 
     -H 'Packaging: http://purl.org/net/sword/package/BagIt' \
     --data-binary @${BAG}.1 -u $USER:$PASSWORD $SWORD_COL_IRI

If the upload is successful the server will respond with a download receipt:

<entry xmlns="http://www.w3.org/2005/Atom">
    <generator uri="http://www.swordapp.org/" version="2.0"/>
    <id>https://swordserver.org/sword2/container/a5bb644a-78a3-47ae-907a-0bdf162a0cd4</id>
    <link href="https://swordserver.org/sword2/container/a5bb644a-78a3-47ae-907a-0bdf162a0cd4" rel="edit"/>
    <link href="https://swordserver.org/sword2/container/a5bb644a-78a3-47ae-907a-0bdf162a0cd4" rel="http://purl.org/net/sword/terms/add"/>
    <link href="https://swordserver.org/sword2/media/a5bb644a-78a3-47ae-907a-0bdf162a0cd4" rel="edit-media"/>
    <packaging xmlns="http://purl.org/net/sword/terms/">http://purl.org/net/sword/package/BagIt</packaging>
    <link href="https://swordserver.org/sword2/statement/a5bb644a-78a3-47ae-907a-0bdf162a0cd4" rel="http://purl.org/net/sword/terms/statement"
          type="application/atom+xml; type=feed"/>
    <treatment xmlns="http://purl.org/net/sword/terms/">[1] unpacking [2] verifying integrity [3] storing persistently</treatment>
    <verboseDescription xmlns="http://purl.org/net/sword/terms/">received successfully: bag.zip.1; MD5: 494dd614e36edf5c929403ed7625b157</verboseDescription>
</entry>

Step (2)

Parts 2 and 3 sent to the SWORD "edit" URL (SE-IRI). It can be retrieved from the deposit receipt by finding the link element with rel="edit". In the example this is https://swordserver.org/sword2/container/a5bb644a-78a3-47ae-907a-0bdf162a0cd4.

curl -X POST \
     -H 'Content-Type: application/octet-stream' \
     -H 'Content-Disposition: attachment; filename=bag.zip.2' \
     -H 'In-Progress: true' \
     -H "Content-MD5: $(md5 -q ${BAG}.2)" \ 
     -H 'Packaging: http://purl.org/net/sword/package/BagIt' \
     --data-binary @${BAG}.2 -u $USER:$PASSWORD $SWORD_EDIT_IRI

For the last part the In-Progress header is set to false.

curl -X POST \
     -H 'Content-Type: application/octet-stream' \
     -H 'Content-Disposition: attachment; filename=bag.zip.3' \
     -H 'In-Progress: false' \
     -H "Content-MD5: $(md5 -q ${BAG}.3)" \ 
     -H 'Packaging: http://purl.org/net/sword/package/BagIt' \
     --data-binary @${BAG}.3 -u $USER:$PASSWORD $SWORD_EDIT_IRI

After this, the client will have to wait for the server to process the deposit. It should track the progress until the the server has confirmed that the deposit was fully processed.

Finalizing a deposit

When the client sends (the first part of) a bag, the server creates a draft deposit directory. As long as the client is uploading parts of the deposit, the state of the deposit directory is DRAFT. When the last part has been received, the state becomes UPLOADED. An "uploaded" deposit is waiting for finalization. When a finalization worker becomes available it will:

  1. change the state of the deposit to FINALIZING;
  2. concatenate the parts uploaded in a continued deposit into one file in the order indicated by the sequences numbers at the endings of the file names (in a simple deposit this is skipped, of course);
  3. unzip the file (if this fails, the deposit becomes INVALID);
  4. validate that resulting directory complies with the BagIt specs (if this fails, the deposit becomes INVALID);
  5. change the state to SUBMITTED;
  6. move the deposit to the directory configured for the collection in sword2.collections.<collection>.deposits. (See config.yml.)

(Any error other than the upload not being a valid ZIP file or not being a valid bag, will cause the deposit to transition to a FAILED state. In other words,INVALID indicates faulty input by the client, FAILED means that the server was misconfigured or was experiencing other problems.)

After this dd-sword2, will not write to the deposit in any way. In other words, by moving the deposit tot he deposits directory dd-sword2 hands over further processing to a post-submission process. The only thing that dd-sword2 will continue to do is to serve the client the SWORD Statement whenever requested. This ensures the client can keep track of the deposit even after dd-sword2 is finished with it.

Tracking post-submission processing

As soon as the deposit exists, the client can track its state by downloading the SWORD statement from the Statement URL (Stat-IRI). This URL can be found in the deposit receipt in the link with the attribute rel="http://purl.org/net/sword/terms/statement".

A SWORD Statement is an Atom Feed document (the RDF/XML serialization is currently not supported by dd-sword2), for example:


<feed xmlns="http://www.w3.org/2005/Atom">
    <id>$SWORD_STAT_IRI</id>
    <link href="$SWORD_STAT_IRI" rel="self"/>
    <title type="text">Deposit a5bb644a-78a3-47ae-907a-0bdf162a0cd4</title>
    <author>
        <name>user</name>
    </author>
    <updated>2019-05-23T14:51:15.356Z</updated>
    <category term="ARCHIVED" scheme="http://purl.org/net/sword/terms/state" label="State"/>
    <entry>
        <content type="multipart/related" src="urn:uuid:a5bb644a-78a3-47ae-907a-0bdf162a0cd4"/>
        <id>urn:uuid:a5bb644a-78a3-47ae-907a-0bdf162a0cd4</id>
        <title type="text">Resource urn:uuid:a5bb644a-78a3-47ae-907a-0bdf162a0cd4</title>
        <summary type="text">Resource Part</summary>
        <updated>2019-05-23T14:51:22.342Z</updated>
        <link href="https://doi.org/10.5072/dans-Lwgy-zrn-jfyy" rel="self"/>
    </entry>
</feed>

The statement provides information about the deposit as it is processed by the server (dd-sword2 and any post-submission process), most importantly the current state of the deposit. This can be found in the <category> element with the scheme attribute set to http://purl.org/net/sword/terms/state. The term attribute of this element contains the current state. The table below lists the states implemented by dd-sword2 and their meaning. The post-submission process can set the state by updating the state.label and state.description properties in the deposit.properties file. The state labels it uses are transparent to dd-sword2. It may for example use ARCHIVED to indicate that post-submission processing has finished with successfully archiving the deposit.

State Description
DRAFT The deposit is being prepared by the depositor. It is not submitted to the archive yet
and still open for additional data.
UPLOADED The deposit is in the process of being submitted. It is waiting to be finalized. The data
is completely uploaded. It will automatically move to the next stage and the status will
be updated accordingly.
FINALIZING The deposit is in the process of being submitted. It is being checked for validity. It will
automatically move to the next stage and the status will be updated accordingly.
INVALID The deposit is not accepted by the archive as the submitted bag is not valid.
The description will detail what part of the bag is not according to specifications.
The depositor is asked to fix the bag and resubmit the deposit.
SUBMITTED The deposit is submitted for processing. dd-sword2 will not update it anymore
and limit itself to providing a Statement document on request.
FAILED An error occurred while processing the deposit

ARGUMENTS

    positional arguments:
    {server,check}         available commands

    named arguments:
    -h, --help             show this help message and exit
    -v, --version          show the application version and exit

EXAMPLES

Java client code examples are available in dd-dans-sword2-examples.

INSTALLATION

Currently this project is built as an RPM package for RHEL7/CentOS7 and later. The RPM will install the binaries to /opt/dans.knaw.nl/dd-sword2 and the configuration files to /etc/opt/dans.knaw.nl/dd-sword2.

For installation on systems that do no support RPM and/or systemd:

  1. Build the tarball (see next section).
  2. Extract it to some location on your system, for example /opt/dans.knaw.nl/dd-sword2.
  3. Start the service with the following command bash /opt/dans.knaw.nl/dd-sword2/bin/dd-sword2 server /opt/dans.knaw.nl/dd-sword2/cfg/config.yml

CONFIGURATION

This service can be configured by changing the settings in config.yml. See the comments in that file for more information.

BUILDING FROM SOURCE

Prerequisites:

  • Java 11 or higher
  • Maven 3.3.3 or higher
  • RPM

Steps:

git clone https://github.com/DANS-KNAW/dd-sword2.git
cd dd-sword2 
mvn clean install

If the rpm executable is found at /usr/local/bin/rpm, the build profile that includes the RPM packaging will be activated. If rpm is available, but at a different path, then activate it by using Maven's -P switch: mvn -Prpm install.

Alternatively, to build the tarball execute:

mvn clean install assembly:single