dd-dans-sword2-examples¶
Examples for creating a SWORD2 Java client to deposit datasets to a DANS Data Station or the DANS Vault as a Service.
SYNOPSIS¶
# Build the code
mvn clean install
# Validate a bag
./run-validation.sh https://demo.sword2.domain.datastations.nl/validate-dans-bag bag
# Deposit a bag in one chunk
./run-simple-deposit.sh https://demo.sword2.domain.datastations.nl/collection/1 myuser \
mypassword bag
# Deposit an update to a dataset in one chunk using the SWORD token of the targeted dataset
./run-simple-deposit.sh https://demo.sword2.domain.datastations.nl/collection/1 myuser \
mypassword bag urn:uuid:5c90d501-0bdf-4183-a96d-25fa6dea5489
# Deposit a bag in chunks of configurable size
./run-continued-deposit.sh https://demo.sword2.domain.datastations.nl/collection/1 myuser \
mypassword chunksize bag
# Deposit a sequence of bags, the first one being a new dataset, the others being updates to
# this dataset, each in one chunk
./run-sequence-simple-deposit.sh https://demo.sword2.domain.datastations.nl/collection/1
myuser mypassword bag1 bag2 bag3
# Deposit a sequence of bags, the first one being a new dataset, the others being updates to
# this dataset, each in chunks of configurable size
./run-sequence-continued-deposit.sh SequenceContinued https://demo.sword2.domain.datastations.nl/collection/1 \
myuser mypassword chunksize bag1 bag2 bag3
DESCRIPTION¶
This project contains two important resources for developers who are tasked with the creation or maintenance of a SWORD2 client that deposits datasets to one of the DANS Data Stations or the Vault as a Service:
- Example Java client code
- Examples of bags that conform to the DANS BagIt Profile v1 requirements (and—for illustration—some that violate some of the requirements).
Looking for legacy EASY SWORD2 examples?
This project contains examples for the SWORD2 interface of the new DANS Data Stations. For the legacy EASY SWORD2 service see easy-sword2-dans-examples.
Migrating from EASY SWORD2 to Data Station SWORD2
If you are an existing customer who is migrating an EASY SWORD2 client to a Data Station SWORD2 client, please read Migrating from EASY after you have read the current page.
Data Station vs Vault as a Service
Clients can deposit to either a Data Station or the Vault as a Service (Vaas). The protocol is largely the same. The differences are highlighted in the text with the notes (VaaS) and (Data Station).
SWORD2 in a nutshell¶
Depositing to the DANS Archive via SWORD2 is basically a two-phase process:
- Submitting a deposit for ingest.
- Tracking the state of the deposit as it goes through the ingest-flow, until it reaches PUBLISHED or ACCEPTED status.
The following diagram details this a bit further.
- Client creates a deposit package (conforming to DANS BagIt Profile v1).
- Client sends deposit package to SWORD2 Service, getting back a URL to track the deposit's state.
- SWORD2 Service unzips and validates deposit.
- Ingest Flow performs checks and transformations and creates a dataset in the Data Station Repository (Data Station) or submits an archival package directly to the DANS Data Vault (VaaS).
- Ingest Flow reports back success or failure to SWORD2 Service.
3-5. During this time the Client periodically checks the deposit state through the URL received in step 2. If the final
state of PUBLISHED
(Data Station) / ACCEPTED
(VaaS) is reached, the process is concluded successfully. At
this point the deposit has published a new dataset (or a new version of an existing dataset) in the Data Station
repository (Data Station only) and submitted an archival package to the DANS Data Vault for processing.
Other outcomes may be INVALID
(the bag was invalid according to the BagIt specs)
or REJECTED
(the additional requirements of DANS BagIt Profile v1 were not met). In case the server
encountered an unknown error FAILED
will be returned.
Getting started¶
The following is a step-by-step instruction on how to run a simple example using the Data Station's demo server.
Getting access to the demo server¶
Agreement
Before you can get access to the demo server, there must be a formal agreement between your organization and DANS. The following assumes that this agreement is in place. If it is not, please contact the Data Station Manager of the Data Station that you want to deposit to.
- From your Data Station Manager at DANS request access to the demo Data Station server. The Data Station Manager will
provide the information necessary to connect (see note about the
X-Authorization
header below). - Create an account in the demo Data Station (Data Station only).
- From your Data Station Manager at DANS request the account to be enabled for SWORD2 deposits (Data Station only).
X-Authorization header
In order to keep web crawlers from accessing the demo server, the demo server requires the X-Authorization
header.
This is in addition to the authentication required by the SWORD2 server. The value of the X-Authorization
is
provided by the Data Station Manager. It is the same for all users of the demo server. Put the value in a file
called x-auth-value.txt
in the root of your clone of this project. The method Common.addXAuthorizationToRequest()
will read the value from this file and add it to the request.
Configuring which notifications to receive
The Data Station repository (Dataverse) generates notifications for many events. Most of these can be muted. Log in via the user interface and open the account menu on the top right. Click on the Notifications item. The Notifications tab of your Account page will now be opened. Expand the header Notification settings and uncheck the notification types you do not wish to receive.
Depositing your first dataset¶
Running the SimpleDeposit example¶
-
Clone and build this project:
git clone https://github.com/DANS-KNAW/dd-dans-sword2-examples cd dd-dans-sword2-examples mvn clean install
-
Execute the following command from the base directory of your clone:
./run-simple-deposit.sh https://demo.sword2.<domain>.datastations.nl/collection/1 <user> <password> <bag>
Fill in:
- for
<domain>
the name of the Data Station that you are depositing to, one ofarchaeology
,ssh
,lhms
orpts
; - for
<user>
your Data Station account name; - for
<password>
the password of your Data Station account; - for
<bag>
: any of the bags insrc/main/resources/example-bags
, for examplesrc/main/resources/example-bags/valid/audiences
.
- for
This will run the example program nl.knaw.dans.sword2examples.SimpleDeposit
, which will copy the example bag to the
folder target
(the Maven build folder), zip it and send it
to https://demo.sword2.<domain>.datastations.nl/collection/1
authenticating with the provider username and password using basic auth.
Authenticating with X-Dataverse-key (Data Station only)
Instead of using username and password you can also authenticate using your API-token (also known as API-key). You
can look up your current current API-token in your account settings in the Data Station user interface. The
API-token is specified using the header X-Dataverse-key
. To pass it to the example programs using the
run-*-deposit.sh
scripts specify instead of your user name the literal string API_KEY
and instead of your password
the API-token.
Note that using the API-token is the only way to authenticate if your Data Station account is using an external identity provider such as SURFconext or Google.
Output analysis¶
In the introduction the SWORD2 deposit process is described in 5 stages, the response messages give some indication how far the process has progressed. The output will take the following form, starting with the part of the response representing step 2. The UUID will of course be different.
SUCCESS. Deposit receipt follows:
<entry xmlns="http://www.w3.org/2005/Atom">
<generator uri="http://www.swordapp.org/" version="2.0" />
<id>https://demo.sword2.<domain>.datastations.nl/container/a5bb644a-78a3-47ae-907a-0bdf162a0cd4</id>
<link href="https://demo.sword2.<domain>.datastations.nl/container/a5bb644a-78a3-47ae-907a-0bdf162a0cd4" rel="edit" />
<link href="https://demo.sword2.<domain>.datastations.nl/container/a5bb644a-78a3-47ae-907a-0bdf162a0cd4" rel="http://purl.org/net/sword/terms/add" />
<link href="https://demo.sword2.<domain>.datastations.nl/media/a5bb644a-78a3-47ae-907a-0bdf162a0cd4" rel="edit-media" />
<packaging xmlns="http://purl.org/net/sword/terms/">http://purl.org/net/sword/package/BagIt</packaging>
<link href="https://demo.sword2.<domain>.datastations.nl/statement/a5bb644a-78a3-47ae-907a-0bdf162a0cd4" rel="http://purl.org/net/sword/terms/statement" type="application/atom+xml; type=feed" />
<treatment xmlns="http://purl.org/net/sword/terms/">[1] unpacking [2] verifying integrity [3] storing persistently</treatment>
<verboseDescription xmlns="http://purl.org/net/sword/terms/">received successfully: bag.zip; MD5: 494dd614e36edf5c929403ed7625b157</verboseDescription>
</entry>
Retrieving Statement IRI (Stat-IRI) from deposit receipt ...
Stat-IRI = https://demo.sword2.<domain>.datastations.nl/statement/a5bb644a-78a3-47ae-907a-0bdf162a0cd4
As the deposit is being processed by the server the client polls the Stat-IRI (SWORD2 Statement URI) to track the status of the deposit. During this stage steps 3 and 4 are performed.
Start polling Stat-IRI for the current status of the deposit, waiting 10 seconds before every request ...
Checking deposit status ... SUBMITTED
Checking deposit status ... SUBMITTED
Checking deposit status ... SUBMITTED
Checking deposit status ... SUBMITTED
The 5th and final step of the process is represented by the following response messaging.
Checking deposit status ... PUBLISHED
SUCCESS.
Dataset has been published as: <https://doi.org/doi:10.5072/DAR/MNGAHF>.
Dataset NBN: <https://www.persistent-identifier.nl?identifier=urn:nbn:nl:ui:13-d4cfb364-c6cc-4242-891a-e9e9673379bc>.
Bag ID for this version of the dataset: urn:uuid:ca145147-6d15-4c2b-abf0-fb1110271560
State description: The deposit was successfully ingested in the Data Station and will be automatically archived
Complete statement follows:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<feed xmlns="http://www.w3.org/2005/Atom" xmlns:sword="http://purl.org/net/sword/terms/" xmlns:ns3="http://purl.org/net/sword/">
<id>https://demo.sword2.<domain>.datastations.nl/statement/ca145147-6d15-4c2b-abf0-fb1110271560</id>
<link href="https://demo.sword2.<domain>.datastations.nl/statement/ca145147-6d15-4c2b-abf0-fb1110271560" rel="self"/>
<title type="text">Deposit ca145147-6d15-4c2b-abf0-fb1110271560</title>
<author>
<name>DANS SWORD2</name>
</author>
<updated>2023-02-18T12:03:55.966061+01:00</updated>
<entry>
<id>urn:uuid:ca145147-6d15-4c2b-abf0-fb1110271560</id>
<title type="text">Resource urn:uuid:ca145147-6d15-4c2b-abf0-fb1110271560</title>
<summary type="text">Resource Part</summary>
<content src="urn:uuid:ca145147-6d15-4c2b-abf0-fb1110271560" type="multipart/related"/>
<updated>2023-02-18T12:04:06.251424+01:00</updated>
<link href="https://doi.org/doi:10.5072/DAR/MNGAHF" rel="self"/>
<link href="https://www.persistent-identifier.nl?identifier=urn:nbn:nl:ui:13-d4cfb364-c6cc-4242-891a-e9e9673379bc" rel="self"/>
</entry>
<category label="State" scheme="http://purl.org/net/sword/terms/state" term="PUBLISHED">The deposit was successfully ingested in the Data Station and will be automatically archived</category>
</feed>
How to read this output?
- This confirms that the dataset was successfully published and is resolvable using the DOI URL https://doi.org/doi:10.5072/DAR/MNGAHF. (N.B. in the test environment the DOI will not actually resolve.) The DOI can be used to cite the dataset.
- The dataset has the URN:NBN identifier
urn:nbn:nl:ui:13-d4cfb364-c6cc-4242-891a-e9e9673379bc
. This can be used be the depositor to retrieve a summary from the DANS Data Vault for this dataset. The summary includes information about for which dataset versions a long term preservation copy exists in the Vault. - The Bag ID for this version of the dataset is
urn:uuid:ca145147-6d15-4c2b-abf0-fb1110271560
. The Bag ID serves as a unique identifier for a long term preservation package in the DANS Data Vault.
Output for a failed deposit¶
If you deposit a bag that does not comply with the DANS BagIt Profile v1 requirements, a state
of REJECTED
will be returned. For example, when you use the example bag
in src/main/resources/example-bags/invalid/two-available-dates
, the error will indicate that a second ddm:available
element was found where a ddm:audience
was expected:
Checking deposit status ... REJECTED
FAILURE. Complete statement follows:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<feed xmlns="http://www.w3.org/2005/Atom" xmlns:sword="http://purl.org/net/sword/terms/" xmlns:ns3="http://purl.org/net/sword/">
<id>https://sword2.dar.dans.knaw.nl/statement/dde565d7-878a-4ae2-a607-0a3f55a22630</id>
<link href="https://sword2.dar.dans.knaw.nl/statement/dde565d7-878a-4ae2-a607-0a3f55a22630" rel="self"/>
<title type="text">Deposit dde565d7-878a-4ae2-a607-0a3f55a22630</title>
<author>
<name>DANS SWORD2</name>
</author>
<updated>2023-02-18T12:15:02.451030+01:00</updated>
<entry>
<id>urn:uuid:dde565d7-878a-4ae2-a607-0a3f55a22630</id>
<title type="text">Resource urn:uuid:dde565d7-878a-4ae2-a607-0a3f55a22630</title>
<summary type="text">Resource Part</summary>
<content src="urn:uuid:dde565d7-878a-4ae2-a607-0a3f55a22630" type="multipart/related"/>
<updated>2023-02-18T12:15:12.578076+01:00</updated>
</entry>
<category label="State" scheme="http://purl.org/net/sword/terms/state"
term="REJECTED">Rejected /var/opt/dans.knaw.nl/tmp/auto-ingest/inbox/dde565d7-878a-4ae2-a607-0a3f55a22630: Bag was not valid according to Profile Version 1.0.0. Violations: - [3.1.1] metadata/dataset.xml does not conform to dataset.xml:
- cvc-complex-type.2.4.a: Invalid content was found starting with element '{"http://schemas.dans.knaw.nl/dataset/ddm-v2/":available}'. One of '{"http://schemas.dans.knaw.nl/dataset/ddm-v2/":audience}' is expected.
</category>
</feed>
Statuses¶
The deposit will go through a number of statuses.
State | Description |
---|---|
DRAFT |
The deposit is being prepared by the depositor. It is not submitted to the archive yet and still open for additional data. |
UPLOADED |
The deposit is in the process of being submitted. It is waiting to be finalized. The data is completely uploaded. It will automatically move to the next stage and the status will be updated accordingly. |
FINALIZING |
The deposit is in the process of being submitted. It is being checked for validity. It will automatically move to the next stage and the status will be updated accordingly. |
INVALID |
The deposit is not accepted by the archive as the submitted bag is not valid. The description will detail what part of the bag is not according to specifications. The depositor is asked to fix the bag and resubmit the deposit. |
SUBMITTED |
The deposit is submitted for processing. At this point the Ingest Flow is processing the deposit and will update the state when it finishes. |
FAILED |
An error occurred while processing the deposit |
REJECTED |
The deposit does not meet the requirements of DANS BagIt Profile v1. The description will detail what part of the deposit is not according to specifications. The depositor is requested to fix and resubmit the deposit. |
PUBLISHED |
The deposit is successfully published in the Data Station repository (Data Station). |
ACCEPTED |
The deposit has been accepted for preservation in the DANS Data Vault (VaaS). |
If an error occurs the deposit will end up INVALID, REJECTED (client errors) or FAILED (server error). The text of
the category
element will contain details about the error.
Next steps¶
Studying the example bags¶
After successfully depositing the first example to the demo repository you can start thinking about how to design your SWORD2 client. Depending on your source repository system this make take various shapes. In any case your code will need to assemble a bag conforming to DANS BagIt Profile v1. Some examples of such bags are included in the resources directory of this project.
Mapping rules¶
The contents of the bags you deposit are mapped to data files and metadata in Dataverse. The mapping rules are documented in the Ingest Flow Mapping Rules Google spreadsheet. If you are a DANS SWORD2 customer access will be granted on request.
Finding libraries and tools¶
- bagit-java—a Java library for working with bags. This is a DANS fork of a project started by Library of Congress, which is no longer maintained by them.
- bagit-python—a Python library and command line tool for working with bags, also by Library of Congress. This is still maintained by them.
brew install bagit
is still available on MacOS to install an older version of bagit-java which contained a powerful command line interface, but is no longer maintained.- xmllint—a tool to check that XML files conform to a given XML schema.
Abdera project retired
easy-sword2-dans-examples used the Apache Abdera library to parse Atom Entry and Feed documents. We have removed that dependency, because Abdera is no longer maintained and we do not recommend using unmaintained libraries.
End-point for DANS BagIt Profile validation¶
All bags that are deposited to a Data Station are validated by dd-validate-dans-bag to see if they
conform to DANS BagIt Profile v1. To facilitate faster development in the demo environment this
service
can be invoked directly. The example program nl.knaw.dans.sword2examples.ValidateBag demonstrates how to call this
API.
A helper script to start this program is also provided, see run-validation.sh
.
DO NOT make calling this API part of your production code!
Use of the validation API end-point is entirely optional during testing. The Ingest Flow will call the validation before further processing a deposit, so if it is not valid it will be rejected. That having been said, when writing the code that assembles the bag to be deposited, using the validation API end-point may shorten the Edit - Compile - Run cycle.
Testing different scenarios¶
This project contains four Java example programs which can be used as a guide to writing a custom
client to deposit datasets using the SWORD2 protocol. The examples take one or more bags as input parameters. These bags
may be directories or ZIP files. The code copies each bag to the target
-folder of the project, zips it (if necessary)
and sends it to the specified SWORD2 service. The copying step has been built in because in some examples the bag must
be modified before it is sent; this way we avoid changing the git working directory.
SimpleDeposit.java
sends a zipped dataset in a single chunk and reports on the status.ContinuedDeposit.java
sends a zipped bag in chunks of configurable size and reports on the status.SequenceSimpleDeposit.java
calls the SimpleDeposit class multiple times to send multiple bags belonging to a sequence, the first bag being a new dataset and subsequent bags being updates (new versions) of this dataset.SequenceContinuedDeposit.java
calls the ContinuedDeposit class multiple times to send multiple bags belonging to a sequence, the first bag being a new dataset and subsequent bags being updates (new versions) of this dataset.
The Common.java
class contains elements which are used by all the other classes. This would include parsing, zipping
and sending of files.
The project root directory contains several helper scripts (run-*.sh
) that can be used to invoke the Java programs.
See SYNOPSIS. These scripts were developed to run in a bash
or zsh
shell, but should be easy to adapt
for a different shell environment.
EXAMPLES¶
- Java Example programs.
- Example bags can be found in the resources directory.
BUILDING FROM SOURCE¶
Prerequisites:
- Java 11 or higher
- Maven 3.3.3 or higher
Steps:
git clone https://github.com/DANS-KNAW/dd-dans-sword2-examples.git
cd dd-dans-sword2-examples
mvn clean install