Skip to content

easy-fedora-to-bag

Build Status

Retrieves a dataset from Fedora and transforms it to an AIP bag conforming to DANS-BagIt-Profile v0

SYNOPSIS

easy-fedora-to-bag {-d <dataset-id> | -i <dataset-ids-file>} -o <output-dir> [-s] [-l <log-file>] [-e | -p] -f { AIP | SIP } <transformation>

DESCRIPTION

Tool for exporting datasets from Fedora and constructing Archival/Submission Information Packages. An AIP is a DANS-V0 bag, a SIP is a directory with a bag and a deposit.properties file.

ARGUMENTS

 -d, --datasetId  <arg>       A single easy-dataset-id to be transformed. Use either this or the input-file
                              argument
 -e, --europeana              If provided, only the largest pdf/image will selected as payload.
 -i, --input-file  <arg>      File containing a newline-separated list of easy-dataset-ids to be transformed.
                              Use either this or the dataset-id argument
 -l, --log-file  <arg>        The name of the logfile in csv format. If not provided a file
                              easy-fedora-to-bag-<timestamp>.csv will be created in the home-dir of the user.
                              (default = /home/vagrant/easy-fedora-to-bag-2020-02-02T20:20:02.000Z.csv)
 -p, --no-payload             If provided, no payload files will be exported, i.e. only the metadata is
                              present in the bag.
 -o, --output-dir  <arg>      Empty directory that will be created if it doesn't exist. Successful bags (or 
                              packages) will be moved to this directory.
 -f, --output-format  <arg>   Output format: AIP, SIP. 'SIP' is only implemented for simple, it creates the
                              bags one directory level deeper. easy-bag-to-deposit completes these sips with
                              deposit.properties
 -s, --strict                 If provided, the transformation will check whether the datasets adhere to the
                              requirements of the chosen transformation.
 -h, --help                   Show help message
 -v, --version                Show version of this program

trailing arguments:
 transformation (required)   The type of transformation used: simple, thematische-collectie,
                             original-versioned, fedora-versioned.

EXAMPLES

$ easy-fedora-to-bag -d easy-dataset:1001 -o ~/stagedAIPs -f AIP simple
    creates a directory in '~/stagedAIPs'. This directory is an AIP bag, it has the UUID as the directory name, 
    and contains all relevant information from 'easy-dataset:1001' using the 'simple' transformation.

$ easy-fedora-to-bag -d easy-dataset:1001 -s -o ~/stagedAIPs -f AIP simple
    easy-dataset:1001 is transformed according to the simple transformation, 
    but only if it fulfils the requirements. The AIP bag is generated in directory '~/stagedAIPs'.

$ easy-fedora-to-bag -s -i dataset_ids.txt -o ./stagedAIPs -l ./outputLogfile.csv -f AIP simple
    Creates a bag in './stagedAIPs' for each dataset in 'dataset_ids.txt' using the 'simple' transformation.
    If a dataset does not adhere to the 'simple' requirements, or is not deposited by 'testDepositor',
    it will not be considered and an explanation will be recorded in 'outputLogfile.csv'.

$ easy-fedora-to-bag -i dataset_ids.txt -f SIP -o ./stagedSIPs -e simple
    Creates bags for all dataset-ids in dataset_ids.txt using the 'simple' transformation.
    The payload consists of only one file, the largest PDF or image in the datasets.

$ easy-fedora-to-bag -i dataset_ids.txt fedora-versioned
    Dry run, each line in the log-file will contain one or more dataset IDs.
    The first dataset on a line is the first version for the rest of the datasets on the same line.

$ easy-fedora-to-bag -i dataset_ids.txt -f SIP -o ./stagedSIPs fedora-versioned
    Takes the output of a dry run to create simple bags.

RESULTING FILES

A <log-file> is generated in csv format with the following headers:

easy-dataset-id  input easy-dataset-id
UUID             UUID created for the resulting package of the specified output format
doi              doi as it appears in the EMD
depositor        EASY-User-Account of the depositor of the dataset
transformation   transformation used
comments         if the dataset does not conform to the transformation-requirements, it is chronicled here

For every dataset in the input a bag-directory is created (or more in case of a versioned transformation): * in case of AIP: <output-dir>/<bag-UUID> * in case of SIP: <output-dir>/<package-UUID>/<bag-UUID>

In case of problems (possibly incomplete) bags or packages may be left in the directory defined with staging.dir in the application properties.

The bag-directories contain the transformed metadata and data in a DANS-Bagit-Profile. The bag-info.txt file will contain at least:

EASY-User-Account: ...
Created: ...
Payload-Oxum: ...
Bagging-Date: ...
Bag-Size: ...

TRANSFORMATIONS

SIMPLE

A simple transformation transforms the dataset, with no consideration of other datasets, 'thematische collecties' or external storage.
With the option --strict the transformation will check that the input dataset conforms to the following requirements. The dataset

  • has a DANS-DOI
  • is PUBLISHED
  • is REQUEST_PERMISSION or OPEN_ACCESS
  • has no Jumpoff page (i.e. there is no dans-jumpoff object in Fedora with a isJumpoffPageFor relation to this dataset)
  • has no replaces or isVersionOf relation that references a DANS-DOI, DANS-URN or easy-dataset-id
  • is no thematische collectie (i.e. the title does not contain thematische collectie)
  • is not in the vault already (i.e. check in easy-bag-index)

original-versioned

An original-versioned transformation transforms the dataset into two bags, if there exists an original folder and at least 1 file outside this folder(*). The first bag will contain the content of the original folder. The second bag will contain the accessible files from the original folder, and all remaining files.

Output for a dataset that meets the conditions to produce two bags:

<output-dir>/<package-1-UUID>/<bag-1-UUID>/bag-info-txt
<output-dir>/<package-2-UUID>/<bag-2-UUID>/bag-info-txt

The bag-info.txt of the second bag will have additional content to refer to the first bag, an example:

Is-Version-Of: urn:uuid:<package-1-UUID>
Base-DOI: 10.17026/test-Iiib-z9p-4ywa
Base-URN: urn:nbn:nl:ui:13-00-1haq

fedora-versioned

A fedora-versioned transformation takes several dataset-ids that are meant to be versions of each other, and creates a bag-sequence in the given order. Each line of the input-file should contain a list of dataset-ids, in the correct order, first version first.

Omitting the option output-dir implies a dry run. In that case the CSV file will have one bag-sequence per line. A sequence may contain datasets not in the input if referenced by datasets in the input. Datasets not in the input that reference a sequence but are not referenced by a sequence won't appear.

Except for the first dataset of each input line, bag-info.txt will have additional content (like for original-versioned) to refer to the package of the first dataset on the same input line.

INSTALLATION AND CONFIGURATION

Currently this project is build only as an RPM package for RHEL7/CentOS7 and later. The RPM will install the binaries to /opt/dans.knaw.nl/easy-fedora-to-bag and the configuration files to /etc/opt/dans.knaw.nl/easy-fedora-to-bag.

BUILDING FROM SOURCE

Prerequisites:

  • Java 8 or higher
  • Maven 3.3.3 or higher
  • RPM

Steps:

    git clone https://github.com/DANS-KNAW/easy-fedora-to-bag.git
    cd easy-fedora-to-bag
    mvn clean install

If the rpm executable is found at /usr/local/bin/rpm, the build profile that includes the RPM packaging will be activated. If rpm is available, but at a different path, then activate it by using Maven's -P switch: mvn -Pprm install.