Development

This page contains information for developers about how to contribute to this project.

Setting up the development environment

Poetry initialization

The project uses poetry for a build system. If you don't have it installed yet, install poetry for your user with:

python3 -m pip install --user poetry

After poetry is installed, change directory to the project root and execute:

poetry install

This will install the project and its dependencies in the poetry virtual environment for the project.

Testing commands with poetry

After poetry install the commands provided by the module can be tested by prepending poetry run to the command line, e.g.:

poetry run dv-banner list

The mapping from command to the function that implements it is defined in pyproject.toml, in the tool.poetry.scripts section.

For more information about how to use poetry, see the poetry documentation.

Debugging commands in PyCharm

Poetry provides no support for debugging. To debug a command in PyCharm, create a new run configuration by right-clicking on the corresponding entry-point script and selecting "Modify Run Configuration". In the dialog that appears, you can specify the command line parameters. After saving the configuration you can start it by clicking the green arrow or bug icon in the toolbar. (This should all be familiar to you if you have used an IDE before, but it is different from the way we work in Java projects, where the program is started with the scripts in dans-dev-tools and you then attach a debugger.)

Working directory

By default PyCharm will use the directory of the entry-point script as the working directory. This means that configuration file (.dans-datastation-tools.yml) in the root of the project will not be found. Instead, a new configuration file will be created in the directory of the entry-point script. This may be confusing if you are not aware of it, because poetry will still use the configuration file in the root of the project. To avoid this, you can change the working directory in the run configuration to the root of the project.

Adding to this documentation site

On occasion, you may want to add to this documentation site. It is important to test that your changes look good before you commit them. To do this, you can use the start-mkdocs.sh script in the dans-dev-tools project. (See dans-dev-tools: start-mkdocs.sh.)

start-mkdocs.sh

Edit the files in docs and browse to or refresh http://127.0.0.1:8080 to view your changes.

Note that here we are using a separate virtual environment. This way we don't get the dependencies for dans-datastation-tools and the doc site confused.

User interface

The user interface of dans-datastation-tools consists of a set of commands that can be executed from the command line. In order to make the user interface consistent, the following guidelines should be followed.

Command names

Commands target a specific object type, e.g. a dataset, a file, and perform an action on that object, e.g. create, read, etc. The general pattern for a command name is:

<object-type>-<action>

For example:

dv-dataset-publish # object-type: dv-dataset, action: publish

Subcommands

In some cases the action is a subcommand. This is an inconsistency at the moment. We may want to change this in the future, either by making all actions subcommands or by making all actions regular commands.

The following object types are currently supported: dans-bag, dv-dataset, dv-user, dv-banner, ingest-flow. This list may be extended in the future.

Command line parameters

There are two types of parameters:

  • Positional parameters
  • Named parameters

The input that identifies the object to perform the action on is always a positional parameter. This can for example be a dataset identifier, a file path, etc., or file containing a list of such identifiers to be processed in batch mode.

Named parameters are used to modify the action or provide additional input. They can be optional or required.

Output

  • Output that is not too long should be sent to the standard output.
  • Commands that are likely to produce longer output, a named parameter --output-file should be included, with the default value - (meaning the standard output).
  • When batch-processing a list of objects, it is often useful to have a report of the results. This can be done with the --report-file parameter, with the default value - (meaning the standard output).
  • Messages about the status of the program should be sent to the standard error, for example error messages, progress messages, etc.

Implementation

Commands and entry-point scripts

Commands each have their own entry-point script in the root package datastation. They must all have a main function and a __main__ section that calls that function. The latter is needed so that you can debug the command in PyCharm.

The main function is mapped to the command name in pyproject.toml, in the tool.poetry.scripts section. The name of the entry-point script is the same as the command name, with - replaced by _ and a .py extension added. For example, the dv-dataset-publish command is implemented (by the main function) in the dv_dataset_publish.py script.

The entry-point scripts are not meant to be imported by other modules. Their only purpose is to provide a command-line interface and should do as little else as possible.

Subpackages

Most commands talk to a remote service, e.g. a Dataverse server. The code that talks to the remote service is in a dedicated subpackage, e.g. dataverse for the Dataverse server. There is also a common subpackage that common functionality for all commands and utilities that are not specific to a remote service.

Configuration

There is one configuration file which is in YAML format and contains a section for each targeted service. The objects that need a specific section take a dictionary with only that section as a parameter, e.g.:

from datastation.dataverse.dataverse_client import DataverseClient
from datastation.common.config import init

config = init()
dataverse_client = DataverseClient(config['dataverse'])

Note that DataverseClient does not know about the other sections in the configuration file. On the other hand it does not receive each individual parameter as a separate argument either. This is to avoid having to transfer all the parameters to the constructor of the client. This style is intended to strike a balance between the two extremes.

Code style

Code formatting

Format the code with PyCharm's code formatter.

Unit tests

Unit tests should go under src/tests. The test files should be named test_<module>.py and the test classes should be named Test<Module/Class/Function>. There can be multiple test classes in a test file.

String interpolation

Use the following syntax for string interpolation:

name = "John"
f"Hello {name}"

Do not use the old % syntax, or the .format() method or string concatenation. Also avoid concatenating strings with + or +=. Use string interpolation instead.