zea.data.convert.utils

Functions

check_output_dir_ownership(folder, repo_id)

Raise if folder already contains data from a different dataset.

download_file(url, destination)

Download a file from a URL to a local path.

download_from_girder(collection_id, ...[, ...])

Download a dataset from the Girder server.

load_avi(file_path[, mode])

Load a .avi file and return a numpy array of frames.

require_output_dir_ownership(folder, repo_id)

Raise if folder does not contain a verified dataset card for repo_id.

sitk_load(filepath[, squeeze])

Load a NIfTI/medical image using SimpleITK and return the array and metadata.

unzip(src, dataset)

Checks if data folder exist in src.

upload_dataset_to_hf(folder, repo_id, revision)

Upload a converted dataset to a HuggingFace Hub revision branch.

write_dataset_card(folder, card_content)

Write a HuggingFace dataset card (README.md) into folder.

zea.data.convert.utils.check_output_dir_ownership(folder, repo_id)[source]

Raise if folder already contains data from a different dataset.

The check is based on the zea_repo_id field written into the dataset card (README.md) by each converter. A directory is considered owned by a specific dataset when its README.md contains zea_repo_id: <repo_id>.

  • Empty or non-existent directory → passes (first-time run).

  • Directory with matching README.md → passes (re-run of same dataset).

  • Directory with mismatched README.md → raises FileExistsError.

  • Directory with HDF5 files but no README.md → raises FileExistsError.

Parameters:
  • folder (str | Path) – Output directory to inspect.

  • repo_id (str) – Expected dataset repository ID, e.g. "zeahub/picmus".

Raises:

FileExistsError – If the directory belongs to a different dataset.

Return type:

None

zea.data.convert.utils.download_file(url, destination)[source]

Download a file from a URL to a local path.

Skips the download if the file already exists at destination. Shows a tqdm progress bar based on the content-length header when available.

Uses the ZEA_DOWNLOAD_TIMEOUT environment variable (default 600 s) as the socket timeout.

Parameters:
  • url (str) – URL to download from.

  • destination (str | Path) – Full file path where the downloaded content will be saved. The parent directory is created if it does not exist.

Return type:

Path

Returns:

Path to the (possibly pre-existing) downloaded file.

zea.data.convert.utils.download_from_girder(collection_id, destination, dataset_name, patients=None, top_folder_name='dataset')[source]

Download a dataset from the Girder server.

Navigates the Girder collection to find patient folders and downloads all files for each patient. Existing files are skipped.

Parameters:
  • collection_id (str) – Girder collection ID for the dataset.

  • destination (str | Path) – Directory where the dataset will be downloaded.

  • dataset_name (str) – Human-readable name used in log messages (e.g. "CAMUS" or "CETUS").

  • patients (list[int] | None) – Optional list of patient IDs to download. If None, all patients in the collection are downloaded.

  • top_folder_name (str) – Name of the top-level folder inside the collection that contains patient subfolders. Defaults to "dataset".

Return type:

Path

Returns:

Path to the downloaded dataset directory.

zea.data.convert.utils.load_avi(file_path, mode='L')[source]

Load a .avi file and return a numpy array of frames.

Parameters:
  • filename (str) – The path to the video file.

  • mode (str, optional) – Color mode: “L” (grayscale) or “RGB”. Defaults to “L”.

Returns:

Array of frames (num_frames, H, W) or (num_frames, H, W, C)

Return type:

numpy.ndarray

zea.data.convert.utils.require_output_dir_ownership(folder, repo_id)[source]

Raise if folder does not contain a verified dataset card for repo_id.

Used as a pre-flight check before uploading to HuggingFace Hub to prevent accidentally uploading files from a different dataset.

Parameters:
  • folder (str | Path) – Directory to check.

  • repo_id (str) – Expected dataset repository ID, e.g. "zeahub/picmus".

Raises:
  • FileNotFoundError – If no README.md is found.

  • ValueError – If the README.md does not match repo_id.

Return type:

None

zea.data.convert.utils.sitk_load(filepath, squeeze=False)[source]

Load a NIfTI/medical image using SimpleITK and return the array and metadata.

Parameters:
  • filepath (str | Path) – Path to the image file.

  • squeeze (bool) – If True, squeeze singleton dimensions from the array. Defaults to False.

Returns:

  • Image array. Shape depends on the input and the squeeze parameter.

  • Dictionary of metadata: origin, spacing, direction, size, dimension, and a metadata sub-dict with all image metadata keys.

Return type:

Tuple of

zea.data.convert.utils.unzip(src, dataset)[source]

Checks if data folder exist in src. Otherwise, unzip dataset.zip in src.

Parameters:
  • src (str | Path) – The source directory containing the zip file or unzipped folder.

  • dataset (str) – The name of the dataset to unzip. Options are “picmus”, “camus”, “echonet”, “echonetlvh”.

Returns:

The path to the unzipped dataset directory.

Return type:

Path

zea.data.convert.utils.upload_dataset_to_hf(folder, repo_id, revision, file_glob='*.hdf5', commit_message=None)[source]

Upload a converted dataset to a HuggingFace Hub revision branch.

Upload to the main branch is intentionally blocked. After uploading to a named revision branch, verify the data manually and then merge the branch into main on the Hugging Face Hub.

Parameters:
  • folder (str | Path) – Root folder containing the files to upload.

  • repo_id (str) – Hugging Face Hub repository ID (e.g. "zeahub/picmus").

  • revision (str) – Target branch name. Must not be "main".

  • file_glob (str) – Glob pattern for files to include in the size summary. Defaults to "*.hdf5".

  • commit_message (str | None) – Commit message. Defaults to "Upload <repo_id> (zea format) to <revision>".

Raises:
  • ValueError – If revision is "main".

  • FileNotFoundError – If no files matching file_glob are found under folder.

Return type:

None

zea.data.convert.utils.write_dataset_card(folder, card_content)[source]

Write a HuggingFace dataset card (README.md) into folder.

Parameters:
  • folder (str | Path) – Directory where README.md will be written.

  • card_content (str) – Markdown content for the dataset card.

Return type:

Path

Returns:

Path to the written README.md file.