zea.data.convert.utils¶
Functions
|
Raise if folder already contains data from a different dataset. |
|
Download a file from a URL to a local path. |
|
Download a dataset from the Girder server. |
|
Load a .avi file and return a numpy array of frames. |
|
Raise if folder does not contain a verified dataset card for repo_id. |
|
Load a NIfTI/medical image using SimpleITK and return the array and metadata. |
|
Checks if data folder exist in src. |
|
Upload a converted dataset to a HuggingFace Hub revision branch. |
|
Write a HuggingFace dataset card ( |
- zea.data.convert.utils.check_output_dir_ownership(folder, repo_id)[source]¶
Raise if folder already contains data from a different dataset.
The check is based on the
zea_repo_idfield written into the dataset card (README.md) by each converter. A directory is considered owned by a specific dataset when its README.md containszea_repo_id: <repo_id>.Empty or non-existent directory → passes (first-time run).
Directory with matching README.md → passes (re-run of same dataset).
Directory with mismatched README.md → raises
FileExistsError.Directory with HDF5 files but no README.md → raises
FileExistsError.
- Parameters:
folder (
str|Path) – Output directory to inspect.repo_id (
str) – Expected dataset repository ID, e.g."zeahub/picmus".
- Raises:
FileExistsError – If the directory belongs to a different dataset.
- Return type:
None
- zea.data.convert.utils.download_file(url, destination)[source]¶
Download a file from a URL to a local path.
Skips the download if the file already exists at destination. Shows a
tqdmprogress bar based on thecontent-lengthheader when available.Uses the
ZEA_DOWNLOAD_TIMEOUTenvironment variable (default 600 s) as the socket timeout.- Parameters:
url (
str) – URL to download from.destination (
str|Path) – Full file path where the downloaded content will be saved. The parent directory is created if it does not exist.
- Return type:
Path- Returns:
Path to the (possibly pre-existing) downloaded file.
- zea.data.convert.utils.download_from_girder(collection_id, destination, dataset_name, patients=None, top_folder_name='dataset')[source]¶
Download a dataset from the Girder server.
Navigates the Girder collection to find patient folders and downloads all files for each patient. Existing files are skipped.
- Parameters:
collection_id (
str) – Girder collection ID for the dataset.destination (
str|Path) – Directory where the dataset will be downloaded.dataset_name (
str) – Human-readable name used in log messages (e.g."CAMUS"or"CETUS").patients (
list[int] |None) – Optional list of patient IDs to download. If None, all patients in the collection are downloaded.top_folder_name (
str) – Name of the top-level folder inside the collection that contains patient subfolders. Defaults to"dataset".
- Return type:
Path- Returns:
Path to the downloaded dataset directory.
- zea.data.convert.utils.load_avi(file_path, mode='L')[source]¶
Load a .avi file and return a numpy array of frames.
- Parameters:
filename (str) – The path to the video file.
mode (str, optional) – Color mode: “L” (grayscale) or “RGB”. Defaults to “L”.
- Returns:
Array of frames (num_frames, H, W) or (num_frames, H, W, C)
- Return type:
numpy.ndarray
- zea.data.convert.utils.require_output_dir_ownership(folder, repo_id)[source]¶
Raise if folder does not contain a verified dataset card for repo_id.
Used as a pre-flight check before uploading to HuggingFace Hub to prevent accidentally uploading files from a different dataset.
- Parameters:
folder (
str|Path) – Directory to check.repo_id (
str) – Expected dataset repository ID, e.g."zeahub/picmus".
- Raises:
FileNotFoundError – If no README.md is found.
ValueError – If the README.md does not match repo_id.
- Return type:
None
- zea.data.convert.utils.sitk_load(filepath, squeeze=False)[source]¶
Load a NIfTI/medical image using SimpleITK and return the array and metadata.
- Parameters:
filepath (
str|Path) – Path to the image file.squeeze (
bool) – If True, squeeze singleton dimensions from the array. Defaults to False.
- Returns:
Image array. Shape depends on the input and the
squeezeparameter.Dictionary of metadata:
origin,spacing,direction,size,dimension, and ametadatasub-dict with all image metadata keys.
- Return type:
Tuple of
- zea.data.convert.utils.unzip(src, dataset)[source]¶
Checks if data folder exist in src. Otherwise, unzip dataset.zip in src.
- Parameters:
src (
str|Path) – The source directory containing the zip file or unzipped folder.dataset (
str) – The name of the dataset to unzip. Options are “picmus”, “camus”, “echonet”, “echonetlvh”.
- Returns:
The path to the unzipped dataset directory.
- Return type:
Path
- zea.data.convert.utils.upload_dataset_to_hf(folder, repo_id, revision, file_glob='*.hdf5', commit_message=None)[source]¶
Upload a converted dataset to a HuggingFace Hub revision branch.
Upload to the
mainbranch is intentionally blocked. After uploading to a named revision branch, verify the data manually and then merge the branch intomainon the Hugging Face Hub.- Parameters:
folder (
str|Path) – Root folder containing the files to upload.repo_id (
str) – Hugging Face Hub repository ID (e.g."zeahub/picmus").revision (
str) – Target branch name. Must not be"main".file_glob (
str) – Glob pattern for files to include in the size summary. Defaults to"*.hdf5".commit_message (
str|None) – Commit message. Defaults to"Upload <repo_id> (zea format) to <revision>".
- Raises:
ValueError – If revision is
"main".FileNotFoundError – If no files matching file_glob are found under folder.
- Return type:
None
- zea.data.convert.utils.write_dataset_card(folder, card_content)[source]¶
Write a HuggingFace dataset card (
README.md) into folder.- Parameters:
folder (
str|Path) – Directory whereREADME.mdwill be written.card_content (
str) – Markdown content for the dataset card.
- Return type:
Path- Returns:
Path to the written
README.mdfile.