zea.DataloaderΒΆ
- class zea.Dataloader(file_paths, key='data/image', batch_size=16, n_frames=1, shuffle=True, return_filename=False, seed=None, limit_n_samples=None, limit_n_frames=None, drop_remainder=False, image_size=None, resize_type=None, resize_axes=None, resize_kwargs=None, image_range=None, normalization_range=None, clip_image_range=False, assert_image_range=True, dataset_repetitions=None, cache=False, additional_axes_iter=None, sort_files=True, overlapping_blocks=False, augmentation=None, initial_frame_axis=0, insert_frame_axis=True, frame_index_stride=1, frame_axis=-1, validate=True, revision=None, prefetch=True, shard_index=None, num_shards=1, num_threads=16, prefetch_buffer_size=500, reshuffle_each_epoch=True, convert_to_tensor=True, **kwargs)[source]ΒΆ
Bases:
objectHigh-performance HDF5 dataloader built on Grain.
grain threads (N) β h5py (thread-local handles) β numpy -> cpu tensor β user
The entire pipeline runs using numpy, and the resizing is done on the selected backend, all on cpu.
Does the following in order to load a dataset:
Find all .hdf5 files in the director(ies)
Load the data from each file using the specified key
Apply the following transformations in order (if specified):
shuffle
shard
add channel dim
clip image range
assert image range
resize
repeat
batch
cast to float32
normalize
augmentation
convert_to_tensor
- Parameters:
file_paths (
Union[List[str],str]) β Path(s) to directory(ies) and/or HDF5 file(s).key (
str) β HDF5 dataset key. Default is"data/image".batch_size (
int|None) β Batch size. Set toNoneto disable batching. Default is16.n_frames (
int) β Number of consecutive frames per sample. Default is1. Whenn_frames > 1, frames are grouped into blocks.shuffle (
bool) β Shuffle dataset each epoch. Default isTrue.return_filename (
bool) β Return filename metadata together with each sample. Default isFalse.seed (
int|None) β Random seed used for dataloader (e.g. shuffling). Default isNone. IfNonea random seed is generated.limit_n_samples (
int|None) β Limit total number of samples (useful for debugging). Default isNone(no limit).limit_n_frames (
int|None) β Limit frames loaded per file to the first N frames. Default isNone(no limit).drop_remainder (
bool) β Drop the final incomplete batch. Default isFalse.image_size (
tuple|None) β Target(height, width). Default isNone(no resizing).resize_type (
str|None) β Resize strategy. One of"resize","center_crop","random_crop"or"crop_or_pad". Default isNone, which resolves to"resize"when image_size is set.resize_axes (
tuple|None) β Axes to resize along, must have length 2 (height, width). Only needed when data has more than(h, w, c)dimensions. Axes are interpreted after frame-axis insertion/reordering. Default isNone.resize_kwargs (
dict|None) β Extra keyword arguments passed toResizer. Default isNone.image_range (
tuple|None) β Source value range of images, e.g.(-60, 0). Used for clipping/asserting/normalization. Default isNone.normalization_range (
tuple|None) β Target value range, e.g.(0, 1). If set,image_rangemust also be set. Default isNone.clip_image_range (
bool) β Clip values toimage_rangebefore normalization. Default isFalse.assert_image_range (
bool) β Assert values stay withinimage_range. Default isTrue.dataset_repetitions (
int|None) β Repeat dataset this many times. Repetition happens after sharding. Default isNone(no repetition).cache (
bool) β Cache loaded samples in RAM. Default isFalse. Note that withoverlapping_blocks=True, the same frame can be part of multiple samples, so caching will consume more memory.additional_axes_iter (
tuple|None) β Additional axes to iterate over in addition toinitial_frame_axis. Default isNone.sort_files (
bool) β Sort files numerically before indexing. Default isTrue.overlapping_blocks (
bool) β IfTrue, frame blocks overlap byn_frames - 1. Has no effect whenn_frames == 1. Default isFalse.augmentation (
callable) β Callable applied to each batch after normalization. Default isNone.initial_frame_axis (
int) β Axis in file data that represents frames. Default is0.insert_frame_axis (
bool) β IfTrue, keep per-frame samples and move/insert the frame dimension atframe_axis. IfFalse, loaded frames are concatenated alongframe_axis. Default isTrue.frame_index_stride (
int) β Step between selected frames in a block. Default is1.frame_axis (
int) β Axis along which frames are stacked/placed in output. Default is-1.validate (
bool) β Validate discovered files against the zea format. Default isTrue.revision (
str|None) β HuggingFace revision (branch, tag, or commit hash) forhf://paths. Defaults toNone(uses the default branch, typically"main").prefetch (
bool) β Enable Grain prefetching for iteration. Default isTrue.shard_index (
int|None) β Shard index to select whennum_shards > 1. Must satisfy0 <= shard_index < num_shards.num_shards (
int) β Total number of shards for distributed loading. Sharding happens before downstream transforms. Default is1.num_threads (
int) β Number of Grain read threads (0means main thread only). Default is16.prefetch_buffer_size (
int) β Size of the Grain buffer for reading elements per Python process (not per thread). Useful when reading from a distributed file system. Default is500.reshuffle_each_epoch (
bool) β Whether to reshuffle the dataset after each epoch. Default isTrue. For evaluation it might be useful to set this toFalse. Or when you want to use a persistent iterator between epochs, usingdataset_repetitionsto specify the number of epochs.convert_to_tensor (
bool) β Whether to convert the data to a tensor (on cpu). Default isTrue.
Example
loader = Dataloader( file_paths="/data/camus", key="data/image/values", batch_size=32, image_range=(-60, 0), normalization_range=(0, 1), image_size=(256, 256), ) for batch in loader: ... # batch.shape == (32, 256, 256, 1)
- property datasetΒΆ
The underlying
grain.MapDataset.
- property shapeΒΆ
Output shape of one batch (or sample if unbatched).