data.DataLoader¶

class lucid.data.DataLoader(dataset: Dataset, batch_size: int = 1, shuffle: bool = False, collate_fn: Callable | None = None)¶

The DataLoader class provides an efficient and flexible way to iterate over datasets in mini-batches. It supports shuffling, batching, and parallel data loading, making it essential for training deep learning models.

The DataLoader works with any dataset that inherits from Dataset, enabling seamless integration with custom datasets defined by the user.

Class Signature¶

class DataLoader:
    def __init__(
        self, dataset: Dataset, batch_size: int = 1, shuffle: bool = False
    ) -> None

Methods¶

Core Methods

def __len__(self) -> int

Returns the total number of batches in the DataLoader, calculated as the total number of samples in the dataset divided by the batch size.

Returns:

int: The total number of batches.

Example:

dataset = SquareDataset()  # Assume len(dataset) = 10
loader = DataLoader(dataset, batch_size=2)
print(len(loader))  # Output: 5

def __iter__(self) -> Iterator[list[Any]]

Returns an iterator that yields batches of data from the dataset. The iterator returns a list of samples for each batch.

Yields:

list[Any]: A batch of samples, each corresponding to a sample from the dataset.

Example:

dataset = SquareDataset()  # Assume dataset returns squares of indices
loader = DataLoader(dataset, batch_size=2)
for batch in loader:
    print(batch)  # Output: [0, 1], [4, 9], ...

Special Methods

def __call__(self) -> Iterator[list[Any]]

An alternative way to obtain an iterator for the DataLoader. This allows for cleaner syntax when using DataLoader in loops.

Yields:

list[Any]: A batch of samples, each corresponding to a sample from the dataset.

Example:

dataset = SquareDataset()
loader = DataLoader(dataset, batch_size=3)
for batch in loader():  # Call syntax
    print(batch)

Parameters¶

__init__

Initializes the DataLoader object with the specified dataset, batch size, and shuffle option.

Parameters:

dataset (Dataset): The dataset to load data from. Must be a subclass of Dataset.
batch_size (int, optional): The number of samples per batch. Defaults to 1.
shuffle (bool, optional): Whether to shuffle the data at the start of each epoch. Defaults to False.

Raises:

TypeError: If dataset is not an instance of Dataset.
ValueError: If batch_size is not a positive integer.

Examples¶

Loading a custom dataset

import lucid.data as data

dataset = data.SquareDataset()
loader = data.DataLoader(dataset, batch_size=2, shuffle=True)

for batch in loader:
    print(batch)  # Prints random batches of 2 samples from the dataset

Using DataLoader with ConcatDataset

import lucid.data as data

dataset1 = data.SquareDataset()
dataset2 = data.RandomNoiseDataset(10)
combined_dataset = data.ConcatDataset([dataset1, dataset2])

loader = data.DataLoader(combined_dataset, batch_size=3, shuffle=True)

for batch in loader:
    print(batch)  # Prints batches of 3 samples from the combined dataset

Tip

Shuffling data

When shuffle=True, the data order is randomized at the beginning of every epoch. This is useful for preventing the model from overfitting to the data order.

dataset = SquareDataset()
loader = DataLoader(dataset, batch_size=2, shuffle=True)
for epoch in range(2):
    print(f"Epoch {epoch+1}")
    for batch in loader:
        print(batch)  # Batches are randomly shuffled for each epoch

Warning

Batch size constraints

If the total number of samples is not a multiple of batch_size, the last batch will contain fewer samples than the batch size.

dataset = SquareDataset()  # 10 samples in total
loader = DataLoader(dataset, batch_size=3)
for batch in loader:
    print(batch)  # Last batch will have fewer than 3 samples