Separate BatchGenerator into standalone Slicer and Batcher components?

### What is your issue?

## Current state

Currently, xbatcher v0.3.0's [`BatchGenerator`](https://xbatcher.readthedocs.io/en/v0.3.0/generated/xbatcher.BatchGenerator.html) is this all-in-one class/function that does too many things, and there are more features planned. The 400+ lines of code at https://github.com/xarray-contrib/xbatcher/blob/v0.3.0/xbatcher/generators.py is not something easy for people to understand and contribute to without spending a few hours. To make things more maintainable and future proof, we might need a major refactor.

## Proposal

Split `BatchGenerator` into 2 (or more) subcomponents. Specifically:
1. A [`Slicer`](https://pytorch.org/data/0.6/generated/torchdata.datapipes.iter.Batcher.html#torchdata.datapipes.iter.Batcher) that does the slicing/subsetting/cropping/tiling/chipping from a multi-dimensional `xarray` object.
2. A [`Batcher`](https://pytorch.org/data/0.6/generated/torchdata.datapipes.iter.Batcher.html#torchdata.datapipes.iter.Batcher) that groups together the pieces from the `Slicer` into batches of data.

These are the parameters from the current `BatchGenerator` that would be handled by each component:

Slicer:
- input_dims
- input_overlap

Batcher:
- batch_dims
- concat_input_dims
- preload_batch

### Benefits

- A NaN checker could be inserted in between `Slicer` and `Batcher`
  - #158
  - #162
- All the extra logic on deleting/adding extra dimensions can be done on the `Batcher` side, or in a step post-`Batcher`
  - #36
  - #127
- Allow for creating train/val/test splits after `Slicer` but before `Batcher`
  - https://github.com/xarray-contrib/xbatcher/discussions/78
  - Also, some people do shuffling after getting slices of data, others may shuffle after batches are created, xref https://github.com/xarray-contrib/xbatcher/pull/170
- Streaming data for performance reasons
  - In torchdata, it is possible to have the `Slicer` run in parallel with the `Batcher`. E.g. with a batch_size of 128, `Slicer` would load data up to 128 chips, pass it on to `Batcher` and feed it to the ML model, while the next round of data processing happens. This is without loading everything into memory.
  - https://github.com/orgs/xarray-contrib/projects/1
- Flexibility with what step to cache things at
  - At https://github.com/xarray-contrib/xbatcher/issues/109, the proposal was to cache things after `Batcher` when the batches have been generated already. Sometimes though, people might want to set `batch_size` as a hyperparameter in their ML experimentation, in which case the cache should be done after `Slicer`.

Cons
- May result in the current one-liner becoming a multi-liner
- Could lead to some backwards incompatibility/breaking changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Separate BatchGenerator into standalone Slicer and Batcher components? #172

What is your issue?

Current state

Proposal

Benefits

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Separate BatchGenerator into standalone Slicer and Batcher components? #172

Description

What is your issue?

Current state

Proposal

Benefits

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions