braindecode.models.PBT#

class braindecode.models.PBT(n_chans=None, n_outputs=None, n_times=None, chs_info=None, input_window_seconds=None, sfreq=None, d_input=64, d_model=128, n_blocks=4, num_heads=4, drop_prob=0.1, learnable_cls=True, bias_transformer=False, activation=<class 'torch.nn.modules.activation.GELU'>)[source]#

Patched Brain Transformer (PBT) model from Klein et al. (2025) [pbt].

Large Brain Model

This implementation was based in timonkl/PatchedBrainTransformer

Patched Brain Transformer Architecture

PBT tokenizes EEG trials into per-channel patches, linearly projects each patch to a model embedding dimension, prepends a classification token and adds channel-aware positional embeddings. The token sequence is processed by a Transformer encoder stack and classification is performed from the classification token.

Macro Components

  • PBT.tokenization (patch extraction)

    Operations. The pre-processed EEG signal \(X \in \mathbb{R}^{C \times T}\) (with \(C = \text{n_chans}\) and \(T = \text{n_times}\)) is divided into non-overlapping patches of size \(d_{\text{input}}\) along the time axis. This process yields \(N\) total patches, calculated as \(N = C \left\lfloor \frac{T}{D} \right\rfloor\) (where \(D = d_{\text{input}}\)). When time shifts are applied, \(N\) decreases to \(N = C \left\lfloor \frac{T - T_{\text{aug}}}{D} \right\rfloor\).

    Role. Tokenizes EEG trials into fixed-size, per-channel patches so the model remains adaptive to different numbers of channels and recording lengths. Process is inspired by Vision Transformers [visualtransformer] and adapted for GPT context from [efficient-batchpacking].

  • PBT.patch_projection (patch embedding)

    Operations. The linear layer PBT.patch_projection maps the tokens from dimension \(d_{\text{input}}\) to the Transformer embedding dimension \(d_{\text{model}}\). Patches \(X_P\) are projected as \(X_E = X_P W_E^\top\), where \(W_E \in \mathbb{R}^{d_{\text{model}} \times D}\). In this configuration \(d_{\text{model}} = 2D\) with \(D = d_{\text{input}}\).

    Interpretability. Learns periodic structures similar to frequency filters in the first convolutional layers of CNNs (for example EEGNet). The learned filters frequently focus on the high-frequency range (20-40 Hz), which correlates with beta and gamma waves linked to higher concentration levels.

  • PBT.cls_token (classification token)

    Operations. A classification token \([c_{\text{ls}}] \in \mathbb{R}^{1 \times d_{\text{model}}}\) is prepended to the projected patch sequence \(X_E\). The CLS token can optionally be learnable (see learnable_cls).

    Role. Acts as a dedicated readout token that aggregates information through the Transformer encoder stack.

  • PBT.pos_embedding (positional embedding)

    Operations. Positional indices are generated by PBT.linear_projection, an instance of _ChannelEncoding, and mapped to vectors through Embedding. The embedding table \(W_{\text{pos}} \in \mathbb{R}^{(N+1) \times d_{\text{model}}}\) is added to the token sequence, yielding \(X_{\text{pos}} = [c_{\text{ls}}, X_E] + W_{\text{pos}}\).

    Role/Interpretability. Introduces spatial and temporal dependence to counter the position invariance of the Transformer encoder. The learned positional embedding exposes spatial relationships, often revealing a symmetric pattern in central regions (C1-C6) associated with the motor cortex.

  • PBT.transformer_encoder (sequence processing and attention)

    Operations. The token sequence passes through \(n_{\text{blocks}}\) Transformer encoder layers. Each block combines a Multi-Head Self-Attention (MHSA) module with num_heads attention heads and a Feed-Forward Network (FFN). Both MHSA and FFN use parallel residual connections with Layer Normalization inside the blocks and apply dropout (drop_prob) within the Transformer components.

    Role/Robustness. Self-attention enables every token to consider all others, capturing global temporal and spatial dependencies immediately and adaptively. This architecture accommodates arbitrary numbers of patches and channels, supporting pre-training across diverse datasets.

  • PBT.final_layer (readout)

    Operations. A linear layer operates on the processed CLS token only, and the model predicts class probabilities as \(y = \operatorname{softmax}([c_{\text{ls}}] W_{\text{class}}^\top + b_{\text{class}})\).

    Role. Performs the final classification from the information aggregated into the CLS token after the Transformer encoder stack.

Convolutional Details

PBT omits convolutional layers; equivalent feature extraction is carried out by the patch pipeline and attention stack.

  • Temporal. Tokenization slices the EEG into fixed windows of size \(D = d_{\text{input}}\) (for the default configuration, \(D=64\) samples \(\approx 0.256\,\text{s}\) at \(250\,\text{Hz}\)), while PBT.patch_projection learns periodic patterns within each patch. The Transformer encoder then models long- and short-range temporal dependencies through self-attention.

  • Spatial. Patches are channel-specific, keeping the architecture adaptive to any electrode montage. Channel-aware positional encodings \(W_{\text{pos}}\) capture relationships between nearby sensors; learned embeddings often form symmetric motifs across motor cortex electrodes (C1–C6), and self-attention propagates information across all channels jointly.

  • Spectral. PBT.patch_projection acts similarly to the first convolutional layer in EEGNet, learning frequency-selective filters without an explicit Fourier transform. The highest-energy filters typically reside between \(20\) and \(40\,\text{Hz}\), aligning with beta/gamma rhythms tied to focused motor imagery.

Attention / Sequential Modules

  • Attention Details. PBT.transformer_encoder stacks \(n_{\text{blocks}}\) Transformer encoder layers with Multi-Head Self-Attention. Every token attends to all others, enabling immediate global integration across time and channels and supporting heterogeneous datasets. Attention rollout visualisations highlight strong activations over motor cortex electrodes (C3, C4, Cz) during motor imagery decoding.

Warning

Important: As the other Large Brain Models in Braindecode, PBT is designed for large-scale pre-training and fine-tuning. Training from scratch on small datasets may lead to suboptimal results. Cross-Dataset pre-training and subsequent fine-tuning is recommended to leverage the full potential of this architecture.

Parameters:
  • n_chans (int) – Number of EEG channels.

  • n_outputs (int) – Number of outputs of the model. This is the number of classes in the case of classification.

  • n_times (int) – Number of time samples of the input window.

  • chs_info (list of dict) – Information about each individual EEG channel. This should be filled with info["chs"]. Refer to mne.Info for more details.

  • input_window_seconds (float) – Length of the input window in seconds.

  • sfreq (float) – Sampling frequency of the EEG recordings.

  • d_input (int, optional) – Size (in samples) of each patch (token) extracted along the time axis.

  • d_model (int, optional) – Transformer embedding dimensionality.

  • n_blocks (int, optional) – Number of Transformer encoder layers.

  • num_heads (int, optional) – Number of attention heads.

  • drop_prob (float, optional) – Dropout probability used in Transformer components.

  • learnable_cls (bool, optional) – Whether the classification token is learnable.

  • bias_transformer (bool, optional) – Whether to use bias in Transformer linear layers.

  • activation (nn.Module, optional) – Activation function class to use in Transformer feed-forward layers.

Raises:

ValueError – If some input signal-related parameters are not specified: and can not be inferred.

Notes

If some input signal-related parameters are not specified, there will be an attempt to infer them from the other parameters.

Hugging Face Hub integration

When the optional huggingface_hub package is installed, all models automatically gain the ability to be pushed to and loaded from the Hugging Face Hub. Install with:

pip install braindecode[hug]

Pushing a model to the Hub:

from braindecode.models import EEGNetv4

# Train your model
model = EEGNetv4(n_chans=22, n_outputs=4, n_times=1000)
# ... training code ...

# Push to the Hub
model.push_to_hub(
    repo_id="username/my-eegnet-model", commit_message="Initial model upload"
)

Loading a model from the Hub:

from braindecode.models import EEGNetv4

# Load pretrained model
model = EEGNetv4.from_pretrained("username/my-eegnet-model")

The integration automatically handles EEG-specific parameters (n_chans, n_times, sfreq, chs_info, etc.) by saving them in a config file alongside the model weights. This ensures that loaded models are correctly configured for their original data specifications.

Important

Currently, only EEG-specific parameters (n_outputs, n_chans, n_times, input_window_seconds, sfreq, chs_info) are saved to the Hub. Model-specific parameters (e.g., dropout rates, activation functions, number of filters) are not preserved and will use their default values when loading from the Hub.

To use non-default model parameters, specify them explicitly when calling from_pretrained():

model = EEGNet.from_pretrained("user/model", dropout=0.3, activation='relu')

Full parameter serialization will be addressed in a future update.

References

[pbt]

Klein, T., Minakowski, P., & Sager, S. (2025). Flexible Patched Brain Transformer model for EEG decoding. Scientific Reports, 15(1), 1-12. https://www.nature.com/articles/s41598-025-86294-3

[visualtransformer]

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J. & Houlsby, N. (2021). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. International Conference on Learning Representations (ICLR).

[efficient-batchpacking]

Krell, M. M., Kosec, M., Perez, S. P., & Fitzgibbon, A. (2021). Efficient sequence packing without cross-contamination: Accelerating large language models without impacting performance. arXiv preprint arXiv:2107.02027.

Methods

forward(X)[source]#

The implementation follows the original code logic

  • split input into windows of size (num_embeddings - 1) * d_input

  • for each window: reshape into tokens, map positional indices to embeddings, add cls token, run Transformer encoder and collect CLS outputs

  • aggregate CLS outputs across windows (if >1) and pass through final_layer

Parameters:

X (torch.Tensor) – Input tensor with shape (batch_size, n_chans, n_times)

Returns:

Output logits with shape (batch_size, n_outputs).

Return type:

torch.Tensor