braindecode.models.AttentionBaseNet#
- class braindecode.models.AttentionBaseNet(n_times=None, n_chans=None, n_outputs=None, chs_info=None, sfreq=None, input_window_seconds=None, n_temporal_filters=40, temp_filter_length_inp=25, spatial_expansion=1, pool_length_inp=75, pool_stride_inp=15, drop_prob_inp=0.5, ch_dim=16, temp_filter_length=15, pool_length=8, pool_stride=8, drop_prob_attn=0.5, attention_mode=None, reduction_rate=4, use_mlp=False, freq_idx=0, n_codewords=4, kernel_size=9, activation=<class 'torch.nn.modules.activation.ELU'>, extra_params=False)[source]#
AttentionBaseNet from Wimpff M et al. (2023) [Martin2023].
Convolution Small Attention
Architectural Overview
AttentionBaseNet is a convolution-first network with a channel-attention stage. The end-to-end flow is:
(i)
_FeatureExtractor
learns a temporal filter bank and per-filter spatial projections (depthwise across electrodes), then condenses time by pooling;Channel Expansion uses a
1x1
convolution to set the feature width;
(iii)
_ChannelAttentionBlock
refines features via depthwise–pointwise temporal convs and an optional channel-attention module (SE/CBAM/ECA/…);Classifier flattens the sequence and applies a linear readout.
This design mirrors shallow CNN pipelines (EEGNet-style stem) but inserts a pluggable attention unit that re-weights channels (and optionally temporal positions) before classification.
Macro Components
_FeatureExtractor
(Shallow conv stem → condensed feature map)Operations.
Temporal conv (
torch.nn.Conv2d
) with kernel(1, L_t)
creates a learned FIR-like filter bank withn_temporal_filters
maps.Depthwise spatial conv (
torch.nn.Conv2d
,groups=n_temporal_filters
) with kernel(n_chans, 1)
learns per-filter spatial projections over the full montage.BatchNorm → ELU → AvgPool → Dropout stabilize and downsample time.
Output shape:
(B, F2, 1, T₁)
withF2 = n_temporal_filters x spatial_expansion
.
Interpretability/robustness. Temporal kernels behave as analyzable FIR filters; the depthwise spatial step yields rhythm-specific topographies. Pooling acts as a local integrator that reduces variance on short EEG windows.
Channel Expansion
Operations.
A
1x1
conv → BN → activation mapsF2 → ch_dim
without changing the temporal lengthT₁
(shape:(B, ch_dim, 1, T₁)
). This sets the embedding width for the attention block.
_ChannelAttentionBlock
(temporal refinement + channel attention)Operations.
Depthwise temporal conv
(1, L_a)
(groups=``ch_dim``) + pointwise ``1x1``, BN and activation → preserves shape(B, ch_dim, 1, T₁)
while refining timing.Optional attention module (see Additional Mechanisms) applies channel reweighting (some variants also apply temporal gating).
AvgPool (1, P₂) with stride
(1, S₂)
and Dropout → outputs(B, ch_dim, 1, T₂)
.
Role. Emphasizes informative channels (and, in certain modes, salient time steps) before the classifier; complements the convolutional priors with adaptive re-weighting.
Classifier (aggregation + readout)
Operations.
torch.nn.Flatten
→torch.nn.Linear
from(B, ch_dim·T₂)
to classes.Convolutional Details
- Temporal (where time-domain patterns are learned).
Wide kernels in the stem (
(1, L_t)
) act as a learned filter bank for oscillatory bands/transients; the attention block’s depthwise temporal conv ((1, L_a)
) sharpens short-term dynamics after downsampling. Pool sizes/strides (P₁,S₁
thenP₂,S₂
) set the token rate and effective temporal resolution.
- Spatial (how electrodes are processed).
A depthwise spatial conv with kernel
(n_chans, 1)
spans the full montage to learn per-temporal-filter spatial projections (no cross-filter mixing at this step), mirroring the interpretable spatial stage in shallow CNNs.
- Spectral (how frequency content is captured).
No explicit Fourier/wavelet transform is used in the stem—spectral selectivity emerges from learned temporal kernels. When
attention_mode="fca"
, a frequency channel attention (DCT-based) summarizes frequencies to drive channel weights.
Attention / Sequential Modules
- Type. Channel attention chosen by
attention_mode
(SE, ECA, CBAM, CAT, GSoP, EncNet, GE, GCT, SRM, CATLite). Most operate purely on channels; CBAM/CAT additionally include temporal attention.
- Type. Channel attention chosen by
- Shapes. Input/Output around attention:
(B, ch_dim, 1, T₁)
. Re-arrangements (if any) are internal to the module; the block returns the same shape before pooling.
- Shapes. Input/Output around attention:
- Role. Re-weights channels (and optionally time) to highlight informative sources
and suppress distractors, improving SNR ahead of the linear head.
If the input duration is too short for the configured kernels/pools, the implementation automatically rescales temporal lengths/strides downward (with a warning) to keep shapes valid and preserve the pipeline semantics.
Usage and Configuration
n_temporal_filters
,temporal_filter_length
andspatial_expansion
:control the capacity and the number of spatial projections in the stem.
pool_length_inp
,pool_stride_inp
thenpool_length
,pool_stride
:trade temporal resolution for compute; they determine the final sequence length
T₂
.
ch_dim
: width after the1x1
expansion and the effective embedding size for attention.attention_mode
+ its specific hyperparameters (reduction_rate
,kernel_size
,seq_len
,freq_idx
,n_codewords
,use_mlp
): select and tune the reweighting mechanism.
drop_prob_inp
anddrop_prob_attn
: regularize stem and attention stages.Training tips.
Start with moderate pooling (e.g.,
P₁=75,S₁=15
) and ELU activations; enable attention only after the stem learns stable filters. For small datasets, prefer simpler modes ("se"
,"eca"
) before heavier ones ("gsop"
,"encnet"
).- Parameters:
n_times (int) – Number of time samples of the input window.
n_chans (int) – Number of EEG channels.
n_outputs (int) – Number of outputs of the model. This is the number of classes in the case of classification.
chs_info (list of dict) – Information about each individual EEG channel. This should be filled with
info["chs"]
. Refer tomne.Info
for more details.sfreq (float) – Sampling frequency of the EEG recordings.
input_window_seconds (float) – Length of the input window in seconds.
n_temporal_filters (int, optional) – Number of temporal convolutional filters in the first layer. This defines the number of output channels after the temporal convolution. Default is 40.
temp_filter_length_inp (
int
) – The description is missing.spatial_expansion (int, optional) – Multiplicative factor to expand the spatial dimensions. Used to increase the capacity of the model by expanding spatial features. Default is 1.
pool_length_inp (int, optional) – Length of the pooling window in the input layer. Determines how much temporal information is aggregated during pooling. Default is 75.
pool_stride_inp (int, optional) – Stride of the pooling operation in the input layer. Controls the downsampling factor in the temporal dimension. Default is 15.
drop_prob_inp (float, optional) – Dropout rate applied after the input layer. This is the probability of zeroing out elements during training to prevent overfitting. Default is 0.5.
ch_dim (int, optional) – Number of channels in the subsequent convolutional layers. This controls the depth of the network after the initial layer. Default is 16.
temp_filter_length (int, default=15) – The length of the temporal filters in the convolutional layers.
pool_length (int, default=8) – The length of the window for the average pooling operation.
pool_stride (int, default=8) – The stride of the average pooling operation.
drop_prob_attn (float, default=0.5) – The dropout rate for regularization for the attention layer. Values should be between 0 and 1.
attention_mode (str, optional) –
- The type of attention mechanism to apply. If None, no attention is applied.
”se” for Squeeze-and-excitation network
”gsop” for Global Second-Order Pooling
”fca” for Frequency Channel Attention Network
”encnet” for context encoding module
”eca” for Efficient channel attention for deep convolutional neural networks
”ge” for Gather-Excite
”gct” for Gated Channel Transformation
”srm” for Style-based Recalibration Module
”cbam” for Convolutional Block Attention Module
”cat” for Learning to collaborate channel and temporal attention
from multi-information fusion - “catlite” for Learning to collaborate channel attention
from multi-information fusion (lite version, cat w/o temporal attention)
reduction_rate (int, default=4) – The reduction rate used in the attention mechanism to reduce dimensionality and computational complexity.
use_mlp (bool, default=False) – Flag to indicate whether an MLP (Multi-Layer Perceptron) should be used within the attention mechanism for further processing.
freq_idx (int, default=0) – DCT index used in fca attention mechanism.
n_codewords (int, default=4) – The number of codewords (clusters) used in attention mechanisms that employ quantization or clustering strategies.
kernel_size (int, default=9) – The kernel size used in certain types of attention mechanisms for convolution operations.
activation (nn.Module, default=nn.ELU) – Activation function class to apply. Should be a PyTorch activation module class like
nn.ReLU
ornn.ELU
. Default isnn.ELU
.extra_params (bool, default=False) – Flag to indicate whether additional, custom parameters should be passed to the attention mechanism.
- Raises:
ValueError – If some input signal-related parameters are not specified: and can not be inferred.
Notes
Sequence length after each stage is computed internally; the final classifier expects a flattened
ch_dim x T₂
vector.Attention operates on channel dimension by design; temporal gating exists only in specific variants (CBAM/CAT).
The paper and original code with more details about the methodological choices are available at the [Martin2023] and [MartinCode].
Added in version 0.9.
References
[Martin2023] (1,2)Wimpff, M., Gizzi, L., Zerfowski, J. and Yang, B., 2023. EEG motor imagery decoding: A framework for comparative analysis with channel attention mechanisms. arXiv preprint arXiv:2310.11198.
[MartinCode]Wimpff, M., Gizzi, L., Zerfowski, J. and Yang, B. GitHub martinwimpff/channel-attention (accessed 2024-03-28)
Methods
- forward(x)[source]#
Define the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.- Parameters:
x – The description is missing.