Skip to content

Conversation

@jfsantos
Copy link
Contributor

Optimizing grouped convolutions by pre-computing weight block indices and unrolling loops for common numbers of groups. Also added a performance benchamark for Conv1D and Conv1x1.

Other potential updates (still not implemented):

  • Store weights as block-diagonal sparse matrix, do one matmul instead of G matmuls
  • Templated convolutions with compile-time static shapes to leverage specific compile-time optimizations

João Felipe Santos and others added 4 commits January 28, 2026 12:49
…chmarking tool for convolution performance.
… for Conv1D

Conv1x1: Use explicit group loop with groups=1 fast path. For small channel
counts (2-8), this avoids the overhead of zero multiplications in block-diagonal
matrices that BLAS cannot optimize efficiently.

Conv1D: Keep block-diagonal approach (single matmul per kernel position) which
shows 1.5-1.9x speedup for grouped convolutions. The multiple kernel positions
amortize the overhead, making this approach beneficial.

Removed pre-computed GroupBlock structs as they are no longer needed with
these simplified implementations.

Updated benchmark tool to test channels 2-8 for detailed comparison.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant