Cutlass
CUDA Templates for Linear Algebra Subroutines and Solvers
gemm Directory Reference

Files

file  clear_accumulators.h [code]
 Defines abstractions for efficiently clearing accumulator tiles.
 
file  device_gemm.h [code]
 device level GEMM implemented by more than one kernels.
 
file  device_gemm_traits.h [code]
 
file  dgemm_traits.h [code]
 Defines structural traits of double-precision GEMM.
 
file  fp16_sgemm_multiply_add.h [code]
 Template implementing matrix multiply-add operations on fragments.
 
file  fp16_sgemm_traits.h [code]
 Defies structural properties of single-precision GEMM where any number of the input/output could be fp16 or fp32. The accumulator type stays in fp32.
 
file  gemm.h [code]
 Implements a software-pipelined efficient GEMM.
 
file  gemm_config.h [code]
 Defines properties of GEMM computation that impose some constraints on caller.
 
file  gemm_coord.h [code]
 GemmCoord is a structure derived from Coord<4> that specifies a location within the coordinate system of a GEMM problem.
 
file  gemm_desc.h [code]
 Implements a software-pipelined efficient GEMM.
 
file  gemm_epilogue.h [code]
 Implements the epilogue phase of the GEMM kernel that efficiently updates global memory with the computed matrix product.
 
file  gemm_epilogue_traits.h [code]
 Defines structural properties of the GEMM epilogue.
 
file  gemm_global_stream.h [code]
 Implements efficient loading of the thread block-level tile from global memory and storing to shared memory.
 
file  gemm_global_tile.h [code]
 Defines iterators for efficiently loading and storing to global memory.
 
file  gemm_operand.h [code]
 Defines constant expressions for mapping GEMM problem size and strides onto pitch-linear memory.
 
file  gemm_shared_stream.h [code]
 Defines abstractions for managing loading and storing fragments to shared memory in the efficient GEMM pipeline.
 
file  gemm_shared_tile.h [code]
 Defines iterators for efficiently loading and storing tiles to and from shared memory.
 
file  gemm_stream_pair.h [code]
 Defines a pair of GEMM tile streams.
 
file  gemm_traits.h [code]
 Defines structural properties of complete GEMM computation.
 
file  hgemm_global_tile.h [code]
 Tile traits used to construct global tile iterator for HGEMM. This is intended to partition the thread block-level tile into 2D subtiles loaded by the threads and facilitate memory accesses larger than 16 bits.
 
file  hgemm_multiply_add.h [code]
 Specialization implementing multiply-add operation on half-precision floating point fragments.
 
file  hgemm_swizzle.h [code]
 Transposes a tile of 16b elements. Used by HGEMM to construct a K-strided layout in shared memory for multiplicands.
 
file  hgemm_traits.h [code]
 Defies structural properties of half-precision GEMM computation.
 
file  igemm_epilogue.h [code]
 Defines the epilogue phase of the GEMM computation for IGEMM, supporting integer and floating-point output matrix formats.
 
file  igemm_global_tile.h [code]
 Implements tile iterators to partition the thread block tile into 2D subtiles and efficiently load each. Applies permute transformation to construct 'interleaved K-strided' data layout in which 4-element dot products from the same K index are arranged in consecutive locations within shared memory.
 
file  igemm_multiply_add.h [code]
 Implements matrix multiply accumulate operation of 8-bit integer data using DP4A instruction.
 
file  igemm_swizzle.h [code]
 Transposes a fragment of data containing packed 8-bit integer elements.
 
file  igemm_traits.h [code]
 Defies structural properties of mixed-precision integer GEMM. Multiplicands are assumed to be packed 8bit integers, accumulators are assumed to be 32b signed integers, and output formats vary.
 
file  linear_scaling.h [code]
 Implements the BLAS linear scaling function alpha*AB + beta*C.
 
file  linear_scaling_device_ptr.h [code]
 Implements the BLAS linear scaling function alpha*AB + beta*C.
 
file  scalar_or_pointer.h [code]
 Implements the BLAS linear scaling function alpha*AB + beta*C.
 
file  sgemm_traits.h [code]
 Defies structural properties of single-precision GEMM.
 
file  thread_multiply_add.h [code]
 Template implementing matrix multiply-add operations on fragments.
 
file  gemm/threadblock_swizzle.h [code]
 Defies functors for mapping blockIdx to partitions of the GEMM computation.
 
file  wmma_gemm_epilogue_traits.h [code]
 Defines structural properties of WMMA GEMM's epilogue phase.
 
file  wmma_gemm_global_tile.h [code]
 Defines tile iterator traits for loading thread block-level tile from global memory.
 
file  wmma_gemm_multiply_add.h [code]
 Implements warp-level matrix multiply-accumulate operation using CUDA WMMA API.
 
file  wmma_gemm_shared_tile.h [code]
 Defines iterator traits for efficiently loading and storing fragment to and from shared memory, specialized for WMMA GEMM.
 
file  wmma_gemm_traits.h [code]
 Defies structural properties of GEMM targeting WMMA API in CUDA.