Cutlass
CUDA Templates for Linear Algebra Subroutines and Solvers
File List
Here is a list of all files with brief descriptions:
 batched_reduction.hImplements a software-pipelined efficient batched reduction. D = alpha * Reduction(A) + beta * C
 batched_reduction_traits.hDefines structural properties of complete batched reduction. D = alpha * Reduction(A) + beta * C
 clear_accumulators.hDefines abstractions for efficiently clearing accumulator tiles
 complex.h
 convert.hDefines conversion operations among Fragments of different base type
 coord.hA Coord is a coordinate of arbitrary rank into a tensor or matrix
 core_io.hHelpers for printing cutlass/core objects
 cutlass.hBasic include for CUTLASS macros
 cutlass_math.hMath utilities
 debug.hDebugging and logging functionality
 device_gemm.hDevice level GEMM implemented by more than one kernels
 device_gemm_traits.h
 dgemm_traits.hDefines structural traits of double-precision GEMM
 fp16_sgemm_multiply_add.hTemplate implementing matrix multiply-add operations on fragments
 fp16_sgemm_traits.hDefies structural properties of single-precision GEMM where any number of the input/output could be fp16 or fp32. The accumulator type stays in fp32
 fragment.hDefines Fragment, a statically-sized array for storing parts of matrices within a thread's registers
 fragment_multiply_add.hDefines multiply-add operations on fragments within a thread
 gemm.hImplements a software-pipelined efficient GEMM
 gemm_config.hDefines properties of GEMM computation that impose some constraints on caller
 gemm_coord.hGemmCoord is a structure derived from Coord<4> that specifies a location within the coordinate system of a GEMM problem
 gemm_desc.hImplements a software-pipelined efficient GEMM
 gemm_epilogue.hImplements the epilogue phase of the GEMM kernel that efficiently updates global memory with the computed matrix product
 gemm_epilogue_traits.hDefines structural properties of the GEMM epilogue
 gemm_global_stream.hImplements efficient loading of the thread block-level tile from global memory and storing to shared memory
 gemm_global_tile.hDefines iterators for efficiently loading and storing to global memory
 gemm_operand.hDefines constant expressions for mapping GEMM problem size and strides onto pitch-linear memory
 gemm_shared_stream.hDefines abstractions for managing loading and storing fragments to shared memory in the efficient GEMM pipeline
 gemm_shared_tile.hDefines iterators for efficiently loading and storing tiles to and from shared memory
 gemm_stream_pair.hDefines a pair of GEMM tile streams
 gemm_traits.hDefines structural properties of complete GEMM computation
 hgemm_global_tile.hTile traits used to construct global tile iterator for HGEMM. This is intended to partition the thread block-level tile into 2D subtiles loaded by the threads and facilitate memory accesses larger than 16 bits
 hgemm_multiply_add.hSpecialization implementing multiply-add operation on half-precision floating point fragments
 hgemm_swizzle.hTransposes a tile of 16b elements. Used by HGEMM to construct a K-strided layout in shared memory for multiplicands
 hgemm_traits.hDefies structural properties of half-precision GEMM computation
 igemm_epilogue.hDefines the epilogue phase of the GEMM computation for IGEMM, supporting integer and floating-point output matrix formats
 igemm_global_tile.hImplements tile iterators to partition the thread block tile into 2D subtiles and efficiently load each. Applies permute transformation to construct 'interleaved K-strided' data layout in which 4-element dot products from the same K index are arranged in consecutive locations within shared memory
 igemm_multiply_add.hImplements matrix multiply accumulate operation of 8-bit integer data using DP4A instruction
 igemm_swizzle.hTransposes a fragment of data containing packed 8-bit integer elements
 igemm_traits.hDefies structural properties of mixed-precision integer GEMM. Multiplicands are assumed to be packed 8bit integers, accumulators are assumed to be 32b signed integers, and output formats vary
 iterator_access.hFree functions for loading and storing to implementations of tile iteartor concepts
 kernel_launch.hDefines structures and helpers to launch CUDA kernels within CUTLASS
 linear_scaling.hImplements the BLAS linear scaling function alpha*AB + beta*C
 linear_scaling_device_ptr.hImplements the BLAS linear scaling function alpha*AB + beta*C
 load_store.hDefines abstractions for efficiently loading and storing vectors to memory
 matrix_traits.hDefines properties of matrices used to denote layout and operands to GEMM kernels
 numeric_types.h
 pair.hDefines a pair<>
 performance_tuning.h
 platform.hC++ features that may be otherwise unimplemented for CUDA device functions
 predicate_vector.hDefines container classes and iterators for managing a statically sized vector of boolean predicates
 reshape_tile.hDefines a type for restructuring a tile
 scalar_or_pointer.hImplements the BLAS linear scaling function alpha*AB + beta*C
 sgemm_traits.hDefies structural properties of single-precision GEMM
 shape.hDefines Shape implementing the Layout concept for representing a 4D hypercube of objects
 tensor_ref.hDefines a structure containing strides, bounds, and a pointer to tensor data
 tensor_ref_collection.hIntroduces TensorRefCollection concept and defines TensorRefBatch and TensorRefArray
 tensor_view.hDefines a structure containing strides and a pointer to tensor data
 thread_multiply_add.hTemplate implementing matrix multiply-add operations on fragments
 gemm/threadblock_swizzle.hDefies functors for mapping blockIdx to partitions of the GEMM computation
 reduction/threadblock_swizzle.hDefies functors for mapping blockIdx to partitions of the batched reduction computation
 tile_allocation.hDefines a fragment based on a Shape<> template
 tile_coord.hDefines a coordinate used for the CUTLASS 4-D tile structure
 tile_iterator.hDefines the Tile Traits concept and iterators for loading and storing to tiles efficiently
 tile_stream.hImplements the tile stream concept, composing an iterator with a transformation. Offers split-phase semantics, separating the initiation of an asynchronous memory operation with a fence forcing it to complete
 tile_traits_standard.hDefines tile traits for several tile partitioning arrangements of threads expected to achieve efficient streaming performance
 vector.hDefines a 1D vector of elements held in the registers of each thread
 wmma_gemm_epilogue_traits.hDefines structural properties of WMMA GEMM's epilogue phase
 wmma_gemm_global_tile.hDefines tile iterator traits for loading thread block-level tile from global memory
 wmma_gemm_multiply_add.hImplements warp-level matrix multiply-accumulate operation using CUDA WMMA API
 wmma_gemm_shared_tile.hDefines iterator traits for efficiently loading and storing fragment to and from shared memory, specialized for WMMA GEMM
 wmma_gemm_traits.hDefies structural properties of GEMM targeting WMMA API in CUDA
 wmma_matrix.hAbstractions for loading and storing matrices using the CUDA WMMA API
 zip_fragment.hModels a pair of fragments
 zip_tensor_ref.hDefines a structure containing a pair of TensorRef-like objects
 zip_tile_iterator.hConstructs an iterator that owns two tile iterator instances