Cutlass
CUDA Templates for Linear Algebra Subroutines and Solvers
|
batched_reduction.h | Implements a software-pipelined efficient batched reduction. D = alpha * Reduction(A) + beta * C |
batched_reduction_traits.h | Defines structural properties of complete batched reduction. D = alpha * Reduction(A) + beta * C |
clear_accumulators.h | Defines abstractions for efficiently clearing accumulator tiles |
complex.h | |
convert.h | Defines conversion operations among Fragments of different base type |
coord.h | A Coord is a coordinate of arbitrary rank into a tensor or matrix |
core_io.h | Helpers for printing cutlass/core objects |
cutlass.h | Basic include for CUTLASS macros |
cutlass_math.h | Math utilities |
debug.h | Debugging and logging functionality |
device_gemm.h | Device level GEMM implemented by more than one kernels |
device_gemm_traits.h | |
dgemm_traits.h | Defines structural traits of double-precision GEMM |
fp16_sgemm_multiply_add.h | Template implementing matrix multiply-add operations on fragments |
fp16_sgemm_traits.h | Defies structural properties of single-precision GEMM where any number of the input/output could be fp16 or fp32. The accumulator type stays in fp32 |
fragment.h | Defines Fragment, a statically-sized array for storing parts of matrices within a thread's registers |
fragment_multiply_add.h | Defines multiply-add operations on fragments within a thread |
gemm.h | Implements a software-pipelined efficient GEMM |
gemm_config.h | Defines properties of GEMM computation that impose some constraints on caller |
gemm_coord.h | GemmCoord is a structure derived from Coord<4> that specifies a location within the coordinate system of a GEMM problem |
gemm_desc.h | Implements a software-pipelined efficient GEMM |
gemm_epilogue.h | Implements the epilogue phase of the GEMM kernel that efficiently updates global memory with the computed matrix product |
gemm_epilogue_traits.h | Defines structural properties of the GEMM epilogue |
gemm_global_stream.h | Implements efficient loading of the thread block-level tile from global memory and storing to shared memory |
gemm_global_tile.h | Defines iterators for efficiently loading and storing to global memory |
gemm_operand.h | Defines constant expressions for mapping GEMM problem size and strides onto pitch-linear memory |
gemm_shared_stream.h | Defines abstractions for managing loading and storing fragments to shared memory in the efficient GEMM pipeline |
gemm_shared_tile.h | Defines iterators for efficiently loading and storing tiles to and from shared memory |
gemm_stream_pair.h | Defines a pair of GEMM tile streams |
gemm_traits.h | Defines structural properties of complete GEMM computation |
hgemm_global_tile.h | Tile traits used to construct global tile iterator for HGEMM. This is intended to partition the thread block-level tile into 2D subtiles loaded by the threads and facilitate memory accesses larger than 16 bits |
hgemm_multiply_add.h | Specialization implementing multiply-add operation on half-precision floating point fragments |
hgemm_swizzle.h | Transposes a tile of 16b elements. Used by HGEMM to construct a K-strided layout in shared memory for multiplicands |
hgemm_traits.h | Defies structural properties of half-precision GEMM computation |
igemm_epilogue.h | Defines the epilogue phase of the GEMM computation for IGEMM, supporting integer and floating-point output matrix formats |
igemm_global_tile.h | Implements tile iterators to partition the thread block tile into 2D subtiles and efficiently load each. Applies permute transformation to construct 'interleaved K-strided' data layout in which 4-element dot products from the same K index are arranged in consecutive locations within shared memory |
igemm_multiply_add.h | Implements matrix multiply accumulate operation of 8-bit integer data using DP4A instruction |
igemm_swizzle.h | Transposes a fragment of data containing packed 8-bit integer elements |
igemm_traits.h | Defies structural properties of mixed-precision integer GEMM. Multiplicands are assumed to be packed 8bit integers, accumulators are assumed to be 32b signed integers, and output formats vary |
iterator_access.h | Free functions for loading and storing to implementations of tile iteartor concepts |
kernel_launch.h | Defines structures and helpers to launch CUDA kernels within CUTLASS |
linear_scaling.h | Implements the BLAS linear scaling function alpha*AB + beta*C |
linear_scaling_device_ptr.h | Implements the BLAS linear scaling function alpha*AB + beta*C |
load_store.h | Defines abstractions for efficiently loading and storing vectors to memory |
matrix_traits.h | Defines properties of matrices used to denote layout and operands to GEMM kernels |
numeric_types.h | |
pair.h | Defines a pair<> |
performance_tuning.h | |
platform.h | C++ features that may be otherwise unimplemented for CUDA device functions |
predicate_vector.h | Defines container classes and iterators for managing a statically sized vector of boolean predicates |
reshape_tile.h | Defines a type for restructuring a tile |
scalar_or_pointer.h | Implements the BLAS linear scaling function alpha*AB + beta*C |
sgemm_traits.h | Defies structural properties of single-precision GEMM |
shape.h | Defines Shape implementing the Layout concept for representing a 4D hypercube of objects |
tensor_ref.h | Defines a structure containing strides, bounds, and a pointer to tensor data |
tensor_ref_collection.h | Introduces TensorRefCollection concept and defines TensorRefBatch and TensorRefArray |
tensor_view.h | Defines a structure containing strides and a pointer to tensor data |
thread_multiply_add.h | Template implementing matrix multiply-add operations on fragments |
gemm/threadblock_swizzle.h | Defies functors for mapping blockIdx to partitions of the GEMM computation |
reduction/threadblock_swizzle.h | Defies functors for mapping blockIdx to partitions of the batched reduction computation |
tile_allocation.h | Defines a fragment based on a Shape<> template |
tile_coord.h | Defines a coordinate used for the CUTLASS 4-D tile structure |
tile_iterator.h | Defines the Tile Traits concept and iterators for loading and storing to tiles efficiently |
tile_stream.h | Implements the tile stream concept, composing an iterator with a transformation. Offers split-phase semantics, separating the initiation of an asynchronous memory operation with a fence forcing it to complete |
tile_traits_standard.h | Defines tile traits for several tile partitioning arrangements of threads expected to achieve efficient streaming performance |
vector.h | Defines a 1D vector of elements held in the registers of each thread |
wmma_gemm_epilogue_traits.h | Defines structural properties of WMMA GEMM's epilogue phase |
wmma_gemm_global_tile.h | Defines tile iterator traits for loading thread block-level tile from global memory |
wmma_gemm_multiply_add.h | Implements warp-level matrix multiply-accumulate operation using CUDA WMMA API |
wmma_gemm_shared_tile.h | Defines iterator traits for efficiently loading and storing fragment to and from shared memory, specialized for WMMA GEMM |
wmma_gemm_traits.h | Defies structural properties of GEMM targeting WMMA API in CUDA |
wmma_matrix.h | Abstractions for loading and storing matrices using the CUDA WMMA API |
zip_fragment.h | Models a pair of fragments |
zip_tensor_ref.h | Defines a structure containing a pair of TensorRef-like objects |
zip_tile_iterator.h | Constructs an iterator that owns two tile iterator instances |