Utilizing The NVIDIA CUDA Stream-Ordered Memory Allocator, Half 1

Most CUDA developers are acquainted with the cudaMalloc and cudaFree API capabilities to allocate GPU accessible memory. Nevertheless, there has long been an impediment with these API functions: they aren’t stream ordered. In this publish, we introduce new API functions, cudaMallocAsync and cudaFreeAsync, that allow Memory Wave allocation and deallocation to be stream-ordered operations. In part 2 of this collection, we highlight the advantages of this new functionality by sharing some large information benchmark outcomes and provide a code migration information for modifying your current purposes. We also cowl superior subjects to make the most of stream-ordered memory allocation within the context of multi-GPU entry and the use of IPC. This all helps you enhance performance inside your existing applications. The following code example on the left is inefficient because the primary cudaFree call has to anticipate kernelA to finish, so it synchronizes the machine before freeing the memory. To make this run extra effectively, the Memory Wave System might be allocated upfront and sized to the larger of the two sizes, as proven on the best.

This increases code complexity in the applying as a result of the memory administration code is separated out from the business logic. The problem is exacerbated when other libraries are concerned. This is way tougher for the applying to make environment friendly because it could not have full visibility or management over what the library is doing. To avoid this problem, the library would have to allocate memory when that operate is invoked for the primary time and by no means free it until the library is deinitialized. This not solely increases code complexity, but it surely additionally causes the library to hold on to the memory longer than it needs to, potentially denying another portion of the applying from utilizing that memory. Some purposes take the idea of allocating memory upfront even additional by implementing their own custom allocator. This adds a significant amount of complexity to utility improvement. CUDA goals to offer a low-effort, high-efficiency alternative.
reference.com

CUDA 11.2 introduced a stream-ordered memory allocator to solve some of these issues, with the addition of cudaMallocAsync and cudaFreeAsync. These new API features shift memory allocation from world-scope operations that synchronize your complete device to stream-ordered operations that enable you to compose memory administration with GPU work submission. This eliminates the need for synchronizing outstanding GPU work and helps restrict the lifetime of the allocation to the GPU work that accesses it. It's now attainable to handle memory at operate scope, as in the next example of a library function launching kernelA. All the standard stream-ordering guidelines apply to cudaMallocAsync and cudaFreeAsync. The memory returned from cudaMallocAsync can be accessed by any kernel or memcpy operation as long as the kernel or memcpy is ordered to execute after the allocation operation and earlier than the deallocation operation, in stream order. Deallocation could be carried out in any stream, as long as it is ordered to execute after the allocation operation and after all accesses on all streams of that memory on the GPU.

In impact, stream-ordered allocation behaves as if allocation and free had been kernels. If kernelA produces a valid buffer on a stream and kernelB invalidates it on the same stream, then an software is free to entry the buffer after kernelA and before kernelB in the appropriate stream order. The following instance reveals varied valid usages. Figure 1 shows the assorted dependencies specified in the earlier code example. As you'll be able to see, all kernels are ordered to execute after the allocation operation and full earlier than the deallocation operation. Memory allocation and deallocation cannot fail asynchronously. Memory errors that occur due to a call to cudaMallocAsync or cudaFreeAsync (for instance, out of memory) are reported immediately via an error code returned from the call. If cudaMallocAsync completes successfully, the returned pointer is assured to be a sound pointer to memory that is secure to entry in the appropriate stream order. The CUDA driver makes use of memory swimming pools to achieve the habits of returning a pointer immediately.