Image & Memory

PyImageCUDA provides two image types and flexible memory management for GPU buffers.

Image Types

`Image` - Float32 Precision

Primary image type for all operations. Stores RGBA data in 32-bit floating point (0.0 to 1.0 range).

from pyimagecuda import Image

# Create a new image
img = Image(width=1920, height=1080)

Uninitialized Memory

Newly created images contain uninitialized GPU memory with random data. Always initialize before use with Fill.color() or load data explicitly.

    img = Image(1920, 1080)
    Fill.color(img, (0, 0, 0, 0))  # Clear to transparent black, or use load() or other Fill functions

When to use: All composition, effects, and color operations. This is your default choice.

`ImageU8` - 8-bit Precision

Storage-optimized type for loading/saving. Stores RGBA data as unsigned 8-bit integers (0 to 255 range).

from pyimagecuda import ImageU8

# Typically used internally by load/save
u8_img = ImageU8(width=1920, height=1080)

When to use: Rarely needed directly. load() and save() handle conversions automatically.

Memory Management

PyImageCUDA offers three memory management approaches:

1. Automatic (Garbage Collection)

Simplest approach. Python's GC cleans up when images go out of scope.

from pyimagecuda import Image, Fill

img = Image(1920, 1080)
Fill.color(img, (1, 0, 0, 1))
# img will be freed automatically when no longer referenced

Use when: Writing simple scripts or prototypes.

2. Explicit with Context Managers

Immediate cleanup using with statements. Example - Batch Processing:

from pyimagecuda import load, save, Filter

for i in range(1000):
    with load(f"input_{i}.jpg") as img:
        Filter.gaussian_blur(img, radius=10)
        save(img, f"output_{i}.jpg")
    # Each image is freed before loading the next

3. Manual Control

Explicit free() calls for maximum control.

from pyimagecuda import Image, Fill

img = Image(1920, 1080)
Fill.color(img, (1, 0, 0, 1))
img.free()  # Free immediately

Use when: You need precise control over when memory is released.

4. Handling Out of Memory

When GPU memory is exhausted, you'll see:

RuntimeError: CUDA malloc failed: out of memory

This is a clear signal that your GPU has run out of VRAM. Check your memory usage and consider freeing unused buffers or reducing workload size.

Buffer Reuse

All operations that create temporary buffers accept optional buffer parameters for zero-allocation workflows.

Buffers can be larger than necessary but not smaller. If they are larger, their logical dimensions will be adapted within the function without any performance cost, but the original maximum size with which it was created will be maintained.

Example: Gaussian Blur

from pyimagecuda import Image, Filter

src = Image(1920, 1080)
dst = Image(1920, 1080)
temp = Image(1920, 1080)

# Process 100 images reusing the same buffers
for i in range(100):
    load(f"input_{i}.jpg", f32_buffer=src)
    Filter.gaussian_blur(src, dst_buffer=dst, temp_buffer=temp)
    save(dst, f"output_{i}.jpg")

# Clean up once
src.free()
dst.free()
temp.free()

Benefits:

No repeated allocations
Consistent VRAM usage
Critical for video processing

Dynamic Buffer Sizing

Image buffers have a fixed capacity but adjustable logical dimensions.

from pyimagecuda import Image

# Create buffer with capacity for 1920×1080
img = Image(1920, 1080)

# Can logically resize within capacity
img.resize(1280, 720)

# Check capacity
max_pixels = img.get_max_capacity()
print(f"Capacity: {max_pixels:,} pixels")  # 2,073,600 pixels (1920×1080)
print(f"Current: {img.width}×{img.height}")  # 1280×720

CUDA Interop (Zero-Copy)

PyImageCUDA images implement the __cuda_array_interface__ v3 protocol. This allows any library in the CUDA Python ecosystem (CuPy, PyTorch, Numba, RAPIDS, JAX with the CUDA backend, etc.) to read and write the image's GPU memory without any copies.

The memory layout exposed is (height, width, 4) interleaved RGBA, row-major contiguous:

Image → float32 (typestr <f4)
ImageU8 → uint8 (typestr |u1)

Memory ownership

The CUDA buffer is owned by the Image / ImageU8 instance. The image must stay alive while any external array views it, otherwise the view becomes a dangling pointer. Do not call image.free() while a CuPy / Torch tensor still references it.

Raw device pointer

Use the cuda_ptr property to retrieve the raw CUDA device pointer as a Python int. Useful for custom kernels or low-level interop.

from pyimagecuda import Image

img = Image(1920, 1080)
print(hex(img.cuda_ptr))  # e.g. 0x7f1234500000

CuPy

The to_cupy() helper wraps a pyimagecuda image as a zero-copy cupy.ndarray. CuPy is not a dependency of pyimagecuda; install it separately:

pip install cupy-cuda12x   # or cupy-cuda11x for CUDA 11

from pyimagecuda import Image, Fill, to_cupy, download
import cupy as cp

img = Image(512, 512)
Fill.color(img, (1, 0, 0, 1))

# Zero-copy view — shares the same GPU memory
arr = to_cupy(img)
assert arr.data.ptr == img.cuda_ptr
assert arr.shape == (512, 512, 4)
assert arr.dtype == cp.float32

# Modify via CuPy → changes are visible in the pyimagecuda image
arr[:, :, 1] = 1.0  # add green channel

# Download to verify
pixels = download(img)

Because __cuda_array_interface__ is honored, cp.asarray(img) works too:

arr = cp.asarray(img)  # zero-copy

Custom CUDA kernels (CuPy RawKernel)

You can run your own CUDA kernels directly on pyimagecuda buffers:

from pyimagecuda import Image, Fill
import cupy as cp

img = Image(512, 512)
Fill.color(img, (0.2, 0.4, 0.8, 1.0))

invert_rgb = cp.RawKernel(r'''
extern "C" __global__
void invert_rgb(float4* data, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i >= n) return;
    float4 p = data[i];
    data[i] = make_float4(1.0f - p.x, 1.0f - p.y, 1.0f - p.z, p.w);
}
''', 'invert_rgb')

arr = cp.asarray(img)
n = img.width * img.height
invert_rgb((n // 256 + 1,), (256,), (arr, n))

PyTorch

PyTorch consumes __cuda_array_interface__ via torch.as_tensor:

import torch
from pyimagecuda import Image, Fill

img = Image(512, 512)
Fill.color(img, (1, 0, 0, 1))

tensor = torch.as_tensor(img, device='cuda')  # zero-copy
# tensor.shape == torch.Size([512, 512, 4])
# tensor.dtype == torch.float32

Numba CUDA

from numba import cuda
from pyimagecuda import Image

img = Image(512, 512)
device_array = cuda.as_cuda_array(img)  # zero-copy

Best Practices

For Simple Scripts

# Just use automatic GC
img = load("input.jpg")
process(img)
save(img, "output.jpg")

For Batch Processing

# Use with statements
for file in files:
    with load(file) as img:
        process(img)
        save(img, output)

For Video/Real-time

# Reuse buffers explicitly
frame = Image(1920, 1080)
temp = Image(1920, 1080)

while video.has_frames():
    video.read_into(frame)
    process(frame, temp_buffer=temp)
    video.write(frame)

frame.free()
temp.free()

Memory Considerations

VRAM vs RAM:

Image(1920, 1080) uses ~32MB of VRAM
Python object itself uses <100 bytes of RAM
GC triggers on RAM pressure, not VRAM pressure, use explicit management for large workloads