IO (Loading and Saving)

PyImageCUDA handles image loading and saving through pyvips, supporting all common formats with robust decoding.

Efficient IO Strategy

All image files are loaded/saved as uint8 (8-bit) data, then converted to/from float32 on the GPU. This minimizes disk I/O and CPU-GPU transfer overhead while maintaining float32 precision for all internal operations.

Loading Images

Basic Usage

from pyimagecuda import load

img = load("photo.jpg")
# Returns Image (float32) ready for processing

Supported formats: JPG, PNG, WEBP, HEIC, GIF, TIFF, BMP, and more.

Loading pipeline:

pyvips decodes image file to uint8 RGBA (CPU)
uint8 data uploaded to GPU (fast - 4 bytes per pixel)
GPU converts uint8 → float32 (instant kernel operation)

Buffer Reuse

Avoid repeated allocations by reusing buffers:

from pyimagecuda import Image, ImageU8, load

# Create reusable buffers
f32_buffer = Image(4096, 4096)      # Max capacity
u8_buffer = ImageU8(4096, 4096)     # Temporary conversion buffer

# Load multiple images reusing the same memory
for filename in image_files:
    load(filename, f32_buffer=f32_buffer, u8_buffer=u8_buffer)
    process(f32_buffer)
    save(f32_buffer, f"output_{filename}")

Benefits:

Zero allocations after first load
Constant VRAM usage
Critical for batch processing

How it works:

load() reads the image file
Decodes into u8_buffer (uint8 RGBA)
Converts to f32_buffer (float32 RGBA)
Buffer dimensions adjust automatically within capacity

Format Handling

PyImageCUDA automatically normalizes all formats to RGBA:

# Grayscale → RGBA
img = load("grayscale.png")  # 1 channel → 4 channels (R=G=B, A=255)

# RGB → RGBA
img = load("photo.jpg")      # 3 channels → 4 channels (A=255)

# RGBA → RGBA
img = load("logo.png")       # 4 channels → 4 channels (unchanged)

All operations work uniformly on RGBA float32 images.

Saving Images

Basic Usage

from pyimagecuda import save

save(img, "output.png")

Supported formats: JPG, PNG, WEBP, HEIC, TIFF, BMP.

Saving pipeline:

GPU converts float32 → uint8 (instant kernel operation)
uint8 data downloaded from GPU (fast - 4 bytes per pixel)
pyvips encodes uint8 RGBA to file (CPU)

Quality Control

For lossy formats, specify compression quality:

# JPEG (1-100, higher = better quality)
save(img, "photo.jpg", quality=95)

# WebP (1-100)
save(img, "photo.webp", quality=85)

# HEIC (1-100)
save(img, "photo.heic", quality=90)

Default: Maximum quality for all formats.

Buffer Reuse

Reuse temporary buffers for batch saving:

from pyimagecuda import Image, ImageU8, save

processed_images = [...]  # List of Image objects
u8_buffer = ImageU8(1920, 1080)

for i, img in enumerate(processed_images):
    save(img, f"output_{i}.jpg", u8_buffer=u8_buffer, quality=90)

How it works:

Converts float32 → uint8 into u8_buffer
Downloads from GPU to CPU
Encodes and writes to disk

NumPy Integration

PyImageCUDA provides native NumPy bridges for seamless interoperability with the Python ecosystem.

Works with OpenCV, Pillow, Matplotlib, and More!

Since OpenCV (cv2.imread()), Pillow (Image.open()), Matplotlib, and most Python image libraries return NumPy arrays, you can use from_numpy() and to_numpy() to work with them all.

Basic Usage

from pyimagecuda import from_numpy, to_numpy
import numpy as np

# NumPy → PyImageCUDA
np_array = np.random.rand(1080, 1920, 4).astype(np.float32)
img = from_numpy(np_array)

# PyImageCUDA → NumPy
result = to_numpy(img)  # Returns np.ndarray of shape (H, W, 4)

Supported Input Formats

from_numpy() automatically handles common array formats:

# Grayscale (H, W) → RGBA
gray = np.random.randint(0, 255, (1080, 1920), dtype=np.uint8)
img = from_numpy(gray)  # Expands to RGBA: R=G=B, A=255

# RGB (H, W, 3) → RGBA
rgb = np.random.randint(0, 255, (1080, 1920, 3), dtype=np.uint8)
img = from_numpy(rgb)  # Adds alpha channel: A=255

# RGBA (H, W, 4) → RGBA
rgba = np.random.rand(1080, 1920, 4).astype(np.float32)
img = from_numpy(rgba)  # Direct upload

# Supported dtypes: uint8 (0-255) or float32 (0.0-1.0)

Conversion pipeline:

uint8 input: Uploads as 4 bytes/pixel → GPU converts to float32 (optimized)
float32 input: Direct upload as 16 bytes/pixel (no conversion needed)

OpenCV Integration

import cv2
from pyimagecuda import from_numpy, to_numpy, adjust_saturation

# Load image with OpenCV
cv_img = cv2.imread("photo.jpg")  # BGR uint8
cv_img = cv2.cvtColor(cv_img, cv2.COLOR_BGR2RGB)  # Convert BGR → RGB

# Process on GPU
gpu_img = from_numpy(cv_img)
adjust_saturation(gpu_img, 1.5)

# Back to OpenCV
result = to_numpy(gpu_img)
result = (result[:, :, :3] * 255).astype(np.uint8)  # Float32 → uint8, drop alpha
result = cv2.cvtColor(result, cv2.COLOR_RGB2BGR)  # RGB → BGR
cv2.imwrite("output.jpg", result)

Pillow Integration

from PIL import Image as PILImage
from pyimagecuda import from_numpy, to_numpy, blur
import numpy as np

# Load with Pillow
pil_img = PILImage.open("photo.jpg").convert("RGBA")
np_array = np.array(pil_img)

# Process on GPU
gpu_img = from_numpy(np_array)
blur(gpu_img, 10)

# Back to Pillow
result = to_numpy(gpu_img)
result = (result * 255).astype(np.uint8)  # Float32 → uint8
pil_result = PILImage.fromarray(result, mode="RGBA")
pil_result.save("output.png")

Matplotlib Integration

import matplotlib.pyplot as plt
from pyimagecuda import from_numpy, to_numpy, adjust_exposure

# Load from Matplotlib
img_array = plt.imread("photo.png")  # Returns float32 [0.0, 1.0]

# Process on GPU
gpu_img = from_numpy(img_array)
adjust_exposure(gpu_img, 0.5)

# Display result
result = to_numpy(gpu_img)
plt.imshow(result)
plt.show()

Buffer Reuse for Performance

Reuse buffers to eliminate allocations in tight loops:

from pyimagecuda import Image, ImageU8, from_numpy
import cv2

# Create reusable buffers
f32_buffer = Image(1920, 1080)
u8_buffer = ImageU8(1920, 1080)

# Process video frames
cap = cv2.VideoCapture("video.mp4")
while cap.isOpened():
    ret, frame = cap.read()
    if not ret:
        break

    # Reuses existing GPU memory
    from_numpy(frame, f32_buffer=f32_buffer, u8_buffer=u8_buffer)
    process(f32_buffer)

    result = to_numpy(f32_buffer)
    cv2.imshow("Processed", (result[:, :, :3] * 255).astype(np.uint8))

Benefits:

Zero GPU allocations after first frame
Constant VRAM usage
Critical for real-time video processing

Low-Level Operations

For advanced use cases, you can access the underlying conversion functions:

Manual Conversions

from pyimagecuda import Image, ImageU8, convert_u8_to_float, convert_float_to_u8

# uint8 → float32
u8_img = ImageU8(1920, 1080)
f32_img = Image(1920, 1080)
convert_u8_to_float(f32_img, u8_img)

# float32 → uint8
convert_float_to_u8(u8_img, f32_img)

Direct Upload/Download

from pyimagecuda import Image, upload, download

# Upload raw RGBA float32 bytes to GPU
img = Image(512, 512)
raw_data = bytes(512 * 512 * 16)  # 16 bytes per pixel (4 × float32)
upload(img, raw_data)

# Download from GPU to CPU
raw_data = download(img)  # Returns bytes

"Raw Data Types" upload() and download() transfer raw bytes without conversion. * If using Image, data must be float32 (16 bytes per pixel). * If using ImageU8, data must be uint8 (4 bytes per pixel).

Copy Between Buffers

from pyimagecuda import Image, copy

src = Image(1920, 1080)
dst = Image(1920, 1080)

copy(dst, src)  # GPU-to-GPU copy (very fast)

Best Practices

For Simple Scripts

# Just load and save
img = load("input.jpg")
process(img)
save(img, "output.png")

For NumPy/OpenCV/Pillow Workflows

import cv2
from pyimagecuda import from_numpy, to_numpy

# Load with your preferred library
frame = cv2.imread("photo.jpg")

# Process on GPU
gpu_img = from_numpy(frame)
process(gpu_img)

# Back to CPU
result = to_numpy(gpu_img)
cv2.imwrite("output.jpg", result)

For Batch Processing

# Reuse buffers
f32 = Image(4096, 4096)
u8 = ImageU8(4096, 4096)

for file in files:
    load(file, f32_buffer=f32, u8_buffer=u8)
    process(f32)
    save(f32, output_file, u8_buffer=u8)

For Video Processing

import cv2
from pyimagecuda import Image, ImageU8, from_numpy, to_numpy

# Fixed-size buffers for consistent frame sizes
frame_buffer = Image(1920, 1080)
u8_temp = ImageU8(1920, 1080)

cap = cv2.VideoCapture("video.mp4")
while cap.isOpened():
    ret, frame = cap.read()
    if not ret:
        break

    from_numpy(frame, f32_buffer=frame_buffer, u8_buffer=u8_temp)
    process(frame_buffer)

    result = to_numpy(frame_buffer)
    cv2.imshow("Output", (result * 255).astype(np.uint8))

Performance Notes

NumPy Integration:

uint8 arrays: Uploads 4 bytes/pixel → GPU converts to float32 (<1ms for 1920×1080)
float32 arrays: Direct upload 16 bytes/pixel (no conversion)
Download: Always float32 → 16 bytes/pixel transfer

File Loading:

pyvips decodes file to uint8 (CPU)
uint8 → GPU: 4 bytes/pixel (1920×1080 = ~8MB transfer)
GPU converts uint8 → float32: <1ms

File Saving:

GPU converts float32 → uint8: <1ms
uint8 → CPU: 4 bytes/pixel (1920×1080 = ~8MB transfer)
pyvips encodes uint8 to file (CPU)

Why this is fast:

CPU↔GPU transfers prefer uint8 when possible (4× smaller than float32)
Conversions happen on GPU (massively parallel)
NumPy bridge uses optimized upload/download paths

Tip: For maximum throughput with NumPy arrays, prefer uint8 input when possible, and use buffer reuse for batch processing.