Performance Benchmarks

PyImageCUDA is designed for GPU-accelerated image processing. These benchmarks compare its performance against industry-standard libraries (Pillow and OpenCV) across common operations.

Test Environment

All benchmarks were conducted on the following hardware:

CPU: AMD Ryzen 7 3700X
GPU: NVIDIA RTX 3070
RAM: DDR4
Storage: Standard SSD
OS: Windows 11
Driver: Latest NVIDIA drivers

Each test ran 50 iterations with a 1920×1080 (1080p) RGBA source image.

Understanding the Results

The benchmarks are divided into two categories:

Pure Algorithm (Compute Bound)

These tests measure raw processing speed without disk I/O. The image is loaded once, then the operation runs repeatedly in memory. This shows the true computational performance difference between libraries.

When this matters:

Batch processing multiple operations on the same image
Real-time video processing pipelines
Interactive applications with live previews
Server-side processing with images already in memory

End-to-End (Disk I/O + Encode)

These tests include the complete workflow: load image from disk → process → encode → save to disk. This represents real-world scenarios where you load and save files repeatedly.

When this matters:

Simple scripts that process individual files
One-off image transformations
Workflows where you must save intermediate results

Key Takeaway

PyImageCUDA excels when you can keep data on the GPU across multiple operations. For single operations with file I/O, the benefit is smaller due to CPU↔GPU transfer overhead.

Performance Highlights

PyImageCUDA delivers 10-400x faster performance in pure computation:

376x faster - Blend/Composite operations
260x faster - Arbitrary angle rotations
132x faster - Bilinear resizing
54x faster - Gaussian blur (heavy filters)
35x faster - Lanczos resizing (high-quality)

For end-to-end workflows with disk I/O, PyImageCUDA typically provides 1.5-2.7x speedup over CPU libraries.

Best Practices for Maximum Performance

✅ DO: Reuse Buffers

Pre-allocate buffers and reuse them across operations:

from pyimagecuda import Image, load, Transform, save, ImageU8

# Pre-allocate buffers
src = Image(1920, 1080)
dst = Image(1920, 1080)
u8_buffer = ImageU8(1920, 1080)

for file in image_files:
    load(file, f32_buffer=src, u8_buffer=u8_buffer)
    Transform.flip(src, direction='horizontal', dst_buffer=dst)
    save(dst, f"output_{file}", u8_buffer=u8_buffer)

# Clean up once at the end
src.free()
dst.free()
u8_buffer.free()

Result: 3-11x faster than allocating new buffers each time.

✅ DO: Chain Operations on GPU

Keep data on GPU for multiple operations:

from pyimagecuda import load, Filter, Adjust, save

img = load("photo.jpg")

# All operations run on GPU without CPU transfers
Filter.gaussian_blur(img, radius=10, dst_buffer=img)
Adjust.brightness(img, 0.2)
Adjust.contrast(img, 1.3)

save(img, "output.jpg")
img.free()

❌ DON'T: Use GPU for Single Simple Operations

For one-off crops or flips with file I/O, CPU libraries are competitive:

# This won't be significantly faster than Pillow/OpenCV
img = load("photo.jpg")
cropped = Transform.crop(img, 0, 0, 512, 512)
save(cropped, "output.jpg")

Use PyImageCUDA when building processing pipelines with multiple operations.

When to Use PyImageCUDA

✅ Perfect For:

Batch processing thousands of images
Real-time video processing (60+ FPS)
Complex multi-step pipelines
Interactive applications with live preview
Heavy filters (blur, drop shadows, etc.)

⚠️ Not Ideal For:

Single simple operations (crop, flip) on individual files
Small images (<500×500)
Environments without NVIDIA GPU
Scripts that rarely run

Detailed Results

Below are the complete benchmark results across 7 common image processing operations. Each operation was tested 50 times, and results show both pure algorithm performance and end-to-end performance including disk I/O.

Generated automatically by /benchmarks/benchmarks.py

PyImageCUDA Performance Report

Generated automatically by /benchmarks/benchmarks.py

Gaussian Blur Benchmark (1080p)

Config: Image: photo.jpg, Radius: 20, Iterations: 50

Pure Algorithm (Compute Bound)

Library	Avg (ms)	FPS	Speedup
PyImageCUDA (Reuse)	0.91	1095.4	56.6x
PyImageCUDA (Alloc)	3.44	290.6	15.0x
OpenCV	47.17	21.2	1.1x
Pillow	51.68	19.4	1.0x

End-to-End (Disk I/O + Encode)

Library	Avg (ms)	FPS	Speedup
PyImageCUDA E2E (Buffered)	79.21	12.6	2.7x
PyImageCUDA E2E	86.14	11.6	2.5x
OpenCV E2E	97.91	10.2	2.2x
Pillow E2E	217.36	4.6	1.0x

Blend Normal Benchmark (1080p)

Config: Image: photo.jpg, Iterations: 50

Pure Algorithm (Compute Bound)

Library	Avg (ms)	FPS	Speedup
PyImageCUDA	0.27	3746.9	370.2x
Pillow	12.45	80.3	7.9x
OpenCV	98.80	10.1	1.0x

End-to-End (Disk I/O + Encode)

Library	Avg (ms)	FPS	Speedup
PyImageCUDA E2E	162.79	6.1	2.0x
PyImageCUDA E2E (Buffered)	165.58	6.0	2.0x
OpenCV E2E	245.91	4.1	1.3x
Pillow E2E	326.25	3.1	1.0x

Resize Bilinear Benchmark (1080p -> 800x600)

Config: Image: photo.jpg, Target: 800x600, Interpolation: Bilinear, Iterations: 50

Pure Algorithm (Compute Bound)

Library	Avg (ms)	FPS	Speedup
PyImageCUDA Bilinear (Reuse)	0.11	8730.9	172.6x
OpenCV Bilinear	0.48	2065.1	40.8x
PyImageCUDA Bilinear (Alloc)	0.75	1340.1	26.5x
Pillow Bilinear	19.77	50.6	1.0x

End-to-End (Disk I/O + Encode)

Library	Avg (ms)	FPS	Speedup
OpenCV Bilinear E2E	38.11	26.2	2.6x
PyImageCUDA Bilinear E2E (Buffered)	45.29	22.1	2.2x
PyImageCUDA Bilinear E2E	48.03	20.8	2.1x
Pillow Bilinear E2E	100.55	9.9	1.0x

Resize Lanczos Benchmark (1080p -> 800x600)

Config: Image: photo.jpg, Target: 800x600, Interpolation: Lanczos/Bicubic, Iterations: 50

Pure Algorithm (Compute Bound)

Library	Avg (ms)	FPS	Speedup
PyImageCUDA Lanczos (Reuse)	0.50	1999.1	61.7x
PyImageCUDA Lanczos (Alloc)	1.09	915.4	28.3x
OpenCV Lanczos	3.36	297.7	9.2x
Pillow Lanczos	30.87	32.4	1.0x

End-to-End (Disk I/O + Encode)

Library	Avg (ms)	FPS	Speedup
OpenCV Lanczos E2E	45.54	22.0	2.4x
PyImageCUDA Lanczos E2E (Buffered)	52.51	19.0	2.1x
PyImageCUDA Lanczos E2E	53.69	18.6	2.0x
Pillow Lanczos E2E	107.78	9.3	1.0x

Rotate 35° Benchmark (1080p)

Config: Image: photo.jpg, Angle: 35°, Expand: True, Iterations: 50

Pure Algorithm (Compute Bound)

Library	Avg (ms)	FPS	Speedup
PyImageCUDA (Reuse)	0.26	3827.8	329.3x
PyImageCUDA (Alloc)	3.05	327.4	28.2x
OpenCV	6.10	164.0	14.1x
Pillow	86.02	11.6	1.0x

End-to-End (Disk I/O + Encode)

Library	Avg (ms)	FPS	Speedup
OpenCV E2E	145.25	6.9	2.7x
PyImageCUDA E2E (Buffered)	147.43	6.8	2.7x
PyImageCUDA E2E	150.59	6.6	2.6x
Pillow E2E	398.69	2.5	1.0x

Flip Horizontal Benchmark (1080p)

Config: Image: photo.jpg, Direction: Horizontal, Iterations: 50

Pure Algorithm (Compute Bound)

Library	Avg (ms)	FPS	Speedup
PyImageCUDA (Reuse)	0.17	5901.5	22.3x
PyImageCUDA (Alloc)	1.63	614.1	2.3x
OpenCV	3.01	332.4	1.3x
Pillow	3.78	264.6	1.0x

End-to-End (Disk I/O + Encode)

Library	Avg (ms)	FPS	Speedup
OpenCV E2E	120.68	8.3	2.5x
PyImageCUDA E2E (Buffered)	141.84	7.0	2.1x
PyImageCUDA E2E	145.26	6.9	2.1x
Pillow E2E	300.48	3.3	1.0x

Crop Center Benchmark (1080p → 512×512)

Config: Image: photo.jpg, Source: 1920×1080, Output: 512×512, Iterations: 50

Pure Algorithm (Compute Bound)

Library	Avg (ms)	FPS	Speedup
PyImageCUDA (Reuse)	0.05	19517.5	9.5x
Pillow	0.29	3493.2	1.7x
OpenCV	0.29	3475.1	1.7x
PyImageCUDA (Alloc)	0.49	2056.4	1.0x

End-to-End (Disk I/O + Encode)

Library	Avg (ms)	FPS	Speedup
OpenCV E2E	26.86	37.2	2.1x
PyImageCUDA E2E (Buffered)	35.04	28.5	1.6x
PyImageCUDA E2E	35.71	28.0	1.6x
Pillow E2E	57.40	17.4	1.0x