Performance Benchmarks
PyImageCUDA is designed for GPU-accelerated image processing. These benchmarks compare its performance against industry-standard libraries (Pillow and OpenCV) across common operations.
Test Environment
All benchmarks were conducted on the following hardware:
- CPU: AMD Ryzen 7 3700X
- GPU: NVIDIA RTX 3070
- RAM: DDR4
- Storage: Standard SSD
- OS: Windows 11
- Driver: Latest NVIDIA drivers
Each test ran 50 iterations with a 1920×1080 (1080p) RGBA source image.
Understanding the Results
The benchmarks are divided into two categories:
Pure Algorithm (Compute Bound)
These tests measure raw processing speed without disk I/O. The image is loaded once, then the operation runs repeatedly in memory. This shows the true computational performance difference between libraries.
When this matters:
- Batch processing multiple operations on the same image
- Real-time video processing pipelines
- Interactive applications with live previews
- Server-side processing with images already in memory
End-to-End (Disk I/O + Encode)
These tests include the complete workflow: load image from disk → process → encode → save to disk. This represents real-world scenarios where you load and save files repeatedly.
When this matters:
- Simple scripts that process individual files
- One-off image transformations
- Workflows where you must save intermediate results
Key Takeaway
PyImageCUDA excels when you can keep data on the GPU across multiple operations. For single operations with file I/O, the benefit is smaller due to CPU↔GPU transfer overhead.
Performance Highlights
PyImageCUDA delivers 10-400x faster performance in pure computation:
- 376x faster - Blend/Composite operations
- 260x faster - Arbitrary angle rotations
- 132x faster - Bilinear resizing
- 54x faster - Gaussian blur (heavy filters)
- 35x faster - Lanczos resizing (high-quality)
For end-to-end workflows with disk I/O, PyImageCUDA typically provides 1.5-2.7x speedup over CPU libraries.
Best Practices for Maximum Performance
✅ DO: Reuse Buffers
Pre-allocate buffers and reuse them across operations:
from pyimagecuda import Image, load, Transform, save, ImageU8
# Pre-allocate buffers
src = Image(1920, 1080)
dst = Image(1920, 1080)
u8_buffer = ImageU8(1920, 1080)
for file in image_files:
load(file, f32_buffer=src, u8_buffer=u8_buffer)
Transform.flip(src, direction='horizontal', dst_buffer=dst)
save(dst, f"output_{file}", u8_buffer=u8_buffer)
# Clean up once at the end
src.free()
dst.free()
u8_buffer.free()
Result: 3-11x faster than allocating new buffers each time.
✅ DO: Chain Operations on GPU
Keep data on GPU for multiple operations:
from pyimagecuda import load, Filter, Adjust, save
img = load("photo.jpg")
# All operations run on GPU without CPU transfers
Filter.gaussian_blur(img, radius=10, dst_buffer=img)
Adjust.brightness(img, 0.2)
Adjust.contrast(img, 1.3)
save(img, "output.jpg")
img.free()
❌ DON'T: Use GPU for Single Simple Operations
For one-off crops or flips with file I/O, CPU libraries are competitive:
# This won't be significantly faster than Pillow/OpenCV
img = load("photo.jpg")
cropped = Transform.crop(img, 0, 0, 512, 512)
save(cropped, "output.jpg")
Use PyImageCUDA when building processing pipelines with multiple operations.
When to Use PyImageCUDA
✅ Perfect For:
- Batch processing thousands of images
- Real-time video processing (60+ FPS)
- Complex multi-step pipelines
- Interactive applications with live preview
- Heavy filters (blur, drop shadows, etc.)
⚠️ Not Ideal For:
- Single simple operations (crop, flip) on individual files
- Small images (<500×500)
- Environments without NVIDIA GPU
- Scripts that rarely run
Detailed Results
Below are the complete benchmark results across 7 common image processing operations. Each operation was tested 50 times, and results show both pure algorithm performance and end-to-end performance including disk I/O.
Generated automatically by /benchmarks/benchmarks.py
Gaussian Blur Benchmark (1080p)
Config: Image:
photo.jpg, Radius:20, Iterations:50
Pure Algorithm (Compute Bound)
| Library | Avg (ms) | FPS | Speedup |
|---|---|---|---|
| PyImageCUDA (Reuse) | 0.97 | 1032.7 | 54.4x |
| PyImageCUDA (Alloc) | 3.49 | 286.4 | 15.1x |
| OpenCV | 50.85 | 19.7 | 1.0x |
| Pillow | 52.67 | 19.0 | 1.0x |
End-to-End (Disk I/O + Encode)
| Library | Avg (ms) | FPS | Speedup |
|---|---|---|---|
| PyImageCUDA E2E (Buffered) | 83.72 | 11.9 | 2.7x |
| PyImageCUDA E2E | 92.06 | 10.9 | 2.4x |
| OpenCV E2E | 100.19 | 10.0 | 2.2x |
| Pillow E2E | 224.84 | 4.4 | 1.0x |
Blend Normal Benchmark (1080p)
Config: Image:
photo.jpg, Iterations:50
Pure Algorithm (Compute Bound)
| Library | Avg (ms) | FPS | Speedup |
|---|---|---|---|
| PyImageCUDA | 0.27 | 3705.1 | 376.7x |
| Pillow | 12.55 | 79.7 | 8.1x |
| OpenCV | 101.67 | 9.8 | 1.0x |
End-to-End (Disk I/O + Encode)
| Library | Avg (ms) | FPS | Speedup |
|---|---|---|---|
| PyImageCUDA E2E (Buffered) | 153.80 | 6.5 | 2.2x |
| PyImageCUDA E2E | 159.17 | 6.3 | 2.1x |
| OpenCV E2E | 242.98 | 4.1 | 1.4x |
| Pillow E2E | 334.47 | 3.0 | 1.0x |
Resize Bilinear Benchmark (1080p -> 800x600)
Config: Image:
photo.jpg, Target:800x600, Interpolation:Bilinear, Iterations:50
Pure Algorithm (Compute Bound)
| Library | Avg (ms) | FPS | Speedup |
|---|---|---|---|
| PyImageCUDA Bilinear (Reuse) | 0.14 | 7096.3 | 132.9x |
| OpenCV Bilinear | 0.52 | 1940.0 | 36.3x |
| PyImageCUDA Bilinear (Alloc) | 0.80 | 1249.7 | 23.4x |
| Pillow Bilinear | 18.73 | 53.4 | 1.0x |
End-to-End (Disk I/O + Encode)
| Library | Avg (ms) | FPS | Speedup |
|---|---|---|---|
| OpenCV Bilinear E2E | 40.42 | 24.7 | 2.5x |
| PyImageCUDA Bilinear E2E (Buffered) | 49.53 | 20.2 | 2.0x |
| PyImageCUDA Bilinear E2E | 52.37 | 19.1 | 1.9x |
| Pillow Bilinear E2E | 99.91 | 10.0 | 1.0x |
Resize Lanczos Benchmark (1080p -> 800x600)
Config: Image:
photo.jpg, Target:800x600, Interpolation:Lanczos/Bicubic, Iterations:50
Pure Algorithm (Compute Bound)
| Library | Avg (ms) | FPS | Speedup |
|---|---|---|---|
| PyImageCUDA Lanczos (Reuse) | 0.88 | 1131.3 | 35.2x |
| PyImageCUDA Lanczos (Alloc) | 1.62 | 617.8 | 19.2x |
| OpenCV Lanczos | 4.16 | 240.2 | 7.5x |
| Pillow Lanczos | 31.15 | 32.1 | 1.0x |
End-to-End (Disk I/O + Encode)
| Library | Avg (ms) | FPS | Speedup |
|---|---|---|---|
| OpenCV Lanczos E2E | 44.42 | 22.5 | 2.5x |
| PyImageCUDA Lanczos E2E (Buffered) | 52.28 | 19.1 | 2.1x |
| PyImageCUDA Lanczos E2E | 54.43 | 18.4 | 2.0x |
| Pillow Lanczos E2E | 108.93 | 9.2 | 1.0x |
Rotate 35° Benchmark (1080p)
Config: Image:
photo.jpg, Angle:35°, Expand:True, Iterations:50
Pure Algorithm (Compute Bound)
| Library | Avg (ms) | FPS | Speedup |
|---|---|---|---|
| PyImageCUDA (Reuse) | 0.30 | 3336.0 | 260.6x |
| PyImageCUDA (Alloc) | 3.31 | 301.9 | 23.6x |
| OpenCV | 5.98 | 167.2 | 13.1x |
| Pillow | 78.11 | 12.8 | 1.0x |
End-to-End (Disk I/O + Encode)
| Library | Avg (ms) | FPS | Speedup |
|---|---|---|---|
| OpenCV E2E | 156.55 | 6.4 | 2.8x |
| PyImageCUDA E2E | 162.23 | 6.2 | 2.7x |
| PyImageCUDA E2E (Buffered) | 162.33 | 6.2 | 2.7x |
| Pillow E2E | 432.76 | 2.3 | 1.0x |
Flip Horizontal Benchmark (1080p)
Config: Image:
photo.jpg, Direction:Horizontal, Iterations:50
Pure Algorithm (Compute Bound)
| Library | Avg (ms) | FPS | Speedup |
|---|---|---|---|
| PyImageCUDA (Reuse) | 0.17 | 5916.0 | 20.3x |
| PyImageCUDA (Alloc) | 1.73 | 577.5 | 2.0x |
| OpenCV | 3.12 | 320.6 | 1.1x |
| Pillow | 3.43 | 291.7 | 1.0x |
End-to-End (Disk I/O + Encode)
| Library | Avg (ms) | FPS | Speedup |
|---|---|---|---|
| OpenCV E2E | 126.09 | 7.9 | 2.6x |
| PyImageCUDA E2E (Buffered) | 149.56 | 6.7 | 2.2x |
| PyImageCUDA E2E | 150.95 | 6.6 | 2.2x |
| Pillow E2E | 324.85 | 3.1 | 1.0x |
Crop Center Benchmark (1080p → 512×512)
Config: Image:
photo.jpg, Source:1920×1080, Output:512×512, Iterations:50
Pure Algorithm (Compute Bound)
| Library | Avg (ms) | FPS | Speedup |
|---|---|---|---|
| PyImageCUDA (Reuse) | 0.04 | 27037.3 | 13.3x |
| OpenCV | 0.25 | 3987.1 | 2.0x |
| Pillow | 0.27 | 3692.7 | 1.8x |
| PyImageCUDA (Alloc) | 0.49 | 2029.6 | 1.0x |
End-to-End (Disk I/O + Encode)
| Library | Avg (ms) | FPS | Speedup |
|---|---|---|---|
| OpenCV E2E | 27.37 | 36.5 | 2.0x |
| PyImageCUDA E2E (Buffered) | 35.72 | 28.0 | 1.6x |
| PyImageCUDA E2E | 38.34 | 26.1 | 1.5x |
| Pillow E2E | 55.69 | 18.0 | 1.0x |