Making the import work at scale – Peter Knight Digital

MediaHub has a feature called "Scan & Import" that can search your site for existing images and import them into the media library. The idea is straightforward: you already have hundreds or thousands of images in your ProcessWire image fields, and you shouldn't have to re-upload them one by one.

I built the scan with an assumption: most sites would have somewhere between 100 and 500 images. For that scale, the original implementation worked fine. It wasn't fast, but it finished.

The tester with 13,000+ images

I recently asked a few people to test MediaHub against their own sites. I was expecting feedback about the UI, or edge cases in field types, or things I hadn't thought of. What I hadn't anticipated was one tester running the import scan on a site with 13,000 images. It took well over an hour.

It was a good example of how easy it is to make assumptions about how people will use your software. A site with 13,000 images felt unusual to me, but the truth is I have no idea what size sites people are truly running. And if one tester hit this, it's safe to assume more will.

Rather than just accepting the limitation, it felt this was a good opportunity for some focused testing and see how far I could push the performance using my own AI-assisted development tools.

Setting up a test environment

I needed a reproducible, worst-case scenario to benchmark against. Prompting Cursor (an AI-powered code editor), I set up a dedicated local ProcessWire installation called "MediaHub Labs" and gave it the following instructions:

Create a script to generate 25,000 different-sized JPEGs across several categories (architecture, cars, cities, food, fruit, landscapes, staff portraits, vegetables, countries, galleries)
Use a neutral colour palette for the images and write the category name and filename on each one
Create multiple pages on the test site with image fields and assign batches of images to their respective categories
Run an initial scan across all 25,000 images to see how long the import takes and establish a baseline
Identify bottlenecks: inefficient database queries, unnecessary image processing, or poorly thought-out ProcessWire API calls
Optimise, re-run, record the results
Repeat

I used the Tracy Debugger module with some custom profiling timers to record results after each change. Cursor would read the profiling output, identify the next bottleneck, and suggest targeted fixes. The whole optimise-measure-repeat loop ran through a single afternoon.

Although I only needed to improve scanning performance based on 13,000 images, I thought it would be useful to test with 25,000 and see how the scan performed at a larger scale. Initially, as expected, the scan timed out, crashed, and my MacBook sounded like it needed a new fan. But that was fine. We needed to establish a baseline and now we had something to work with. Cursor came back with suggestions for improvements and we started making much better progress.

As we progressed, scans which had been crashing were taking an hour. Optimising and iterating over the next few hours, we eventually got the scan to complete in minutes.

I was hoping for some reasonable improvements but the end result surprised me. The scan went from timing out to completing in two seconds, an 825x improvement. But the scan was only the first bottleneck. The actual import of images into the library, and the library view itself, both had their own problems. This post covers all three.

Why it was slow

The scan does one conceptually simple thing: for each image field on your ProcessWire site, find every page that uses it, read each image's metadata (filename, dimensions, filesize, description, date added), and return the results as JSON for the admin UI to display.

The reality was more complicated. Two things were happening behind the scenes that turned a simple data-reading operation into something computationally expensive.

Variation counting

For every image, the scan called ProcessWire's getVariations() method to count how many resize and crop variations existed on disk. This performs a filesystem directory scan with regex matching against every file in the page's asset directory. With 80 images per page and 160+ files per directory (originals plus thumbnails), it was doing thousands of directory scans per field.

The variation count appeared as a column in the results table. A nice-to-have detail that was initially poorly thought through was actually consuming 96% of the total scan time.

Full object hydration

Each image was loaded as a full ProcessWire Pageimage object. That means Wire infrastructure, hooks, access checks, and property magic, all to read a filename, a width, and a height. Data that exists as plain values in the database.

What changed

Two rounds of optimisation, each targeting the dominant bottleneck at the time.

Round 1: drop the variation count

The variation count column was useful context but not essential information. Nobody decides whether to import an image based on how many crops exist. Removing getVariations() from the scan loop eliminated 96% of the processing time in one change.

3,000 images went from 25.6 seconds to 1.5 seconds. A 17x improvement.

Round 2: skip object hydration, skip thumbnail generation

With variation counting gone, the profiler showed two new bottlenecks: Pageimage object hydration at 50% and server-side thumbnail generation at 30%.

The fix was a hybrid approach. Page metadata (path, URL, template name) still comes from ProcessWire's standard find(), since that data requires hydrated Page objects to be accurate. But image field data (filenames, dimensions, filesize, descriptions, dates) now comes from findRaw(), which reads raw database values without constructing any ORM objects.

For thumbnails, the change was more drastic: stop generating them entirely. The scan results table now displays the original image file, scaled down to 72 pixels by CSS. No GD resizing, no filesystem checks, no thumbnail cache to manage. And it has a practical benefit: when you click a thumbnail to preview it, you see the full-resolution original instead of a blurry 160-pixel crop.

3,000 images went from 1.5 seconds to 31 milliseconds. Another 48x improvement on top of Round 1.

The numbers

Measured against a test site with 25,000 generated images across 316 pages and 10 image fields.

Version	3,000 images	25,000 images	Speedup
Original	25.6 seconds	Timeout / ~60 minutes	Baseline
After Round 1	1.5 seconds	~3-5 minutes (projected)	17x
After Round 2	31 milliseconds	~2 seconds (measured)	825x

The 25,000-image result is a real measurement from the browser UI, not a projection. The Tracy debug bar showed individual AJAX requests completing in 223 to 644 milliseconds.

What about shared hosting?

Local development environments are fast. We have SSDs, plenty of RAM, and no one else is competing for resources, unlike average shared hosting. The question that matters is how this performs on the kind of hosting most ProcessWire sites actually run on.

To test this, AI built a profiling script that simulates shared hosting conditions by adding artificial delays: 0.8 milliseconds per filesystem I/O operation and 0.2 milliseconds per CPU-bound operation. These numbers approximate a budget shared host where disk access and CPU time are constrained by other tenants on the same machine.

Under those conditions, 3,000 images processed in 818 milliseconds. Extrapolated to 25,000 images, that's about 5 to 7 seconds of server processing time, plus a few hundred milliseconds of HTTP overhead per AJAX request.

For a more typical site with 500 to 5,000 images, even budget shared hosting should return results in one to two seconds. That's the difference between a feature that works and a feature that times out, annoys users and increases my support workload.

Client-side changes

Scanning 25,000 images is one problem. Displaying them in the results table is another. The original implementation rendered every result as a DOM element simultaneously, which caused noticeable browser lag on large result sets.

Again, having assumed most sites might contain 100 to 500 images, I hadn't considered what would happen when someone had 10,000-plus images and how poorly they would display in one continuous table without pagination. My bad.

The results table now paginates at 100 images per page with navigation controls at the bottom. Combined with loading="lazy" on image elements, the browser only loads and renders what's visible. Sorting and filtering reset to page one automatically.

What stayed the same

The scan still uses time-based batching for its AJAX requests. Each request has a 10-second time budget; when the budget runs out mid-page, the server returns what it has and the client fires the next request. This ensures the scan works within any server's execution limits without configuration, and it adapts automatically: faster servers process more images per request.

Duplicate detection still works the same way, matching normalised filenames across fields to flag potential duplicates in the results. The 25,000-image test correctly identified 4,541 potential duplicates across the 10 fields.

The import was slow too

With the scan finishing in seconds, I assumed the hard part was done. Then I clicked "Import Selected" on 100 images and watched it take nearly a minute.

The scan finds images. The import actually creates MediaHub assets from them: copying the file, creating a page, saving metadata, generating thumbnails. I'd optimised the scan extensively but hadn't once profiled the import itself. When I did, the numbers were not great.

For each of the 100 images, the import was taking about 525 milliseconds. Nearly 88% of that time was thumbnail generation: two GD resize operations per image (one for the grid view, one for the list view), creating the small preview cards you see in the library. At that rate, importing 25,000 images would take over three and a half hours on my local machine, and significantly longer on shared hosting.

Deferring the expensive work

The key insight was that thumbnails don't need to exist at import time. They're only needed when someone views the library. And most users won't browse through all 25,000 assets the moment they finish importing.

Three changes brought the import from 525ms per image down to 29ms:

Defer thumbnail generation entirely. The two GD calls per image were removed from the import path. This single change eliminated 88% of the import time.
Reduce database saves. The original code saved each asset page three times: once to create it, once to add the image file, and once to save metadata (alt text, description, MIME type). Setting metadata before the first save eliminated one round trip. Two saves remain because ProcessWire requires a saved page ID before files can be added to it.
Cache source lookups. When importing multiple images from the same page and field (which is common in bulk import), the source page and field objects are now cached and reused instead of re-loaded per image.

On top of the server-side changes, the client-side batch size was increased from 5 to 25 images per AJAX request. Fewer HTTP round-trips, same total work. Each batch now completes in under 3 seconds, well within shared hosting timeout limits.

Import numbers

Version	Per image	100 images	Speedup
Original	525ms	~50 seconds	Baseline
Optimised	29ms	~3 seconds	18x

For 25,000 images, the projected import time is about 14 minutes locally and 30 to 50 minutes on shared hosting. That's down from an estimated 3.6 hours (local) and "functionally broken" (shared hosting, where it would time out repeatedly).

Making the library load instantly

There was still a problem. After importing thousands of images, navigating to the library was slow. Every tile on the page triggered a GD resize to generate a preview thumbnail. With 25 tiles per page, that's 25 calls to ProcessWire's image sizing engine, each one reading the full image from disk and writing a resized copy. For freshly imported assets with no existing thumbnails, this added several seconds per page load.

The fix uses the same principle as the scan: serve the original image and let CSS handle the sizing. A new function checks whether a thumbnail variation already exists on disk (a single filesystem glob, no GD). If a thumbnail is there, it returns it. If not, it returns the original image URL. The browser's object-fit: cover makes the image look identical to a proper thumbnail in the card layout.

In the background, a small JavaScript snippet fires after the page renders. It collects the IDs of any assets that were served as originals and sends them to a thumbnail-generation endpoint in batches. The thumbnails are created silently, with no visible delay. The next time the user visits that page, the smaller thumbnail files are served instead.

The result: the library page always loads in under a second, regardless of whether the assets were imported five minutes ago or five months ago. Paginating through thousands of freshly imported assets is smooth. No spinner, no waiting.

The full picture on shared hosting

Putting it all together for a site with 25,000 images on typical shared hosting:

Step	Before	After
Scan	~60 minutes / timeout	~6-8 seconds
Import	Broken (timeouts)	~30-50 minutes
Library view	~11 seconds per page	Instant

25,000 images is an extreme case. A more typical site with 1,000 to 5,000 images would scan in about a second and import in 5 to 15 minutes. The import shows an estimated time before you start and lets you filter by template or field (with image counts) so you can stagger imports if you prefer.

Download

These improvements are included in the latest MediaHub release, available from the downloads page. The full changelog is in the documentation.