Building a High-Concurrency AI Moderation Service with Spring WebFlux — From Concept to Production

As AI applications move from research into real-world production, one of the biggest challenges is how to wrap model inference into a high-concurrency, low-latency API service.

In this project, we implemented an NSFW image moderation API:

We built it using Spring Boot + WebFlux + DJL (Deep Java Library), creating a moderation service with production-level scalability.


1. Why WebFlux?

1.1 The Limits of Spring MVC

Spring MVC is based on the Servlet model, where each request is tied to a dedicated thread.

In AI inference scenarios:

1.2 The Advantages of WebFlux

Spring WebFlux is built on Reactor’s reactive programming model and Netty’s non-blocking I/O.

Key benefits:

Using Spring MVC is like a waiter who takes your order and waits at your table until the food arrives. With Spring WebFlux, the waiter takes your order, the kitchen prepares it, and whichever waiter is free will deliver it. So the whole service will be more efficient.


2. Project Structure

Spring WebFlux AI Moderation Service Project Structure Diagram


3. Core Implementation

3.1 Model Configuration (NsfwModelConfig.java)

Loads the NSFW model (PyTorch, via DJL) and handles preprocessing.

@Configuration
public class NsfwModelConfig {

    @Bean
    public ZooModel<NDList, Classifications> nsfwModel() throws Exception {
        Criteria<NDList, Classifications> criteria = Criteria.builder()
                .setTypes(NDList.class, Classifications.class)
                .optModelPath(modelPath)
                .optTranslator(translator)
                .optEngine("OnnxRuntime")
                .build();
        
        return ModelZoo.loadModel(criteria);
    }

    public static NDArray preprocess(BufferedImage img, NDManager manager) {
        int width = 224;
        int height = 224;
        float[] mean = {0.485f, 0.456f, 0.406f};
        float[] std = {0.229f, 0.224f, 0.225f};

        ai.djl.modality.cv.Image djlImage = ImageFactory.getInstance().fromImage(img);
        ai.djl.modality.cv.Image scaled = djlImage.resize(width, height, true);
        BufferedImage buffered = (BufferedImage) scaled.getWrappedImage();

        int[] pixels = buffered.getRGB(0, 0, width, height, null, 0, width);
        float[] data = new float[3 * width * height];

        for (int i = 0; i < pixels.length; i++) {
            int rgb = pixels[i];
            int r = (rgb >> 16) & 0xFF;
            int g = (rgb >> 8) & 0xFF;
            int b = rgb & 0xFF;

            int h = i / width;
            int w = i % width;
            int idx = h * width + w;

            data[idx] = ((r / 255f) - mean[0]) / std[0];
            data[width * height + idx] = ((g / 255f) - mean[1]) / std[1];
            data[2 * width * height + idx] = ((b / 255f) - mean[2]) / std[2];
        }

        return manager.create(data, new Shape(1, 3, height, width)); // NCHW
    }
}

3.2 Reactive Service (ModerationFluxService.java)

Uses Mono.fromCallable with Schedulers.boundedElastic() to run inference off the main I/O thread.

@Service
public class ModerationFluxService {

    private final ZooModel<NDList, Classifications> model;

    public ModerationFluxService(ZooModel<NDList, Classifications> model) {
        this.model = model;
    }

    public Mono<ModerationResult> classify(FilePart file) {
        return DataBufferUtils.join(file.content())
                .flatMap(buffer -> Mono.fromCallable(() -> {
                    try (InputStream in = buffer.asInputStream();
                         Predictor<NDList, Classifications> predictor = model.newPredictor()) {

                        BufferedImage img = ImageIO.read(in);
                        NDArray nd = NsfwModelConfig.preprocess(img, model.getNDManager());
                        NDList input = new NDList(nd);
                        Classifications out = predictor.predict(input);

                        var best = out.best();
                        Map<String, Double> probs = new LinkedHashMap<>();
                        out.items().forEach(i -> probs.put(i.getClassName(), i.getProbability()));

                        return new ModerationResult(best.getClassName(), best.getProbability(), probs);
                    }
                }))
                .subscribeOn(Schedulers.boundedElastic()) 
                .timeout(Duration.ofSeconds(3))        
                .onErrorResume(TimeoutException.class,
                        e -> Mono.just(ModerationResult.timeoutFallback()));
    }
}

3.3 Controller (ModerationFluxController.java)

Handles POST requests with image uploads.

@RestController
@RequestMapping("/flux/moderation")
@RequiredArgsConstructor
public class ModerationFluxController {

    private final ModerationFluxService service;

    @PostMapping(value = "/check", consumes = MediaType.MULTIPART_FORM_DATA_VALUE)
    public Mono<ModerationResult> check(@RequestPart("file") FilePart file) {
        return service.classify(file);
    }
}

4. Performance Benchmarking

We ran a local benchmark of the WebFlux-based moderation API under 200 concurrent users using JMeter. The following numbers reflect performance on my development machine and highlight the current bottlenecks.

Test Summary

Metric MVC API WebFlux API
Requests 1018 1340
Errors 0 (0.00%) 0 (0.00%)
Average Response Time 35,853 ms (~36s) 26,957 ms (~27s)
Min Response Time 3,232 ms 340 ms
Max Response Time 49,187 ms (~49s) 141,574 ms (~141s)
Median Response Time 36,556 ms 23,713 ms
90th Percentile 42,631 ms 54,292 ms
95th Percentile 44,824 ms 64,867 ms
99th Percentile **46,654 ms 86,554 ms
Throughput 4.93 req/s 3.31 req/s
Avg. Response Size 4.8 KB 3.2 KB
APDEX (500ms / 1.5s) 0.000 0.002

Observations


Key Takeaway

This benchmark was run on my local development machine without GPU acceleration. While absolute numbers look slow, WebFlux demonstrates better scalability than a blocking MVC version would under the same conditions. In high-concurrency scenarios, WebFlux ensures requests are handled asynchronously, preventing thread pool exhaustion.

On local CPU tests, MVC delivers slightly higher throughput but suffers from rigid blocking latency. WebFlux, while showing long-tail effects, is better suited for scaling under true high-concurrency workloads, especially once GPU acceleration or batching is introduced.

Next steps: Optimizations such as GPU acceleration, inference batching, or distributed model serving will significantly improve throughput and latency in production environments.


5. Conclusion

This project demonstrates how to build a production-ready AI moderation service:

This is more than just a demo — it’s a practical AI engineering case study for bringing deep learning models into production systems.