Implementing a Python Inference Service with HTTP (FastAPI) and RPC (gRPC) Interfaces

Introduction

When deploying AI models in production, we often face a choice:

Should we expose the service via HTTP (REST/JSON) or RPC (gRPC)?

This tutorial walks through building a Python inference service with both FastAPI (HTTP) and gRPC, then compares their performance and usability.

In modern AI applications, it’s common to deploy machine learning models as services so that different systems and programming languages can use them easily. For example, a Java backend system might need to use a PyTorch-trained model. Deploying the model in a Python service and exposing an API allows the Java system to send input data and receive predictions.

In this project, we built a Python inference service that supports two interface types:

HTTP Interface using FastAPI (supports JSON and multipart/form-data requests)
gRPC Interface using Protocol Buffers (efficient binary RPC calls)

Finally it provides a hands-on guide for machine learning API deployment using Python, showing how to serve AI models with both REST (FastAPI) and gRPC for production-grade inference services.

Project Setup: Loading ResNet18 and API Interfaces

Load the ResNet18 model from local weights (avoiding online download and SSL issues).
Accept an image from the client, perform preprocessing, run inference, and return the top-K classification results.
Implement both HTTP (FastAPI) and gRPC interfaces.

Technology Stack for Python Inference Service (FastAPI + gRPC)

Technology	Purpose	Why Chosen
PyTorch / torchvision	Model loading & inference	Industry-standard deep learning framework with torchvision for pretrained models and transforms.
FastAPI	HTTP service framework	Asynchronous, high-performance web framework with automatic OpenAPI documentation.
gRPC	High-performance RPC	Compact binary protocol, HTTP/2, language-neutral, ideal for low-latency microservices.
Pillow (PIL)	Image processing	Easy to load and manipulate images before feeding them to the model.
python-multipart	HTTP file uploads	Required for multipart/form-data in FastAPI.
NumPy (<2)	Tensor operations compatibility	Ensures compatibility with PyTorch 2.x, which is compiled against NumPy 1.x.

Project Structure Below: Project Structure for Python Inference Service with FastAPI and gRPC

Building the HTTP Inference API with FastAPI

API Design

POST /predict/file: Accepts an uploaded image file via multipart/form-data.
POST /predict/b64: Accepts a base64-encoded image string (useful for mobile or cross-language scenarios).

Core Inference Code (`fastapi_app/infer.py`)

import torch
from torchvision import models
from PIL import Image
from typing import List, Tuple

_MODEL = None
_TRANSFORM = None

def _load_model():
    global _MODEL, _TRANSFORM
    if _MODEL is None:
        weights_path = "/path/to/resnet18-f37072fd.pth"
        _MODEL = models.resnet18()
        state_dict = torch.load(weights_path, map_location=torch.device("cpu"))
        _MODEL.load_state_dict(state_dict, strict=False)
        _MODEL.eval()

        default_weights = models.ResNet18_Weights.DEFAULT
        _TRANSFORM = default_weights.transforms()
    return _MODEL, _TRANSFORM

def preprocess(img: Image.Image) -> torch.Tensor:
    _, transform = _load_model()
    return transform(img).unsqueeze(0)

@torch.inference_mode()
def predict(img: Image.Image, topk: int = 5) -> List[Tuple[str, float]]:
    model, _ = _load_model()
    x = preprocess(img)
    logits = model(x)
    probs = torch.softmax(logits, dim=1)[0]
    topk_probs, topk_idxs = torch.topk(probs, k=topk)
    classes = models.ResNet18_Weights.DEFAULT.meta["categories"]
    return [(classes[i], float(topk_probs[j])) for j, i in enumerate(topk_idxs)]

FastAPI App Entrypoint (`fastapi_app/app.py`)

from fastapi import FastAPI, UploadFile, File
from fastapi.responses import JSONResponse
from pydantic import BaseModel
from infer import predict
from PIL import Image
import base64, io

app = FastAPI(title="ResNet18 Inference Service")

class B64ImageRequest(BaseModel):
    image_base64: str
    topk: int = 5

@app.post("/predict/file")
async def predict_file(file: UploadFile = File(...), topk: int = 5):
    img = Image.open(file.file).convert("RGB")
    res = predict(img, topk)
    return JSONResponse({"topk": [{"label": l, "prob": p} for l, p in res]})

@app.post("/predict/b64")
async def predict_b64(req: B64ImageRequest):
    img = Image.open(io.BytesIO(base64.b64decode(req.image_base64))).convert("RGB")
    res = predict(img, req.topk)
    return JSONResponse({"topk": [{"label": l, "prob": p} for l, p in res]})

Testing with `curl`

It is easy to test using curl command.

curl -X POST "http://127.0.0.1:8000/predict/file?topk=5"   
	-F "file=@/path/to/image.jpg"

Implementing the gRPC Inference API with Protocol Buffers

Protocol Buffers Definition (`grpc_app/protos/classify.proto`)

syntax = "proto3";

package classify;

service Classifier {
  rpc Predict (ImageRequest) returns (PredictionResponse) {}
}

message ImageRequest {
  bytes image = 1;
  int32 topk = 2;
}

message Prediction {
  string label = 1;
  float prob = 2;
}

message PredictionResponse {
  repeated Prediction topk = 1;
}

Inference Logic (`grpc_app/infer.py`)

Same model loading logic as FastAPI, using local weights

gRPC Server (`grpc_app/server.py`)

import grpc
from concurrent import futures
import classify_pb2, classify_pb2_grpc
from infer import predict_bytes

class ClassifierServicer(classify_pb2_grpc.ClassifierServicer):
    def Predict(self, request, context):
        preds = predict_bytes(request.image, topk=request.topk or 5)
        return classify_pb2.PredictionResponse(
            topk=[classify_pb2.Prediction(label=l, prob=p) for l, p in preds]
        )

def serve():
    server = grpc.server(futures.ThreadPoolExecutor(max_workers=4))
    classify_pb2_grpc.add_ClassifierServicer_to_server(ClassifierServicer(), server)
    server.add_insecure_port("[::]:50051")
    server.start()
    server.wait_for_termination()

if __name__ == "__main__":
    serve()

Testing gRPC Client

For testing, we implement a gRPC client.

import grpc, classify_pb2, classify_pb2_grpc

with open("image.jpg", "rb") as f:
    data = f.read()

channel = grpc.insecure_channel("127.0.0.1:50051")
stub = classify_pb2_grpc.ClassifierStub(channel)
resp = stub.Predict(classify_pb2.ImageRequest(image=data, topk=5))
print(resp)

HTTP vs gRPC Performance Comparison and Deployment Takeaways

Feature	HTTP (FastAPI)	gRPC
Protocol	JSON / multipart	Protobuf (binary)
Performance	Slower (text encoding, larger payloads)	Faster (compact binary, HTTP/2)
Debugging	Easier (curl, Postman)	Harder (need generated stubs)
Cross-language	Excellent	Excellent
Complexity	Low	Medium (requires .proto, codegen)

HTTP is ideal for quick integration, easy debugging, and scenarios where clients can easily send JSON or form-data.
gRPC is better for low-latency, high-throughput systems where performance matters.

Using local weights avoids external dependencies, improves startup time, and makes deployments more stable.

In production, the choice depends on the trade-off between ease of integration and performance requirements.

Related post: Deploying ResNet Models with Java DJL – From Zero to Hero Learn how to integrate and deploy deep learning models in Java using DJL.
Related post: PyTorch Housing Price Prediction Demo (Multi-feature) Hands-on guide to applying linear regression in PyTorch with multiple features.
FastAPI Official Documentation
Comprehensive docs for building high-performance Python APIs.
gRPC Documentation Learn more about gRPC core concepts, Protocol Buffers, and API design best practices.

Implementing a Python Inference Service with HTTP (FastAPI) and RPC (gRPC) Interfaces

Introduction

Project Setup: Loading ResNet18 and API Interfaces

Technology Stack for Python Inference Service (FastAPI + gRPC)

Building the HTTP Inference API with FastAPI

API Design

Core Inference Code (fastapi_app/infer.py)

FastAPI App Entrypoint (fastapi_app/app.py)

Testing with curl

Implementing the gRPC Inference API with Protocol Buffers

Protocol Buffers Definition (grpc_app/protos/classify.proto)

Inference Logic (grpc_app/infer.py)

gRPC Server (grpc_app/server.py)