This demo showcases how to deploy a machine learning model using BentoML and perform load testing with Locust.
The guide uses uv, as it is a fast, reliable, and feature-rich Python package installer and resolver. But you can use any other package manager you prefer.
- Install uv:
curl -LsSf https://astral.sh/uv/install.sh | sh
- Create a new virtual environment and install dependencies:
# if you want to use package management features
uv sync
# the old requirements way
uv venv --python=3.11
uv pip install -r pyproject.toml
BentoML is an open-source platform for machine learning model serving. It simplifies the process of packaging, deploying, and managing machine learning models in production environments. Key features include:
- Model packaging and versioning
- API server generation
- Microservice architecture support
- Scalable model serving (on BentoCloud)
Locust is an open-source load testing tool. It allows you to write Python code to define the behavior of your users and then swarm your system with millions of simultaneous users. Features include:
- Python-based test scripts
- Web-based UI for real-time test monitoring
- Distributed testing across multiple machines
- Customizable metrics and reporting
-
Ensure your model is saved as a BentoML service (refer to BentoML documentation for details).
-
Run the BentoML service:
uv run bentoml serve service:Summarization
- Your BentoML service should now be running on
http://localhost:3000
.
-
Ensure you have the
locustfile.py
in your project directory. -
Start Locust:
uv run locust -f locustfile.py
-
Open a web browser and go to
http://localhost:8089
. -
In the Locust web interface:
- Set the number of users to simulate
- Set the spawn rate (users started per second)
- Enter the host URL of your BentoML server (e.g.,
http://localhost:3000
) - Click "Start swarming"
-
Monitor the performance metrics in real-time through the Locust web interface.
After running your load tests, you can analyze the results in the Locust web interface. Look for metrics such as:
- Response times (average, median, 95th percentile)
- Requests per second
- Number of failures
Use these insights to optimize your BentoML service for better performance under load.
Contributions to improve this demo are welcome! Please feel free to submit pull requests or open issues for any enhancements, bug fixes, or documentation improvements.
This project is licensed under the Apache 2.0 License - see the LICENSE file for details.