Model Deployment: Deploying NLP models in production environments (e.g. REST APIs, cloud services)¶

Deploying NLP models in production environments is an important step in the NLP development process. This involves deploying the model in a way that makes it easy to use and accessible to end users. This can be done by deploying the model as a REST API or as a cloud service. It is also important to ensure that the model is secure and that it can handle high volumes of traffic.

In this section, we will be using the ChatGPT model as an example to demonstrate how to deploy a large language model in a production environment. The team behind ChatGPT is using Ray Serve to deploy the model as a REST API. Ray Serve is a high-performance, scalable, and easy-to-use framework for deploying machine learning models in production environments. It is built on top of Ray, a distributed computing framework for Python that makes it easy to scale Python applications. Ray Serve allows you to deploy models as REST APIs or as cloud services. It also provides a number of features that make it easy to deploy models in production environments, including support for multiple backends, automatic batching, and support for multiple languages. Ray, a distributed computing framework for Python that makes it easy to scale Python applications. Ray Serve allows you to deploy models as REST APIs or as cloud services.

Here is an example of how to deploy Llama2 model as a REST API using Ray Serve.

Dependencies¶

The core new library is just simply ray[serve]. One perculiarity is that we install specific version of transformers to be compatible with model deployed. By default ray is already importing transformer so we need to fix the version now.

In [ ]:

Copied!

!pip install transformers==4.39.2 ray[serve]
!pip install transformers==4.39.2 ray[serve]

Define the model endpoint¶

Since we are using Llama2 pretrained model with some optimization for fitting it in the GPU, we install another set of dependencies.

In [ ]:

Copied!

!pip install -q accelerate==0.21.0 peft==0.4.0 bitsandbytes==0.40.2 transformers==4.39.2 trl==0.4.7
!pip install -q accelerate==0.21.0 peft==0.4.0 bitsandbytes==0.40.2 transformers==4.39.2 trl==0.4.7

The basic of model serving with ray is to define a deployment class that:

load the model to serve __init__()
allow to make prediction through a mandatory __call__() method.

The actual definition of the deployment constraints are passed through a decorator (those familiar with ray core will recognize this approach and the range of parameters that can be applied to ray tasks or actors).

In [ ]:

Copied!





import requests
from starlette.requests import Request
from typing import Dict

from ray import serve

import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    Conversation,
    HfArgumentParser,
    TrainingArguments,
    pipeline,
    logging,
)
from peft import LoraConfig, PeftModel


model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

# Load the entire model on the GPU 0
device_map = {"": 0}

runtime_env = {"pip": ["accelerate==0.21.0", "peft==0.4.0", "bitsandbytes==0.40.2", "transformers==4.39.2", "trl==0.4.7"]}

# 1: Define a Ray Serve application.
@serve.deployment(
    num_replicas=1,
    ray_actor_options={
        "num_cpus": 1,
        "num_gpus": 1,
        "runtime_env": runtime_env
    }
)
class LLMChat:
    def __init__(self):
        # load model
        base_model = AutoModelForCausalLM.from_pretrained(
            model_name,
            low_cpu_mem_usage=True,
            return_dict=True,
            torch_dtype=torch.float16,
            device_map=device_map,
            # device_map="auto", # "balanced"
            offload_folder="./"
        )
        # load tokenizer
        tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
        tokenizer.pad_token = tokenizer.eos_token
        tokenizer.padding_side = "right"

        # attach to deployment
        self.model = pipeline(task="conversational", model=base_model, tokenizer=tokenizer, max_length=100)

    def generate(self, message: str) -> str:
        # Creating input conversation from message
        conversation = Conversation([
            {"role": "system", "content": "You are a polite assistant. Answer user message only with a short and simple response in casual conversation."}
        ])
        conversation.add_message({"role": "user", "content": message})

        # Run inference
        messages = self.model(
            conversation, do_sample=True,
            temperature=0.9, top_k=50, top_p=0.95
        )

        # Post-process output to return only the generated text
        return messages[-1]["content"]

    async def __call__(self, http_request: Request) -> str:
        prompt: str = await http_request.json()
        return self.generate(prompt)
import requests
from starlette.requests import Request
from typing import Dict

from ray import serve

import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    Conversation,
    HfArgumentParser,
    TrainingArguments,
    pipeline,
    logging,
)
from peft import LoraConfig, PeftModel


model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

# Load the entire model on the GPU 0
device_map = {"": 0}

runtime_env = {"pip": ["accelerate==0.21.0", "peft==0.4.0", "bitsandbytes==0.40.2", "transformers==4.39.2", "trl==0.4.7"]}

# 1: Define a Ray Serve application.
@serve.deployment(
    num_replicas=1,
    ray_actor_options={
        "num_cpus": 1,
        "num_gpus": 1,
        "runtime_env": runtime_env
    }
)
class LLMChat:
    def __init__(self):
        # load model
        base_model = AutoModelForCausalLM.from_pretrained(
            model_name,
            low_cpu_mem_usage=True,
            return_dict=True,
            torch_dtype=torch.float16,
            device_map=device_map,
            # device_map="auto", # "balanced"
            offload_folder="./"
        )
        # load tokenizer
        tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
        tokenizer.pad_token = tokenizer.eos_token
        tokenizer.padding_side = "right"

        # attach to deployment
        self.model = pipeline(task="conversational", model=base_model, tokenizer=tokenizer, max_length=100)

    def generate(self, message: str) -> str:
        # Creating input conversation from message
        conversation = Conversation([
            {"role": "system", "content": "You are a polite assistant. Answer user message only with a short and simple response in casual conversation."}
        ])
        conversation.add_message({"role": "user", "content": message})

        # Run inference
        messages = self.model(
            conversation, do_sample=True,
            temperature=0.9, top_k=50, top_p=0.95
        )

        # Post-process output to return only the generated text
        return messages[-1]["content"]

    async def __call__(self, http_request: Request) -> str:
        prompt: str = await http_request.json()
        return self.generate(prompt)

Among multiple available parameters, the key is often to define a ray_actor_options with a runtime_env.

runtime_env = {"pip": ["accelerate==0.21.0", "peft==0.4.0", "bitsandbytes==0.40.2", "transformers==4.31.0", "trl==0.4.7"]}

The most important aspect of this in our case is to be able to set specific dependencies related to the model we load.

This environement is then declared in the decorator with additional resources management options: in our case declaring one CPU and one GPU available for the model (in production one can eventually attach partial resource too) as well as one replicas. More replicas allowing for higher request thoughput in production.

@serve.deployment(
  num_replicas=1,
  ray_actor_options={
    "num_cpus": 1,
    "num_gpus": 1,
    "runtime_env": runtime_env
  }
)

One can note that ray serve is not limited to serving ML models and in fact any service could be easily offered in that way.

Next step is to bind the model to a specific app. Multiple models can be serve on the same ray serve application. The only limit is the computing resources available to ray — in our case a limited colab notebook with only one GPU so, one model will be enough for now.

In [ ]:

Copied!

app = LLMChat.bind()
app = LLMChat.bind()

Serving the application¶

The app is then served on a specific route on a simple REST server.

In [ ]:

Copied!

serve.run(app, route_prefix="/")
serve.run(app, route_prefix="/")

This code will create a REST API endpoint for Llama-2-7b-chat-hf that you can call using a web browser or any HTTP client to generate text based on a given prompt. To generate text, you would make a GET request to http://127.0.0.1:8000/ with the prompt in json data, where the prompt is the text you want to use as a starting point for text generation.

In [ ]:

Copied!

import requests

prompt = "How are you?"

response = requests.post("http://127.0.0.1:8000/", json=prompt)
llm_response = response.text

print(llm_response)
import requests

prompt = "How are you?"

response = requests.post("http://127.0.0.1:8000/", json=prompt)
llm_response = response.text

print(llm_response)

Shutting down the server is done a in simple call (usefull when debugging the app configuration).

In [ ]:

Copied!

serve.shutdown()
serve.shutdown()

If you want to allow your service to scale to handle high volume traffic you can use Ray Serve's autoscaling feature. This allows you to automatically scale your service up and down based on the number of requests it is receiving.

There are several ways to monitor a Ray Serve deployment, including:

Ray Dashboard: Ray provides a built-in dashboard that provides real-time information about the health and performance of the cluster and its components, including the Ray Serve endpoints. You can access the dashboard by opening a web browser and navigating to http://<ray_head_node_ip>:8265.
Logs: Ray provides a comprehensive logging system that you can use to monitor the performance of your deployment and troubleshoot any issues that arise. You can access the logs by using the ray.tune.Analysis class or by using a logging framework like logging.
Metrics: Ray provides a built-in metrics system that you can use to monitor various metrics, such as resource utilization, task latencies, and more. You can access the metrics by using the ray.tune.Analysis class or by using a monitoring tool like Datadog, InfluxDB, or Grafana.
Health checks: You can implement custom health checks to monitor the health of your deployment. Health checks can be used to monitor the status of specific components, such as the Ray Serve endpoint, and take action if a component is not functioning correctly.

By monitoring the deployment, you can ensure that it is running smoothly, identify and resolve any issues that arise, and make informed decisions about scaling and other operations.

Hosting custom model¶

In [ ]:

Copied!





import requests
from starlette.requests import Request
from typing import Dict

from ray import serve

import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    Conversation,
    HfArgumentParser,
    TrainingArguments,
    pipeline,
    logging,
)
from peft import LoraConfig, PeftModel


model_name = "gdupont/TinyLlama-1.1B-Chat-colors-v1.0_peft"

# Load the entire model on the GPU 0
device_map = {"": 0}

runtime_env = {"pip": ["accelerate==0.21.0", "peft==0.4.0", "bitsandbytes==0.40.2", "transformers==4.39.2", "trl==0.4.7"]}

# 1: Define a Ray Serve application.
@serve.deployment(
    num_replicas=1,
    ray_actor_options={
        "num_cpus": 1,
        "num_gpus": 1,
        "runtime_env": runtime_env
    }
)
class LLMChat:
    def __init__(self):
        # load model
        base_model = AutoModelForCausalLM.from_pretrained(
            model_name,
            low_cpu_mem_usage=True,
            return_dict=True,
            torch_dtype=torch.float16,
            device_map=device_map,
            # device_map="auto", # "balanced"
            offload_folder="./"
        )
        # load tokenizer
        tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
        tokenizer.pad_token = tokenizer.eos_token
        tokenizer.padding_side = "right"

        # attach to deployment
        self.model = pipeline(task="conversational", model=base_model, tokenizer=tokenizer)

    def generate(self, message: str) -> str:
        # Creating input conversation from message
        conversation = Conversation([
            {
                "role": "system",
                "content": "You will be given a text describing a color. Only return the hex color code without any other information"
            }
        ])
        conversation.add_message({"role": "user", "content": message})

        # Run inference
        messages = self.model(
            conversation, do_sample=False, max_new_tokens=6,
            # temperature=0.9, top_k=50, top_p=0.95
        )

        # Post-process output to return only the generated text
        return messages[-1]["content"]

    async def __call__(self, http_request: Request) -> str:
        prompt: str = await http_request.json()
        return self.generate(prompt)
import requests
from starlette.requests import Request
from typing import Dict

from ray import serve

import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    Conversation,
    HfArgumentParser,
    TrainingArguments,
    pipeline,
    logging,
)
from peft import LoraConfig, PeftModel


model_name = "gdupont/TinyLlama-1.1B-Chat-colors-v1.0_peft"

# Load the entire model on the GPU 0
device_map = {"": 0}

runtime_env = {"pip": ["accelerate==0.21.0", "peft==0.4.0", "bitsandbytes==0.40.2", "transformers==4.39.2", "trl==0.4.7"]}

# 1: Define a Ray Serve application.
@serve.deployment(
    num_replicas=1,
    ray_actor_options={
        "num_cpus": 1,
        "num_gpus": 1,
        "runtime_env": runtime_env
    }
)
class LLMChat:
    def __init__(self):
        # load model
        base_model = AutoModelForCausalLM.from_pretrained(
            model_name,
            low_cpu_mem_usage=True,
            return_dict=True,
            torch_dtype=torch.float16,
            device_map=device_map,
            # device_map="auto", # "balanced"
            offload_folder="./"
        )
        # load tokenizer
        tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
        tokenizer.pad_token = tokenizer.eos_token
        tokenizer.padding_side = "right"

        # attach to deployment
        self.model = pipeline(task="conversational", model=base_model, tokenizer=tokenizer)

    def generate(self, message: str) -> str:
        # Creating input conversation from message
        conversation = Conversation([
            {
                "role": "system",
                "content": "You will be given a text describing a color. Only return the hex color code without any other information"
            }
        ])
        conversation.add_message({"role": "user", "content": message})

        # Run inference
        messages = self.model(
            conversation, do_sample=False, max_new_tokens=6,
            # temperature=0.9, top_k=50, top_p=0.95
        )

        # Post-process output to return only the generated text
        return messages[-1]["content"]

    async def __call__(self, http_request: Request) -> str:
        prompt: str = await http_request.json()
        return self.generate(prompt)

In [ ]:

Copied!

app = LLMChat.bind()
serve.run(app, route_prefix="/")
app = LLMChat.bind()
serve.run(app, route_prefix="/")

In [ ]:

Copied!

import requests

prompt = "Golden yellow"

response = requests.post("http://127.0.0.1:8000/", json=prompt)
llm_response = response.text

print(llm_response)
import requests

prompt = "Golden yellow"

response = requests.post("http://127.0.0.1:8000/", json=prompt)
llm_response = response.text

print(llm_response)

Next steps¶

There are several ways to monitor a Ray Serve deployment, including:

Ray Dashboard: Ray provides a built-in dashboard that provides real-time information about the health and performance of the cluster and its components, including the Ray Serve endpoints. You can access the dashboard by opening a web browser and navigating to http://<ray_head_node_ip>:8265.
Logs: Ray provides a comprehensive logging system that you can use to monitor the performance of your deployment and troubleshoot any issues that arise. You can access the logs by using the ray.tune.Analysis class or by using a logging framework like logging.
Metrics: Ray provides a built-in metrics system that you can use to monitor various metrics, such as resource utilization, task latencies, and more. You can access the metrics by using the ray.tune.Analysis class or by using a monitoring tool like Datadog, InfluxDB, or Grafana.
Health checks: You can implement custom health checks to monitor the health of your deployment. Health checks can be used to monitor the status of specific components, such as the Ray Serve endpoint, and take action if a component is not functioning correctly.

By monitoring the deployment, you can ensure that it is running smoothly, identify and resolve any issues that arise, and make informed decisions about scaling and other operations.

In [ ]: