BERT inference cost/performance analysis CPU vs GPU

Vincent Teyssier
3 min readApr 19, 2021

--

BERT is a fantastic model that can be retrained in many ways to adapt to various NLP tasks such as sentiment analysis, personality classification, etc…

In order to feed BERT, you need to clean your text a bit, eventually use a stemmer, and tokenize it. This pre-processing takes some computing that is quickly resource consuming when running large inferences.

There has been many comments on SO arguing that GPU for inference was overkill. In this article, I’m looking at whether this is true or not, and what we could take away from this experiment in terms of cost optimization.

The model:

We use the bert-base-uncased which has been trained on the MBTI corpus. This is a dataset of roughly 8600 samples which are plain texts, labelled with the MBTI or Meyers-Briggs Type Indicator, a classic personality classification framework.

The data:

We will feed our model with 5000 files containing each 1500 short texts (below 255 characters).

These files are in csv format and contain a lot of useless metadata column (for this test) that we simply filter out when loading the csv to a dataframe.

The infrastructure:

For this test we will use Google cloud platform compute instances running Ubuntu 20.04. Clean install with just the required packages.

We will focus on computing power since we only do inference and don’t need large memory.

I tried first with a e2-highcpu-8 (8 vCPUs, 8 GB memory), then a e2-highcpu-16 (16 vCPUs, 64GB memory) and e2-highcpu-32 (32 vCPUs, 96GB memory) for the CPU test.

Then I tried a n1-standard-8 (8 vCPUs, 30GB memory) using 1 x NVIDIA Tesla V100 as GPU.

In both cases the memory is totally overkill.

Costs per hour are defined as follow as per today:

e2-highcpu-8 $0.197872 per hour or $144.45 monthly
e2-highcpu-16 $0.395744 per hour or $288.89 monthly
n1-standard-8 + V100 $2.004 per hour or $1,462.99 monthly

Interesting to see the price difference! x10 between lowest and highest.

Results:

The 8vCPU machine output consistent processing times between 69.4s to 70.5s for each file of 1500 sample.
Overall vCPU usage is around 75%

The 16vCPU machine output consistent processing times between 40.80s to 43.34 for each file of 1500 sample.
Overall vCPU usage is around 60%.

The 32vCPU machine output consistent processing times between 35.73s to 40.21 for each file of 1500 sample.
Overall vCPU usage is around 40%.

What we see is that the more we scale the less we utilize the full computing power, ie a lot of waste for little gain.

And finally, the GPU instance…. output processing time is also consistent between 2.16s and 3.28s. That’s 25x faster than our 8cpu instance, and 15 times faster that the 32 CPU instances.

If you had any doubt before, it is obvious that the time/cost is optimal in the CPU instance. However, given the entry cost to access GPU, your use case needs to make sense on a value perspective.

Let’s do the math:

Cost /hour  Cost /inference   Max inference per hour
8cpu 0.197872 0.003848 51.42
16cpu 0.395744 0.004584 86.33
32cpu 0.791488 0.008266 95.74
1gpu 2.004 0.001447 1384.61

So in hour case, the break-even point would be when you run less than 520.66 inferences per hour ( 1384.61 / (0.003848 / 0.001447) )

It is now up to you to make the calculation for your use case depending of the size of the data you infer over and the amount per hour, but the method above should give you a good idea.

--

--