When using numpy at the beginning of a project we don’t think about performance. This is because it is more important for us to get solution that works. And only then, with real data, observing how our computer begins to slow down, we have to find answer on how to avoid this. And then we remember the GPU inside our computer and think: can it help us?
Yes, sure. PyTorch can utilize GPU and perform all computations on it. PyTorch uses Tensor primitives like numpy arrays. Unlike numpy, once declared tensors reside on GPU, and are calculated on it. Due to a lot of the parallel working GPU inits, computing performance arises dramatically.
We will take the matrix multiplication algorithm for the example.
First, let’s make sure the multiplication computations are performed correctly.
1 2 3 4 |
import numpy as np a = np.array([[1,2],[3,4]]) b = np.array([[5,6],[7,8]]) print(a @ b) |
Output:
1 2 |
[[19 22] [43 50]] |
Everything is as expected.
Now, tensor comes into play. At this point, all the necessary modules and drivers have already been installed:
1 2 3 4 5 6 |
import torch print("Cuda is available" if torch.cuda.is_available() else "Cpu only available" ) |
Output:
1 |
Cuda is available |
Let’s allow PyTorch to multiply matricies:
1 2 3 |
ta = torch.from_numpy(a) tb = torch.from_numpy(b) print(torch.matmul(ta, tb)) |
Output:
1 2 |
tensor([[19, 22], [43, 50]]) |
The result is the same as for the numpy.
Take a larger matrices and see how much time it takes to calculate in each of the three modes:
1) PyTorch in «cpu» (numpy — like) CPU only mode
2) PyTorch in «cuda» (GPU) mode
3) Numpy CPU mode
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
import datetime as dt dtype = torch.float M = 100 def calc(device): a = torch.rand(M, M, device=device, dtype=dtype) b = torch.rand(M, M, device=device, dtype=dtype) n1=dt.datetime.now() torch.matmul(a, b).size() n2=dt.datetime.now() print(device, '\t', M, n2-n1) a = np.random.rand(M, M).astype('f') b = np.random.rand(M, M).astype('f') n1=dt.datetime.now() np.matmul(a, b).size n2=dt.datetime.now() print("numpy", '\t', M, n2-n1) calc(torch.device("cpu")) calc(torch.device("cuda")) |
Output for M=100
1 2 3 |
numpy 100 0:00:00.000190 cpu 100 0:00:00.006382 cuda 100 0:00:00.017093 |
Output for M=1000
1 2 3 |
numpy 1000 0:00:00.014212 cpu 1000 0:00:00.064719 cuda 1000 0:00:00.014513 |
Output for M=10000
1 2 3 |
numpy 10000 0:00:12.415283 cpu 10000 0:00:12.241426 cuda 10000 0:00:00.036787 |
For small matrices (M=100) numpy looks like a performance champion. Even torch in cpu mode loses to it. The worst results are observed for the GPU.
For a medium-sized matrices (M=1000) the results are leveled out. Numpy and torch in cpu mode are the same and only torch in gpu mode is far behind.
At the moment, everything looks sad for GPU. It makes a real breakthrough for large-sized (M=10000) matrices. Here the gap in execution speed is huge: 37 milliseconds for GPU versus 12 seconds for numpy.
Simultaneous computing is good
Last comments