Solve ‘MemoryError: Unable to allocate XXXGiB for an array’ by setting Overcommit and Swap
Problem Description
In my recent RL project, I need to generate a multidimensional Numpy array for a Q-table.
self.qtable = np.zeros((2,2,2,2,2,2,2,2,2,2,10,2,10,2,100000))
However, as the array has a really big size, the terminal reports an error:
MemoryError: Unable to allocate 305. GiB for an array with shape (2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 10, 2, 100000) and data type float64
Step1: Overcommit handling
Firstly, it exposes the problem of my ubuntu system’s overcommit handling mode.
The default overcommit handling is set to 0
, which means:
Heuristic overcommit handling.
Obvious overcommits of address space are refused. Used for a typical system.
It ensures a seriously wild allocation fails while allowing overcommit to reduce swap usage.
The root is allowed to allocate slightly more memory in this mode.
This is the default.
As 305. GiB
is much greater than my actual physical memory space 32. GiB
. This obvious overcommit of address space is refused.
To check the current overcommit mode, run:
$ cat /proc/sys/vm/overcommit_memory
In my situation, as the array should be sparse, actually it will not occupy 305. GiB
memory space as what it claims. So It is fine to allow the overcommit.
To enable the overcommit, run:
$ sudo -i
$ echo 1 > /proc/sys/vm/overcommit_memory
Changing this to 1
means:
Always overcommit.
Appropriate for some scientific applications.
A classic example is a code using sparse arrays and just relying on the virtual memory consisting almost entirely of zero pages.
Now the system will allow you to declare a large array without worrying about how large it is. The system will only allocate physical memory pages for those explicit data in your sparse array.
Step2: Swap (Virtual memory)
After setting the ‘Overcommit handling mode’ to 1
, I can start training my Q-learning model.
However, the ‘sparse’ array becomes denser and denser during the training progress. Finally, it runs out of my 32. GiB
physical memory space at around 4750000 epochs (Total epochs: 8000000).
As a result, the system has to terminate the training before it is finished. :(
I need more available memory space!! I realize that increasing the Swap
size could be helpful.
Swap
is a special file located on your hard disk, which the system could use as an additional virtual memory space.
The default size of the Swap
on my ubuntu20.04
system is 2. GiB
, which is insufficient.
To extend the Swap
space:
- Turn off all
Swap processes
by:$ sudo swapoff -a
- Resize the
Swap
: (In my situation, I estimate an additional32GiB
virtual memory should be sufficient)$ sudo dd if=/dev/zero of=/swapfile bs=1G count=32
Note that:
if = input file of = output file bs = block size count = multiplier of blocks
- Change its permission:
$ sudo chmod 600 /swapfile
(
600
means that the owner has full read and write access to the file, while no other user can access the file) - Set the file as
Swap
and activate it:$ sudo mkswap /swapfile $ sudo swapon /swapfile
- Check the file
/etc/fstab
by$ sudo vi /etc/fstab
Check whether the following command is there. If not, add it to the bottom.
/swapfile none swap sw 0 0
- Check the amount of
Swap
available by:$ grep SwapTotal /proc/meminfo
Or,
$ free -t -m
After finishing these two steps, I can run my RL model without any error related to the memory space.
Reference
- “Unable to allocate array with shape and data type.” https://stackoverflow.com/questions/57507832/unable-to-allocate-array-with-shape-and-data-type
- “Change swap size in Ubuntu 18.04 or newer.” https://bogdancornianu.com/change-swap-size-in-ubuntu/
Leave a comment