Convert Mask from Numpy Array to RLE (Run-Length Encoding) for CVAT

March 4, 2025 Total Views for This Post:

What is RLE?

When reading a mask in the format of numpy.array line by line, you often encounter long sequences of consecutive 0 (or False) followed by sequences of 1 (or True). Based on this observation, we can first define a bounding box that encloses the mask. Then, we read the values line by line within this box and record the counts of consecutive values.

For example, consider the following mask enclosed in a box:

1 2	[[0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0], [0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0]]

Reading this mask line by line, we get the sequence: 3*0, 6*1, 5*0, 7*1, 3*0. This sequence can be efficiently encoded as the RLE string: '3,6,5,7,3'.

If the mask starts with a 1 instead of a 0, we prepend a 0 to the RLE string to indicate that there are zero 0s at the beginning. This is the standard rule for converting a mask to RLE format.

Standard Implementation

def binary_image_mask_to_cvat_rle(mask: np.ndarray) -> dict:

    # Get box information
    yy, xx = np.nonzero(mask)
    top, left = np.min(yy), np.min(xx)
    bottom, right = np.max(yy), np.max(xx)
    height, width = bottom - top + 1, right - left + 1

    rle = []
    offset = 0 # How many consecutive '0's or '1's
    value = 0 # Are we counting '0' or '1'

    # Read line by line
    for y in range(top, top + height):
        for x in range(left, left + width):
            if mask[y][x] == value:
                offset += 1
            else:
                rle.append(offset) # Save to RLE
                offset = 1 # Reset 'offset'
                value = 1 - value # Flip 'value'
    if offset > 0:
        rle.append(offset)

    return {"rle": rle, "left": left, "top": top, "width": width, "height": height}

The code requires iterating over the entire box, resulting in a loop that runs height * width times. Converting a single mask to RLE takes about 0.02 seconds on my device. When processing masks in batches, you might need to wait for tens of seconds in the CVAT interface just for format conversion. We can replace the above method with a parallelizable approach.

Time-efficient Implementation

First, we need to determine which positions in the mask transition from 0 to 1 and which positions transition from 1 to 0.

bbox_mask = mask[top:bottom + 1, left:right + 1]
flat_mask = bbox_mask.flatten()  # transform the mask to a 1D array

diff = np.diff(flat_mask, prepend=0) # compute the differences
changes = np.where(np.isin(diff, [1, -1]))[0]   # find where the value changes

Due to the use of prepend=0, changes=0 if flat_mask[0]=1. This is great because the first element in changes will always stands for the number of 0s in the mask.

Except for the first element, which represents the count of 0s, the difference between each pair of adjacent elements in changes represents the count of consecutive 0s or 1s. The difference between the last element and the length of flat_mask is the last number to be added to the RLE.

rle = []
rle.append(changes[0])
for i in range(len(changes)-1):
    rle.append(changes[i+1] - changes[i])
rle.append(len(flat_mask)-changes[-1])

The time-efficient implementation:

def binary_image_mask_to_cvat_rle(mask: np.ndarray) -> dict:
    """
    Optimized version of converting binary image mask to CVAT RLE.
    """
    yy, xx = np.nonzero(mask)
    if len(yy) == 0: 
        return {"size": mask.shape, "counts": []}

    top, left = np.min(yy), np.min(xx)
    bottom, right = np.max(yy), np.max(xx)
    height, width = bottom - top + 1, right - left + 1

    bbox_mask = mask[top:bottom + 1, left:right + 1]
    flat_mask = bbox_mask.flatten()

    diff = np.diff(flat_mask, prepend=0) 
    changes = np.where(np.isin(diff, [1, -1]))[0] 

    rle = []
    rle.append(changes[0])
    for i in range(len(changes)-1):
        rle.append(changes[i+1] - changes[i])
    rle.append(len(flat_mask)-changes[-1])
    
    return {"rle": rle, "left": left, "top": top, "width": width, "height": height}