Data Format

MSCOCO

This style is used by the youtubevos' dataset and thus this MaskTrackRCNN repo. Details about youtubevos can be found in the other resources section. The labels is in JSON format, and this is how it looks like:

    {
        "info" : info,
        "videos" : [video],
        "annotations" : [annotation],
        "categories" : [category],
    }
    video{
        "id" : int,
        "width" : int,
        "height" : int,
        "length" : int,
        "file_names" : [file_name],
    }
    annotation{
        "id" : int, 
        "video_id" : int, 
        "category_id" : int, 
        "segmentations" : [RLE or [polygon] or None], 
        "areas" : [float or None], 
        "bboxes" : [[x,y,width,height] or None], 
        "iscrowd" : 0 or 1,
    }
    category{
        "id" : int, 
        "name" : str, 
        "supercategory" : str,
    }

There are some important notes about this dataformat:

[1] The category id must starts from 1

details

In /mmdet/datasets/kitti.py, the category id is loaded through

    def load_annotations(self, ann_file):
        self.ytvos = YTVOS(ann_file)
        self.cat_ids = self.ytvos.getCatIds()
        self.cat2label = {
            cat_id: i + 1
            for i, cat_id in enumerate(self.cat_ids)
        }

The i returned by the enumerate starts from 1, and thus the cat_id maps to i+1 that starts from 1.

Following are related code in the ytvos.py

    def getCatIds(self, catNms=[], supNms=[], catIds=[]):
        """
        filtering parameters. default skips that filter.
        :param catNms (str array)  : get cats for given cat names
        :param supNms (str array)  : get cats for given supercategory names
        :param catIds (int array)  : get cats for given cat ids
        :return: ids (int array)   : integer array of cat ids
        """
        catNms = catNms if _isArrayLike(catNms) else [catNms]
        supNms = supNms if _isArrayLike(supNms) else [supNms]
        catIds = catIds if _isArrayLike(catIds) else [catIds]

        if len(catNms) == len(supNms) == len(catIds) == 0:
            cats = self.dataset['categories']
        else:
            cats = self.dataset['categories']
            cats = cats if len(catNms) == 0 else [cat for cat in cats if cat['name']          in catNms]
            cats = cats if len(supNms) == 0 else [cat for cat in cats if cat['supercategory'] in supNms]
            cats = cats if len(catIds) == 0 else [cat for cat in cats if cat['id']            in catIds]
        ids = [cat['id'] for cat in cats]
        return ids
    def __init__(self, annotation_file=None):
        ...
        dataset = json.load(open(annotation_file, 'r'))
        self.dataset = dataset
        ...

---END DETAILS---

[2] For annotation, len(bboxes) = len(segmentations) = len(areas) = the number of frames in that sequence. So if an instance does not appear in frame i, then bboxes[i], segmentations[i], and areas[i] for that frame are None.

[3] Annotation id corresponds to the global instance id in other dataset. This means one instance id cannot appear in different sequences/videos.

Kitti MOTS

Kitti MOTS is developed upon the Kitti Tracking 2012 dataset by adding mask information. In order to understand its data format better, we need to be familiar with both of them.

Each video is a sequence of image frames. Kitti provides txt label file for each sequence, and each row in that txt file is in the following format:

# Values Name        Description
----------------------------------------------------------------------------
   1    frame        Frame within the sequence where the object appearers
   1    track id     Unique tracking id of this object within this sequence
   1    type         Describes the type of object: 'Car', 'Van', 'Truck',
                     'Pedestrian', 'Person_sitting', 'Cyclist', 'Tram',
                     'Misc' or 'DontCare'
   1    truncated    Integer (0,1,2) indicating the level of truncation.
                     Note that this is in contrast to the object detection
                     benchmark where truncation is a float in [0,1].
   1    occluded     Integer (0,1,2,3) indicating occlusion state:
                     0 = fully visible, 1 = partly occluded
                     2 = largely occluded, 3 = unknown
   1    alpha        Observation angle of object, ranging [-pi..pi]
   4    bbox         2D bounding box of object in the image (0-based index):
                     contains left, top, right, bottom pixel coordinates
   3    dimensions   3D object dimensions: height, width, length (in meters)
   3    location     3D object location x,y,z in camera coordinates (in meters)
   1    rotation_y   Rotation ry around Y-axis in camera coordinates [-pi..pi]
   1    score        Only for results: Float, indicating confidence in
                     detection, needed for p/r curves, higher is better.

For Kitti MOTS, the mask labels are in either png or txt format. The txt format looks like:

# time_frame obj_id class_id img_height img_width rle
52 1005 1 375 1242 WSV:2d;1O10000O10000O1O100O100O1O100O1000000000000000O100O102N5K00O1O1N2O110OO2O001O1NTga3

# obj_id 10,000 denotes an ignore region and 0 is background
# class and instance id is calculated from obj_id:
class_id = obj_id // 1000
obj_instance_id = obj_id % 1000

Dataset Convertion

In order to use other dataset on MaskTrackRCNN, we need to first convert it to MSCOCO style and the decide how to split it to train/val/test set. The way to do the split depends on if the dataset has official split requirement. Here are two code sample to convert Kitti MOTS to MSCOCO and random split to train/test set in 8:2 ratio.

linkAndRenameFiles.py: prepare data for mots2cocoVer2.py. This link image from different folder to a combined folder

import os
cmd = 'ln -s %s/%s %s/%s_%s'
# path = '/home/liz220/Documents/dataset/MOTS/training/image_02'
# des = '/home/liz220/Documents/dataset/MOTS/training/image_combine'
path = '/home/liz220/Documents/dataset/MOTS/Annotations/training/instances'
des = '/home/liz220/Documents/dataset/MOTS/Annotations/training/instances_combine'

for root, dirs, files in os.walk(path):
    dirs.sort()
    for dirname in dirs:
        current_path = os.path.join(path,dirname)
        for subroot,_, subfiles in os.walk(current_path):
            for filename in subfiles:
                os.system(cmd %(current_path,filename,des,dirname,filename))


mots2cocoVer2.py: convert kitti to MSCOCO

import argparse
import json
import sys
import os
import cv2
import numpy as np
from tqdm import tqdm
from PIL import Image
import imagesize
import random
from pycocotools import mask as maskUtils 


def parseNonemptyArgs():
    parser = argparse.ArgumentParser(description='Convert dataset')
    parser.add_argument(
        '--outdir', help="output dir for outputVer2.json files", default=".", type=str)
    parser.add_argument(
        '--mask_label_dir', help="data dir for mask annotations to be converted",
        default='/home/liz220/Documents/dataset/MOTS/Annotations/training/instances_txt', type=str)
    parser.add_argument(
        '--bbox_label_dir', help="data dir for ground truth bbox annotations to be converted",
        default='/home/liz220/Documents/dataset/MOTS/Annotations/training/label_02', type=str)
    parser.add_argument(
        '--img_label_dir', help="data dir for original images in %s_%06d.png naming format",
        default='/home/liz220/Documents/code/MaskTrackRCNN/data/MOTS/images/image_combine', type=str)
    return parser.parse_args()


def main():
    next_img_id = 0
    filename2imageid = {}
    next_instance_id = 0
    local2globalinstanceid = {}

    ann_dict = {}
    bbox_dict = {}
    images = []
    annotations = []

    source_label = {'Car': 1, 'Pedestrian': 2}
    to_dest_label = {1:2, 2:1}
    dest_label = {'person': 1, 'car': 2}

    args = parseNonemptyArgs()


    # build filename to imageid map, using all the png file. In case the some frame has blank label.
    for _, _, filenames in os.walk(args.img_label_dir):
        filenames.sort()
        for file_name in filenames:
            image_id = next_img_id
            filename2imageid[file_name] = next_img_id
            next_img_id+=1

            image = {}
            image['id'] = image_id
            image['width'], image['height']= list(map(int,imagesize.get(os.path.join(args.img_label_dir, file_name))))
            image['file_name'] = file_name
            image['seg_file_name'] = file_name
            images.append(image)

    for _, _, filenames in os.walk(args.mask_label_dir):
        filenames.sort()

        for filename in tqdm(filenames):
            print('Processed %s' % filename)

            sequence = filename[:-4]
            mask_label_file_path = os.path.join(args.mask_label_dir, filename)
            bboxs_label_file_path = os.path.join(args.bbox_label_dir, filename)

            # load bbox labels
            with open (bboxs_label_file_path,"r",encoding='utf-8') as bbox_source:
                # time_frame id class_id img_height img_width rle
                for line in bbox_source.readlines():
                    line = line.split(" ")
                    if(source_label.get(line[2],-1)==-1):
                        continue
                    frame_id, local_instance_id, category, left, top, right, bottom = int(line[0]), int(line[1]), to_dest_label[source_label[line[2]]], float(line[6]), float(line[7]), float(line[8]), float(line[9])
                    bbox_key = "%s_%04d_%04d"%(sequence,frame_id,local_instance_id)
                    bbox_dict[bbox_key] = [frame_id, local_instance_id, category, left, top, right, bottom]


            # load mask labels
            with open (mask_label_file_path,"r",encoding='utf-8') as mask_source:
                # time_frame id class_id img_height img_width rle
                for line in mask_source.readlines():
                    frame_id, obj_id, class_id, img_height, img_width, rle = line.strip("\n").split(" ")

                    assert(int(class_id) == int(obj_id) // 1000)
                    local_instance_id = int(obj_id) % 1000

                    ignored_region_id = 10000
                    if(int(obj_id) == ignored_region_id):
                        continue

                    file_name = "%s_%06d.png"%(sequence,int(frame_id))
                    image_id = filename2imageid.get(file_name, -1)
                    assert(image_id != -1)

                    instance_id_key = "%s_%04d"%(sequence,local_instance_id)
                    global_instance_id = local2globalinstanceid.get(instance_id_key,-1) 
                    if(global_instance_id == -1):
                        global_instance_id = next_instance_id
                        local2globalinstanceid[instance_id_key] = next_instance_id
                        next_instance_id+=1

                    bbox_key = "%s_%04d_%04d"%(sequence,int(frame_id),local_instance_id)
                    bbox_info = bbox_dict.get(bbox_key, None)
                    if bbox_info!=None:
                        _frame_id, _local_instance_id, category, left, top, right, bottom = bbox_info
                        assert(_frame_id == int(frame_id) and _local_instance_id == local_instance_id and category == to_dest_label[int(class_id)])

                        mask = {'size': [int(img_height), int(img_width)], 'counts': rle.encode(encoding='UTF-8')}
                        bbox = xyxy_to_xywh([left, top, right, bottom])
                        area = float(maskUtils.area(mask))

                        ann = {}

                        use_polygon = False
                        if use_polygon:
                            ground_truth_binary_mask = maskUtils.decode(mask)
                            contours, _ = cv2.findContours(ground_truth_binary_mask, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_TC89_L1)

                            if contours == []:
                                print('Warning: empty contours.')
                                continue

                            new_contours = []
                            for x in range(np.shape(contours)[0]):
                                if np.size(contours[x]) < 6:
                                    print('3 points required for polygon')
                                    continue
                                new_contours.append( np.reshape(contours[x],(np.size(contours[x]))).tolist())

                            if new_contours == []:
                                continue

                            ann['segmentation'] = new_contours
                        else:
                            ann['segmentation'] = {'size': [int(img_height), int(img_width)], 'counts': rle}

                        # if random.random() < 0.01:  
                        #     displaybbox(maskUtils.decode(mask),[left, top, right, bottom],bbox_key,to_dest_label[int(class_id)])

                        ann['id'] = str(global_instance_id)
                        ann['image_id'] = image_id
                        ann['category_id'] = to_dest_label[int(class_id)]
                        ann['iscrowd'] = 1
                        ann['area'] = area # changed later
                        ann['bbox'] = bbox #xywh_box
                        annotations.append(ann)

        ann_dict['images'] = images
        categories = [{"id": dest_label[name], "name": name} for name in dest_label.keys()]
        ann_dict['categories'] = categories
        ann_dict['annotations'] = annotations
        saveAnnotationAsJson(ann_dict,args.outdir)


def saveAnnotationAsJson(ann_dict, out_dir):
    with open(os.path.join(out_dir, 'outputVer2.json'), 'w') as outfile:
        outfile.write(json.dumps(ann_dict))

def xyxy_to_xywh(xyxy_box):
    xmin, ymin, xmax, ymax = xyxy_box
    TO_REMOVE = 1
    xywh_box = (xmin, ymin, xmax - xmin + TO_REMOVE, ymax - ymin + TO_REMOVE)
    return xywh_box

def displaybbox(mask, xyxy_box, name, category):
    '''visualize the contours and bbox for testing purposs'''
    height, width = mask.shape
    mask*=255 # use the fact that rle bits are either 0 or 1
    canvas = np.asarray(Image.fromarray(mask.astype(np.uint8)))
    cv2.cvtColor(canvas, cv2.COLOR_GRAY2BGR)
    x1,y1,x2,y2 = list(map(int,xyxy_box))
    cv2.rectangle(canvas, (x1, y1), (x2, y2), 255 , 1)
    cv2.putText(canvas, str(category), (x1, y1-10), cv2.FONT_HERSHEY_SIMPLEX, 0.9, 255, 1)
    cv2.imwrite('%s.png'%(name),canvas)


if __name__ == '__main__':
    main()


We need to split the dataset to two different set: train set and validation set. If the dataset you are working on has a split file, you need to follow that. For kitti tracking, we seperate the sequence [0, 1, 3, 4, 5, 9, 11, 12, 15, 17, 19, 20] for training, and [2, 6, 7, 8, 10, 13, 14, 16, 18] for testing.
cocosplit_modified.py: split MSCOCO's json file

import json
import argparse
import funcy
from tqdm import tqdm
import numpy as np
from sklearn.model_selection import train_test_split

parser = argparse.ArgumentParser(description='Splits COCO annotations file into training and test sets.')
parser.add_argument('--annotations', metavar='coco_annotations', type=str, default="../MOTS_v2_rle.json",
                    help='Path to COCO annotations file.')
parser.add_argument('--train', type=str, default="../instances_train_sub.json", help='Where to store COCO training annotations')
parser.add_argument('--test', type=str, default="../instances_val_sub.json", help='Where to store COCO test annotations')
parser.add_argument('--having-annotations', dest='having_annotations', action='store_true',
                    help='Ignore all images without annotations. Keep only these with at least one annotation')
parser.add_argument('--remove-car-label', dest='remove_car_label', action='store_true',
                    help='Ignore all car annotations. Keep only these are pedestrains')

args = parser.parse_args()

instance_id_counter = 0
def getNewInstanceId():
    global instance_id_counter
    current_id = instance_id_counter
    instance_id_counter += 1
    return current_id

def save_coco(file, videos, annotations, categories):
    with open(file, 'wt', encoding='UTF-8') as coco:
        json.dump({ 'videos': videos, 'annotations': annotations, 'categories': categories}, coco, indent=2, sort_keys=True)

def extract_video_sequence(images, sequence, sequence_id):
     imgs = funcy.lfilter(lambda a: int(a['file_name'][:4]) == sequence, images)
     ids = funcy.lmap(lambda i: i["id"], imgs)
     file_names = funcy.lmap(lambda i: i["file_name"], imgs)
     file_names.sort()
     ids.sort()
     return ids, {"id": sequence_id, "width":imgs[0]["width"], "height":imgs[0]["height"], "length": len(imgs), "file_names": file_names }

def main(args):
    with open(args.annotations, 'rt', encoding='UTF-8') as annotations:
        coco = json.load(annotations)
        images = coco['images']
        annotations = coco['annotations']
        categories = coco['categories']
        person_id, car_id = 1, 2 # hard-coded
        train_sequence_split = [0, 1, 4, 9, 11, 12, 13, 15, 17, 19]
        test_sequence_split = [2, 7, 10, 14, 16]


        # print(images[0])
        # {'id': 0, 'height': 375, 'width': 1242, 'file_name': '0000_000000.png', 'seg_file_name': '0000_000000.png'}
        # print(annotations[0])
        # {'id': 0, 'image_id': 0, 'segmentation': [[1139,..., 176]] or {'size': [int(img_height), int(img_width)], 'counts': rle)}, 'category_id': 1, 'iscrowd': 0, 'area': 4809.0, 'bbox': [1106, 176, 93, 142]}

        if args.remove_car_label:
            annotations = funcy.lremove(lambda i: int(i['category_id']) == car_id, annotations)

        if args.having_annotations:
            images_with_annotations = funcy.lmap(lambda a: int(a['image_id']), annotations)
            images = funcy.lremove(lambda i: i['id'] not in images_with_annotations, images)

        train_video = []
        train_ann = []

        for sequence_id, sequence in enumerate(tqdm(train_sequence_split),start=1):
            ids, video= extract_video_sequence(images,sequence, sequence_id)
            train_video.append(video)

            anns_in_sequence = funcy.lfilter(lambda a: int(a['image_id']) in ids, annotations)
            unique_ann_ids_in_sequence = set(funcy.lmap(lambda i: i["id"], anns_in_sequence))
            # print(len(unique_ann_ids_in_sequence),len(anns_in_sequence))

            for instance_id in unique_ann_ids_in_sequence:
                segmentations = []
                areas = []
                bboxes = []
                labels = []
                for id in ids:
                    a = funcy.lfilter(lambda a: int(a['image_id']) == int(id) and int(a['id']) == int(instance_id), anns_in_sequence)
                    if a == []:
                        segmentations.append(None)
                        areas.append(None)
                        bboxes.append(None)
                    else:
                        segmentations.append(a[0]["segmentation"])
                        areas.append(a[0]["area"])
                        bboxes.append(a[0]["bbox"])
                        labels.append(a[0]['category_id'])

                category_id = np.unique(labels)
                assert(len(category_id) == 1)

                train_ann.append({"id": getNewInstanceId(), "video_id": sequence_id, "category_id": int(category_id[0]), "segmentations": segmentations, "areas": areas, "bboxes": bboxes, "iscrowd": 0})      
        save_coco(args.train, train_video, train_ann, categories)   

        test_video = []
        test_ann = []
        for sequence_id, sequence in enumerate(tqdm(test_sequence_split),start=1):
            ids, video= extract_video_sequence(images,sequence, sequence_id)
            test_video.append(video)

            anns_in_sequence = funcy.lfilter(lambda a: int(a['image_id']) in ids, annotations)
            unique_ann_ids_in_sequence = set(funcy.lmap(lambda i: i["id"], anns_in_sequence))
            # print(len(unique_ann_ids_in_sequence),len(anns_in_sequence))

            for instance_id in unique_ann_ids_in_sequence:
                segmentations = []
                areas = []
                bboxes = []
                labels = []
                for id in ids:
                    a = funcy.lfilter(lambda a: int(a['image_id']) == int(id) and int(a['id']) == int(instance_id), anns_in_sequence)
                    if a == []:
                        segmentations.append(None)
                        areas.append(None)
                        bboxes.append(None)
                    else:
                        segmentations.append(a[0]["segmentation"])
                        areas.append(a[0]["area"])
                        bboxes.append(a[0]["bbox"])
                        labels.append(a[0]['category_id'])

                category_id = np.unique(labels)
                assert(len(category_id) == 1)

                test_ann.append({"id": getNewInstanceId(), "video_id": sequence_id, "category_id": int(category_id[0]), "segmentations": segmentations, "areas": areas, "bboxes": bboxes, "iscrowd": 0})
        save_coco(args.test, test_video, test_ann, categories)

        # print("Saved {} entries in {} and {} in {}".format(len(img_train), args.train, len(img_test), args.test))


if __name__ == "__main__":
    main(args)


Then you can put the dataset and label at data folder according to the MaskTrackRCNN repo:

mmdetection
├── mmdet
├── tools
├── configs
├── data
│   ├── train
│   ├── val
│   ├── annotations
│   │   ├── instances_train_sub.json
│   │   ├── instances_val_sub.json

For images that could take plenty of space, it is usually stored at /data forder in the server, and we do not want to have multiple copy of them. Thus we can create soft link for the files/folders. Then command is ln -s path/to/source path/to/dest.