How to get the most out of your data

by Todd Gleed | December 14, 2021 | Data | 10 min read
How to get the most out of your data


The data that you use is the most important part of building a Computer Vision model for object detection. The goal of this tutorial is to show you a couple of different ways you can ensure that you are getting the most out of the data you have. First, I’ll demonstrate a couple of techniques for stretching your data. Next, I’ll provide some tools for analysis of the distribution of your data. Finally, I’ll go over some scenarios you may run into when working with your data. Wherever possible I’ve included links to scripts hosted on github that will help you manage your data. Keep in mind these are not polished tools, they are more like quick fixes that I have found useful recently.

Dataset Size

Collecting or acquiring data that is relevant to your use case can be difficult, time-consuming, and in some cases, impossible. However, we all know that the more examples of each target object you have available to use in training, the better. So, if you have fewer examples of the target objects than you would like for your training session, below are a couple of techniques for multiplying your data.


Data augmentation is the most common way to enhance your dataset. There is a lot of information on augmentation available online, so I’m not going to get deep into the technical aspects of it. The truth is augmentation is rather straightforward, in fact, simple augmentations such as rotation, flipping, or resizing are built into common computer vision model training frameworks such as PyTorch and TensorFlow. These augmentations are applied randomly and on the fly during training. To perform more complex augmentations, such as color/saturation changing or noise addition, there are many 3rd party libraries to choose from:

These all have mostly the same capabilities, the difference being the language in which they are written or the processing method. Most of my experience is with Imgaug, so that is the library utilized in the provided sample code, but you can find documentation for any of them. Here is a link to the script that I use: https://github.com/alwaysai/data-management-tools/blob/main/augment_images.py My colleague Lila has configured a few augmentation permutations and added some CLI functionality to simplify the process a little. After cloning the repository, run the script using the standard command, and include the required flag ‘--input_dir’ along with at least one augmentation flag. The following augmentation permutation flags are built into the script, but you are not limited to these:

  • --rotate_180
  • --darken - darkens to (0.7, 0.8)
  • --rotate_90_darken - darkens to (0.7, 0.8)
  • --rotate_180_darken - darkens to (0.7, 0.8)
  • --brighten
  • --rotate_brighten
  • --blur
  • --rotate_180_blur
  • --rotate_270_darken
  • --greyscale
  • --rotate_90_grayscale
  • --rotate_180_grayscale
  • --grayscale_darken - darkens to (0.7, 0.8)
  • --grayscale_brighten - brightens to (1.4, 1.6)
  • --grayscale_blur
  • --rotate_270_greyscale
  • --zoom - zooms the image to x=(0.6, 1.0), y=(0.6, 1.0)
  • --all - performs all of the above augmentations

For each augmentation flag that you include in the command, a new image will be generated, along with a new annotation file, in PascalVOC format. So, an example command that generates a brightened image as well as an image that has been rotated and turned grayscale would be:

python augment_images.py --input_dir path/to/dir --brighten –rotate_180_grayscale

This is a fast way to increase the images and annotations in your dataset. You can see if you apply the same transformations to every image, you can quickly increase your training data by 2, 6, 10 times. But it is important to consider the value of each augmentation adds to the dataset. I vary the specific augmentations based on the available data, use-case, and environment. If the model is to be used in the daytime or well-lit environments, I will apply more light-based augmentations - brightening or saturation based on my current images. We haven't built in saturation, but it is available in Imgaug.

I've included the default values for fields that take it, but these can be changed in the code. For example, in the code for brighten, increasing the number will brighten the image:

if brighten or aug_all: req.iaa.Multiply((1.4, 1.6))

Synthetic Data

Synthetic data is another way to increase the size of your dataset. There are a couple of different ways to incorporate synthetic data into your project. I have done a little experimentation with purely synthetic data, where you use video games style mesh renderings and generated backgrounds to create a dataset for object detection. There is a good Unity tutorial here

that I had moderate success with. The problem for me was the work involved in generating the assets, which takes a lot of time. This tutorial will show you how to do a different kind of synthetic data generation, where I use a few varied examples of objects to generate many more. It is also time-consuming but can really save you time when it comes to annotation.

The first step is to gather examples of your target objects. Like traditional datasets, try to capture images or videos from an environment that is like the environment in which you will be running your model. Try to vary lighting, positions, and placement until you feel you have captured all sides of each object. I have been experimenting with convenience store items, like soda and candy and chips, and feel like I have a good mix with about 20 images.

The next step is to remove the background from your objects. To be clear, the goal is a transparent background except for the target object. For this you can use any program you are comfortable with - Photoshop, PowerPoint, Preview. I have explored some online services for this as well. Below is an example of what you start with and what you end with. You need to save the final image as a .png in order to preserve the transparency.

Make sure to crop the image so that only the object is in frame. More specifically, crop the image as if you were drawing a bounding box around the object. Don’t leave extra space. We use this technique in the script for generating bounding boxes, so don’t skip this part.

Access the project here. This project contains a sample .xml file that is used for the annotations, and the main script, synth.py. What this script does is it plots your example objects on background images of your choice. This is not an elegant script by any means, so I’ll tell you how to use it.

The script looks for a folder containing your background images. This folder should contain images that you want to use for your backgrounds. These images should not contain any example of the objects you intend to detect, because we are not going to be annotating these images. Try to include images that are varied and represent your target environment.

path = r”/Path/to/your/backgrounds/”

Next create a folder that contains all the different objects you want to train your object detection model to detect. Each object needs its own folder within this class folder. The folder structure should look like this: Classes>class_name>images. The folder name class_name should be unique for each object, as we use it as the label for the class when training. Next set the images variable to point to this folder. In this case we have the folder in the same directory as the script, but you need to create it yourself.

images = glob.glob('./Classes/**/*/*.png', recursive=True)

When we run the script, we will generate images and associate annotation files. For this to work properly, ensure that there is a folder called JPEGImages in the same directory as the script. In addition, if one does not exist, create a folder named Annotations and change the variable annotation_path to point to it.

Before running the script, set the variable numImages. When running, the script will create this number of images by randomly selecting an image from the folder containing your backgrounds, and then overlay several images from the files in the folders contained in your class folder. It will generate a corresponding xml file which is the annotations for that image in PascalVOC format. You will end up with 2 folders, Annotations and JPEGImages, which you can compress together and use to train using the alwaysAI Model Training Toolkit.

Like I said, this is still in experimental phase, and just a quick hack. I currently have the number of objects per image to be random, with maximum 5. This can be hardcoded or increased or decreased by setting the variable a. I also try to space out the placed images, because I have no mechanism to recognize whether there is already an image placed in a location and to adjust the bounding boxes. So, without the logic to spread out the images, we might get an object completely occluded by another but retaining the bounding box, resulting in a mis-labeled object. You can see an example of a generated image below.

Dataset Analysis

The last couple tools that I’ve included are useful for analyzing your dataset are the class balancer, and a way to test your annotations. Both are relatively straightforward. Each of these assumes that you already have a dataset, and that it is in the format that we use in the alwaysAI Model Training Toolkit. Details here

The script test_annotations.py does exactly what it says. It will plot the annotations contained in your Annotations folder directly on to your images. I use this as a sanity check. If you are running into challenges with performance training your model, or you have manually converted a dataset, or gotten a dataset from an unknown source, this script will help you confirm the annotation you are using are accurate. In the past I have made mistakes with xmin and xmax, and ran into an exif data rotation issue, where visualizing the bounding boxes has helped to diagnose the issue.

There are 3 flags to use with the command to run test_annotations.py: --input_dir, --output_dir, and -sample. Input and output are self-explanatory; just put the path to your dataset .zip for input and one to a folder to use as output. Sample defines the rate to sample your images, because datasets can be large, and we don’t need to see every image to confirm the accuracy of the bounding boxes. The number after the -sample flag is the sample rate, meaning the value 10 will plot bounding boxes on every 10 images, and 1 will plot every image.

python test_annotations.py --input_dir path/to/dir --output_dir path/to/dir

The last script I want to talk about is class_balancer.py. This script analyzes your dataset and gives you the total number of examples of each class contained within. You will need to specify the input directory with the '--input-dir' flag; this to a directory that follows the alwaysAI dataset format but is not compressed. You can use the '--partition' flag to move the data you would like to hold out into a new directory; this directory will be called ‘holdout’, unless you specify a name with the '--ouput_dir' flag.

This script is an alwaysAI Application script, so you will need to create a new alwaysAI project see here and replace app.py in the file alwaysai.app.json with class_balancer.py. Then run the app as normal with the command aai app start.

The result of running this project is three lists, the first two containing class names followed by a number, and the third a list of file names. The first list is the current distribution of classes. It counts and displays all the examples of each object contained in the dataset. The second list contains the optimal distribution possible based on the current images in your dataset, with optimal being the same number of examples of each class. The script has determined how close you can get to an optimal distribution, and which files you would need to delete to get there, which is the content of the third list.

Since data is hard to come by, you most likely don’t want to delete images, but rather use the top list as a guide for adding examples of classes to try to achieve a better distribution.


Creating and managing datasets is hard, and often unwieldy. These tools can help you in your dataset generation process, but keep in mind they are in a primitive state. As they evolve we will bring them into the alwaysAI platform and continue to improve the Model Training experience.

stylized image of a computer chip

Sign up today and start your project

We can't wait to see what you'll build!