How to build a dataset for image classification
Your image dataset is your ML tool’s nutrition, so it’s critical to curate digestible data to maximize its performance. A high-quality training dataset enhances the accuracy and speed of your decision-making while lowering the burden on your organization’s resources.
If your training data is reliable, then your classifier will be firing on all cylinders. So let’s dig into the best practices you can adopt to create a powerful dataset for your deep learning model.
How to approach an image classification dataset: Thinking per "label"
The label structure you choose for your training dataset is like the skeletal system of your classifier.
Thus, the first thing to do is to clearly determine the labels you'll need based on your classification goals. Then, you can craft your image dataset accordingly.
In particular, you need to take into account 3 key aspects: the desired level of granularity within each label, the desired number of labels, and what parts of an image fall within the selected labels.
Let's take an example to make these points more concrete.
Let’s say you’re running a high-end automobile store and want to classify your online car inventory. Here are the questions to consider:
1. What is your desired level of granularity within each label?
Do you want to have a deeper layer of classification to detect not just the car brand, but specific models within each brand or models of different colors?
2. What is your desired number of labels for classification?
How many brands do you want your algorithm to classify? Porsche and Ferrari? Or Porsche, Ferrari, and Lamborghini? If you also want to classify the models of each car brand, how many of them do you want to include?
It is important to underline that your desired number of labels must be always greater than 1. Even when you're interested in classifying just Ferraris, you'll need to teach the model to label non-Ferrari cars as well.
3. Which part of the images do you want to be recognized within the selected label?
Do you want to train your dataset to exclusively tag as Ferraris full pictures of Ferrari models? Or do you want a broader filter that recognizes and tags as Ferraris photos featuring just a part of them (e.g. the headlight view)?
Clearly answering these questions is key when it comes to building a dataset for your classifier. Without a clear per label perspective, you may only be able to tap into a highly limited set of benefits from your model.
Indeed, your label definitions directly influence the number and variety of images needed for running a smoothly performing classifier. Let's see how and why in the next chapter.
How much training data do you need?
In general, when it comes to machine learning, the richer your dataset, the better your model performs.
In addition, the number of data points should be similar across classes in order to ensure the balancing of the dataset.
However, how you define your labels will impact the minimum requirements in terms of dataset size. In particular:
- A rule of thumb on our platform is to have a minimum number of 100 images per each class you want to detect. In many cases, however, more data per class is required to achieve high-performing systems. If you seek to classify a higher number of labels, then you must adjust your image dataset accordingly.
- If you’re aiming for greater granularity within a class, then you need a higher number of pictures. You need to ensure meeting the threshold of at least 100 images for each added sub-label.
- The more items (e.g. headlight view, the whole car, rearview, ...) you want to fit into a class, the higher the number of images you need to ensure your model performs optimally. Again, a healthy benchmark would be a minimum of 100 images per each item that you intend to fit into a label.
Before diving into the next chapter, it's important you remember that 100 images per class are just a rule of thumb that suggests a minimum amount of images for your dataset. Depending on your use-case, you might need more.
Unfortunately, there is no way to determine in advance the exact amount of images you'll need. Just use the highest amount of data available to you. Then, test your model performance and if it's not performing well you probably need more data.
Is your training image dataset diverse enough?
Logically, when you seek to increase the number of labels, their granularity, and items for classification in your model, the variety of your dataset must be higher. You need to include in your image dataset each element you want to take into account.
In addition, there is another, less obvious, factor to consider. This is intrinsic to the nature of the label you have chosen. Indeed, the more an object you want to classify appears in reality with different variations, the more diverse your image dataset should be since you need to take into account these differences.
Let’s follow up on the example of the automobile store owner who wants to classify different cars that fall within the Ferraris and Porsche brands.
Now, classifying them merely by sourcing images of red Ferraris and black Porsches in your dataset is clearly not enough. You need to take into account a number of different nuances that fall within the 2 classes.
In reality, these labels appear in different colors and models.
Thus, you need to collect images of Ferraris and Porsches in different colors for your training dataset. Otherwise, your model will fail to account for these color differences under the same target label. Even worse, your classifier will mislabel a black Ferrari as a Porsche.
Similarly, you must further diversify your dataset by including pictures of various models of Ferraris and Porsches, even if you're not interested specifically in classifying models as sub-labels.
Summing up in an example
The example below summarizes the concepts explained above. The dataset you'll need to create a performing model depends on your goal, the related labels, and their nature:
Beyond thinking per label: Features and quality of images
Now, you are familiar with the essential gameplan for structuring your image dataset according to your labels.
Next, you must be aware of the challenges that might arise when it comes to the features and quality of images used for your training model.
Challenges & best practices
Here are some common challenges to be mindful of while finalizing your training image dataset:
- Variable Viewpoints - Objects can be oriented in many ways with respect to the camera.
- Variable Scales - Objects may often appear to be different in size based on the perspective of the camera that captured them.
- Lack of Visibility - The objects being analyzed may be hidden from view with only a few portions of the object visible.
- Lighting conditions – Lighting effects make a tremendous impact on the pixel level.
The points above threaten the performance of your image classification model. Indeed, it might not ensure consistent and accurate predictions under different lighting conditions, viewpoints, shapes, etc.
So how can you build a constantly high-performing model? The answer is always the same: train it on more and diverse data.
In particular, you have to follow these practices to train and implement them effectively:
- Collect images of the object from different angles and perspectives.
- Gather images of the object in variable lighting conditions.
- Gather images with different object sizes and distances for greater variance.
- Ensure your future input images are clearly visible. Otherwise, train the model to classify objects that are partially visible by using low-visibility datapoints in your training dataset
Don't forget the technical features
Besides considering different conditions under which pictures can be taken, it is important to keep in mind some purely technical aspects. Indeed, the size and sharpness of images influence model performance as well.
You should follow these best practices:
Collect high-quality images - An image with low definition makes analyzing it more difficult for the model. Just like for the human eye, if a model wants to recognize something in a picture, it's easier if that picture is sharp.
Avoid images with excessive size: You should limit the data size of your images to avoid extensive upload times. Many AI models resize images to only 224x224 pixels. Thus, uploading large-sized picture files would take much more time without any benefit to the results.
The path forward
Now comes the exciting part! Once you have prepared a rich and diverse training dataset, the bulk of your workload is done. You can say goodbye to tedious manual labeling and launch your automated custom image classifier in less than one hour.