Object Detection and YOLO

Hello 👋👋, Welcome to our blog post on Object Detection using YOLO (You Only Look Once). YOLO is considered as one of the State of The Art (SOTA) Algorithms in object detection tasks. Here’s the original paper, it might seem difficult to understand and read even because of the format. So here’s our blog to help you out.

Let’s discuss Object Detection and then we will jump into details of YOLO. To be brief, Object Detection is a Computer vision task, which focuses on predicting an object’s presence and localization in an image. Aah, that’s too straight. Let’s know more.

Object Detection

Object Detection

Object Detection, a Computer Vision and a supervised learning task which involves predicting pixel values of each object belonging to each class. If the no. of classes is equal to 1, this task is referred to object localization.

How training data is labelled?

In Object Detection based datasets, we would specify the following.

  • Path to Image
  • Presence of Object
  • X_min
  • Y_min
  • X_max
  • Y_max
  • Target Value

Have a look on the sample data point in the below image.

Sample Data point

From the dataset, we can conclude that this is a mix of classification and regression. We need to predict if there is any object in the given image. If yes, then predict the corresponding pixel values.

We have known Object Detection pretty well, let’s know how to implement it. There are different ways of doing it, in which YOLO is one. Other’s include Sliding Window Detection etc.



YOLO, unlike other algorithms performs detection on the go, requires only one network evaluation. YOLO has four versions, till date each version improving the model performance over the previous one. In this article, we will cover the base model, and also know about the improvements done in later versions.

YOLO network treats Object Detection as a regression problem. It divides our image into regions, and predicts bounding boxes and probability for each region.

Let’s explore YOLO like any other neural network, we will consider the following while exploring You Only Look Once 😜.

What’s YOLO?

You only Look Once refers to our capability of detecting things/objects in real life, for the first time we get a snap of our surroundings. Object Detection algo’s like R-CNN used an ensemble of different networks to perform inference, which impacts performance when deployed in real time.

The idea behind YOLO is, we divide image into an SxS grid, if the center of object falls into a particular grid, that grid cell predicts a set of bounding box co-ordinates for the object, along with a confidence value associated with each of the predicted bounding boxes, whose value will be equal to zero, in case the bounding box doesn’t contain object in it. If the value is not zero, we would calculate the confidence value based on IOU of predicted and the original values.

What’s IOU?

IOU implies Intersection over Union, which means we would take the size of the intersection of two bounding boxes and divide it by union of two bounding boxes. This value will be considered as confidence w.r.t the output of that grid cell.


What exactly do we predict?

Our network outputs a S × S × (B ∗ 5 + C) tensor, where B is the bounding box values which consists of x, y, w, h and confidence.

  • (x, y) – center of grid cell.
  • w, h – relative width and height to the original image.
  • confidence – IOU value between predicted and original bounding box.

What if there are Multiple detections for same object ?

Non-Max Suppression

Here’s the answer. If there are multiple bounding box detected by the grid cell that contains the center of the object, the one’s with lower confidence or higher IOU are suppressed.


The neural architecture of YOLO consists of 24 convolutional layers, followed by 2 fully connected layers. The below diagram from YOLO paper is self explanatory with two densely connected layers at the end which outputs a 7x7x30 tensor, as the original YOLO implementation uses 7×7 grid on Pascal VOC Dataset.

However, faster YOLO (lite version😉) consists of 9 convolutional layers, followed by 2 fully connected layers.

If you are not aware of how CNN’s work, refer to our previous article on ConvNets.

Reference: YOLO Paper


YOLO uses Leaky ReLU in all other layers expect the final layer, which uses a linear activation function.

Leaky RELU

Loss Function

YOLO uses sum-squared error as it’s loss function, with a little tweak to accord with the kind of output the network produces. This is done, because if it is non-bounding box grid cell, the sum-squared error would be too low, which makes model instable.

So, we use two other hyperparameters λcoord and λnoobj, which deal with losses in case of object is present and not respectively. These values default to 5 and 0.5.

That’s not the only case, the difference between width and height in case of large bounding boxes would be too large compared to that of smaller bounding boxes. So we use a square root of this difference in our loss function to back propagate the error in the network.

Loss function (Original Paper)

Loss error is only propagated in case if the confidence is not equal to zero, i.e. if the predictor(grid cell) is responsible for the original bounding box.

Learning Rate

YOLO when trained on Pascal VOC Dataset uses 0.01 for the first 75 epochs, 0.001 for the next 30 epochs, and 0.0001 for the final 30 epochs, summing to 135 epochs on a model whose convolutional layers are pre-trained on ImageNet 1000 class dataset.


Predictions in YOLO require only one network evaluation, which is unlike other SOTA object detection algorithms which makes YOLO detect objects at 45 fps, and faster YOLO can detect objects at 150 fps.

Where’s the code ?


Well, there are many implementations of YOLO out there on internet, and guess what the original work of YOLO is open source 💕, and you can also find many higher level API implementations of YOLO on GitHub. Here’s the official website of YOLO, which guides you through making use of it in your own application.

What’s new in v2 and v3 ?

Barney Stinson GIF - Barney Stinson Himym - Discover & Share GIFs

Let’s cover that in our next article. Until then, stay safe. Cheers..✌

One response to “Object Detection and YOLO”

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Create your website with WordPress.com
Get started
%d bloggers like this: