One of the big challenges with computer vision machine learning projects is performing training.
Why is it challenging?
Because historically it's been a very manual process thats time consuming and costly, but I've got some good news for you. You might be able to automate your training process with a new tool called Autodistill. Checkout this video to learn more.
"Here's a little computer vision application that's detecting surfers in the ocean.
Now, if you've got some experience writing computer vision applications, well you're probably not all that impressed with this little applications, I mean, it only took about 20 lines of code to write the application, which is trivial, but there is an aspect of these types of applications, that can be time-consuming and costly. What I'm referring to is the process of training the computer vision algorithm how to detect whatever it is that you're interested in, such as surfers in this example.
So, historically, the process you follow to train the computer vision algorithm entails collecting a bunch of images with the objects you wish to detect, then you have to label the images, in other words, you need to draw bounding boxes around the objects you wish to detect and you include a label for the object, and then you feed the labeled images into a training process, and as an output, you get a weights file, or model, that you can then use in your computer vision application.
Now, as I mentioned a moment ago, this training process can be very expensive and time-consuming, because historically speaking, the training process has been a largely manual process, well, that is until recently.
You see, there are some new tools that we'll look at in a moment here, which allow you to largely automate the training process.
Ok, so what are these new tools I'm referring to?
Well, one of the tools is Grounding Dino, which I talked about in a prior video.
So, what exactly is Grounding Dino? Well, it's a foundational model, that's a zero shot object detector.
Ok, so what's a zero shot object detector?
Well, it's essentially a model that allows you to perform object detection without the need to train the model.
So for example, you can pass in a prompt like "all surfers" along with this image, and Grounding Dino will perform the object detection for you.
Ok, so you might be wondering, is this example computer vision application using Grounding Dino. Well, no not directly, and the reason it's not using Grounding Dino, is because Grounding Dino is relatively slow, and it can't achieve real-time speeds.
Alright, so are we stuck doing manual training, for applications like this, that need real-time speeds?
Well No. There is a way to automate the training process, by combining a foundational model like Grounding Dino, with a smaller, faster supervised model, like YOLOv8.
Let me explain.
Now, as I said a moment ago, Grounding Dino works well, the only problem is that it's pretty slow, but, what if we did the following.
What if we collected a bunch of images, with surfers in them, and then, instead of manually labeling these images, we used Grounding Dino to label the training images, and then, we fed these labeled training images into the training process for YOLOv8.
So, with this type of a flow, you sort of get best of both worlds. You eliminate the manual training process, and you still get a production model that offers real-time speeds.
Ok, so how do you achieve what I just outlined?
Well, this is where a relatively new package called Autodistill comes into play.
Autodistill, uses big, slower foundational models, to train smaller, faster supervised models.
Let me show you how you might use Autodistill, in a Google Colab notebook, which you can find a link to, in the description below.
So, in this first cell, I'm running the nvidia-smi command to verify I've got access to a GPU. If you're running this notebook and you get a message saying the command is missing, then follow the instructions here, to change your runtime, to include a GPU.
In the next cell, I'm storing the current working directory in a variable and I'm creating a videos and images directory, which we'll use in a moment here.
Now, in this cell, I'm installing the relevant autodistill packages.
Next, I'm downloading a few videos of surfers in the ocean.
Ok, so what are we doing with these videos?
Well, down in this cell, I'm using cv2, the computer vision library, to open the videos we just downloaded, and then we're saving every tenth frame of each video as a jpeg file.
We're going to use these jpeg files, as our source training images.
Now, in this next cell, we're using a foundational model, to label our training images.
So, basically on this line, I'm creating an ontology, which is essentially a combination of the thing your looking for in the images, all surfers, in this case, and this term, surfers, is the actual label name or class name we'd like used for the all the surfers that are found.
Now on this line, we're instantiating the foundational model, and then down here, we're initiating the labeling process for the training images we just created.
So, to summarize, this cell is using a relatively slow foundational model to label our training images, and the output of running this cell, is a training dataset, that we can feed into a smaller and faster supervised model, like YOLOv8 in our case.
Now, this cell, takes a bit of time to execute, so I'll go ahead start it and then I'll fast forward to the end.
Ok, in this next cell, we're essentially passing the labeled training dataset, which the last cell created, to YOLOv8's training process.
I'll go ahead and run this cell, which will take a bit of time to execute, so I'll go ahead and fast forward to the end.
Ok, so at this point, we've completed the training, and we have a model file we can download and use in our computer vision applications.
So, I'll go ahead and download the weights file, then I'll move the weights file into my project directory, and I'll verify the model file name in my code, matches the file name we just downloaded.
And now I'll go ahead and run this python script, and check it out.
The application is detecting surfers, and we didn't have to perform any manual labeling, nice.
Now, it's not perfect, for example, not every surfer is detected on every frame, which is to be expected, because our training set was pretty small, but obviously, these results could be improved by performing additional training.
Hey, here at Mycelial, we're building development tools for machine learning on the edge. More specifically, we're building the data backbone for edge machine learning applications. If you're curious about what we're building, I'd encourage you to join our mailing list to learn more."