ZooScan - Part 4: Using the Swift Vision Framework to Classify Animals
Over the course of the past few posts (see the overview here), we’ve introduced the ZooScan app and developed its UI using SwiftUI. In this fourth part, we will focus on integrating the Swift Vision framework to classify animals based on images captured by the app.
Creating a Protocol to Define Image Classifiers #
The first step is defining a protocol for our animal classification model. By using a standardized interface, we can easily switch between different models in the future if needed. Here’s how we can define the protocol:
import Foundation
import UIKit
struct ClassificationResult: Equatable, Hashable {
let label: String
let confidence: Double
}
protocol ImageClassificationModel
{
func classify(image: UIImage) async -> ClassificationResult?
}
In the code snippet above, we define two components. The ClassificationResult
struct represents the result of an image classification, containing a label and a confidence score. The ImageClassificationModel
protocol declares a method classify(image:)
, which takes a UIImage
and returns an optional ClassificationResult
. Note that the result of classify(image:)
is optional. This allows us to return nil
when it was not possible to make a classification. This provides us with a flexible way to implement different image classification models in the future.
Implementing a concrete model using the Swift Vision framework #
Using this protocol, we can now implement a concrete model that utilizes the Swift Vision framework for image classification. Here’s an example implementation:
import UIKit
import Vision
struct VisionClassificationModel : ImageClassificationModel
{
func classify(image: UIImage) async -> ClassificationResult? {
var request = ClassifyImageRequest()
request.cropAndScaleAction = .centerCrop
guard let data = image.pngData() else {
return nil
}
do
{
let results = try await request.perform(on: data)
.filter { $0.hasMinimumPrecision(0.1, forRecall: 0.8) }
guard let bestResult = results.first else {
return nil
}
return ClassificationResult(label: bestResult.identifier, confidence: Double(bestResult.confidence))
}
catch {
print("Error classifying image: \(error)")
}
return nil
}
}
This shows how easy it is to implement a concrete model using the Vision framework. The VisionClassificationModel
struct conforms to the ImageClassificationModel
protocol and implements the classify(image:)
method. Let’s break down the key parts of this implementation. The first thing we do is create a ClassifyImageRequest
instance, which is part of the Vision framework:
var request = ClassifyImageRequest()
request.cropAndScaleAction = .centerCrop
This request is configured to crop and scale the image to the center, ensuring that the most relevant part of the image is analyzed. This step prepares the image for classification. The underlying image models, most often Convolutional Neural Networks (CNNs), are designed to work with images of a specific size and aspect ratio, often around 224x224 pixels. Most camera images are a lot larger than that. Therefore, the image needs to be resized and cropped to fit the model’s requirements. By also centering the crop, we ensure that the most important part of the image is retained, which is especially useful for animal classification where the subject is typically centered in the frame.
After configuring the request, we need to convert the UIImage
to PNG data:
guard let data = image.pngData() else {
return nil
}
The ClassifyImageRequest
can take the image data in various formats. To classify the image, we use an overload of the perform(on:)
method that accepts image data. The pngData()
method converts the UIImage
to PNG format, which is suitable for to pass to this method. The pngData()
method returns an optional Data
object, so we use a guard statement to ensure that the conversion was successful. If it fails, we return nil
, indicating that the classification could not be performed.
To perform the classification, we call the perform(on:)
method of the ClassifyImageRequest
, passing in the image data:
let results = try await request.perform(on: data)
.filter { $0.hasMinimumPrecision(0.1, forRecall: 0.8) }
This method is asynchronous, so we use await
to wait for the results. The perform(on:)
method returns an
array of classification results, which we then filter to include only classifications meeting a minimum
precision and recall threshold. We’ll discuss what these terms mean in a future post. For now, just think about
it as a way to ensure that we only keep the most relevant classifications. Note that the perform(on:)
method
also is throwing so we need to include it in a do-catch
block to handle any potential errors that may occur
during the classification process.
Finally, we pick the first result from the filtered results, which is the best classification based on the confidence score:
guard let bestResult = results.first else {
return nil
}
return ClassificationResult(label: bestResult.identifier, confidence: Double(bestResult.confidence))
We return a ClassificationResult
containing the label and confidence score of the best classification. If no results are found, we return nil
, indicating that the classification was not successful.
Adding the Classification Model to the AnimalStore #
With the VisionClassificationModel
implemented, we can now integrate it into our app. We will use this model to classify images captured by the user and display the results in the UI. Go to AnimalStore.swift
and change the top of the AnimalStore
class to look like this:
import Foundation
import UIKit
@Observable
class AnimalStore {
var animals: [ScannedAnimal] = []
let classificationModel = VisionClassificationModel()
...
}
Here we create an instance of the VisionClassificationModel
and store it in a property called classificationModel
. To use the model to classify the images that we captured, change the addAnimal(image:)
method to use the classification model to classify the image and create a ScannedAnimal
instance with the classification result. The updated method looks like this:
func addAnimal(image: UIImage)
{
Task
{
//let animal = ScannedAnimal(image: image, classification: "Unknown", confidence: 0, isLoved: false)
guard let animalClassification = await classificationModel.classify(image: image) else
{
return
}
let description = descriptions[animalClassification.label] ?? "Description not available."
let animal = ScannedAnimal(
image: image,
classification: animalClassification,
description: description,
isLoved: false
)
animals.insert(animal, at: 0)
}
}
The update method now uses the classificationModel
to classify the image asynchronously. If the classification is successful, it creates a new ScannedAnimal
instance with the classification result and adds it to the animals
array. If the classification fails (i.e., returns nil
), it simply exits without modifying the array. This way, we ensure that only successfully classified animals are added to the store. All of this code
is also wrapped in a Task
to ensure that we can wait for the asynchronous classification to complete before proceeding. With this, we’ve successfully integrated the Vision framework into our ZooScan app, allowing us to classify animals based on images captured by the user. You can now run this app on a device and test the classification functionality.
Implementing a Dummy Model for the Simulator #
Unfortunately, the implemented code only works on a physical device. The Vision framework does not work with the simulator. When you try to run the app in the simulator, you will get an error when you try to classify an image. The error message will look something like this:
Error classifying image: internalError("Error Domain=NSOSStatusErrorDomain Code=-1 \"Failed to create espresso context.\" UserInfo={NSLocalizedDescription=Failed to create espresso context.}")
To keep the app testable in the simulator, we need to work around this limitation. We can implement a dummy classification model that simulates the behavior of an image classifier. This dummy model will return some classifications picked from a default array with four animal names (giraffe, penguin, watusi, zebra). It keeps the same order and starts again when it reaches the end of the array. This way, we can test the app in the simulator without running into errors. Here’s how we can implement this dummy model:
import Foundation
import UIKit
class DummyClassificationModel: ImageClassificationModel {
private let animals = ["giraffe", "penguin", "watusi", "zebra"]
private var currentIndex = 0
func classify(image: UIImage) async -> ClassificationResult? {
let classification = ClassificationResult(
label: animals[currentIndex],
confidence: Double.random(in: 0.7...1.0)
)
// Move to next index, reset to 0 if we reach the end
currentIndex = (currentIndex + 1) % animals.count
return classification
}
}
The last thing we need to do add the dummy model to the AnimalStore
class. We can use
a conditional compilation directive to check if we are compiling for the simulator or on a physical device.
If we compile for the simulator, we will use the DummyClassificationModel
, otherwise we will use the VisionClassificationModel
:
#if targetEnvironment(simulator)
let classificationModel = DummyClassificationModel()
#else
let classificationModel = VisionClassificationModel()
#endif
We’ve now finished a basic version of the ZooScan app that we can use in the simulator and on a physical device. You can see that Apple really made it simple for us to add complex functionality to our apps. The image classification only took us a few lines of code to add! Now it’s time to test the app. Take the app for a spin during your next visit to the zoo! Hope you enjoy!
Conclusion #
In this post, we have successfully integrated the Swift Vision framework into our ZooScan app to classify animals based on images captured by the user. We defined a protocol for image classification models, implemented a concrete model using the Vision framework, and integrated it into our app. Additionally, we created a dummy model to allow testing in the simulator. In the next post, we will have a look at how the app performs in real life and look at the type of model that the Vision framework uses under the hood. Stay tuned for more! 😃