Tutorial: Vision and Core ML - Live Object Detection in iOS 11
June 26th 2017, 1:46 pm

While WWDC 2017 may have been one of the most announcement-packed WWDC's ever, there's no doubt over the features Apple is most proud of and sees the biggest opportunities for the future. Augmented Reality using ARKit, Machine Learning with Core ML, and pairing these with Vision and Metal for AR and VR. My initial thought, I'm sure the same as many other viewers, was that this technology is for would be reserved for the experts, and fans of really hard sums. But Apple's implementation is mind-blowingly simple, and can really be added into almost any app in just a few minutes.

In this tutorial we'll build Core ML and Vision Framework app for detecting objects in real time using your phone's camera, similar to the one demoed in WWDC session 506. To begin you'll need the Xcode 9 Beta, and you'll need the iOS 11 Beta installed on your testing iPhone or iPad, meaning for now this tutorial is just for registered Apple Developers. But if you're not one right now, this will all be freely available in the autumn. If you are registered, head over to Apple's Developer Downloads page, and grab what you need. Bare in mind that this tutorial is written against iOS 11 Beta 2, so things could change a little by the final release.

This tutorial assumes some basic iOS Swift programming knowledge already, but I've tried to keep it as beginner friendly as possible. Please drop me an email if there's anything unclear.


Setting up the project

For this demo we're going to do the object detection live, showing the camera, and overlay text on the screen telling us what Core ML can see. To begin, create a new project in Xcode 9 Beta and choose a Single View Application. Call your project Object Detector. Head to ViewController.swift and add a couple of @IBOutlets. One for a UIView we're calling cameraView - this will be your 'viewfinder'. Add a second @IBOutlet for a UILabel called textLayer - this will be our output of what Core ML has detected.

Open the Main Storyboard and drag on a blank view. Resize it to fit the screen and pin it to the edges. Now drag a label on to the view. Make sure this is topmost on the root view, not on top of the view you just added, or it will be obscured by the camera output. Position the label to the top layout guides and fill the guides left & right. Delete the text, set the text colour to something bright so you'll see it over the image, and change the label to be 6 lines long. Finally add some suitable constraints to the label. Join up the IBOutlets you made earlier to the view and label.

To use the camera, we need to ask permission first, so head to info.plist and add a key for 'Privacy - Camera Usage Description' the string value can be anything you like for now, this is the message shown when asking permission for camera use at first launch.


Displaying the camera output

Now, to the code. First we need to use Vision to create an AVCApture session, this provides us with the live camera view, and also feeds the data from the camera into Core ML. Import AVFoundation and Vision. Head back to ViewController.swift. We need to set up a video capture delegate, so add AVCaptureVideoDataOutputSampleBufferDelegate to your class definition. Add the following code between your IBOutlets and viewDidLoad.

// set up the handler for the captured images
private let visionSequenceHandler = VNSequenceRequestHandler()

// set up the camera preview layer
private lazy var cameraLayer: AVCaptureVideoPreviewLayer = AVCaptureVideoPreviewLayer(session: self.captureSession)

// set up the capture session
private lazy var captureSession: AVCaptureSession = {
let session = AVCaptureSession()
session.sessionPreset = AVCaptureSession.Preset.photo


guard
// set up the rear camera as the device to capture images from
let backCamera = AVCaptureDevice.default(.builtInWideAngleCamera, for: .video, position: .back),
let input = try? AVCaptureDeviceInput(device: backCamera)
else {
print("no camera is available.");
return session

}
// add the rear camera as the capture device
session.addInput(input)
return session
}()


The code above sets up the camera preview layer, then defines the capture session, telling iOS which camera to use, and what quality we're looking for.

Now we've set that up, we need to add the camera preview layer to the display, and tell iOS to start capturing from the camera. Add the following code to viewDidLoad.

// add the camera preview
self.cameraView?.layer.addSublayer(self.cameraLayer)

// set up the delegate to handle the images to be fed to Core ML
let videoOutput = AVCaptureVideoDataOutput()

// we want to process the image buffer and ML off the main thread
videoOutput.setSampleBufferDelegate(self, queue: DispatchQueue(label: "DispatchQueue"))

self.captureSession.addOutput(videoOutput)

// make the camera output fill the screen
cameraLayer.videoGravity = AVLayerVideoGravity.resizeAspectFill

// begin the session
self.captureSession.startRunning()


Then to make sure the viewfinder fills the screen, add a viewDidLayoutSubviews override.

override func viewDidLayoutSubviews() {
super.viewDidLayoutSubviews()

// make sure the layer is the correct size
self.cameraLayer.frame = (self.cameraView?.frame)!
}


Set your target to run on your testing device (as this uses the camera, you'll just get a blank screen on the simulator) and hit Build and Run. After granting access to the camera, you'll be greeted with a full screen output of your devices rear camera view.

Missing something? Get the code up to this point on Github.


Adding Core ML Magic

First of all, we need a Machine Learning Model. Fortunately that's a simple as downloading one from Apple's website. Head over there now and download the Inception v3 model. Add this file to your project.

We now need to get the data from the AVCaptureSession, and feed that our Inception v3 Model. We do this using a AVCaptureVideoDataOutputSampleBufferDelegate method.

func captureOutput(_ output: AVCaptureOutput, didOutput sampleBuffer: CMSampleBuffer, from connection: AVCaptureConnection) {

// Get the pixel buffer from the capture session
guard let pixelBuffer: CVPixelBuffer = CMSampleBufferGetImageBuffer(sampleBuffer) else { return }

// load the Core ML model
guard let visionModel:VNCoreMLModel = try? VNCoreMLModel(for: Inceptionv3().model) else { return }

// set up the classification request
let classificationRequest = VNCoreMLRequest(model: visionModel, completionHandler: handleClassification)

// automatically resize the image from the pixel buffer to fit what the model needs
classificationRequest.imageCropAndScaleOption = VNImageCropAndScaleOptionCenterCrop

// perform the machine learning classification
do {
try self.visionSequenceHandler.perform([classificationRequest], on: pixelBuffer)
} catch {
print("Throws: \(error)")
}
}
}


Just before we handle the results Core ML gives us, two functions to help us update the UI, and to map the Core ML result into something more meaningful

func updateClassificationLabel(labelString: String) {
// We processed the capture session and Core ML off the main thread, so the compleetion handler was called onthe the same thread
// So we need to remember to get the main thread again to update the UI

DispatchQueue.main.async {
self.textLayer?.text = labelString
}
}

func textForClassification(classification: VNClassificationObservation) -> String {
// Mapping the results from the VNClassificationObservation to a human readable string
let pc = Int(classification.confidence * 100)
return "\(classification.identifier)\nConfidence: \(pc)%"
}


Finally, to handle the Core ML results themselves. Here we want to make a few quick checks to make sure everything is as we expect, then we need to filter to get the accuracy and data that we want. For this example we just want to display all the objects we've found, so the only filtering we've done is to cut off the last 20% of results. In the real world, you'll probably want to look for specific objects, so use .filter here to narrow those down. More on that below.

func handleClassification(request: VNRequest, error: Error?) {
guard let observations = request.results else {

// Nothing has been returned so we want to clear the label.
updateClassificationLabel(labelString: "")

return
}

let classifications = observations[0...3] //taking just the top 3, ignoring the rest
.flatMap({ $0 as? VNClassificationObservation}) // discard any erroneous results
.filter({ $0.confidence > 0.2 }) // discard anything with less than 20% accuracy.
.map(self.textForClassification) // get the text to display
// Filter further here if you're looking for specific objects

if (classifications.count > 0) {
// update the label to display what we found
updateClassificationLabel(labelString: "\(classifications.joined(separator: "\n"))")
} else {
// nothing matches our criteria, so clear the label
updateClassificationLabel(labelString: "")
}
}


We only filtered through the top 3 results provided by the Machine Learning model. For this project we just want to know the most probable results. The observation results are provided in decreasing order of certainty, so we know anything else is likely to be a false positive. Click on the model in the sidebar, and you'll see in the model description Apple lists a 'top-5 error' rate. This is the percentage of false positives provided for the model in the top 5 results when tested. In real world situations using a live camera, this is likely to be higher.



Iterating through all the provided guesses from the model is going to use mote resources, and is unlikely to yield any more real positives. If you're looking of specific items and you're doing your own filtering, you may find better end results by looking at more of the model's output.




Where to go from here

If you are looking for specific objects, I'd suggest cutting the confidence level right down to 10% or even less. Then add an extra filter step with your own custom logic to narrow down what you're looking for.
I found the Inception v3 model to sometimes be a little too specific at times, and sometimes not specific enough. It all depends on the category of object you're looking for, and wether the model has already been trained on that category. Its worth checking out other models to make sure you get the one that works best for your project. Apple provide some popular models ready to drop in to your project. Just change the model name in the line guard let visionModel:VNCoreMLModel = try? VNCoreMLModel(for: Inceptionv3().model) else { return } to match the name of the model you choose. If you find another model that does the job, Apple provide Core ML Tools to convert machine learning models to the Core ML format.

I hope you found this quick introduction to Core ML and Vision helpful, if you've got anything to add, or any questions, please let me know, and I'd love to see any apps you make using these frameworks, so please drop me a link when its on the App Store.

Full tutorial code available on Github

Contact

Contact Me

Email: rw@rwapp.co.uk
Twitter •
Tip Jar

About Me

My name's Rob. I've been working with Apple for nearly 10 years as retail, support & engineering, so I like to think I know Apple pretty well.
I make mobile apps for Apple devices, and web & Android too.