Computer Vision for Robotics
Computer vision gives robots the ability to understand their environment from camera images: detect obstacles, estimate depth, track motion, and recognize objects. It bridges raw pixel data and actionable spatial information.
Why It Matters
Cameras are cheap, information-dense sensors. A single RGB camera provides more data per frame than a LiDAR scan. Visual odometry, object detection, and depth estimation enable autonomous navigation, manipulation, and human-robot interaction — all from commodity hardware.
Pinhole Camera Model
The foundation of geometric vision. A 3D point (X, Y, Z) projects onto the image plane at (u, v):
u = fx · X/Z + cx
v = fy · Y/Z + cy
Camera matrix K = [fx 0 cx]
[ 0 fy cy]
[ 0 0 1]
fx, fy = focal length in pixels
cx, cy = principal point (image center)
import numpy as np
K = np.array([[500, 0, 320], # focal 500px, image 640×480
[0, 500, 240],
[0, 0, 1]])
# Project 3D point to pixel
point_3d = np.array([1.0, 0.5, 3.0]) # X, Y, Z in camera frame
pixel = K @ point_3d / point_3d[2]
print(f"Projects to pixel ({pixel[0]:.0f}, {pixel[1]:.0f})")Lens distortion (radial and tangential) must be corrected for accurate geometry. OpenCV’s undistort() handles this using calibration parameters.
Depth Estimation
Stereo Vision
Two cameras separated by a known baseline. The horizontal pixel difference (disparity) between matched points gives depth:
depth = focal_length × baseline / disparity
f = 500 pixels, B = 0.12 m, disparity = 50 pixels
depth = 500 × 0.12 / 50 = 1.2 m
# OpenCV stereo block matching
import cv2
stereo = cv2.StereoBM.create(numDisparities=64, blockSize=15)
disparity = stereo.compute(left_gray, right_gray)
depth_map = f * baseline / (disparity + 1e-6)Other Depth Methods
| Method | How | Pros | Cons |
|---|---|---|---|
| Stereo | Two cameras, triangulation | Passive, outdoor-ready | Needs texture, baseline limits range |
| Structured light | Project pattern, measure distortion | Very accurate indoors | Fails in sunlight (IR interference) |
| Time-of-flight | Measure light round-trip time | Per-pixel depth, fast | Short range (5-10m), noisy edges |
| Monocular CNN | Neural network estimates depth from single image | Only needs one camera | Less accurate, scale-ambiguous |
Feature Detection and Matching
Detect distinctive points (corners, blobs) that can be tracked across frames or matched between images.
import cv2
# ORB: fast, rotation-invariant, binary descriptor
orb = cv2.ORB.create(nfeatures=500)
kp1, desc1 = orb.detectAndCompute(img1, None)
kp2, desc2 = orb.detectAndCompute(img2, None)
# Match descriptors
bf = cv2.BFMatcher(cv2.NORM_HAMMING, crossCheck=True)
matches = bf.match(desc1, desc2)
matches = sorted(matches, key=lambda m: m.distance)[:50]| Detector | Speed | Invariance | Use Case |
|---|---|---|---|
| ORB | Very fast | Rotation + scale | Real-time SLAM, embedded |
| SIFT | Slow | Rotation + scale + lighting | Offline matching, panoramas |
| FAST | Very fast | Minimal | Real-time corner detection |
| SuperPoint | Medium (CNN) | Learned invariance | State-of-art visual SLAM |
Visual Odometry
Estimate camera/robot motion from sequential images:
1. Detect features in frame N (ORB, FAST)
2. Track/match features to frame N+1
3. Compute Essential matrix E from matched points
4. Decompose E into rotation R and translation t
5. Scale translation using known world measurement or other sensor
6. Accumulate: T_total = T_total × [R|t]
Scale ambiguity: monocular VO can only recover motion up to an unknown scale factor. Solutions: known object size, IMU integration, or stereo camera.
ArUco Markers
Square fiducial markers with known size → precise 6-DOF pose estimation:
import cv2
aruco_dict = cv2.aruco.getPredefinedDictionary(cv2.aruco.DICT_4X4_50)
detector = cv2.aruco.ArucoDetector(aruco_dict)
corners, ids, _ = detector.detectMarkers(image)
# Estimate pose (rotation + translation relative to camera)
rvecs, tvecs, _ = cv2.aruco.estimatePoseSingleMarkers(
corners, marker_size_meters, camera_matrix, dist_coeffs)Used for: drone landing pads, robot calibration, ground truth for testing, indoor localization.
Optical Flow
Estimate per-pixel motion between frames. Dense flow gives a motion vector for every pixel; sparse flow tracks specific features.
# Lucas-Kanade sparse optical flow
p1, status, _ = cv2.calcOpticalFlowPyrLK(prev_gray, gray, p0, None)
# p0 = tracked points in prev frame, p1 = where they moved in current frame
# velocity = (p1 - p0) / dtUsed for: ground speed estimation (downward-facing camera on drones), motion segmentation, obstacle detection via motion divergence.
Related
- SLAM — visual SLAM uses features, odometry, and loop closure
- Path Planning — vision provides obstacle maps for planning
- Sensor Fusion — visual odometry fused with IMU for robust estimation
- IMU and Sensors — camera complements inertial data
- Kinematics — visual servoing uses vision to guide robot arms