A visual reading of Perception Based UAV Path Planning for Fruit Harvesting
Cover · 1 · The problem · 2 · Decoupled · 3 · Learning to plan · 4 · What worked · Terminology
A visual reading · v0.1 · 2026

Perception-
Based UAV
Path Planning

How do you fly a drone with an arm to a half-hidden orange in a fake tree, when you have nothing but a camera, a depth sensor, and onboard compute the size of a deck of cards?
Source · S. Kothiyal, JHU 2021
Advisor · Prof. Marin Kobilarov
Apparatus · DJI Matrice 100 + RealSense D435i
Outcome · 97/102 with the classical pipe (84 clean + 13 reset); 77.5% in sim with the learned one
FIG. 0 · EXPLODED ISOMETRIC AERIAL MANIPULATOR DJI MATRICE 100 · REALSENSE D435i · FIXED 3D-PRINTED ARM · PASSIVE MAGNETIC GRIPPER EXPLODED CENTERLINES 4× ROTOR DISKS PROPS ABOVE MOTOR PLANE MOTOR CANS 4 BLDC UNITS · THRUST + TORQUE NUC BATTERY INTEL NUC i7 UBUNTU 14.04 · ROS ONBOARD OPTITRACK MARKERS 6-CAMERA MOCAP ODOMETRY MATRICE 100 FRAME ~3 KG QUADROTOR PLATFORM REALSENSE D435i RGB + IR ACTIVE STEREO 640×480 @ 15 FPS DEPTH 0.3–10 M 3D-PRINTED ARM FIXED JOINT · r CONSTANT NO ARM MOTION · ṙ = 0 PASSIVE MAGNETIC GRIPPER MAGNET + LSM303 FIELD JUMP MAGNET-EQUIPPED ORANGE DETACHABLE TARGET REPOSITIONED EACH TRIAL FIG. 0 · APPARATUS · EXPLODED

What this is

Siddharth Kothiyal's master's thesis at Johns Hopkins teaches a flying robot to pick fruit. It does this twice — once with a stack of classical computer-vision and optimal-control parts bolted together, and once with a single neural network trained to imitate an expert. The two attempts make a clean compare-and-contrast for the central question of modern robotics: where do you draw the line between the perception system and the planner?

I've redrawn the thesis from scratch as a four-chapter visual reader, so you can absorb the geometry, the network architecture, and the result of each ablation without re-reading 67 pages of LaTeX. The original is excellent; this is just an on-ramp.

For the LLM person

You will recognize most of the machinery here. The perception network is a U-Net trained on COCO-style masks. The end-to-end controller is a small ResNet that emits a discrete distribution over a binned action space — i.e. the robot version of a tokenized output head. The training procedure (DAgger) is the active-learning loop where you collect data exactly where the policy is wrong and fine-tune on it. The hard part isn't the ML; it's the geometry tying camera pixels to world coordinates. That's chapter 2.

Read in order

Each chapter takes ~10 minutes. Diagrams are inline SVG — open them in any browser, no build step.

PERCEPTION-BASED UAV PATH PLANNING COVER · A VISUAL READING 2026 · v0.1