How do you fly a drone with an arm to a half-hidden orange in a fake tree, when you have nothing but a camera, a depth sensor, and onboard compute the size of a deck of cards?
Source · S. Kothiyal, JHU 2021
Advisor · Prof. Marin Kobilarov
Apparatus · DJI Matrice 100 + RealSense D435i
Outcome · 97/102 with the classical pipe (84 clean + 13 reset); 77.5% in sim with the learned one
What this is
Siddharth Kothiyal's master's thesis at Johns Hopkins teaches a flying robot to pick fruit. It does this twice — once with a stack of classical computer-vision and optimal-control parts bolted together, and once with a single neural network trained to imitate an expert. The two attempts make a clean compare-and-contrast for the central question of modern robotics: where do you draw the line between the perception system and the planner?
I've redrawn the thesis from scratch as a four-chapter visual reader, so you can absorb the geometry, the network architecture, and the result of each ablation without re-reading 67 pages of LaTeX. The original is excellent; this is just an on-ramp.
For the LLM person
You will recognize most of the machinery here. The perception network is a U-Net trained on COCO-style masks. The end-to-end controller is a small ResNet that emits a discrete distribution over a binned action space — i.e. the robot version of a tokenized output head. The training procedure (DAgger) is the active-learning loop where you collect data exactly where the policy is wrong and fine-tune on it. The hard part isn't the ML; it's the geometry tying camera pixels to world coordinates. That's chapter 2.
Read in order
Each chapter takes ~10 minutes. Diagrams are inline SVG — open them in any browser, no build step.