CHAPTER 1The problem
A drone with an arm, a fake orange tree, a magnetic gripper, and the question every autonomous robot eventually has to answer: where is my goal, and how do I get there without running into anything?
§1.1The scene
The lab at JHU is set up with an artificial orange tree (one with detachable, magnet-equipped oranges that can be relocated between trials) and a quadcopter that starts 2–4 metres away at a height of 0.75–2.5 m. The orange's location, and the angle from which it can be plucked, varies every run — the relative yaw between drone and best approach swings ±75°. The drone has to find the orange, decide which way to come in, fly there without crashing into the tree, and snap the magnet onto the orange.
§1.2Anatomy of the apparatus
The drone is a DJI Matrice 100 (which still uses its built-in flight controller for low-level stabilisation), instrumented with three things the thesis cares about: a vision sensor, a compute brain, and an actuator.
The fixed arm is a modelling choice. Because ṙ = 0, the end-effector pose g_e(q) = g · φ(r) is just a fixed offset from the body frame. So every "where is the gripper" question collapses to "where is the body, plus a constant transform." This is what makes the rest of the math tractable in 67 pages.
§1.3Writing the robot down
Before any planner can do anything useful, you have to commit to a vocabulary for the state. The thesis uses the standard rigid-body description for a quadrotor:
Position p in the world frame, orientation R as a rotation matrix, body-fixed linear velocity v, body-fixed angular velocity ω. The pair (R, p) packs into a single 4×4 pose matrix g ∈ SE(3):
The dynamics ẋ = f(x,u) are derived in Kobilarov 2014 — the upshot is that the system is differentially flat (you can recover the controls from a sufficiently smooth trajectory), which is what makes the polynomial-snap planner in Chapter 2 work.
A planner that wanted full information would need three things up front: the desired final pose ḡ(t_f), a desired trajectory x̄(·), and the obstacle set E ⊂ ℝ³. None of these are available a priori. The drone gets a camera and a depth sensor and has to estimate them on the fly. That's the whole thesis.
subject to ẋ = f(x, u) (dynamics)
A(q(t)) ∩ E = ∅ (avoid collisions) EQ 1.3 · OPTIMAL CONTROL
§1.4Two systems, one task
The thesis tackles the same problem with two architectures. Chapter 2 keeps perception and planning as separate modules that communicate through a clean interface (the goal pose). Chapter 3 fuses them: a single neural network looks at pixels and emits future waypoints directly.
Decoupled · Ch 2
- Reliable when goal is in view
- Each module independently testable
- No training data needed at the planner
- Hard fail when orange leaves view
- RANSAC plane fit oscillates near sparse foliage
- No notion of "exploration"
End-to-end · Ch 3
- Learns latent visual cues for free
- Emergent exploration (circles tree if no orange)
- Can handle initial occlusion
- Needs an expert teacher and lots of data
- Sim-to-real degrades performance
- Fails opaquely
The decoupled vs end-to-end split here is the same axis as retrieval-augmented agents vs end-to-end VLA. In retrieval, you commit to "the world is a vector store, the LLM is a planner over its outputs." In a VLA you let the model learn its own visual representations and emit actions directly. Both papers in this thesis are early-2020s analogues of that debate. Chapter 3 is closer to the modern VLA spirit; Chapter 2 is closer to the agent-with-tools approach.