CHAPTER 1The problem

A drone with an arm, a fake orange tree, a magnetic gripper, and the question every autonomous robot eventually has to answer: where is my goal, and how do I get there without running into anything?

§1.1The scene

The lab at JHU is set up with an artificial orange tree (one with detachable, magnet-equipped oranges that can be relocated between trials) and a quadcopter that starts 2–4 metres away at a height of 0.75–2.5 m. The orange's location, and the angle from which it can be plucked, varies every run — the relative yaw between drone and best approach swings ±75°. The drone has to find the orange, decide which way to come in, fly there without crashing into the tree, and snap the magnet onto the orange.

FIG. 1.1 Test setup, after Kothiyal §2.3.1. The drone starts somewhere on a 2–4 m annulus around the tree at 0.75–2.5 m altitude, with the orange visible somewhere on the foliage. The relative yaw between the drone's initial heading and the best approach to the orange ranges from −75° to +75°. In Chapter 2 the orange is always at least partially in view at t₀. In Chapter 3 the trained policy can also handle starting with the orange behind a leaf, behind the tree, or — emergently — not present at all.

§1.2Anatomy of the apparatus

The drone is a DJI Matrice 100 (which still uses its built-in flight controller for low-level stabilisation), instrumented with three things the thesis cares about: a vision sensor, a compute brain, and an actuator.

FIG. 1.2 The aerial manipulator. Note the arm joint is fixed (

ṙ = 0

) — the drone flies the gripper into position rather than moving an articulated arm. This simplification lets the trajectory generator treat the whole vehicle as a single rigid body.

Why this matters for the algorithm

The fixed arm is a modelling choice. Because $ṙ = 0$ , the end-effector pose $g_e(q) = g \cdot φ(r)$ is just a fixed offset from the body frame. So every "where is the gripper" question collapses to "where is the body, plus a constant transform." This is what makes the rest of the math tractable in 67 pages.

§1.3Writing the robot down

Before any planner can do anything useful, you have to commit to a vocabulary for the state. The thesis uses the standard rigid-body description for a quadrotor:

x = (p, R, v, ω) ∈ ℝ³ × SO(3) × ℝ³ × ℝ³ EQ 1.1

Position p in the world frame, orientation R as a rotation matrix, body-fixed linear velocity v, body-fixed angular velocity ω. The pair (R, p) packs into a single 4×4 pose matrix $g \in SE(3)$ :

The dynamics $ẋ = f(x,u)$ are derived in Kobilarov 2014 — the upshot is that the system is differentially flat (you can recover the controls from a sufficiently smooth trajectory), which is what makes the polynomial-snap planner in Chapter 2 work.

A planner that wanted full information would need three things up front: the desired final pose $ḡ(t_f)$ , a desired trajectory $x̄(\cdot)$ , and the obstacle set $E \subset ℝ³$ . None of these are available a priori. The drone gets a camera and a depth sensor and has to estimate them on the fly. That's the whole thesis.

min_u(·) ∫_t₀^t_f ½‖x(t) − x̄(t)‖²_Q + ½‖u(t)‖²_R dt
subject to ẋ = f(x, u) (dynamics)
A(q(t)) ∩ E = ∅ (avoid collisions) EQ 1.3 · OPTIMAL CONTROL

§1.4Two systems, one task

The thesis tackles the same problem with two architectures. Chapter 2 keeps perception and planning as separate modules that communicate through a clean interface (the goal pose). Chapter 3 fuses them: a single neural network looks at pixels and emits future waypoints directly.

FIG. 1.4 Two architectures, side by side. The decoupled stack passes only a goal pose between perception and planning — a clean, low-bandwidth interface that's easy to debug. The end-to-end stack lets the network learn its own internal representation, which means it can pick up cues the engineer never thought to expose, but you can't easily inspect what it's doing.

Decoupled · Ch 2

Reliable when goal is in view
Each module independently testable
No training data needed at the planner
Hard fail when orange leaves view
RANSAC plane fit oscillates near sparse foliage
No notion of "exploration"

End-to-end · Ch 3

Learns latent visual cues for free
Emergent exploration (circles tree if no orange)
Can handle initial occlusion
Needs an expert teacher and lots of data
Sim-to-real degrades performance
Fails opaquely

For the LLM person

The decoupled vs end-to-end split here is the same axis as retrieval-augmented agents vs end-to-end VLA. In retrieval, you commit to "the world is a vector store, the LLM is a planner over its outputs." In a VLA you let the model learn its own visual representations and emit actions directly. Both papers in this thesis are early-2020s analogues of that debate. Chapter 3 is closer to the modern VLA spirit; Chapter 2 is closer to the agent-with-tools approach.

← CoverA visual reading Next →2 · The decoupled pipeline