CHAPTER 3Learning to plan

Replace the brittle geometric handoff with a network that predicts short-horizon motion from pixels. Keep the old segmenter as attention. Train the rest by imitating a privileged optimal-control teacher.

§3.1Why end-to-end

Chapter 2's pipeline has a clean failure: it only knows what the segmenter sees in the current frame. If the orange is hidden behind leaves or behind the tree at the start, the system has no target point, no fitted foliage plane, and no goal pose to hand to the planner.

Chapter 3 keeps the useful visual prior — "this white blob is the orange mask" — but removes the hand-built geometry between perception and planning. The network sees an RGB image plus the mask and predicts three future 6-DOF pose increments. The hope is that it learns the latent visual structure Chapter 2 never represented: where fruit tends to sit in the canopy, which side of the tree is likely clear, and when a search arc is better than a direct approach.

Two ways to teach a robot

Reinforcement learning would let the drone try things and learn from reward, but real crashes are expensive and the sample efficiency is poor. Imitation learning is more practical here: generate expert trajectories, then train the network to mimic them. The catch is that the expert must be good. In simulation, the expert is DDP with full knowledge of the tree, orange, and drone state.

§3.2The closed loop

FIG. 3.1 The camera output forks. The mask branch goes through U-Net; the raw RGB goes straight ahead. They are concatenated into a 4-channel tensor that the ResNet8 predictor consumes — depth is not a direct input here (it was the chapter-2 projection tool). The predictor emits three pose increments, DDP turns them into a feasible reference trajectory, and the autopilot tracks it until the next image arrives.

This is not a full rewrite of the stack. The orange segmenter survives from Chapter 2 and becomes an attention channel. What disappears is the hand-engineered middle: no pinhole projection to a mean orange point, no RANSAC plane, no single goal pose contract.

§3.3The network

The predictor is a small ResNet8, following Kaufmann et al.'s drone-racing architecture. It is deliberately small: one 32-channel convolution, two residual blocks, two fully connected layers, and an 1800-logit classification head.

§3.4Action as classification

The network does not regress to a continuous waypoint coordinate. Each axis is discretised into 100 bins, and the network does a softmax over those bins. Six axes times three waypoints gives 18 independent 100-class predictions: 1800 logits total.

This choice comes from Kim et al.'s eye-surgery work. A regression head with squared error can collapse to the mean of the dataset, especially when multiple trajectories are plausible. A classification head only has to put probability mass on the right bin.

FIG. 3.3 Continuous coordinate regression versus binned classification. The bin index comes from

p_bin = ((p_t - min) / (max - min)) \times 100

; at inference, the selected bin maps back to a coordinate.

FIG. 3.3B The thesis figure is not just "100 bins." It is a probability cloud over discretised pose coordinates for each future pose increment. The red plus is the expert bin; opacity is predicted probability mass.

Dataset trick

The thesis doubles the supervised dataset with a symmetry trick. Flip the camera image about the vertical axis, then mirror the future waypoints about the UAV's local x-z plane. A fruit on the top-left becomes a fruit on the top-right, and the action labels flip with it.

For the LLM person

This is the same broad move as tokenizing a continuous output. Decoder LLMs predict the next token over a fixed vocabulary; here the network predicts the next waypoint over a fixed bin grid. The useful analogy is not that this thesis is an LLM, but that classification gives the model a sharper training signal than direct continuous regression.

§3.5The MPC teacher

You cannot train an imitation network without an expert. The thesis builds one inside Unity: with full ground-truth knowledge of the tree, the orange, and the drone, a Differential Dynamic Programming solver computes optimal trajectories.

DDP balances state error, control effort, obstacle avoidance, and yaw alignment:

J = ½ dx_fᵀ Q_f dx_f + Σ_i dt[ ½ dxᵀ Q dx + ½ uᵀ R u + b·c(x) + y(x) ] EQ 3.1

The four terms, in plain language:

dx_fᵀ Q_f dx_f — final-state cost. Get to the orange.
dxᵀ Q dx + uᵀ R u — running costs. Do not wander; do not burn unnecessary control.
b · c(x) — obstacle cost. This is two costs under one name: a ground term that penalizes unsafe low altitude, and a tree term that treats the canopy as a cylinder and penalizes squared intersection depth. The gain $b$ starts at 0 and doubles after each convergence.
y(x) = g(1 − R·e₁ · d_f) — yaw cost. Keep the drone's forward axis pointed at the orange. The thesis deliberately defines $d f$ in the x-y plane only; including z made the solver dive at the start.

DDP detail

This is the expert's privileged information in concrete form: it knows the tree cylinder, the ground threshold, the orange position, and the drone state. The student never receives those variables. It only sees the RGB image and mask, then learns to imitate the actions this cost function would have produced.

For the LLM person

This is supervised fine-tuning where the labels are produced offline by a privileged model. The teacher has access to the world; the student only has access to a camera. "Stronger" here means "has better inputs," not "has more parameters."

§3.6DAgger

Pure imitation has a distribution-shift problem. At inference time the student drifts away from expert states, and then every subsequent input is a situation the student did not see during training.

DAgger fixes this by letting the student fly in simulation, asking the expert what it would have done from those student-visited states, adding the difficult image-waypoint pairs to the dataset, and retraining. In this thesis, an example is added when the prediction error exceeds $ε$ ; over time $ε$ decays to 2.5 cm.

§3.7Emergent exploration

The interesting result is not the average success rate. It is the policy that appears when the orange is not immediately visible.

The simulated training data contains reachable oranges. But when the trained network is tested on a tree with no orange present, it does not simply stall; it circles the tree. The thesis calls this emergent exploratory behavior. The loss never says "search," but the policy has learned that moving around a fruit-tree-shaped object is a useful way to make hidden fruit visible.

FIG. 3.6 Head-on cases produce direct paths. Side-on or hidden-fruit cases produce search arcs. With no orange present, the policy still circles the tree, demonstrating the latent search behavior the thesis emphasizes.

Simulation success by scenario: 85% head-on, 70% hard cases, 77.5% overall. The occlusion test is harsher: each 10% occlusion bin gets 25 runs, and the 50–60% bin drops to 40% success.

FIG. 3.7 The trend is "more occlusion hurts," not a clean line. The 30–40% bin is slightly better than 20–30%, but the cliff at 50–60% is real: only 40% success.

§3.8Sim → real

When the same architecture is trained on real-world data and flown on the actual drone, it still does not match the baseline's grasp reliability. The thesis reports final pose error over 71 flights instead of a pick-success rate.

The real tree is visually messier than Unity: shadows, leaf texture, lighting changes, and depth noise.
Real expert trajectories are hand-carried demonstrations, not DDP-optimal trajectories. They are noisier and less consistent.

Real-flight failsafes

The deployed real-world policy still leans on the segmentation network for safety. If the orange mask gets too large, the drone assumes it is too close to the tree and sets $Δḡ = 0$ , hovering in place. If the mask goes to zero, the drone stops translating and commands a $10°$ yaw turn until the orange comes back into view. This is a learned planner with two hand-written guardrails.

The per-waypoint error in centimetres is comparable, but that does not translate directly to mission success. Small waypoint errors compound through the closed loop, and the real tree is less forgiving than the simulator. The thesis points to Sim2Real transfer and better world models as the next fixes.

← Previous2 · The decoupled pipeline Next →4 · What worked