CHAPTER 3Learning to plan
Replace the brittle geometric handoff with a network that predicts short-horizon motion from pixels. Keep the old segmenter as attention. Train the rest by imitating a privileged optimal-control teacher.
§3.1Why end-to-end
Chapter 2's pipeline has a clean failure: it only knows what the segmenter sees in the current frame. If the orange is hidden behind leaves or behind the tree at the start, the system has no target point, no fitted foliage plane, and no goal pose to hand to the planner.
Chapter 3 keeps the useful visual prior — "this white blob is the orange mask" — but removes the hand-built geometry between perception and planning. The network sees an RGB image plus the mask and predicts three future 6-DOF pose increments. The hope is that it learns the latent visual structure Chapter 2 never represented: where fruit tends to sit in the canopy, which side of the tree is likely clear, and when a search arc is better than a direct approach.
Reinforcement learning would let the drone try things and learn from reward, but real crashes are expensive and the sample efficiency is poor. Imitation learning is more practical here: generate expert trajectories, then train the network to mimic them. The catch is that the expert must be good. In simulation, the expert is DDP with full knowledge of the tree, orange, and drone state.
§3.2The closed loop
This is not a full rewrite of the stack. The orange segmenter survives from Chapter 2 and becomes an attention channel. What disappears is the hand-engineered middle: no pinhole projection to a mean orange point, no RANSAC plane, no single goal pose contract.
§3.3The network
The predictor is a small ResNet8, following Kaufmann et al.'s drone-racing architecture. It is deliberately small: one 32-channel convolution, two residual blocks, two fully connected layers, and an 1800-logit classification head.
§3.4Action as classification
The network does not regress to a continuous waypoint coordinate. Each axis is discretised into 100 bins, and the network does a softmax over those bins. Six axes times three waypoints gives 18 independent 100-class predictions: 1800 logits total.
This choice comes from Kim et al.'s eye-surgery work. A regression head with squared error can collapse to the mean of the dataset, especially when multiple trajectories are plausible. A classification head only has to put probability mass on the right bin.
The thesis doubles the supervised dataset with a symmetry trick. Flip the camera image about the vertical axis, then mirror the future waypoints about the UAV's local x-z plane. A fruit on the top-left becomes a fruit on the top-right, and the action labels flip with it.
This is the same broad move as tokenizing a continuous output. Decoder LLMs predict the next token over a fixed vocabulary; here the network predicts the next waypoint over a fixed bin grid. The useful analogy is not that this thesis is an LLM, but that classification gives the model a sharper training signal than direct continuous regression.
§3.5The MPC teacher
You cannot train an imitation network without an expert. The thesis builds one inside Unity: with full ground-truth knowledge of the tree, the orange, and the drone, a Differential Dynamic Programming solver computes optimal trajectories.
DDP balances state error, control effort, obstacle avoidance, and yaw alignment:
The four terms, in plain language:
- dxfᵀ Qf dxf — final-state cost. Get to the orange.
- dxᵀ Q dx + uᵀ R u — running costs. Do not wander; do not burn unnecessary control.
- b · c(x) — obstacle cost. This is two costs under one name: a ground term that penalizes unsafe low altitude, and a tree term that treats the canopy as a cylinder and penalizes squared intersection depth. The gain b starts at 0 and doubles after each convergence.
- y(x) = g(1 − R·e₁ · df) — yaw cost. Keep the drone's forward axis pointed at the orange. The thesis deliberately defines df in the x-y plane only; including z made the solver dive at the start.
This is the expert's privileged information in concrete form: it knows the tree cylinder, the ground threshold, the orange position, and the drone state. The student never receives those variables. It only sees the RGB image and mask, then learns to imitate the actions this cost function would have produced.
This is supervised fine-tuning where the labels are produced offline by a privileged model. The teacher has access to the world; the student only has access to a camera. "Stronger" here means "has better inputs," not "has more parameters."
§3.6DAgger
Pure imitation has a distribution-shift problem. At inference time the student drifts away from expert states, and then every subsequent input is a situation the student did not see during training.
DAgger fixes this by letting the student fly in simulation, asking the expert what it would have done from those student-visited states, adding the difficult image-waypoint pairs to the dataset, and retraining. In this thesis, an example is added when the prediction error exceeds ε; over time ε decays to 2.5 cm.
§3.7Emergent exploration
The interesting result is not the average success rate. It is the policy that appears when the orange is not immediately visible.
The simulated training data contains reachable oranges. But when the trained network is tested on a tree with no orange present, it does not simply stall; it circles the tree. The thesis calls this emergent exploratory behavior. The loss never says "search," but the policy has learned that moving around a fruit-tree-shaped object is a useful way to make hidden fruit visible.
Simulation success by scenario: 85% head-on, 70% hard cases, 77.5% overall. The occlusion test is harsher: each 10% occlusion bin gets 25 runs, and the 50–60% bin drops to 40% success.
§3.8Sim → real
When the same architecture is trained on real-world data and flown on the actual drone, it still does not match the baseline's grasp reliability. The thesis reports final pose error over 71 flights instead of a pick-success rate.
- The real tree is visually messier than Unity: shadows, leaf texture, lighting changes, and depth noise.
- Real expert trajectories are hand-carried demonstrations, not DDP-optimal trajectories. They are noisier and less consistent.
The deployed real-world policy still leans on the segmentation network for safety. If the orange mask gets too large, the drone assumes it is too close to the tree and sets Δḡ = 0, hovering in place. If the mask goes to zero, the drone stops translating and commands a 10° yaw turn until the orange comes back into view. This is a learned planner with two hand-written guardrails.
The per-waypoint error in centimetres is comparable, but that does not translate directly to mission success. Small waypoint errors compound through the closed loop, and the real tree is less forgiving than the simulator. The thesis points to Sim2Real transfer and better world models as the next fixes.