A visual reading of Perception Based UAV Path Planning for Fruit Harvesting
Cover · 1 · The problem · 2 · Decoupled · 3 · Learning to plan · 4 · What worked · Terminology

CHAPTER 3Learning to plan

Replace the brittle geometric handoff with a network that predicts short-horizon motion from pixels. Keep the old segmenter as attention. Train the rest by imitating a privileged optimal-control teacher.

§3.1Why end-to-end

Chapter 2's pipeline has a clean failure: it only knows what the segmenter sees in the current frame. If the orange is hidden behind leaves or behind the tree at the start, the system has no target point, no fitted foliage plane, and no goal pose to hand to the planner.

Chapter 3 keeps the useful visual prior — "this white blob is the orange mask" — but removes the hand-built geometry between perception and planning. The network sees an RGB image plus the mask and predicts three future 6-DOF pose increments. The hope is that it learns the latent visual structure Chapter 2 never represented: where fruit tends to sit in the canopy, which side of the tree is likely clear, and when a search arc is better than a direct approach.

Two ways to teach a robot

Reinforcement learning would let the drone try things and learn from reward, but real crashes are expensive and the sample efficiency is poor. Imitation learning is more practical here: generate expert trajectories, then train the network to mimic them. The catch is that the expert must be good. In simulation, the expert is DDP with full knowledge of the tree, orange, and drone state.

§3.2The closed loop

END-TO-END PLANNING LOOP RESNET8 TAKES TWO INPUTS — RGB DIRECTLY, AND THE U-NET MASK ALONGSIDE IT CAMERA RGB IMAGE RGB · 3 CH U-NET RGB → BINARY MASK MASK · 1 CH CONCAT 4 CH RESNET8 PREDICTOR 4 CH IN — 3 POSE DELTAS OUT DDP SMOOTHER FEASIBLE x*(t) PD TRACKER DJI AUTOPILOT CLOSED LOOP — NEW CAMERA FRAME RESTARTS THE 1.5 S PLAN NETWORK OUTPUT x₀ Δḡ₁ +0.5 S Δḡ₂ +1.0 S Δḡ₃ +1.5 S 3 WAYPOINTS × 6 DOF × 100 BINS = 1800 LOGITS FIG. 3.1 · END-TO-END PIPELINE · §3.2
FIG. 3.1 The camera output forks. The mask branch goes through U-Net; the raw RGB goes straight ahead. They are concatenated into a 4-channel tensor that the ResNet8 predictor consumes — depth is not a direct input here (it was the chapter-2 projection tool). The predictor emits three pose increments, DDP turns them into a feasible reference trajectory, and the autopilot tracks it until the next image arrives.

This is not a full rewrite of the stack. The orange segmenter survives from Chapter 2 and becomes an attention channel. What disappears is the hand-engineered middle: no pinhole projection to a mean orange point, no RANSAC plane, no single goal pose contract.

§3.3The network

The predictor is a small ResNet8, following Kaufmann et al.'s drone-racing architecture. It is deliberately small: one 32-channel convolution, two residual blocks, two fully connected layers, and an 1800-logit classification head.

RESNET8 WAYPOINT PREDICTOR RGB + MASK 4 CHANNELS 32 CONV RESBLOCK 48 CH C(R(C(R(x)))) + C(x) RESBLOCK 72 CH FC8192 FC4096 1800 LOGITS SOFTMAX Output layout: 3 waypoints × 6 DoF (x, y, z, yaw, pitch, roll) × 100 bins per DoF. FIG. 3.2 · RESNET8 PREDICTOR · §3.2.1.1

§3.4Action as classification

The network does not regress to a continuous waypoint coordinate. Each axis is discretised into 100 bins, and the network does a softmax over those bins. Six axes times three waypoints gives 18 independent 100-class predictions: 1800 logits total.

This choice comes from Kim et al.'s eye-surgery work. A regression head with squared error can collapse to the mean of the dataset, especially when multiple trajectories are plausible. A classification head only has to put probability mass on the right bin.

CLASSIFICATION-BASED WAYPOINT GENERATION REGRESSION RETURNS A NUMBER; THE THESIS PREDICTS DISTRIBUTIONS OVER ACTION BINS A · CONTINUOUS REGRESSION ONE COORDINATE PER DOF min max MEAN PREDICTION VALID PATH VALID PATH Squared error can average two good choices into a bland action that commits to neither trajectory. B · ONE WAYPOINT AS SIX 100-BIN SOFTMAXES EACH AXIS IS A 1D PROBABILITY HISTOGRAM x BIN 0..99 y z yaw pitch roll 3 FUTURE WAYPOINTS Δt = 0.5 S · N_w = 3 W₁ 6 DOF × 100 BINS W₂ 6 DOF × 100 BINS W₃ 6 DOF × 100 BINS 1800 TOTAL p_bin = ((p_t − min) / (max − min)) × 100 3 WAYPOINTS × 6 DOF × 100 BINS = 1800 LOGITS FIG. 3.3 · CLASSIFICATION-BASED WAYPOINT GENERATION · §3.2.1.2
FIG. 3.3 Continuous coordinate regression versus binned classification. The bin index comes from p_bin = ((p_t − min) / (max − min)) × 100; at inference, the selected bin maps back to a coordinate.
WHAT FIG. 3.2 IS REALLY SHOWING THE NETWORK DOES NOT EMIT A POINT; IT EMITS PROBABILITY MASS AROUND EACH FUTURE WAYPOINT Δḡ₁ · +0.5 S TRUE BIN (EXPERT) HIGH MASS NEAR THE EXPERT WAYPOINT Δḡ₂ · +1.0 S EACH DOF IS TRAINED WITH CROSS-ENTROPY Δḡ₃ · +1.5 S INFERENCE CHOOSES A BIN, THEN MAPS BACK FIG. 3.3B · BIN PROBABILITY CLOUD · AFTER THESIS FIG. 3.2
FIG. 3.3B The thesis figure is not just "100 bins." It is a probability cloud over discretised pose coordinates for each future pose increment. The red plus is the expert bin; opacity is predicted probability mass.
Dataset trick

The thesis doubles the supervised dataset with a symmetry trick. Flip the camera image about the vertical axis, then mirror the future waypoints about the UAV's local x-z plane. A fruit on the top-left becomes a fruit on the top-right, and the action labels flip with it.

For the LLM person

This is the same broad move as tokenizing a continuous output. Decoder LLMs predict the next token over a fixed vocabulary; here the network predicts the next waypoint over a fixed bin grid. The useful analogy is not that this thesis is an LLM, but that classification gives the model a sharper training signal than direct continuous regression.

§3.5The MPC teacher

You cannot train an imitation network without an expert. The thesis builds one inside Unity: with full ground-truth knowledge of the tree, the orange, and the drone, a Differential Dynamic Programming solver computes optimal trajectories.

DDP balances state error, control effort, obstacle avoidance, and yaw alignment:

J = ½ dxfᵀ Qf dxf + Σi dt[ ½ dxᵀ Q dx + ½ uᵀ R u + b·c(x) + y(x) ] EQ 3.1

The four terms, in plain language:

DDP detail

This is the expert's privileged information in concrete form: it knows the tree cylinder, the ground threshold, the orange position, and the drone state. The student never receives those variables. It only sees the RGB image and mask, then learns to imitate the actions this cost function would have produced.

TEACHER-STUDENT IMITATION TEACHER · PRIVILEGED DDP INPUTS Tree pose and cylinder radius Orange position Drone initial state OPTIMIZES Cost J with obstacle gain ramp Expert trajectory in Unity 300 TRAJ × 500 STEPS = 150K SAMPLES DISTILL IMAGE → ACTION STUDENT · RESNET8 INPUTS RGB camera frame Segmentation mask No tree/orange ground truth LEARNS Cross-entropy against teacher bins Predicts 3 future pose increments DEPLOYED CLOSED-LOOP ON IMAGES FIG. 3.4 · TEACHER-STUDENT IMITATION · §3.3.2
For the LLM person

This is supervised fine-tuning where the labels are produced offline by a privileged model. The teacher has access to the world; the student only has access to a camera. "Stronger" here means "has better inputs," not "has more parameters."

§3.6DAgger

Pure imitation has a distribution-shift problem. At inference time the student drifts away from expert states, and then every subsequent input is a situation the student did not see during training.

DAgger fixes this by letting the student fly in simulation, asking the expert what it would have done from those student-visited states, adding the difficult image-waypoint pairs to the dataset, and retraining. In this thesis, an example is added when the prediction error exceeds ε; over time ε decays to 2.5 cm.

DAgger DATASET AGGREGATION LOOP 1 · ROLLOUT STUDENT FLIES SIM 2 · COMPARE PREDICTED VS EXPERT 3 · AGGREGATE IF d > ε 4 · RETRAIN NEW DATASET REPEAT WHILE ε DECAYS · FINAL THRESHOLD ε = 2.5 CM DISTANCE METRIC Sum over future waypoints: Euclidean position error plus forward-axis rotation mismatch. THE STUDENT'S OWN MISTAKES BECOME PART OF THE NEXT TRAINING SET. FIG. 3.5 · DAGGER LOOP · §3.3.4

§3.7Emergent exploration

The interesting result is not the average success rate. It is the policy that appears when the orange is not immediately visible.

The simulated training data contains reachable oranges. But when the trained network is tested on a tree with no orange present, it does not simply stall; it circles the tree. The thesis calls this emergent exploratory behavior. The loss never says "search," but the policy has learned that moving around a fruit-tree-shaped object is a useful way to make hidden fruit visible.

EMERGENT EXPLORATION · TOP-DOWN TRAJECTORIES A · HEAD-ON ORANGE VISIBLE x₀ DIRECT COMMIT B · SIDE-ON FRUIT INITIALLY HIDDEN x₀ SEARCH ARC THEN COMMIT C · NO ORANGE POLICY CIRCLES THE TREE x₀ LOOPS AROUND CANOPY FIG. 3.6 · EMERGENT EXPLORATION · AFTER THESIS FIG. 3.4
FIG. 3.6 Head-on cases produce direct paths. Side-on or hidden-fruit cases produce search arcs. With no orange present, the policy still circles the tree, demonstrating the latent search behavior the thesis emphasizes.

Simulation success by scenario: 85% head-on, 70% hard cases, 77.5% overall. The occlusion test is harsher: each 10% occlusion bin gets 25 runs, and the 50–60% bin drops to 40% success.

SUCCESS VS OCCLUSION FRACTION 25 SIMULATION RUNS PER 10% OCCLUSION BIN · AFTER THESIS FIG. 3.5 100 75 50 25 0 100% 90% 80% 85% 75% 40% 0–10 10–20 20–30 30–40 40–50 50–60 OCCLUSION PERCENTAGE SUCCESS PERCENTAGE FIG. 3.7 · OCCLUSION RESULTS · §3.3.6
FIG. 3.7 The trend is "more occlusion hurts," not a clean line. The 30–40% bin is slightly better than 20–30%, but the cliff at 50–60% is real: only 40% success.

§3.8Sim → real

When the same architecture is trained on real-world data and flown on the actual drone, it still does not match the baseline's grasp reliability. The thesis reports final pose error over 71 flights instead of a pick-success rate.

  1. The real tree is visually messier than Unity: shadows, leaf texture, lighting changes, and depth noise.
  2. Real expert trajectories are hand-carried demonstrations, not DDP-optimal trajectories. They are noisier and less consistent.
Real-flight failsafes

The deployed real-world policy still leans on the segmentation network for safety. If the orange mask gets too large, the drone assumes it is too close to the tree and sets Δḡ = 0, hovering in place. If the mask goes to zero, the drone stops translating and commands a 10° yaw turn until the orange comes back into view. This is a learned planner with two hand-written guardrails.

SIM AND REAL WAYPOINT ERROR SIM TRAINED Δḡ₁ · 5.1 CM Δḡ₂ · 4.3 CM Δḡ₃ · 4.1 CM 300 TRAJ · 150K SAMPLES · 77.5% SIM SUCCESS REAL TRAINED Δḡ₁ · 4.8 CM Δḡ₂ · 4.6 CM Δḡ₃ · 4.4 CM 314 HAND-CARRIED DEMOS · 71 FLIGHTS FIG. 3.8 · WAYPOINT ERROR · §3.4.3

The per-waypoint error in centimetres is comparable, but that does not translate directly to mission success. Small waypoint errors compound through the closed loop, and the real tree is less forgiving than the simulator. The thesis points to Sim2Real transfer and better world models as the next fixes.

PERCEPTION-BASED UAV PATH PLANNING CHAPTER 3 · LEARNING TO PLAN FIG. 3.1 — 3.8