A visual reading of Perception Based UAV Path Planning for Fruit Harvesting
Cover · 1 · The problem · 2 · Decoupled · 3 · Learning to plan · 4 · What worked · Terminology

CHAPTER 1The problem

A drone with an arm, a fake orange tree, a magnetic gripper, and the question every autonomous robot eventually has to answer: where is my goal, and how do I get there without running into anything?

§1.1The scene

The lab at JHU is set up with an artificial orange tree (one with detachable, magnet-equipped oranges that can be relocated between trials) and a quadcopter that starts 2–4 metres away at a height of 0.75–2.5 m. The orange's location, and the angle from which it can be plucked, varies every run — the relative yaw between drone and best approach swings ±75°. The drone has to find the orange, decide which way to come in, fly there without crashing into the tree, and snap the magnet onto the orange.

GROUND z = 0 2–4 M START ANNULUS 0.75–2.5 M DJI MATRICE 100 4 ROTORS · FIXED MAGNETIC ARM DESIRED TRAJECTORY x̄(·) ORANGE VISIBLE AT t₀ PARTLY OCCLUDED ARTIFICIAL TREE TRUNK · BRANCHES DISCRETE LEAF CLUSTERS ±75° APPROACH YAW FIG. 1.1 · TASK SETUP
FIG. 1.1 Test setup, after Kothiyal §2.3.1. The drone starts somewhere on a 2–4 m annulus around the tree at 0.75–2.5 m altitude, with the orange visible somewhere on the foliage. The relative yaw between the drone's initial heading and the best approach to the orange ranges from −75° to +75°. In Chapter 2 the orange is always at least partially in view at t₀. In Chapter 3 the trained policy can also handle starting with the orange behind a leaf, behind the tree, or — emergently — not present at all.

§1.2Anatomy of the apparatus

The drone is a DJI Matrice 100 (which still uses its built-in flight controller for low-level stabilisation), instrumented with three things the thesis cares about: a vision sensor, a compute brain, and an actuator.

AERIAL MANIPULATOR HARDWARE MATRICE 100 + REALSENSE + NUC + FIXED 3D-PRINTED MAGNETIC ARM NUC BATTERY x_b z_b y_b 4× BLDC ROTORS DJI AUTOPILOT STABILISES LOW-LEVEL FLIGHT INTEL NUC ROS ONBOARD · GPU INFERENCE OFFBOARD REALSENSE D435i RGB + IR STEREO DEPTH 0.3–10 M FIXED ARM 3D PRINTED r CONST · ṙ = 0 MAGNETIC GRIPPER PASSIVE MAGNET + LSM303 DETECTION FIG. 1.2 · APPARATUS · AFTER §2.3.1
FIG. 1.2 The aerial manipulator. Note the arm joint is fixed (ṙ = 0) — the drone flies the gripper into position rather than moving an articulated arm. This simplification lets the trajectory generator treat the whole vehicle as a single rigid body.
Why this matters for the algorithm

The fixed arm is a modelling choice. Because ṙ = 0, the end-effector pose g_e(q) = g · φ(r) is just a fixed offset from the body frame. So every "where is the gripper" question collapses to "where is the body, plus a constant transform." This is what makes the rest of the math tractable in 67 pages.

§1.3Writing the robot down

Before any planner can do anything useful, you have to commit to a vocabulary for the state. The thesis uses the standard rigid-body description for a quadrotor:

x = (p, R, v, ω)   ∈ ℝ³ × SO(3) × ℝ³ × ℝ³ EQ 1.1

Position p in the world frame, orientation R as a rotation matrix, body-fixed linear velocity v, body-fixed angular velocity ω. The pair (R, p) packs into a single 4×4 pose matrix g ∈ SE(3):

STATE, POSE, AND CONTROL VOCABULARY R 3×3 ORIENTATION p 3×1 0 0 0 1 g = ∈ SE(3) x = (p, R, v, ω) p ∈ R³: base position in world frame R ∈ SO(3): base orientation v, ω ∈ R³: body-fixed linear and angular velocity DYNAMICS ġ = g·V̂ M(q)V̇ + b(x) = B(q)·u CONTROL INPUT u ∈ R⁴ three body torques + one thrust along local ẑ_b end-effector pose: g_e(q) = g · φ(r) — fixed offset FIG. 1.3 · STATE & DYNAMICS AFTER §1.2

The dynamics ẋ = f(x,u) are derived in Kobilarov 2014 — the upshot is that the system is differentially flat (you can recover the controls from a sufficiently smooth trajectory), which is what makes the polynomial-snap planner in Chapter 2 work.

A planner that wanted full information would need three things up front: the desired final pose ḡ(t_f), a desired trajectory x̄(·), and the obstacle set E ⊂ ℝ³. None of these are available a priori. The drone gets a camera and a depth sensor and has to estimate them on the fly. That's the whole thesis.

minu(·)   ∫t₀tf ½‖x(t) − x̄(t)‖²Q + ½‖u(t)‖²R dt
subject to   ẋ = f(x, u)   (dynamics)
             A(q(t)) ∩ E = ∅   (avoid collisions) EQ 1.3 · OPTIMAL CONTROL

§1.4Two systems, one task

The thesis tackles the same problem with two architectures. Chapter 2 keeps perception and planning as separate modules that communicate through a clean interface (the goal pose). Chapter 3 fuses them: a single neural network looks at pixels and emits future waypoints directly.

TWO WAYS TO DRAW THE PERCEPTION/PLANNING BOUNDARY CHAPTER 2 · DECOUPLED VISUAL SERVOING INTERFACE BETWEEN MODULES IS A SINGLE 6-DOF GOAL POSE RGB-D REAL SENSE U-NET MASK PINHOLE + DEPTH p_ORANGE RANSAC PLANE BEST YAW MIN-SNAP PD TRACKER REPLAN AT 8–9 HZ WHILE ORANGE REMAINS VISIBLE CHAPTER 3 · END-TO-END IMITATION NETWORK LEARNS VISUAL FEATURES THAT IMPLY WHERE TO FLY NEXT RGB + MASK CONCATENATED RESNET8 · 1800 LOGITS 3 WAYPOINTS × 6 DOF × 100 BINS DDP SMOOTH DYNAMICS PD TRACK DJI AUTOPILOT CLOSED LOOP WITH THREE 0.5 S WAYPOINT INCREMENTS FIG. 1.4 · DECOUPLED VS END-TO-END PIPELINES
FIG. 1.4 Two architectures, side by side. The decoupled stack passes only a goal pose between perception and planning — a clean, low-bandwidth interface that's easy to debug. The end-to-end stack lets the network learn its own internal representation, which means it can pick up cues the engineer never thought to expose, but you can't easily inspect what it's doing.

Decoupled · Ch 2

  • Reliable when goal is in view
  • Each module independently testable
  • No training data needed at the planner
  • Hard fail when orange leaves view
  • RANSAC plane fit oscillates near sparse foliage
  • No notion of "exploration"

End-to-end · Ch 3

  • Learns latent visual cues for free
  • Emergent exploration (circles tree if no orange)
  • Can handle initial occlusion
  • Needs an expert teacher and lots of data
  • Sim-to-real degrades performance
  • Fails opaquely
For the LLM person

The decoupled vs end-to-end split here is the same axis as retrieval-augmented agents vs end-to-end VLA. In retrieval, you commit to "the world is a vector store, the LLM is a planner over its outputs." In a VLA you let the model learn its own visual representations and emit actions directly. Both papers in this thesis are early-2020s analogues of that debate. Chapter 3 is closer to the modern VLA spirit; Chapter 2 is closer to the agent-with-tools approach.

PERCEPTION-BASED UAV PATH PLANNING CHAPTER 1 · THE PROBLEM FIG. 1.1 — 1.4