A visual reading of Perception Based UAV Path Planning for Fruit Harvesting
Cover · 1 · The problem · 2 · Decoupled · 3 · Learning to plan · 4 · What worked · Terminology

APPENDIX ATerminology

Every piece of jargon in the thesis, in plain English. Where it helps, I've added an LLM-side analogue — same idea, different vocabulary.

AHardware & flight

The physical apparatus and what each component is responsible for.

UAV Unmanned Aerial Vehicle Used throughout

Any flying robot without a human pilot on board. In this thesis it specifically means a quadcopter — four propellers, X-frame, hovering capable. "Drone" is the colloquial synonym.

Quadcopter / Quadrotor §1, §2.3.1

A four-rotor flying vehicle. Two rotors spin clockwise, two counter-clockwise; balancing their thrusts and torques produces movement in any direction. The DJI Matrice 100 used here is a development-platform quadrotor with an open SDK.

Aerial manipulator §1, §1.2

A flying robot with an arm. Combines the mobility of a UAV with the ability to interact with objects. In this paper the "arm" is just a fixed rod with a gripper — it doesn't articulate, which simplifies the dynamics enormously.

End-effector §1.2 onward

The business end of a robot — whatever does the actual interaction with the world. Here it's the magnetic gripper at the tip of the arm. Mathematically it's a fixed pose offset from the drone body.

Gripper §2.3.1

The "hand." This thesis uses a passive magnetic gripper — no motors, just a strong magnet that latches when it gets close enough to the magnet on the orange. Cheap, reliable, no actuation complexity, but only works for objects you've prepared with magnets.

BLDC motor Brushless Direct Current §1.2 (implicit)

The motors that spin the rotors. Brushless = electronically commutated, more efficient than brushed motors. Drones use BLDCs for high power density and reliability.

IMU Inertial Measurement Unit §1.2 (implicit, in DJI autopilot)

A small chip combining accelerometers and gyroscopes. Reports linear acceleration and angular velocity. Used by the flight controller for attitude stabilisation; not directly used in this thesis, but it's running in the background to keep the drone level.

RGB-D camera §2.2 onward

A camera that returns both colour (RGB) and depth (D) for every pixel. The thesis uses the Intel RealSense D435i, which gets depth from active stereo: it projects an invisible IR pattern, sees how the pattern is distorted by the geometry, and triangulates.

Depth map §2.2.2

An image where each pixel is a distance, not a colour. Same dimensions as the RGB image, aligned per-pixel. Crucial for going from 2D pixel coordinates back to 3D world coordinates (see pinhole camera model).

Point cloud §2.2.2

A set of 3D points, usually obtained by projecting a depth map into space using camera intrinsics. The "shape of the world that the camera sees, in metres."

Optitrack / motion capture §2.3.1

An infrared multi-camera system that tracks reflective markers stuck on the drone, returning sub-millimetre position estimates. Standard in robotics labs as a "ground truth" odometry source. It's not how a real-world drone would localise itself outdoors — only used here because the lab environment allows it.

Odometry §2.3.1

The robot's estimate of its own position and orientation over time. Could come from wheel encoders (ground robots), GPS, IMU integration, visual-inertial estimation, or — as here — external mocap. The thesis uses Optitrack because it's high-accuracy and the test environment is small enough.

ROS Robot Operating System §2.3.1

Not really an OS — a middleware framework where each algorithm runs as a "node," communicating via typed message topics. The thesis runs U-Net, projection, RANSAC, and trajectory generation as separate ROS nodes. ROS gives you free pub/sub, transforms, visualisation (RViz), and a huge ecosystem of pre-built nodes.

For the LLM person Think of it as a single-machine version of a service mesh: nodes publish to typed topics, others subscribe. The clean module boundaries in chapter 2 are largely because ROS encourages this style.

BComputer vision

Tools for turning images into structured information.

Semantic segmentation §2.2.1

Per-pixel classification: every pixel in an image gets a label ("orange," "leaf," "sky"). Different from object detection, which gives bounding boxes; closer to "drawing inside the lines" of every object. The thesis only has one class (orange vs not-orange) so it's binary segmentation.

Mask §2.2.1

The output of a segmentation network — a same-shape array of 0s and 1s (or class labels) over the input image. Tells you which pixels belong to the object you care about.

CNN Convolutional Neural Network §1, §2

A neural network whose layers slide small filters across the input — the standard tool for image-shaped data. Filters detect patterns (edges, then textures, then objects) at increasing scales of abstraction. CNNs were the dominant vision architecture from ~2012 to ~2020 before transformers caught up.

U-Net §2.2.1

A specific CNN architecture for segmentation. Shape: a downsampling path (encoder) followed by a symmetric upsampling path (decoder), with skip connections jumping across that "U" shape. Skip connections feed early high-resolution features into the decoder so fine details aren't lost. Originally introduced for biomedical image segmentation in 2015.

Skip connection §2.2.1, §3.2.1.1

A direct wire that shortcuts past several layers of a network. Helps gradients flow during training (the original motivation in ResNet), and helps preserve information that would otherwise be destroyed by downsampling (the motivation in U-Net).

ResNet Residual Network §3.2.1.1

A CNN family that uses skip connections to make very deep networks trainable. Each "block" adds the input of the block to its output: y = F(x) + x. ResNet8 is just a tiny version with 8 layers — the thesis uses it because the prediction task has limited training data.

IoU Intersection over Union §2.2.1

The standard metric for segmentation quality. IoU = |A ∩ B| / |A ∪ B| — how much your predicted mask overlaps with the ground-truth mask, divided by their combined area. 1.0 = perfect, 0 = no overlap. The thesis hits 93.49%, which for a single-class network on a clean dataset is solid.

COCO Common Objects in Context §2.1

A widely-used image dataset containing 150+ common object classes with segmentation labels. Standard "pretraining" target for vision networks. Mentioned here because U-Nets are often trained on COCO before fine-tuning to a specific task.

Bounding box §2.1

A rectangle around an object in an image. Used in object detection (faster but less precise than segmentation). The thesis discusses bounding-box methods (R-CNN) only to explain why it picked segmentation instead.

C3D geometry & projection

How pixels relate to physical metres.

Pinhole camera model §2.2.2

An idealised camera where all light passes through a single infinitesimal hole and lands on a flat sensor at distance f (the focal length). This model is what lets you write down the equation that maps 3D world points to 2D pixels — the foundation of every projection-based 3D reconstruction.

Focal length (f) §2.2.2

Distance between the pinhole and the image plane. Bigger f = narrower field of view (telephoto). Smaller f = wider field of view (wide-angle). Specified by the camera manufacturer; the thesis uses the values RealSense reports.

Intrinsic matrix (M_int) §2.2.2 eq 2.2

A 3×4 matrix that encodes the camera's internal geometry: focal lengths f_x, f_y, principal point offsets o_x, o_y. Multiply a 3D point in camera frame by this matrix to get the (homogeneous) pixel coordinate. Stored once per camera; doesn't change as the camera moves.

Extrinsic matrix implied throughout §2

The 4×4 matrix that describes where the camera is and which way it points, relative to some world frame. Combined with the intrinsics, you can map any world point to a pixel. The thesis treats the camera-to-body extrinsic as a known constant (mounted in a fixed position on the drone).

Image plane §2.2.2

The flat surface inside the camera where the sensor lives. World points project onto it through the pinhole. Pixel coordinates (u, v) measure positions on this plane.

Camera frame §2.2.2

A coordinate system attached to the camera: origin at the pinhole, z-axis pointing forward (the way the camera is looking). All "depths" are along this z-axis.

Body frame / World frame §1.2, §2.2.2

"Body frame" rides with the drone — y-axis sideways, z-axis up, x-axis forward. "World frame" is fixed in the room. The same physical point has different coordinates in different frames; converting between them uses the rotation matrix R and position p — i.e. the pose g.

RANSAC Random Sample Consensus §2.2.2

A robust model-fitting algorithm. Repeatedly: pick a random small subset of the data, fit a model, count how many points agree (the "consensus"). The model with the most consensus wins, and outliers are ignored. The thesis uses RANSAC to fit a plane through the foliage point cloud — without RANSAC, branches sticking out would corrupt a least-squares fit.

Plane fitting §2.2.2

Finding the plane ax + by + cz + d = 0 that best passes through a set of 3D points. The plane's normal vector (a, b, c) is the direction perpendicular to it — in this thesis, the direction the drone should fly in to "punch through" the foliage at the orange.

SLAM Simultaneous Localisation And Mapping §1.1

The classical robotics problem of building a map of an unknown environment while figuring out where you are inside it. The thesis specifically avoids SLAM because it doesn't scale well to large environments (the map gets too big). The whole thesis can be read as "what if we didn't need a map at all?"

DControl & planning

How the drone actually moves.

PD controller Proportional-Derivative §2.2.3

The simplest non-trivial controller. Output is k_p · error + k_d · derivative_of_error. The "P" term pushes towards the target; the "D" term damps oscillation. Used here as the inner loop tracking the reference trajectory. PID adds an "I" (integral) term for steady-state error — not used here.

MPC Model Predictive Control §2 intro, §3.5

Solve an optimal-control problem over a short future horizon, execute the first action, replan from the new state. Lets you handle constraints (obstacles, motor limits) explicitly inside the optimisation. The "expert" in chapter 3 is essentially an MPC running offline with full knowledge.

DDP Differential Dynamic Programming §3.3.2

A specific algorithm for solving non-linear optimal control. Cousin of LQR (Linear Quadratic Regulator) for non-linear dynamics. Iterative; converges quickly when initialised well. The thesis uses DDP both to compute expert trajectories for training (chapter 3) and to smooth the network's predicted waypoints into a feasible reference at inference time.

Trajectory generation §2.2.3

Computing a sequence of states the robot should visit over time. Distinct from trajectory tracking, which is the lower-level loop that actuates motors to follow that sequence.

Minimum-snap / polynomial-snap §2 intro, §2.2.3

A trajectory parameterised as a polynomial in time, chosen so that the fourth derivative of position (the "snap") is minimised. Smooths out trajectories so motors don't have to jerk. Standard for quadrotors thanks to a 2011 Mellinger & Kumar paper.

Differential flatness §2.2.3

A property some systems have where the entire state and control history can be recovered from a sufficiently smooth trajectory of certain "flat outputs" — for a quadrotor, position and yaw. This is what makes minimum-snap trajectories actually flyable: you can smoothly differentiate the path and read off motor commands.

Visual servoing §2 title, §2.2

A control technique where the controller's feedback comes directly from camera measurements, not from a separate state estimator. Two flavours: position-based (compute a 3D goal from images, then use standard control) and image-based (drive features in the image directly to target locations). The thesis uses position-based.

Closed-loop / open-loop throughout

Closed-loop = controller uses real-time feedback (e.g. camera) to correct itself. Open-loop = controller commits to a plan and executes it blindly. Both pipelines in this thesis are closed-loop, just with different feedback (decoupled: 6-DOF goal pose @ 8–9 Hz; end-to-end: full image-to-waypoints @ image rate).

Goal pose / waypoint throughout

"Goal pose" = the final 6-DOF target the drone is trying to reach. "Waypoint" = an intermediate pose along the way. Chapter 2 communicates one goal pose between modules; chapter 3 communicates three waypoints (Δḡ₁, Δḡ₂, Δḡ₃) at 0.5 s spacing.

Receding horizon §3 intro

The strategy where you plan far into the future, but only execute the near-future portion before re-planning from the new state. Both pipelines do this — in chapter 3 the horizon is 3 × 0.5 s = 1.5 s of waypoints, but only the first ~0.5 s is executed before the network re-runs.

EMath notation

Decoder ring for the symbols in the thesis.

SO(3) Special Orthogonal group §1.2

The set of 3D rotations. Equivalent to the set of 3×3 orthogonal matrices with determinant 1. When you see R ∈ SO(3), read "R is a rotation."

SE(3) Special Euclidean group §1.2

The set of rigid-body transformations in 3D — rotation + translation. A 4×4 matrix combining a rotation R and a position p. When you see g ∈ SE(3), read "g is a 6-DOF pose."

Rotation matrix R §1.2 onward

A 3×3 matrix that, when applied to a vector, rotates it. One of several equivalent ways to encode 3D orientation; the others are quaternions, Euler angles, axis-angle. Rotation matrices are nice because composition is just matrix multiplication.

Slerp Spherical Linear Interpolation §2.3.2 eq 2.4

The right way to interpolate between two rotations. Linear interpolation of rotation matrices doesn't preserve orthogonality; slerp moves along the great-circle arc on the unit quaternion sphere. Used here in the alpha-decay smoothing of the goal pose estimate.

Yaw / Pitch / Roll throughout

The three independent rotations of a rigid body. Yaw = rotation about vertical axis (turning left/right). Pitch = rotation about sideways axis (nose up/down). Roll = rotation about forward axis (banking). For a quadrotor, yaw is independent of position; pitch and roll directly control horizontal motion.

Body-fixed velocity (v, ω) §1.2

"Body-fixed" means the velocity is expressed in the drone's own coordinate frame, not the world's. So v is "how fast forward the drone thinks it's moving," not "how fast east." Same for angular velocity ω.

Cost function J §1.2 eq 1.3, §3.3.2

The number being minimised by an optimal-control solver. Penalises deviation from desired state, control effort, obstacle collisions, etc. Each term has a weight (Q, R, b…) you tune to trade things off. Designing J well is most of the work in classical control.

Cross-entropy loss §3.2.1.2

The standard loss for classification — measures how surprised you are at the true label given the network's predicted distribution. Equivalent to maximum-likelihood under a categorical model. Used here on each of the 18 independent 100-class softmaxes.

Softmax §3.2.1.1

The function that turns a vector of logits into a probability distribution. Exponentiate each logit, normalise so they sum to 1. Used here at the network's output to give a proper probability over the 100 bins.

FLearning

How the network in chapter 3 actually gets trained.

Imitation learning §3 intro

Training a policy by mimicking an expert. You collect (state, action) pairs from someone (or something) doing the task correctly, and train a network to map state → action. No exploration, no reward signal, no failed attempts. The catch is you need an expert.

For the LLM person Direct analogue of supervised fine-tuning. The "expert demonstrations" are your labels.
Behaviour cloning implied background

The simplest form of imitation learning: just train on (state, action) pairs as a regression/classification problem. Famously brittle because of distribution shift — the model's small errors at inference time compound, leading it into states the expert never showed it. DAgger is the classic fix.

DAgger Dataset Aggregation §3.3.4

An algorithm that fixes behaviour cloning's compounding-error problem. Procedure: roll the student out, query the expert for what it would do at each visited state, add those new (state, expert action) pairs to the dataset, retrain. Repeat until the student stops drifting. Ross & Bagnell 2011.

For the LLM person This is the active-learning loop in dataset construction — collect data exactly where the model is wrong, fine-tune on it. Modern teleop pipelines use the same shape.
Reinforcement learning §3 intro, §3.1

Training a policy via trial-and-error from a reward signal. Mentioned in this thesis only to be rejected: real-world RL on a drone means real-world crashes. Imitation learning is preferred when a working expert exists.

Sim-to-real / Sim2Real §3.4, §4.3

The whole problem of transferring a policy trained in simulation to the real world. Hard because the simulator differs from reality in countless small ways (lighting, friction, sensor noise). Common fixes: domain randomisation (train across many randomised sims), domain adaptation (align features), photorealistic rendering. The thesis flags this as the dominant unsolved issue.

Data augmentation §3.2.1.3

Synthetically expanding the training set by applying transformations that should not change the label. The thesis flips images horizontally and mirrors the corresponding waypoint labels — effectively doubling the data for free.

Train / val / test split §3.2.1.3

The standard ML hygiene practice: train on most of the data, hold out a validation set for hyperparameter tuning, and a separate test set for final evaluation that nobody touches until the end. Thesis uses 70 / 15 / 15.

Regression vs classification §3.2.1.2

Two ways to make a network predict a number. Regression: predict the number directly, train with MSE. Classification: discretise the range into bins and predict which bin, train with cross-entropy. The thesis explicitly chooses classification because regression collapses to the mean of the dataset for this kind of multimodal target.

Logits §3.2.1.1

The raw outputs of a classification network before softmax. The "1800 logits" mentioned in chapter 3 means 1800 unnormalised scores, organised as 18 groups of 100, each group fed through softmax independently.

DreamerV2 §4.3

A 2020 model-based RL algorithm that learns a "world model" — a learned simulator — and plans inside it. Mentioned at the end as the suggested next step. The 2026 successors are Dreamer V3, world-model VLAs, and physical-foundation-models.

GRobotics concepts

Higher-level ideas the thesis sits inside.

Perception throughout

The part of a robot that turns sensor inputs into useful structured information about the world. Vision, depth, IMU integration, etc. In a decoupled architecture, perception's output is a clean data structure (a mask, a point cloud, a goal pose). In an end-to-end one, there's no externally-visible "perception output" — the network's internal activations are doing the job.

Planning throughout

The part of a robot that decides what to do given some understanding of the world. Trajectory generation, search, decision-making. The thesis is fundamentally about where to draw the boundary between perception and planning.

Control throughout

The lowest-level loop that turns a desired trajectory into actual motor commands. Always closed-loop. The thesis abstracts this away — the DJI autopilot does the inner control loop on its own.

Forward kinematics §1.2

Given the joint angles of a robot, compute the pose of its end-effector. Just trigonometry plus matrix multiplication. The thesis's φ(r) is the forward-kinematic map from arm joint angle r to gripper pose. Because the arm is fixed, φ(r) is a constant.

Latent representation §3 intro

The internal activations of a network — the "embedding space" where it represents the world after compressing the input. Hidden, but the structure that determines what the network can express. End-to-end policies care about latent representations; decoupled ones replace them with hand-engineered data structures.

Feature throughout

Vague but important word. Sometimes means "a measurable property of the input" (a corner, an edge), sometimes "an activation channel inside a network." When the thesis says "latent visual patterns," it means features the network has learned to extract that the engineer didn't explicitly design for.

Receding-horizon planning throughout

See receding horizon in §D. The receding-horizon idea is universal in real-time robotics: plan for the near future, act for a moment, replan with new information. The horizon length and replanning frequency are the two design knobs.

PERCEPTION-BASED UAV PATH PLANNING APPENDIX A · TERMINOLOGY A — G