APPENDIX ATerminology
Every piece of jargon in the thesis, in plain English. Where it helps, I've added an LLM-side analogue — same idea, different vocabulary.
AHardware & flight
The physical apparatus and what each component is responsible for.
Any flying robot without a human pilot on board. In this thesis it specifically means a quadcopter — four propellers, X-frame, hovering capable. "Drone" is the colloquial synonym.
A four-rotor flying vehicle. Two rotors spin clockwise, two counter-clockwise; balancing their thrusts and torques produces movement in any direction. The DJI Matrice 100 used here is a development-platform quadrotor with an open SDK.
A flying robot with an arm. Combines the mobility of a UAV with the ability to interact with objects. In this paper the "arm" is just a fixed rod with a gripper — it doesn't articulate, which simplifies the dynamics enormously.
The business end of a robot — whatever does the actual interaction with the world. Here it's the magnetic gripper at the tip of the arm. Mathematically it's a fixed pose offset from the drone body.
The "hand." This thesis uses a passive magnetic gripper — no motors, just a strong magnet that latches when it gets close enough to the magnet on the orange. Cheap, reliable, no actuation complexity, but only works for objects you've prepared with magnets.
The motors that spin the rotors. Brushless = electronically commutated, more efficient than brushed motors. Drones use BLDCs for high power density and reliability.
A small chip combining accelerometers and gyroscopes. Reports linear acceleration and angular velocity. Used by the flight controller for attitude stabilisation; not directly used in this thesis, but it's running in the background to keep the drone level.
A camera that returns both colour (RGB) and depth (D) for every pixel. The thesis uses the Intel RealSense D435i, which gets depth from active stereo: it projects an invisible IR pattern, sees how the pattern is distorted by the geometry, and triangulates.
An image where each pixel is a distance, not a colour. Same dimensions as the RGB image, aligned per-pixel. Crucial for going from 2D pixel coordinates back to 3D world coordinates (see pinhole camera model).
A set of 3D points, usually obtained by projecting a depth map into space using camera intrinsics. The "shape of the world that the camera sees, in metres."
An infrared multi-camera system that tracks reflective markers stuck on the drone, returning sub-millimetre position estimates. Standard in robotics labs as a "ground truth" odometry source. It's not how a real-world drone would localise itself outdoors — only used here because the lab environment allows it.
The robot's estimate of its own position and orientation over time. Could come from wheel encoders (ground robots), GPS, IMU integration, visual-inertial estimation, or — as here — external mocap. The thesis uses Optitrack because it's high-accuracy and the test environment is small enough.
Not really an OS — a middleware framework where each algorithm runs as a "node," communicating via typed message topics. The thesis runs U-Net, projection, RANSAC, and trajectory generation as separate ROS nodes. ROS gives you free pub/sub, transforms, visualisation (RViz), and a huge ecosystem of pre-built nodes.
BComputer vision
Tools for turning images into structured information.
Per-pixel classification: every pixel in an image gets a label ("orange," "leaf," "sky"). Different from object detection, which gives bounding boxes; closer to "drawing inside the lines" of every object. The thesis only has one class (orange vs not-orange) so it's binary segmentation.
The output of a segmentation network — a same-shape array of 0s and 1s (or class labels) over the input image. Tells you which pixels belong to the object you care about.
A neural network whose layers slide small filters across the input — the standard tool for image-shaped data. Filters detect patterns (edges, then textures, then objects) at increasing scales of abstraction. CNNs were the dominant vision architecture from ~2012 to ~2020 before transformers caught up.
A specific CNN architecture for segmentation. Shape: a downsampling path (encoder) followed by a symmetric upsampling path (decoder), with skip connections jumping across that "U" shape. Skip connections feed early high-resolution features into the decoder so fine details aren't lost. Originally introduced for biomedical image segmentation in 2015.
A direct wire that shortcuts past several layers of a network. Helps gradients flow during training (the original motivation in ResNet), and helps preserve information that would otherwise be destroyed by downsampling (the motivation in U-Net).
A CNN family that uses skip connections to make very deep networks trainable. Each "block" adds the input of the block to its output: y = F(x) + x. ResNet8 is just a tiny version with 8 layers — the thesis uses it because the prediction task has limited training data.
The standard metric for segmentation quality. IoU = |A ∩ B| / |A ∪ B| — how much your predicted mask overlaps with the ground-truth mask, divided by their combined area. 1.0 = perfect, 0 = no overlap. The thesis hits 93.49%, which for a single-class network on a clean dataset is solid.
A widely-used image dataset containing 150+ common object classes with segmentation labels. Standard "pretraining" target for vision networks. Mentioned here because U-Nets are often trained on COCO before fine-tuning to a specific task.
A rectangle around an object in an image. Used in object detection (faster but less precise than segmentation). The thesis discusses bounding-box methods (R-CNN) only to explain why it picked segmentation instead.
C3D geometry & projection
How pixels relate to physical metres.
An idealised camera where all light passes through a single infinitesimal hole and lands on a flat sensor at distance f (the focal length). This model is what lets you write down the equation that maps 3D world points to 2D pixels — the foundation of every projection-based 3D reconstruction.
Distance between the pinhole and the image plane. Bigger f = narrower field of view (telephoto). Smaller f = wider field of view (wide-angle). Specified by the camera manufacturer; the thesis uses the values RealSense reports.
A 3×4 matrix that encodes the camera's internal geometry: focal lengths f_x, f_y, principal point offsets o_x, o_y. Multiply a 3D point in camera frame by this matrix to get the (homogeneous) pixel coordinate. Stored once per camera; doesn't change as the camera moves.
The 4×4 matrix that describes where the camera is and which way it points, relative to some world frame. Combined with the intrinsics, you can map any world point to a pixel. The thesis treats the camera-to-body extrinsic as a known constant (mounted in a fixed position on the drone).
The flat surface inside the camera where the sensor lives. World points project onto it through the pinhole. Pixel coordinates (u, v) measure positions on this plane.
A coordinate system attached to the camera: origin at the pinhole, z-axis pointing forward (the way the camera is looking). All "depths" are along this z-axis.
"Body frame" rides with the drone — y-axis sideways, z-axis up, x-axis forward. "World frame" is fixed in the room. The same physical point has different coordinates in different frames; converting between them uses the rotation matrix R and position p — i.e. the pose g.
A robust model-fitting algorithm. Repeatedly: pick a random small subset of the data, fit a model, count how many points agree (the "consensus"). The model with the most consensus wins, and outliers are ignored. The thesis uses RANSAC to fit a plane through the foliage point cloud — without RANSAC, branches sticking out would corrupt a least-squares fit.
Finding the plane ax + by + cz + d = 0 that best passes through a set of 3D points. The plane's normal vector (a, b, c) is the direction perpendicular to it — in this thesis, the direction the drone should fly in to "punch through" the foliage at the orange.
The classical robotics problem of building a map of an unknown environment while figuring out where you are inside it. The thesis specifically avoids SLAM because it doesn't scale well to large environments (the map gets too big). The whole thesis can be read as "what if we didn't need a map at all?"
DControl & planning
How the drone actually moves.
The simplest non-trivial controller. Output is k_p · error + k_d · derivative_of_error. The "P" term pushes towards the target; the "D" term damps oscillation. Used here as the inner loop tracking the reference trajectory. PID adds an "I" (integral) term for steady-state error — not used here.
Solve an optimal-control problem over a short future horizon, execute the first action, replan from the new state. Lets you handle constraints (obstacles, motor limits) explicitly inside the optimisation. The "expert" in chapter 3 is essentially an MPC running offline with full knowledge.
A specific algorithm for solving non-linear optimal control. Cousin of LQR (Linear Quadratic Regulator) for non-linear dynamics. Iterative; converges quickly when initialised well. The thesis uses DDP both to compute expert trajectories for training (chapter 3) and to smooth the network's predicted waypoints into a feasible reference at inference time.
Computing a sequence of states the robot should visit over time. Distinct from trajectory tracking, which is the lower-level loop that actuates motors to follow that sequence.
A trajectory parameterised as a polynomial in time, chosen so that the fourth derivative of position (the "snap") is minimised. Smooths out trajectories so motors don't have to jerk. Standard for quadrotors thanks to a 2011 Mellinger & Kumar paper.
A property some systems have where the entire state and control history can be recovered from a sufficiently smooth trajectory of certain "flat outputs" — for a quadrotor, position and yaw. This is what makes minimum-snap trajectories actually flyable: you can smoothly differentiate the path and read off motor commands.
A control technique where the controller's feedback comes directly from camera measurements, not from a separate state estimator. Two flavours: position-based (compute a 3D goal from images, then use standard control) and image-based (drive features in the image directly to target locations). The thesis uses position-based.
Closed-loop = controller uses real-time feedback (e.g. camera) to correct itself. Open-loop = controller commits to a plan and executes it blindly. Both pipelines in this thesis are closed-loop, just with different feedback (decoupled: 6-DOF goal pose @ 8–9 Hz; end-to-end: full image-to-waypoints @ image rate).
"Goal pose" = the final 6-DOF target the drone is trying to reach. "Waypoint" = an intermediate pose along the way. Chapter 2 communicates one goal pose between modules; chapter 3 communicates three waypoints (Δḡ₁, Δḡ₂, Δḡ₃) at 0.5 s spacing.
The strategy where you plan far into the future, but only execute the near-future portion before re-planning from the new state. Both pipelines do this — in chapter 3 the horizon is 3 × 0.5 s = 1.5 s of waypoints, but only the first ~0.5 s is executed before the network re-runs.
EMath notation
Decoder ring for the symbols in the thesis.
The set of 3D rotations. Equivalent to the set of 3×3 orthogonal matrices with determinant 1. When you see R ∈ SO(3), read "R is a rotation."
The set of rigid-body transformations in 3D — rotation + translation. A 4×4 matrix combining a rotation R and a position p. When you see g ∈ SE(3), read "g is a 6-DOF pose."
A 3×3 matrix that, when applied to a vector, rotates it. One of several equivalent ways to encode 3D orientation; the others are quaternions, Euler angles, axis-angle. Rotation matrices are nice because composition is just matrix multiplication.
The right way to interpolate between two rotations. Linear interpolation of rotation matrices doesn't preserve orthogonality; slerp moves along the great-circle arc on the unit quaternion sphere. Used here in the alpha-decay smoothing of the goal pose estimate.
The three independent rotations of a rigid body. Yaw = rotation about vertical axis (turning left/right). Pitch = rotation about sideways axis (nose up/down). Roll = rotation about forward axis (banking). For a quadrotor, yaw is independent of position; pitch and roll directly control horizontal motion.
"Body-fixed" means the velocity is expressed in the drone's own coordinate frame, not the world's. So v is "how fast forward the drone thinks it's moving," not "how fast east." Same for angular velocity ω.
The number being minimised by an optimal-control solver. Penalises deviation from desired state, control effort, obstacle collisions, etc. Each term has a weight (Q, R, b…) you tune to trade things off. Designing J well is most of the work in classical control.
The standard loss for classification — measures how surprised you are at the true label given the network's predicted distribution. Equivalent to maximum-likelihood under a categorical model. Used here on each of the 18 independent 100-class softmaxes.
The function that turns a vector of logits into a probability distribution. Exponentiate each logit, normalise so they sum to 1. Used here at the network's output to give a proper probability over the 100 bins.
FLearning
How the network in chapter 3 actually gets trained.
Training a policy by mimicking an expert. You collect (state, action) pairs from someone (or something) doing the task correctly, and train a network to map state → action. No exploration, no reward signal, no failed attempts. The catch is you need an expert.
The simplest form of imitation learning: just train on (state, action) pairs as a regression/classification problem. Famously brittle because of distribution shift — the model's small errors at inference time compound, leading it into states the expert never showed it. DAgger is the classic fix.
An algorithm that fixes behaviour cloning's compounding-error problem. Procedure: roll the student out, query the expert for what it would do at each visited state, add those new (state, expert action) pairs to the dataset, retrain. Repeat until the student stops drifting. Ross & Bagnell 2011.
Training a policy via trial-and-error from a reward signal. Mentioned in this thesis only to be rejected: real-world RL on a drone means real-world crashes. Imitation learning is preferred when a working expert exists.
The whole problem of transferring a policy trained in simulation to the real world. Hard because the simulator differs from reality in countless small ways (lighting, friction, sensor noise). Common fixes: domain randomisation (train across many randomised sims), domain adaptation (align features), photorealistic rendering. The thesis flags this as the dominant unsolved issue.
Synthetically expanding the training set by applying transformations that should not change the label. The thesis flips images horizontally and mirrors the corresponding waypoint labels — effectively doubling the data for free.
The standard ML hygiene practice: train on most of the data, hold out a validation set for hyperparameter tuning, and a separate test set for final evaluation that nobody touches until the end. Thesis uses 70 / 15 / 15.
Two ways to make a network predict a number. Regression: predict the number directly, train with MSE. Classification: discretise the range into bins and predict which bin, train with cross-entropy. The thesis explicitly chooses classification because regression collapses to the mean of the dataset for this kind of multimodal target.
The raw outputs of a classification network before softmax. The "1800 logits" mentioned in chapter 3 means 1800 unnormalised scores, organised as 18 groups of 100, each group fed through softmax independently.
A 2020 model-based RL algorithm that learns a "world model" — a learned simulator — and plans inside it. Mentioned at the end as the suggested next step. The 2026 successors are Dreamer V3, world-model VLAs, and physical-foundation-models.
GRobotics concepts
Higher-level ideas the thesis sits inside.
The part of a robot that turns sensor inputs into useful structured information about the world. Vision, depth, IMU integration, etc. In a decoupled architecture, perception's output is a clean data structure (a mask, a point cloud, a goal pose). In an end-to-end one, there's no externally-visible "perception output" — the network's internal activations are doing the job.
The part of a robot that decides what to do given some understanding of the world. Trajectory generation, search, decision-making. The thesis is fundamentally about where to draw the boundary between perception and planning.
The lowest-level loop that turns a desired trajectory into actual motor commands. Always closed-loop. The thesis abstracts this away — the DJI autopilot does the inner control loop on its own.
Given the joint angles of a robot, compute the pose of its end-effector. Just trigonometry plus matrix multiplication. The thesis's φ(r) is the forward-kinematic map from arm joint angle r to gripper pose. Because the arm is fixed, φ(r) is a constant.
The internal activations of a network — the "embedding space" where it represents the world after compressing the input. Hidden, but the structure that determines what the network can express. End-to-end policies care about latent representations; decoupled ones replace them with hand-engineered data structures.
Vague but important word. Sometimes means "a measurable property of the input" (a corner, an edge), sometimes "an activation channel inside a network." When the thesis says "latent visual patterns," it means features the network has learned to extract that the engineer didn't explicitly design for.
See receding horizon in §D. The receding-horizon idea is universal in real-time robotics: plan for the near future, act for a moment, replan with new information. The horizon length and replanning frequency are the two design knobs.