Five modules in a chain. Each one does one thing, hands off a clean data structure, and forgets the rest.
§2.1Five blocks
Every cycle, this is what happens at 8–9 Hz:
FIG. 2.1 The visual-servoing loop. Each block is a separate ROS node. The handoff between (4) and (5) is the only place that perception and planning communicate — through a 6-DOF pose. This is the contract that the next chapter rewrites.
§2.2Segmenting the orange
The first job is finding which pixels in the RGB image actually are the orange. The thesis uses a small fully-convolutional U-Net trained on hand-labelled images. It hits an Intersection-over-Union of 93.49% on held-out frames.
FIG. 2.2 A standard encoder-decoder fully-convolutional network with two skip connections. Channels: 128 → 256 → 512 → 512 ↘ 512 → 256 → 128. The skip connections feed early high-resolution features back into the decoder so fine details (the round edge of the orange) survive to the output mask.
Engineering note
For the orange-picking task, this network is doing one thing: where is the orange in this image? Today you would start with a foundation segmenter or an open-vocabulary mask model before training a tiny U-Net from scratch. In 2021, a hand-labelled single-class U-Net was the straightforward engineering choice.
§2.3From pixels to 3D
A 2D mask is not enough — the planner needs a position in metres. The thesis uses the standard pinhole camera model: every world point projects through a single point (the camera origin) onto a flat image plane at focal length f. Inverting that projection requires one extra piece of information: depth. That's why the camera is RGB-D.
FIG. 2.3 Single world point P projecting to a pixel (u,v). Going pixel → world is impossible without z_c — there's a one-parameter family of world points that project to the same pixel. The depth sensor pins down z_c, and the rest follows from the intrinsic matrix M_int. Every "orange pixel" gets projected; their mean is the goal position.
§2.4Where to attack from
Knowing where the orange is is half the goal pose. The other half is which way to come in — pluck it from the wrong angle and the gripper smacks into a leaf or the rest of the foliage occludes the camera. The thesis's heuristic: come in perpendicular to the surface of foliage around the orange. That's the angle with the most clear airspace.
To estimate that surface, the thesis grows a 3× box around the orange in the segmented image, projects all those pixels to 3D, and fits a plane through them with RANSAC. The plane's normal is the desired approach direction.
FIG. 2.4 The four-stage geometry that turns a binary mask into a goal pose. RANSAC is the right tool here because the projected foliage points are noisy (depth sensor errors, branches sticking out at odd angles) — a least-squares fit would be biased by the worst points. RANSAC discards them by definition.
Failure mode
The RANSAC fit oscillates when the foliage around the orange is sparse. There aren't enough inliers to constrain the plane, so the normal swings as the drone moves — the goal pose flickers, the trajectory replans on every swing, and the drone overshoots. The thesis reports 5 outright failures and names two failure families: lost view during the maneuver, and sparse-foliage plane instability. It does not split those 5 failures numerically.
Alpha-decay smoothing
The patch for the flickering plane normal is to stop trusting each frame as a fresh truth. The system keeps a world-frame estimate of the orange pose, gwo, by composing odometry with the observed relative pose: gwo = gwq gqo. Since the orange is static, new observations should refine the same fixed world pose, not yank it around frame by frame.
FIG. 2.4B Alpha-decay is an exponential moving average over pose, with slerp for rotation. It smooths the RANSAC normal once the orange has been seen. It still cannot solve the harder Chapter 3 case where the orange is never visible at the start.
§2.5The two-step pick
One last engineering wrinkle: the gripper is bolted to the bottom of the drone. If the drone flies directly at the orange, the gripper occludes the camera before the magnet engages — and the moment the camera loses sight of the orange, the whole loop falls apart.
The thesis solves this with a two-step approach: first move below the orange (gripper out of view), then move up and forward to engage the magnet.
If the orange leaves view during either step, the drone returns to the last known good state where the mask was visible. If that reset fails more than two times, the run is counted as a failure. That reset rule is why the thesis separates clean successes from "success with reset" in Table 2.1.
FIG. 2.5 The drone first goes under the orange so the gripper falls below the camera's field of view. Then it rises and presses forward — the orange stays in frame the whole time, so the perception loop never breaks.
§2.6102 runs
The system was tested on 102 runs against the artificial tree, with the orange repositioned between trials so the relative approach yaw varied across the full ±75° range.
FIG. 2.6 The decoupled pipeline is extremely reliable when the goal is in view: 84 clean picks, 13 more after a reset to the last-good state, and 5 failures. The thesis attributes the failures to lost view/reset breakdowns and sparse-foliage plane instability, without giving a count for each bucket.
But the headline number hides the real limitation. Every test started with the orange visible. The whole pipeline is built on the assumption that the segmentation network has something to segment. The moment the orange is occluded by a leaf at t₀, this system has no recourse — there's nothing to project to 3D, nothing to plan towards. Chapter 3 is the response to exactly that problem.