A visual reading of Perception Based UAV Path Planning for Fruit Harvesting
Cover · 1 · The problem · 2 · Decoupled · 3 · Learning to plan · 4 · What worked · Terminology

CHAPTER 2The decoupled pipeline

Five modules in a chain. Each one does one thing, hands off a clean data structure, and forgets the rest.

§2.1Five blocks

Every cycle, this is what happens at 8–9 Hz:

DECOUPLED VISUAL-SERVOING LOOP EACH BOX IS A SEPARATE ROS PROCESS; ONLY THE GOAL POSE CROSSES INTO PLANNING 1 RGB-D CAMERA 640×480 @ 15 FPS 2 U-NET MASK IoU 93.49% 3 3D PROJECTION PINHOLE + DEPTH 4 RANSAC YAW 3× FOLIAGE BOX 5 MIN-SNAP PD TRACKER CLOSED LOOP EVERY ~110 MS · RESET TO LAST GOOD STATE IF ORANGE LEAVES VIEW DATA CONTRACTS 1 → 2RGB image plus aligned depth map from RealSense D435i 2 → 3Binary orange mask; each positive pixel is a candidate point 3 → 4Mean orange point and surrounding 3× orange-size foliage point cloud 4 → 5Goal pose ḡ = position plus best approach yaw FIG. 2.1 · DECOUPLED PIPELINE · §2.2
FIG. 2.1 The visual-servoing loop. Each block is a separate ROS node. The handoff between (4) and (5) is the only place that perception and planning communicate — through a 6-DOF pose. This is the contract that the next chapter rewrites.

§2.2Segmenting the orange

The first job is finding which pixels in the RGB image actually are the orange. The thesis uses a small fully-convolutional U-Net trained on hand-labelled images. It hits an Intersection-over-Union of 93.49% on held-out frames.

U-NET SEGMENTER FOR THE ORANGE MASK RGB 640×480×3 ENCODER · 4 CONV LAYERS 128 256 512 512 BOTTLENECK DECODER · 3 DECONV LAYERS 512 256 128 MASK BINARY SKIP L1↔L7 · HIGH-RES EDGES SKIP L2↔L6 · 1×1 CONV RESULT: SINGLE-CLASS ORANGE SEGMENTER · IoU 93.49% FIG. 2.2 · U-NET ARCHITECTURE · §2.2.1
FIG. 2.2 A standard encoder-decoder fully-convolutional network with two skip connections. Channels: 128 → 256 → 512 → 512 ↘ 512 → 256 → 128. The skip connections feed early high-resolution features back into the decoder so fine details (the round edge of the orange) survive to the output mask.
Engineering note

For the orange-picking task, this network is doing one thing: where is the orange in this image? Today you would start with a foundation segmenter or an open-vocabulary mask model before training a tiny U-Net from scratch. In 2021, a hand-labelled single-class U-Net was the straightforward engineering choice.

§2.3From pixels to 3D

A 2D mask is not enough — the planner needs a position in metres. The thesis uses the standard pinhole camera model: every world point projects through a single point (the camera origin) onto a flat image plane at focal length f. Inverting that projection requires one extra piece of information: depth. That's why the camera is RGB-D.

PINHOLE CAMERA MODEL · PIXEL TO 3D POINT SIDE CUTAWAY: SENSOR PLANE BEHIND A PINHOLE, WORLD POINT P IN FRONT u v IMAGE PLANE PIXEL GRID ON SENSOR PINHOLE WORLD POINT P P = (x_c, y_c, z_c) ORANGE SURFACE SAMPLE PIXEL (u,v) f FOCAL LENGTH ẑ_c ŷ_c x̂_c z_c FROM DEPTH MAP WITHOUT z_c THE RAY HAS INFINITELY MANY 3D POINTS INTRINSIC MATRIX M_INT 3×4 — APPLIES TO HOMOGENEOUS [x_c y_c z_c 1]ᵀ [ f_x 0 o_x 0 ] [ 0 f_y o_y 0 ] [ 0 0 1 0 ] FORWARD PROJECTION [u·z_c, v·z_c, z_c]ᵀ = M_int · P̃ u = f_x·x_c/z_c + o_x v = f_y·y_c/z_c + o_y INVERSE NEEDS DEPTH x_c = (u − o_x)·z_c / f_x y_c = (v − o_y)·z_c / f_y z_c = depth(u,v) FIG. 2.3 · PINHOLE PROJECTION · §2.2.2
FIG. 2.3 Single world point P projecting to a pixel (u,v). Going pixel → world is impossible without z_c — there's a one-parameter family of world points that project to the same pixel. The depth sensor pins down z_c, and the rest follows from the intrinsic matrix M_int. Every "orange pixel" gets projected; their mean is the goal position.

§2.4Where to attack from

Knowing where the orange is is half the goal pose. The other half is which way to come in — pluck it from the wrong angle and the gripper smacks into a leaf or the rest of the foliage occludes the camera. The thesis's heuristic: come in perpendicular to the surface of foliage around the orange. That's the angle with the most clear airspace.

To estimate that surface, the thesis grows a 3× box around the orange in the segmented image, projects all those pixels to 3D, and fits a plane through them with RANSAC. The plane's normal is the desired approach direction.

APPROACH YAW FROM FOLIAGE GEOMETRY 1 · MASK BOX 3× ORANGE SIZE 2 · PROJECT OUTLIER FOLIAGE POINTS IN R³ 3 · RANSAC n PLANE NORMAL = BEST YAW 4 · GOAL POSE R_YAW ḡ = (R_YAW, p_ORANGE) FIG. 2.4 · APPROACH-ANGLE ESTIMATION · §2.2.2
FIG. 2.4 The four-stage geometry that turns a binary mask into a goal pose. RANSAC is the right tool here because the projected foliage points are noisy (depth sensor errors, branches sticking out at odd angles) — a least-squares fit would be biased by the worst points. RANSAC discards them by definition.
Failure mode

The RANSAC fit oscillates when the foliage around the orange is sparse. There aren't enough inliers to constrain the plane, so the normal swings as the drone moves — the goal pose flickers, the trajectory replans on every swing, and the drone overshoots. The thesis reports 5 outright failures and names two failure families: lost view during the maneuver, and sparse-foliage plane instability. It does not split those 5 failures numerically.

Alpha-decay smoothing

The patch for the flickering plane normal is to stop trusting each frame as a fresh truth. The system keeps a world-frame estimate of the orange pose, gwo, by composing odometry with the observed relative pose: gwo = gwq gqo. Since the orange is static, new observations should refine the same fixed world pose, not yank it around frame by frame.

ALPHA-DECAY GOAL-POSE FILTER WORLD-FRAME ESTIMATE TURNS NOISY RELATIVE OBSERVATIONS INTO A STABLE TARGET OBSERVED RELATIVE POSE g_qo FROM MASK + RANSAC NOISY WHEN FOLIAGE IS SPARSE ODOMETRY g_wq FROM OPTITRACK COMPOSE g_wo^obs = g_wq g_qo^obs ORANGE SHOULD BE STATIC UPDATE ESTIMATE p_est ← p_est + α(p_obs − p_est) R_est ← slerp(R_obs R_est⁻¹, α) R_est α DECAYS TOWARD α_MIN OVER MAX STEPS EARLY FRAMES MOVE THE ESTIMATE; LATER FRAMES ONLY NUDGE IT FIG. 2.4B · ALPHA-DECAY SMOOTHING · §2.3.2
FIG. 2.4B Alpha-decay is an exponential moving average over pose, with slerp for rotation. It smooths the RANSAC normal once the orange has been seen. It still cannot solve the harder Chapter 3 case where the orange is never visible at the start.

§2.5The two-step pick

One last engineering wrinkle: the gripper is bolted to the bottom of the drone. If the drone flies directly at the orange, the gripper occludes the camera before the magnet engages — and the moment the camera loses sight of the orange, the whole loop falls apart.

The thesis solves this with a two-step approach: first move below the orange (gripper out of view), then move up and forward to engage the magnet.

If the orange leaves view during either step, the drone returns to the last known good state where the mask was visible. If that reset fails more than two times, the run is counted as a failure. That reset rule is why the thesis separates clean successes from "success with reset" in Table 2.1.

TWO-STEP ORANGE PICKING MANEUVER GO BELOW FIRST SO THE MAGNETIC GRIPPER STAYS OUT OF THE CAMERA FRUSTUM GROUND REFERENCE MAGNET ORANGE TARGET STAYS IN VIEW CAMERA VIEW-CONE STRAIGHT-IN PATH RISKS SELF-OCCLUSION 1 · START ORANGE VISIBLE 2 · BELOW GRIPPER BELOW FRUSTUM 3 · ENGAGE UP + FORWARD A · DROP UNDER THE FRUIT B · RISE INTO MAGNET CONTACT FIG. 2.5 · TWO-STEP MANEUVER · §2.3.2
FIG. 2.5 The drone first goes under the orange so the gripper falls below the camera's field of view. Then it rises and presses forward — the orange stays in frame the whole time, so the perception loop never breaks.

§2.6102 runs

The system was tested on 102 runs against the artificial tree, with the orange repositioned between trials so the relative approach yaw varied across the full ±75° range.

VISUAL-SERVOING RESULTS · 102 REAL TRIALS CLEAN SUCCESS 84 / 102 · 82.4% WITH RESET 13 / 102 · 12.7% FAILURE 5 / 102 · SPARSE FOLIAGE OR LOST VIEW FIG. 2.6 · RESULTS · TABLE 2.1
FIG. 2.6 The decoupled pipeline is extremely reliable when the goal is in view: 84 clean picks, 13 more after a reset to the last-good state, and 5 failures. The thesis attributes the failures to lost view/reset breakdowns and sparse-foliage plane instability, without giving a count for each bucket.

But the headline number hides the real limitation. Every test started with the orange visible. The whole pipeline is built on the assumption that the segmentation network has something to segment. The moment the orange is occluded by a leaf at t₀, this system has no recourse — there's nothing to project to 3D, nothing to plan towards. Chapter 3 is the response to exactly that problem.

PERCEPTION-BASED UAV PATH PLANNING CHAPTER 2 · DECOUPLED PIPELINE FIG. 2.1 — 2.6