CHAPTER 4What worked, what didn't
The thesis is a clean two-trial experiment. Each system wins where the other can't.
§4.1The scoreboard
§4.2The real trade-off
It's tempting to read this as "old method wins on the bench, new method has interesting potential" and shrug. The actual lesson is sharper.
The decoupled stack is the right answer for any task that fits inside its assumptions: the goal is in view, the foliage is dense enough for plane fitting, the environment is bounded. Inside that envelope it is essentially solved: 97/102 successful picks, with 84 clean and another 13 recovered by returning to a last-good pose.
The end-to-end network is the right answer for any task that doesn't fit inside those assumptions — initial occlusion, no map, having to search. It's worse on the easy task and better on the hard one. That's not failure; that's a different operating point on the same trade-off curve.
Decoupled pipelines optimise for verifiability — every module is testable, every interface inspectable. End-to-end networks optimise for capability in the open world — they pick up cues you didn't think to engineer. The choice between them isn't religious; it's a question of how rigid your environment actually is. For an artificial tree in a mocap room, decoupled wins. For a real orchard with wind, weather, and varied lighting, the learned approach is the only one with a path forward.
§4.3What comes next
The thesis itself flags three directions:
- Use the baseline as a data engine. The decoupled pipeline can run thousands of trials with consistent labels — feed those into the end-to-end network as additional supervision. This is the same idea as bootstrapping a learned policy with a working classical controller.
- Sim-to-real transfer. Domain randomisation, photorealistic rendering, or feature-level alignment so the simulation-trained model survives the gap to the real lighting and texture. Sadeghi et al. 2018 (RC-GAN) is the suggested starting point.
- Bigger / better world models. The thesis name-checks DreamerV2 — explicitly model the environment and let the network plan in latent space. The 2021 reading list; the 2026 equivalent is whatever VLA architecture is current (π0, OpenVLA, RT-2).
§4.4What this thesis teaches you
Read this thesis as one well-executed instance of the perception-action loop debate. Not for the orange-picking specifically — that hardware is dated, the network architecture is dated, and a 2026 implementation would replace half the components. Read it because it does, on a small scale, the exact thing that's at the centre of modern robotics:
- Reduces a complex 3D physical task to a clean state-space + cost function (chapter 1).
- Engineers a working solution by stacking known computer-vision and control primitives (chapter 2).
- Replaces those primitives with a single learned function and shows what extra capability you buy with that move (chapter 3).
- Honestly reports where the learned approach loses on the test bench (chapter 4).
Every modern paper in robotic manipulation — π0, RT-2, OpenVLA, the Figure VLAs — is doing some version of this same comparison at much larger scale. If you understand the tension between chapters 2 and 3 of this thesis, you understand the central architectural axis those papers debate.
Pick one chapter to do, not just read. Easiest path given your background: open a Jupyter notebook, install lerobot from HuggingFace, and run a pretrained ALOHA policy in the bundled simulator. Then come back here and compare what's in the LeRobot codebase to chapter 3 of this thesis. Most of it will look familiar — they're solving the same problem at five years' distance.