NeRF is one of the most interesting technologies I’ve ever come across. It pulls off both 2D view transformation and 3D reconstruction simultaneously—all with just an MLP.
NeRF learns the color and opacity of every coordinate in a scene, conditioned on the viewing angle. Some people say NeRF’s training process is just overfitting. I’d say that’s both true and not quite true. The argument for overfitting comes from the fact that NeRF is designed to fit a specific scene rather than generalizing across multiple ones. But in my view, within the 3D world it has seen, NeRF actually achieves perfect generalization across all possible viewpoints.
Data
NeRF takes in a five-dimensional input—three dimensions representing the observation position and two for the viewing angle (camera parameters). The output consists of four dimensions: three for color and one for opacity.
For the highest quality training data, it's best if you can provide camera parameters with minimal error. But if you don't have access to them, no worries—you can use Colmap to match your images and infer the camera parameters automatically.
To prepare training data for NeRF, you need to capture images from different heights while circling around the target object. For example, if you're reconstructing an anime figurine, you should take shots at different levels—around the lower legs, waist, and neck. Additionally, you should capture extra images of hard-to-see areas to prevent texture loss.
If you're using Colmap, there’s one extra thing to keep in mind: Colmap relies on distinctive points in the images for matching. These points can be found at high-contrast edges, such as the border of a QR code, a sharp corner of a wall, or any region with prominent textures.
With the right image collection strategy, at least 200 images should be collected for a fair-enough NeRF reconstruction.
Method
NeRF’s network architecture is incredibly simple—just an MLP that maps a 5D input to a 4D output:
Anyone unfamiliar with NeRF might assume that the input represents the camera’s viewpoint from the training data, while the output corresponds to each pixel’s RGB values—plus some mysterious σ whose meaning isn’t immediately clear. In reality, the camera viewpoint in the input doesn’t represent an observation angle in the traditional sense. Instead, it should be understood as a set of points along the viewing direction.
NeRF models the continuous distribution of color and opacity in the interested 3D space. The observed color at a given viewpoint is then computed as the accumulated contribution of all these color blocks along the viewing direction, weighted by their opacity. This is represented by the equation:
where:
represents how much light is transmitted past all points before point , calculated as: represents the fraction of light contributed by point , given by:
Since many sampled points fall in the air with zero opacity,
uniform sampling can lead to inefficient computations.
To improve efficiency, NeRF treats
When constructing training data, NeRF extracts the incident angle and RGB value for each pixel from the input images and camera parameters. For every pixel, it samples a series of points along the corresponding ray direction and feeds them into an MLP to predict their RGB colors and opacity values. The final color for that pixel is then computed by integrating the contributions of all sampled points along the ray, and this predicted color is supervised by the ground-truth RGB value from the original image.
Since NeRF's input incorporates the viewing angle (incident direction), it enables the same point to exhibit different colors when observed from different perspectives. This allows view-dependent lighting effects to be rendered - such as sunlight reflections on glass surfaces.