GOEnFusion: Gradient Origin Encodings for 3D Forward Diffusion Models

Abstract

The recently introduced Foward-Diffusion method allows to train a 3D diffusion model using only 2D images for supervision. However, it does not easily generalise to different 3D representations and requires a computationally expensive auto-regressive sampling process to generate the underlying 3D scenes. In this paper, we propose GOEn: Gradient Origin Encoding (pronounced "gone"). GOEn can encode input images into any type of 3D representation without the need to use a pre-trained image feature extractor. It can also handle single, multiple or no source view(s) alike, by design, and tries to maximise the information transfer from the views to the encodings. Our proposed GOEnFusion model pairs GOEn encodings with a realisation of the Forward-Diffusion model which addresses the limitations of the vanilla Forward-Diffusion realisation. We evaluate how much information the GOEn mechanism transfers to the encoded representations, and how well it captures the prior distribution over the underlying 3D scenes, through the lens of a partial AutoEncoder. Lastly, the efficacy of the GOEnFusion model is evaluated on the recently proposed OmniObject3D dataset while comparing to the state-of-the-art Forward and non-Forward-Diffusion models and other 3D generative models.

Method

Didactic illustration of the GOEn mechanism. We demonstrate the mechanism here using the Triplane representation for \(g(c, d)\), but note that this can be applied to other representations as well. The GOEn mechanism consists of two steps. First we render the origin \(\zeta_0\) from the context-poses \(\phi^\text{ctxt}\) into almost blank renders. Then, we compute the gradient of the MSE between the renders and the source-views \(o^\text{ctxt}\) wrt. the origin \(\zeta_0\) which gives us the GOEn encoded version \(\zeta_\text{enc}\) .

We propose Gradient Origin Encodings (GOEns), where we define the encodings of the observations as the gradient of the log-likelihood of the observations under the differentiable \(\texttt{forward}\) operation. Without loss of generality, assuming that \(\zeta\) are the parameters of the \(g\) function (i.e. the features/weights of the 3D Radiance-Field), \(\mathcal{R}\) is the differntiable render (\( \texttt{forward} \)) functional and \(\zeta_0\) denotes the origin (zero parameters), we define the encodings \(\zeta_\text{enc}\) (fig. above) as follows: \begin{align} \zeta_\text{enc} &:= GOEn(g, \mathcal{R}, o^\text{ctxt}, \phi^\text{ctxt}) \nonumber \\ &\boxed{:= -\nabla_{\zeta_0}||o^\text{ctxt} - \mathcal{R}(g, \phi^\text{ctxt}; \zeta_0)||^2_2.} \label{eq:goen_enc} \end{align} As done usually, we estimate the log-likehood by the mean squared error and note that the encoding function \(GOEn\) backpropagates through the differentiable forward functional \(\mathcal{R}\). As apparent, these encodings can handle single, multiple or no source views by design, and can be used with any 3D representation \(g\). We minimise the following loss function: \begin{align} &\mathcal{L}^\text{GOEn-MSE}(o^\text{ctxt}, \hat{o}^\text{ctxt}) := ||o^\text{ctxt} - \hat{o}^\text{ctxt} ||^2_2 \nonumber \\ & := ||o^\text{ctxt} - \mathcal{R}(g, \phi^\text{ctxt}; GOEn(g, \mathcal{R}, o^\text{ctxt}, \phi^\text{ctxt}))||^2_2, \end{align} for maximising the information content in the encodings \(\zeta_\text{enc}\). We essentially re-purpose the backward pass of the rendering function to encode the information in the source views \(o^\text{ctxt}\) into the parameters of the 3D scene representation \(\zeta\).

Partial Auto-encoding experiments

In order to evaluate the information transfer from the source views to the encoded 3D representation, we train the standalone GOEn component on its own before using it in the context of 3D generation and 3D reconstruction. Given the dataset \(D = \lbrace(\mathcal{I}_i^j, \phi_i^j) | i \in [0, N] \text{ and } j \in [0, C]\rbrace\) of \(N\) 3D scenes where each scene contains \(C\) images and camera parameters, we define the partial-autoencoder as a mechanism which encodes \(k\) source views and camera parameters, of a certain 3D scene, into the representation \(g\) (whose parameters are \(\zeta\)). The encoded scene representation should be such that the rendered views from the same source cameras, should be as close as possible to the G.T. images \(\mathcal{I}\), i.e., the partial-autoencoder \(PA: \mathbb{R}^{h \times w \times c} \times \mathbb{R}^{4 \times 4} \rightarrow \mathbb{R}^{h \times w \times c}\) should minimize the following mean squared error objective: \begin{align} \mathcal{L}^\text{PA-MSE} &:= \mathbb{E}_{(\mathcal{I}, \phi) \sim D}\| \mathcal{I} - PA(\mathcal{I}, \phi) \|^2_2 \nonumber \\ \text{where, } \nonumber \\ PA(\mathcal{I}, \phi) &:= \mathcal{R}(g, \phi; GOEn(g, \mathcal{R}, \mathcal{I}, \phi)). \end{align} Since the auto-encoder is not targeted to do the 3D reconstruction, we denote this setting by the term "partial-autoencoding"" instead of autoencoding. We evalute the GOEn encodings on three different 3D Radiance-Field representations, \(g\), namely Triplanes, Feature-Voxel grids, and MLPs.

Quantitative evaluation of Partial AutoEncoding. PSNR(\(\uparrow\)), LPIPS(\(\downarrow\)) and SSIM(\(\uparrow\)) of GOEn on three different realisations of the 3D Radiance-Field \(g\). All the metrics are evaluated for target views (different from the source views) against the G.T. mesh renders from the dataset. The Single-Scene-Overfitting (SSO) scores denote the case of individually fitting the representations to the 3D scenes.

Qualitative evaluation of Partial-AutoEncoding. The rows MLP, Triplane and Voxel-grid show the renders of the GOEn encoded representations from the target-view respectively. The colour-coded columns demonstrate the effect of varying the number-of-source views \((1, 2, 3, 4)\) used in the GOEn encoding. The SSO column shows the target render of the single-scene-overfitted representation while the G.T. column shows the mesh-render from the dataset (repeated for clarity).

Qualitative Results

Qualitative samples of GOEn in reconstruction setting. Here we show the 360\(^{\circ}\) rotating videos of the 3D reconstructed scenes from the OmniObject3D dataset by Triplane-GOEn. These samples are generated by using any random \(4\) views of individual object.

Qualitative samples of GOEn in diffusion setting. Here we visualize the 360\(^{\circ}\) rotating videos of the 3D generated scenes from GOEnFusion model trained on the OmniObject3D dataset. We also visualize the depth-maps alongside the rendered videos to demonstrate the 3D consistency of the generated samples.

Qualitative samples of Forward-Diffusion baseline model. Here we show the videos of the 3D reconstructed scenes from the OmniObject3D dataset by the forward diffusion model. We perform their vanilla autoregressive sampling procedure to generate the 3D samples. Since, the forward diffusion model is based on the Pixel-NeRF architecture, and since OmniObject3D dataset has random camera locations, the generated videos do not form a smooth closed-loop trajectory around the object.

Qualitative samples of our non-forward baseline. Here we visualize the 360\(^{\circ}\) rotating videos of the 3D generated scenes from our non-forward triplane based diffusion model trained on the OmniObject3D dataset. We also visualize the depth-maps alongside the rendered videos to demonstrate the 3D consistency of the generated samples. Since, DiffTF's code was not available, we developed a similar non-forward diffusion baseline of our own for fair comparison. In short, we first fit triplanes to the rendered views of the objects from OmniObject3D, and then train a DiT based diffusion model on the fitted triplanes. We didn't modify the base architecture of DiT in any way and also did not apply any of the tricks from DiffTF.

Bibtex


            @misc{karnewar2023goenfusion,
              title={{GOEnFusion}: Gradient Origin Encodings for 3D Forward Diffusion Models}, 
              author={Animesh Karnewar and Andrea Vedaldi and Niloy J. Mitra and David Novotny},
              year={2023},
              eprint={2312.08744},
              archivePrefix={arXiv},
              primaryClass={cs.CV}
            }

Acknowledgements

Animesh and Niloy were partially funded by the European Union's Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No. 956585. This research has also been supported by MetaAI and the UCL AI Centre. Finally, Animesh would like to thank Roman Shapovalov for the insightful discussions and help with triplane fitting code.