We compare our approach with three state-of-the-art techniques, which represent the best reported results on monocular video-based 2D-to-3D estimation to date: the deep feedforward 2D-to-3D network (Martinez et al. In particular, the input numbers of frames for our three prototypes exactly match the corresponding ones in Pavllo et al. Similarly, to verify the performance, we implemented three different prototypes according to the number of layers and levels, as shown in Table 10. Horizontally, each row indicates a different prototype of the causal model. N indicates the corresponding input frame number. As the level index grows, the number of dilated units decreases due to the increasing receptive fields. At level 0 , the TCN units are placed by layers along the y-axis corresponding to the ones depicted in Fig. 3. From level 1, along the positive z-axis, different scaled dilated convolution units are placed. Architectures of input/output data flows across different dilated convolution units. For simplicity, we use a black/gray rectangle shape to denote the group of TCN units within a layer.
Also visit my blog post ... bbs.ftbj.net