Deep learning for depth estimation from monocular video feed is a common strategy to get rough 3D surface information when an RGB-D camera is not present. Depth information is of importance in many domains
such as object localization, tracking, and scene reconstruction in robotics and industrial environments from multiple camera views. The convolutional neural networks UpProjection, DORN, and Encoder/Decoder are
evaluated on hybrid training datasets enriched by CGI data. The highest accuracy results are derived from the UpProjection network with a relative deviation of 1.77% to 2.69% for CAD-120 and SMV dataset respectively.
It is shown, that incorporation of front and side view allows to increase the achievable depth estimation for human body images. With the incorporation of a second view the error is reduced from 6.69% to
6.16%. For the target domain of this depth estimation, the 3D human body reconstruction from aligned images in T-pose, plain silhouette reconstruction generally leads to acceptable results. Nevertheless,
additionally incorporating the rough depth approximation in the future, concave areas at the chest, breast, and buttocks, currently not handled by the silhouette reconstruction, can result in more realistic
3D body models by utilizing the deep learning outcome in a hybrid approach.