Cloud photogrammetry with dense stereo for fisheye cameras

We present a novel approach for dense 3-D cloud reconstruction above an area of 10× 10 km2 using two hemispheric sky imagers with fisheye lenses in a stereo setup. We examine an epipolar rectification model designed for fisheye cameras, which allows the use of efficient out-of-the-box dense matching algorithms designed for classical pinholetype cameras to search for correspondence information at every pixel. The resulting dense point cloud allows to recover a detailed and more complete cloud morphology compared to previous approaches that employed sparse feature-based stereo or assumed geometric constraints on the cloud field. Our approach is very efficient and can be fully automated. From the obtained 3-D shapes, cloud dynamics, size, motion, type and spacing can be derived, and used for radiation closure under cloudy conditions, for example. Fisheye lenses follow a different projection function than classical pinhole-type cameras and provide a large field of view with a single image. However, the computation of dense 3-D information is more complicated and standard implementations for dense 3-D stereo reconstruction cannot be easily applied. Together with an appropriate camera calibration, which includes internal camera geometry, global position and orientation of the stereo camera pair, we use the correspondence information from the stereo matching for dense 3-D stereo reconstruction of clouds located around the cameras. We implement and evaluate the proposed approach using real world data and present two case studies. In the first case, we validate the quality and accuracy of the method by comparing the stereo reconstruction of a stratocumulus layer with reflectivity observations measured by a cloud radar and the cloud-base height estimated from a Lidar-ceilometer. The second case analyzes a rapid cumulus evolution in the presence of strong wind shear.

Dear Editors and Reviewers, thank you very much for your valuable comments on our paper. We addressed all the issues raised by the reviewers and modified/extended the paper as requested. Below, you can find the answers to the questions raised by the referees.
We also want to inform you that we included Mr. Martin Lennefer into the list of co-authors due to his intensive involvement in our work regarding his technical expertise of the camera system and his contributions to the design and implementation of our field campaigns.

Stachniss and Clemens Simmer
Point-to-Point response to: anonymous referee #1 from June 9th, 2016 Referee Comment: "In many parts of the manuscript, rectification is referred as the method that allows dense stereo matching. I find this misleading, because rectification is merely a transformation to translate the epipoles to infinity so that the epipolar lines are parallel in both images, hence matching algorithm is less time consuming and more straightforward to design."

Authors Response:
We agree with the referee that this can be misleading. Our intention was to emphasize that epipolar rectification is usually a prerequisite to use efficient out of the box dense stereo matching techniques, like the Semi-Global-Matching employed in this paper. The plane-sweep algorithm works also on unrectified images.
To be more precise we added changes at several places. 1) In the abstract we point out, that the rectification "allows the use of efficient out-of-the-box dense matching algorithms designed for classical pinhole-type cameras …" (Page 1, Line 3) 2) We added at Page 2, Line 21, a clarification that we are aiming to use dense stereo methods designed for perspective cameras: "The main contribution of this paper is an approach to combine the large field of view of a fisheye camera with an efficient out-of-a-box dense stereo matching algorithm [...]" 3) We added a statement at Page 2, Line 22 in order to clarify that epipolar rectification is not a principal requirement for a dense stereo reconstruction: "Although epipolar rectification is not required for a dense reconstruction in principle, many dense stereo algorithms require rectified images because computation is greatly simplified." coordinates of x on the image plane are a function of the azimuth angle φ and r(θ) and are given by [...]"

Referee Comment :
"In Section 3.3, it is claimed that rectification allows to use the complete image content of a fisheye image. It is not clear to me why the whole content of rectified image can be used but the whole content of the non-rectified image cannot. Besides, the distortion (stretching) introduced by the rectification is likely to severely limit the use of data beyond a certain theta."

Authors Response:
We agree with the referee and we have corrected this issue. The image content of a non-rectified image can also be used, e.g. for feature-based matching.
However, the matching algorithm we use requires epipolar rectified images, which allow to use out-of-the-box dense and efficient stereo methods as we mention on Page 2, Line 19-20. As we have fisheye images, we have to use a special rectification model and cannot use the classical perspective rectification using a homography (Page 6, Line 30): " In the frame of pinhole-type cameras, epipolar image rectification refers to the computation and application of a homography which maps epipolar lines (projections of epipolar planes on the image plane) to image rows. In the omnidirectional camera model however, epipolar lines become epipolar curves due to the non-linear projection and thus cannot be mapped by a homography because of its line-preserving character. Therefore, we employ the rectification scheme following Abraham and Förstner (2005) which is sketched in Figure 4" The used epipolar rectification allows to keep the complete image content, but also introduces some strong distortions at the image borders (Page 7, Line 4): "However, epipolar rectification leads to lower accuracies at the margins as the image is stretched in these areas, cf. to Schneider et al. (2016)." Point-to-Point response to: anonymous referee #2 from August 5 th , 2016 We thank the referee for the important and valuable comments on the paper.
We used them to correct the paper for the revised version. Our response to the referees comments are as follows:

Referee Comment :
"Being not an expert in the field of cloud observation, I would be interested to read a bit more about the significance of the work: what does this technique offer, compared to the existing ones? Is it a better resolution/accuracy, a larger area, reduced cost, more completeness? One could also describe it in terms of requirements: we want to see ..., but existing techniques only give ... and therefore we develop a new system expecting to get ... and here we evaluate it"

Authors Response:
The problem is that the mentioned instruments simply do not have the spatial and/or temporal resolution for near-instantaneous scans and thus representations of e.g. boundary layer clouds which are highly dynamic and comparatively small are quite limited. In case of a cloud radar they also lack sensitivity to cover all parts of a cloud.
In order to better motivate our contribution, we reformulated the respective paragraph in the Introduction (Page 2, Line 1-5): "Current ground-based cloud observations are made primarily with cloud radars, lidars, lidar-ceilometers and infrared and microwave radiometers, all of which usually only sense clouds along a pencil beam; they record the 3D cloud evolution at time resolutions during which clouds already change significantly.
For instance, a cross-section scan of a cloud radar takes up to one minute with a beam width of about 0.6 degrees; moreover its sensitivity does not not allow to detect the cloud boundaries. A lidar-ceilometer observes the cloud base height with high temporal resolution, but only as zenith point-measurement." and Page 1, Line 27: " A more complete and consistent cloud shape can be used in radiative transfer applications where cloud geometry is modeled explicitly. Cloud evolution studies can benefit from the larger geometric data basis regarding segmentation and classification of individual clouds, tracking and visualization, making further analysis more effective." Regarding the novelty and contribution of our approach in the field of cloud photogrammetry, we state in Page 2, Lines 17-24: "The main contribution of this paper is an approach to combine the large field of view of a fisheye camera with an efficient out-of-a-box dense stereo matching algorithm in order to obtain consistent and detailed cloud geometries above the area around the cameras. [...]. In contrast to regular feature-based methods used in previous studies on cloud photogrammetry, dense methods seek a correspondence for every pixel in the stereo images, leading to a dense 3D point cloud. At the same time dense stereo methods often impose spatial consistency constraints, which allows us to obtain more reliable correspondences in low-contrast image regions, which are typical for clouds, than sparse feature-based methods" We also state on Page 3, Line 16, that up to now, fisheye cameras have only been used to derive cloud base height, but not for a recovery of complete 3D cloud geometries of e.g. convective boundary layer clouds: "Experiments involving sky imagers focused on the derivation of the cloud base height."

Referee Comment :
"There are several challenges being addressed, concerning the "difficult" geometry of fisheye lenses, the size of the setup (with a baseline of 300m), the use of automatic (dense) matching with "fuzzy" objects (clouds). All is well explained, but it is not always completely clear how it relates to the state of the art and where is the novelty -I suppose it is in the application of dense matching to clouds, but in that case the results could be analyzed a bit more exactly there.(Furthermore, I liked the used of stars in the exterior orientation)."

Authors Response:
We updated some passages in section 2 and 5 in order to put our methods and decisions better in the context of the cited methods.
The general contribution and novelty of this paper is mentioned in the introduction, which has also undergone some changes (Page 2, Lines 19-31).
We now also state on Page 3, Line 27-31: " In contrast to previous studies, we used a dense stereo method to recover a dense 3D cloud geometry (Figure 1). Dense stereo methods obtain much more geometric information than feature-based methods, especially in image regions with low contrast, which is a general problem in cloud photogrammetry. This additional geometric information can prove beneficial in cloud evolution and radiation closure studies where the cloud geometry is modeled explicitly." A state-of-the-art regarding the stereo matching depends primarily on the reconstruction application (clouds in our case) and the number of cameras involved. Multi-view reconstruction for example offers many more constraints on the scene geometry than two cameras and makes visibility and radiometric consistency assumptions more feasible. In this field, feature-based methods (at least as a starting point) are the best choice because each feature / point can be refined towards a consistent maximum-likelidhood estimate.
A two-camera stereo application has only the simple epipolar constraint and hence -assuming a relatively small baseline compared to the whole 3D scenefurther constraints do not have the same significance than in multi-view applications with a variety of viewports.
Another aspect is, that clouds themselves are fuzzy objects and thus do not have clearly defined 3D boundaries that can be refined the same way as solid object boundaries can be. These implications cause a shift of our focus from geometric accuracy towards geometric completeness, although we do not think that there is a significant trade-off (compare statement about feature-based methods and dense stereo methods in section 4 or in the introduction).

Referee Comment :
"Only in the evaluation section it becomes apparent what one had in mind concerning the size of the area to be measured: results are shown up to 4 km away from the cameras at two different heights (Fig 11) . "

Authors Response:
We updated the abstract and the introduction as well as the caption of Figure 1 to solve this deficit: "We present a novel approach for dense 3D cloud reconstruction using two hemispheric sky imagers with fisheye lenses in a stereo setup above 10×10 km2." and "Two cameras with a spatial displacement and simultaneous time of exposure provide the necessary information for a 3D reconstruction within an area of about 10×10 km2 around the cameras."

Referee Comment:
"The accuracy gets rather poor at larger distances, which may be due to the baseline of only 300m (but a larger one might affect matching performance).
Some more discussion about this would be welcome."

Authors Response:
We revised section 5.2 extensively and discussed the implications of a larger / shorter baseline or different distances of the object on the reconstruction uncertainty and the matching performance. For example, on Page 13, Line 16, we state now: "However, larger baselines (or disparities) usually affect the stereo matching because increasing parts of the object might not be visible by both cameras.
Additionally, the object will have significantly different geometric appearance in each camera." We also want to refer to our statement in section 5.1 where we also discuss this issue at Page 12, Line 13-16: "Compared to previous studies that mention a baseline between 500 m (Allmen and Kegelmeyer (1996)) and 900 m (Öktem et al. (2014)), this is a rather short distance and results in higher geometric uncertainty of the estimated 3D points on the clouds, but reduces occlusions and enhances the ratio of mutually visible cloud regions in both images"

Authors Response:
Since the baseline is constant the main area of reconstruction does not change.
We added the distance information to the figure captions, e.g. for Fig. 16: This implication should be mentioned."

Authors Response:
We added a reference to Heinrichs and Rodehorst (2006), who gives a solution for perspective cameras. An adaption to omnidirecitonal cameras has, to the best of our knowledge, not been addressed yet, but seems to be possible, but we didn't investigated it. The revised text is in Page 13, Line 30: "A successful integration of a third camera into the dense stereo matching scheme including epipolar rectification is explained for perspective cameras in Heinrichs and Rodehorst (2006). An adaption to omnidirectional cameras has, to the best of our knowledge, not been addressed yet, but seems to be possible." List of all relevant changes (Pages and Lines w.r.t. the marked-up text) :

Description Page, Line
Updated Abstract to clarify of our approach 1 Updated Introduction for a better explanation of current cloud remote sensing techniques 2,7 Updated Introduction to outline the novelty of our approach and its use 2,27 Added a sentence that epipolar rectification is not mandatory for dense stereo 2,32 Added reference in the Related Work section (Romps and Öktem (2015))

3,29
Added reference in the Related Work section (Nguyen and Kleissl (2014)) 4,7 Added introduction of working variables phi and theta 5,9 Replaced 'principal distance' by 'camera constant' for consistency 5,14 Replaced 'projection center' by 'principal point'; could be misleading 5,15 Added variable declaration of A1, A2 and A3 of the polynomial coefficients (Brown) 5,24 Added a "distortion-corrected" to point out that x is not the same as x~6 ,1 Corrected an error in the upper formula (spherical coordinates) 6,3 Corrected the formula which was still using the distortion terms (triangle u/v) 6,5 Added a general classification of epipolar rectification to section 3.3 7,23 Added the full notation of "atan2()" to the formula 8,7 Introduced the variable x' '_V 8,8 Corrected the triangulation formula: psi_r should be psi_r' (Figure 5) 9,7 Section 3.5.1 and 3.5.2 are now one level higher: 3.5 and 3.6 9,10 Corrected an error in the equation for the coordinates of stars (sin(delta) to sin(theta) )
In this work, we examine the epipolar rectification model, which allows the use of dense matching algorithms designed for classical perspective cameras to search for disparity information at every pixel. Together with an appropriate camera calibration, which includes internal camera geometry and global position and orientation of the stereo camera pair, we can use the disparity 15 information ::: use ::: the ::::::::::::: correspondence :::::::::: information ::::: from ::: the ::::: stereo :::::::: matching : for dense 3D stereo reconstruction of the a cloud and thus estimate its shape. From the obtained 3D shapes, cloud dynamics, size, motion, type and spacing can be derived and used for radiation closure under cloudy conditions. ::::: clouds ::::::: located :::::: around ::: the ::::::: cameras. : We implemented and evaluated :::::::: implement :::: and ::::::: evaluate : the proposed approach using real world data and present two case studies. In the first case, we validate the quality and accuracy of the method by comparing the stereo reconstruction of a 20 stratocumulus layer with the reflectivity observations measured by a cloud radar and the cloud base height estimated from a Lidar-ceilometer. The second case analyzes a rapid cumulus convection ::::::: evolution : in the presence of strong wind shear.
3D stereo reconstruction is based on triangulation. Knowing the two camera orientations and the direction vector (baseline) between the cameras, each pair of corresponding image points can be back-projected into ray directions which intersect in the mapped 3D point. This requires accurately known parameters for the interior orientation, e. g. focal length and lens distortion, and also accurate knowledge of the exterior orientation, namely the position and orientation of the cameras in space, which 25 both :::: both :: of ::::: which : need to be determined by a calibration procedure.

10
The paper is organized as follows. In Sec. 2, we discuss previous studies and their contributions to the field of cloud photogrammetry. In Sec. 3, we describe the fisheye camera model, the :::::: applied : camera calibration techniques, as well as the epipolar rectification method and :: the : triangulation. Sec. 4 introduces the employed dense stereo algorithm to obtain corresponding image points. Sec. 5 presents our stereo setup, a geometric uncertainty analysis and two case studies of our reconstructions ::::: stereo ::::::::::: reconstruction :::: case ::::::: studies. One case shows a reconstructed stratocumulus layer which serves as a validation for the achieved 15 geometric accuracy including comparisons with lidar-ceilometer and cloud radar observations. The second case analyzes the cloud development within ::::: under : strong convection and wind shear and illustrates the quality of the cloud morphology reconstruction.

Camera Calibration and Stereo Reconstruction
In this section, we describe the projection model for fisheye cameras and we formulate the geometric relationship between two sky imagers, which is required for 3D stereo reconstruction. We also introduce an epipolar image rectification scheme 20 for fisheye stereo cameras that allows to identify corresponding image points between the two images using a dense stereo matching algorithm. Finally, we describe the employed camera calibration methods and the triangulation of 3D points using corresponding image points in the epipolar rectified images.
The camera model contains a projection function, which should be close to the projection of light in the optics. The interior orientation consist ::::::: consists of the camera calibration parameters of this model, describing the camera specific projection on 30 the image plane. The projection can be split up into a mapping of a 3D point P to a 2D point x on the model image plane, and a mapping of x to x to the actual pixel coordinates on the sensor plane : (Figure 2). While most cameras follow the pinhole 4 camera model (Sonka et al., 1999;Stockman and Shapiro, 2001), fisheye cameras have lenses with a different projection function and follow the omnidirectional camera model (Kannala and Brandt, 2006;Bakstein and Pajdla, 2002). , ::::::::: visualized :: in Figure 2visualizes the omnidirectional camera model. Each projection ray passes through the projection center : C : and intersects the image sphere in the point x , which can be considered as :::::::: determines : the ray direction. The optical axis intersects the image plane in the principal point x 0 :: The ray direction x can be mapped to x :: on ::: the :::::: image ::::: plane using e.g. one of used projection functions r(θ) provided by Abraham and Förstner (2005). Each symmetric projection function r(θ) defines the distance between x to ::: and : the principal point x 0 only ::: x C as a function of the zenith angle r(θ) : θ between the incoming projection ray and the optical axis as depicted in Figure 2 (a). Accordingly, the pixel coordinates of x are :: on ::: the ::::: image ::::: plane ::: are : a :::::::: function :: of ::: the ::::::: azimuth :::: angle :: ϕ :::: and :::: r(θ) ::: and ::: are given by The mapping of x in Cartesian image coordinates to x in pixel coordinates is usually described as an affine transformation with the parameter of the interior orientation, namely the principal distance ::::: which ::::::: depends ::: on ::: the :::::: camera ::::::: constant : c, the shear factor s and the principal point x 0 = (u 0 , v 0 ) , i.e. the projection center ::::::: principal ::::: point ::: x C : in pixel coordinates. Note that the 15 origin of the sensor coordinate system lies in the upper left corner of the image as depicted in Figure 2 (b).

10
The omnidirectional camera model refers to the local camera coordinate system with the projection center as the origin and the sensor plane defining its orientation. The exterior orientation of a camera, which consist of :: is :::::::: described :: by : three rotation angles and three translation shifts, is described in a common world reference system Ω W and allows to derive the geometric relationship between two or more cameras. We choose one camera as the reference camera, which is considered as the left camera and the other as the right camera, which simplifies the following notation and avoids misconceptions. The choice of 15 the reference camera has no impact on the reconstruction results. Figure 3 illustrates the principal stereo configuration with ::: two : hemispheric cameras, making the world reference system Ω W and the two camera reference systems Ω L and Ω R explicit. Let C L be the world coordinates of the left camera and P L an object point in the left camera reference frame. The transformation of P L into world coordinates reads then as Here α L , β L and γ L are the Eulerian angles (roll, pitch, yaw) and R x (α L ), R y (β L ) and R z (γ L ) the respective rotation matrices. Considering R L and R R as the rotation matrices and C L and C R as the world coordinates of the left and right camera, we obtain the relative orientation from the left to ::::::: between :: the :::: left ::: and the right camera with :: via :: a rotation matrix R and the baseline vector t with ::: via 25 The determination of an accurate relative pose is crucial, as errors in the estimated exterior orientations may sum up to larger errors in the relative orientation which compromises the image correspondence algorithm and the triangulation of 3D point coordinates as investigated by Hirschmüller and Gehrig (2009).
The two camera centers C L , C R and the object point P span the epipolar plane ::::::::::: epipolar plane. This geometry can be expressed with the coplanarity equation and holds when where E is the essential matrix obtained by a matrix multiplication :: of :: R with the skew symmetric matrix [t] × of t.
Given two direction rays x L and x R of corresponding image points x L and x R in the left and the right camera images, 5 E x L defines the normal vector of the epipolar plane spanned by x L and t, and hence requires its correspondence x R to lie on the intersection circle between the epipolar plane and the image sphere. The same holds for Ex R and x L . In case of deviations from this constraint, Eq. (8) defines the cosine of the angle between the two epipolar planes spanned by t together with E x L and Ex R respectively, which we are using as an error measure to estimate E using image point correspondences. : : The essential matrix E can be expressed with five parameters, therefore has five degrees of freedom ( :::::::::: independent ::::::::: parameters, 10 two for the baseline direction t and three for the rotation angles). We use an adapted version of the direct method of Longuet-Higgins (1981) to compute E with eight correspondences, which exploits that the eigenvalues of E are λ 1 = λ 2 = 1 and that we use spherically normalized ray directions.
The estimated essential matrix E can be decomposed to obtain the relative rotation R and baseline vector t, which can be used for triangulation and epipolar image rectification, as described in Sec. 3.3. However, a 3D reconstruction based on R and 15 t alone takes place only in the coordinate system of the reference camera and only up to scale because of the scale ambiguity.
A meaningful reconstruction in world coordinates requires the absolute length of the baseline (distance between the cameras) and the absolute orientation of the reference camera. presents Sec. 3.5 ::: and Sec. 3.6 :::::: present methods to estimate both, the parameters of exterior and interior orientation.
Note, that the following formulas ::: The ::::::::: following ::::::::: derivations : with respect to β and ψ are only valid for an epipolar rectified image pair. For each real camera we can define a virtual camera :::::::: (subscript :: V ), such that the virtual cameras are in a canonical stereo setup, i.e. :::: both have a common x-axis, equal orientation ::: the ::::: same ::::::::: orientation : ( R L,V = R R,V = I ) : and are only shifted along the virtual x-axis t V = ( t , 0, 0) . An object point P V in the virtual world coordinate system is then defined by the angle β which denotes the respective epipolar plane, and the two angles ψ L and ψ R that define the angle of the projection ray within the epipolar plane , see (Figure 4 (b)). Based on this geometry, we are able to define a rectification scheme , see : (Figure 4 (d), to cover : ) : , :::::: which ::::: covers : the whole 3D space: the image rows correspond to the angle β, which represents the 5 rotation of the epipolar plane, while the image columns represent the respective angles ψ L and ψ R in the rotated epipolar plane ::::::::::::: :::::::: for y V > 0, :::::::::
Since the matrix columns of R V are the images of the base vectors :: e i , the first column is the normalized baseline vector. We can freely choose the other two coordinate axes as long as they form an orthonormal system, because each realization aligns the x-axis with the baseline. This means, that the rectification scheme is defined up to a rotation about the baseline t, which corresponds to a shift of the range of the angle β and a vertical translation in the rectified image. We define the virtual y-axis 20 in the x-y-plane of the world coordinate system, which also determines the virtual z-axis.

8
The reverse mapping, from rectified image coordinates to world coordinates is given by

Triangulation for 3D-Reconstruction
Having corresponding image points x L and x R identified in the image rows of the epipolar rectified images, x L and x R can be directly derived from x L,V and x R,V using the reverse mapping of the rectification scheme of Eq. (9). Due to the rectification 5 the ray directions are guaranteed to lie in the 3D epipolar plane and do intersect. Considering the geometry shown in Figure 5 we identify the relation s · sin(ψ R ) = b · sin(γ) :::::::::::::::::::::: s · sin(ψ R + π 2 ) = b · sin(γ). With γ = ψ L − ψ R and P = s · x L we have :::::::::::

Calibration and Parameter Estimation
In the following we present our procedure to estimate the parameters of the interior and exterior orientation, introduced in 10 and 3.2.

Parameters of the Interior Orientation
3.5 ::::::::: Parameters ::: of ::: the ::::::: Interior :::::::::: Orientation For the estimation of the parameters of the interior orientation in Eq. (5) we employ a test field with markers that encode a geometric relationship. Such a test field can be a sophisticated setup in a laboratory (Seiz, 2003) or -as in our case -a pattern 15 printed or fixed on a plane or inside an open half-cube as depicted in Figure 6. The calibration generally proceeds in two steps: The first step provides sample images of the pattern in different poses covering the field of view. In ::: For : each image an image processing routine detects and extracts the image coordinates of the pattern geometry. In the second step, the extracted image points are used to estimate the optimal parameters of the camera model with an adjustment procedure.
We employ a software developed by Abraham and Hau (1997), that accepts input images of a calibration cube with a fixed 20 white dotted pattern. Each inner cube side has a fingerprint pattern to make sure the detected dots are properly identified as lying on the x-, y-or z-plane, which determines their corresponding absolute 3D coordinates with respect to the cube reference system. The extraction stage results in a set of correspondences between 2D image points and 3D cube points, which then are :: are ::::: then used in a nonlinear bundle adjustment that iteratively minimizes the reprojection error between the observed image points and the reprojections of the 3D points of the pattern using the respective parameter estimation according to Eq. (5) 25 and (6). 9 3.5.1 Parameters of the Exterior Orientation 3.6 ::::::::: Parameters ::: of ::: the ::::::: Exterior ::::::::::: Orientation First, we describe how to estimate the absolute location and orientation of each camera in the world reference system. This information can then be : is :::: then used to derive a first estimate of the essential matrix E * , which will then be iteratively refined using point-feature correspondences obtained from the stereo images according to the epipolar constraint in Eq. (8).

5
Employing a satellite navigation system like GPS allows us to derive the geographic position of the cameras with an accuracy of about 2-3 m. Accuracies in the range of centimeters can be achieved by using additional correction information broadcasted by terrestrial reference stations (D-GPS). The obtained coordinates can be mapped from the global reference system, e.g. WGS-84, to a local reference system using a suitable projection in order to get the exact baseline length and vector ::::::: direction.
where R abs can be parametrized as a unit quaternion or as axis-angle representation. From the absolute location and orientation of the cameras we can derive the relative orientation using Eq. (7) and a first estimate E * of the essential matrix can be composed according to Eq. (8).
For a further refinement of the essential matrix E * , we collect SIFT-point-feature correspondences (Lowe (2004)) that are consistent with the epipolar constraint in Eq. (8): For each detected feature in the left image, we can select all features in the right image that are consistent with E * up to a predefined error threshold, e.g. (E * x L , E * x R ) < 2 • , and then find the best 5 match via K-Nearest-Neighbor using the SIFT feature descriptor. The same is done in the other direction, i.e. from the right to the left image, so that only mutually consistent matches are selected.
A couple of image pairs are enough to collect plenty of evenly distributed correspondences, as is shown in Figure 8. Since this set of correspondences will contain mismatches that would lead to a flawed refinement of E * , we use the robust parameter estimation technique RANSAC (Fischler and Bolles, 1981) to filter out those likely mismatches.

10
Finally, we employ Levenberg-Marquardt minimization of the cost function Because the observations x L and x R are always subject to some measurement error ::::::::::: measurement ::::: errors, they will not lie exactly within an epipolar plane.x L andx R denote the estimated true locations of x L and x R that do lie exactly within an epipolar plane and at the same time are closest to the observations in an angular sense. As the estimations and the observations 15 are unit vectors and lie on the image sphere, Eq. (11) formulates a meaningful angular error measure and its minimization provides an optimal maximum likelihood solution, see Oliensis (2002).

Stereo Matching
To calculate the 3D information of a point P, we need to know the coordinates of the projected point on the both images planes. Only if such correspondences are known, the :: its : 3D information :::::: location can be computed. The aim of stereo matching 20 algorithms is to compute such correspondence information.
The visual appearance of a scene point P in each camera determines if stereo matching is successful or not. Automatic stereo matching is likely to fail if there are occlusions, specular reflections, varying illumination or large scale and pose differences between the images, so that either corresponding object points are not visible in both images or differ significantly in their appearance with respect to shape and size. Also objects may lack sufficient texture or contrast, or a unique surface does not 25 exists that has a consistent visual appearance when observed from different perspectives. Especially the latter poses a problem in cloud photogrammetry, and . :::::: Hence, : depending on the cloud situation stereo reconstruction has limitations.
In practice, one ::::: either aims at finding the correspondences between distinct points in the images (sparse stereo) or between all pixels (dense stereo). A good overview is given in Scharstein and Szeliski (2002).
We only employ sparse stereo during the estimation of the essential matrix ( Figure 8) as described in Sec. 3.6.

5
The correspondence information is stored in the so-called disparity map D, that contains for each pixel in the rectified reference image the horizontal sub-pixel distance d to its corresponding image point shifted in the same row in the other image, see Figure 9. Hence, for the two corresponding points x L,V in the left and x R,V in the right image, we have for each pixel position in the disparity map D(x L,V ) = |u L − u R | ::::::::::::::::::::: D(x L,V ) = |u L,V − u R,V | : and therefore have the relation :::::::::::::::::::::::::::::::::::: .

10
In our current approach, we rely on a dense matching algorithm that is based on the Semi-Global Matching (SGM) proposed by Hirschmüller (2005) and is called Semi-Global Block-Matching (SGBM). It produces accurate results while being deterministic and fast to compute ::::::::::::: computationally ::::::: efficient. In this work we use the implementation provided in OpenCV.
For a detailed algorithmic description, we refer to the original paper by Hirschmüller (2005) or ::: and to the OpenCV documentation. Here, we present only a short summary. 15 The basic problem of finding an optimal disparity map can be formalized as an energy minimization problem involving an energy functional and an appropriate minimization technique (Scharstein and Szeliski, 2002). A good disparity map should satisfy at least the following two aspects: 1. Two corresponding pixels should have similar intensity or structural values (data consistency). Both aspects can then be combined to form the global energy E(D) = E data + E smooth and are realized in SGBM as follows:

Neighboring pixels with similar intensity or structural values should have similar disparity values (smoothness assump-
To achieve data consistency (aspect 1), E data (x) is computed for each pixel x independently using a window-based similarity measure such as sum of absolute/squared differences or normalized cross-correlation (NCC), yielding one matching cost per pixel for each valid disparity value d ∈ [d min , d max ]. Note, that using a larger window will smooth the disparity map since 25 small details have a smaller influence on the measure. This causes fine structures to disappear, but also reduces errors caused by image noise. However, relying only on a minimum in E data will cause mismatches, especially in regions with low-contrast or repeating patterns, which results in a flawed disparity map. This can be encountered :::::::: countered by introducing an additional smoothness term E smooth , that penalizes larger disparity differences of neighboring image points across eight linear paths. Figure 9 illustrates the computation of E data using epipolar rectified images and the final disparity map after incorporating 30 also E smooth . One can clearly see the difference in the disparity values between clouds that are closer (high disparity) and more distant clouds (smaller disparity).
In our application, we use a window-size of 11×11 pixels. To achieve a successful matching in larger low-contrast regions and reduce the variability in the reconstruction due to the noisy image signal, we furthermore scale the input images to one quarter size. This causes an oversmoothing near cloud boundaries, but this way we obtain smoother cloud surfaces.

Stereo Setup and Results
In this section we present our stereo setup ::::::: deployed at the Forschungszentrum Jülich GmbH, Germany. We also give a geomet-5 ric uncertainty analysis of the current setup, discuss common error sources and how they affect the calibration and reconstruction results , with a focus on asynchronous recordings. Finally, we present our experimental results of two case studies. The first case shows that our dense stereo approach is able to achieve a geometric accuracy that is comparable with those of previous studies using sparse stereo methods, like Seiz (2003) and Öktem et al. (2014). The second case illustrates the capability of our approach to :::::::::: successfully reconstruct the complex 3D cloud structure and dynamics of convective clouds.
Both cameras are IDS network-cameras of type uEye GigE UI-2280SE with a 2/3" CCD sensor consisting of 2448×2048 pixels and have ::: are ::::::: equipped :::: with : a Fujinon FE185C057HA-1 C-Mount Fisheye adapter, providing a 185 • field of view , and have a ::: and : fixed focus. The cameras are mounted in a box with and point towards the sky. An acrylic glass dome protects the cameras against environmental effects. A power supply and a fan distribute heat to prevent the condensation of water on the glass 25 dome. Each camera is connected to a small computer that hosts a self-developed camera control application, which bases ::::: based on the IDS C++ SDK (IDS, 2013) . and ::::: which allows us to control the cameras remotely, e.g. for scheduled recordings with settings as exposure time, recording interval (e.g. 15 seconds) or modes like long-exposure (night mode) or High Dynamic Range (HDR). The images are currently saved locally and transferred if needed. Synchronization is done by frequent requests to a local NTP service.

10
Fisheye lenses cover a substantially larger field of view than normal perspective lenses, but at an reduced effective angular resolution. As a consequence, the stereo depth resolution is lower for fisheye lenses compared to perspective ones. This drawback limits the effective range for a high quality reconstruction, especially for distant clouds. A comparison of our fisheye cameras with one of the wide-angle cameras from Öktem et al. (2014) highlights the differences: While our camera has a field of view (FOV) of about 180 • and the circular view field covers a 3.5 Megapixel region (from a 5 Megapixel sensor), 15 the wide-angle camera has a FOV of 67 • and takes 1-Megapixel images from a 5-Megapixel sensor. To compare the view fields, the respective solid angle Ω f ish for the sky imager and Ω wa for the wide angle camera -given in steradians -must be derived. Assuming a field of view of 180 • for the sky imager and 67 • for the wide-angle camera leads to Ω f ish = 6.28 and Ω wide = 1.04.
Further :::::::::: Furthermore, the solid-angle per pixel is 6.28/3.5·10 −6 = 1.8·10 −6 for the fisheye and 1.04/1.0·10 −6 = 1.04·10 −6 20 for the wide angle camera, resulting in a 43 % lower spatial resolution of the fisheye camera. Using the full resolution of the wide-angle camera with 5 Megapixel, the ratio would be 11 %. Hence, one must use a sensor almost 10 times higher resolution to compete with the wide-angle camera in this respect. Depending on the fisheye projection function and the location in the image, the spatial resolution will of course vary due to the different degree of distortion.
Next, we demonstrate the spatial variability of the 3D reconstruction error based on our real stereo setup by assuming 25 a constant error in the disparity map. Based on the distance between our cameras, we can estimate the uncertainty of the reconstruction using the triangulation method presented in . shows the reconstruction error for a virtual cloud layer at 1500m and 3000m height assuming an error of 1 pixel in the disparity map after the matching phase, , which corresponds to a directional error of approximately 0.1 • . The values represent absolute errors within the epipolar plane. Therefore, depending on the horizontal distance to the cameras, the error has a larger vertical component (small distance) or a larger horizontal 30 component (larger distance). The error grows larger in two cases: First, with increasing distance to the cameras, and second, with increasing co-linearity of the object point and the camera centers. In both cases, the angle γ between the projection rays becomes very small, yielding larger triangulation errors. Thus, the sky imagers do not provide a hemispheric 3D reconstruction with homogeneous accuracy. However, the accuracy can be ameliorated by employing a third camera in a triangle configuration.
The imaging process of a sensor adds a random noise signal, which can be limited, but not avoided. In principle, this also effects ::::: affects : parameter estimation, because both localization and measurement are disturbed. Given a large number of measurements for the calibration, the signal noise can be compensated in a maximum likelihood estimation as the redundancy is high. The stereo matching is also effected :::::: affected : by the noisy image signal and causes a disturbed 3D reconstructionas shown in .

5
The following analysis is designed to :: In ::: the :::::::: following ::::::: analysis ::: we investigate the effects of an asynchronous recording of the stereo images during the observation of a dynamic cloud scene.
Despite frequent requests to an NTP service, we sometimes experience asynchronous system times on the local computers in the range of a few seconds. Consequently, the whole cloud scene shifts between the single shots and thus causes a displacement in the images, which leads to a biased or flawed disparity map. We investigate the effects of a cloud field displacement of ∆ = 10 ±15 m along the baseline (x-direction) and perpendicular to the baseline (y-direction) in a virtual sky imager setup together with a virtual cloud layer at 3 km, using the 3D rendering software Blender citepblender (Blender Foundation, 2016). The virtual sky imagers have identical internal camera geometries comparable to the real ones, and a relative pose of R L = R R = I and t = (300, 0, 0) m. Figure 13 shows the cross-sections of the respective reconstruction along the baseline (x-axis) and

Evaluation of the 3D Reconstruction
The following two case studies are designed to evaluate our approach using observations from a lidar-ceilometer and a cloud radar, and to show its capability to capture the complex 3D shapes and dynamics of convective clouds.

5
After triangulation of the 3D point cloud, we create a cloud surface mesh using methods from the open-source Point Cloud Library (Library, 2016).

Analysis of the 3D Reconstruction of Stratocumulus-layer Clouds
We present two cases with stratocumulus clouds, which we will use to evaluate our result by comparing it against observations from the lidar-ceilometer and the cloud radar.
10 Figure 14 shows the results from the 11 th of August, 2014, at 14:12:00 UTC, right after a shower moved over the JOYCE site with a trailing stratocumulus layer, that was also captured by the cloud radar. The mean cloud-base heights from the reconstructions are 2881 m while the lidar-ceilometer observations result in 2897 m. Note also, that the cloud radar is -due to its measuring frequency in the microwaves -not as sensitive as the lidar-ceilometer or the camera to the visible outer cloud boundaries, with the consequence that especially the boundaries of a cloud might not be detectable by the radar, but by the 15 camera or the lidar.
The second case has been recorded on the 5 th of August, 2014, which is interesting for two reasons: First, a typical boundary layer evolution has been observed, starting with small to medium size cumulus convection along with a steady rise of the cloud base. Second, from early noon on the cloud scenery shifted towards stratocumulus layers, resulting in less convection and thus providing again good conditions for a comparison against observations from the lidar-ceilometer and the cloud radar However, differences between 100 to 150 m are noticeable and the lidar-ceilometer observations report a cloud-base mean height of 2922 m compared with the height of 2766 m derived from stereo reconstruction. We assume a systematic error in the reconstruction, which is discussed in ::: See Sec. 5.3.3 :: for : a ::::::::: discussion :: of ::: the :::::::: potential :::::: reasons ::: for :::: this :::::::::: discrepancy.

Analysis of the 3D Reconstruction of Cumulus Clouds
The 24 th of July, 2014 showed strong convection with rapid cloud development and decline, providing excellent conditions to apply our dense stereo reconstruction and to capture their complex shapes and dynamics. Figure 16 shows a cumulus mediocris forming approximately 3 km away from the stereo camera pair. One can observe the convective updraft and the rising cumulus turret. While reaching a height of about 4000 m, the turret enters the higher wind field, resulting in a skew :::::: skewed shape of the 30 cloud due to the wind shear. Figure 17 shows a cross-section of the reconstructed 3D cloud surface at 12:07:00 UTC together with the cloud base measured with the the lidar-ceilometer. Figure 18 shows another example of a smaller convective cloud, but also with a rather complex morphology from 11:28:00 UTC to 11:32:00 UTC on the same day. One observes how the developing convective turret covers parts of the cloud, which results from its increasingly concave shape. The temporally closest cloud base height values of ::: from : the lidar-ceilometer are between 12:14:38 UTC and 12:17:08 UTC and report 1487 m on average. Figure 19 shows a cross-section of the reconstruction the same way :::::: similar as in Figure 17. 5

Discussion of the Experimental Results
Our cloud reconstructions show an overall good agreement with the cloud base height observations from the lidar-ceilometer and the cloud radar. Mean near-zenith cloud base heights for the stratocumulus cases are within 1 % for the 11 th of August and 5 % for the 5 th of August of :: the lidar-ceilometer mean values, and the stereo method is able to capture the geometric shapes of the cloud bases as in Figure 15. A possible explanation for the deviation from ::::::: ocurring :: on : the 5 th of August is a shift between 10 the computers system time :::::::: computer :::::: system ::::: times : as is described in Sec. 5.2. Although errors in the orientation parameters cannot be excluded, the relative orientation estimation shows a standard deviation of 0.04 • and thus :::: which : is comparably accurate. On the other hand, a bias in the computers system time could be observed at times and confirms this assumption.
Based on the observations from the nearby wind-lidar, which reported a wind speed of about 5 m s -1 in a direction almost collinear to the baseline, a time difference of about +3 seconds between the left and the right camera would lead to a scaled The results from the 24 th of July show that the dense stereo method is able to almost fully reconstruct the visible outer shape of a convective cloud. In both cases the concave and increasingly skew ::::: skewed : shape of the cloud is nicely captured by the dense stereo method, as is illustrated in the cross-sections in Figure 17 and Figure 19. Also the cloud base is clearly visible and 20 matches the lidar-ceilometer value of 1495 m quite well, considering that temporally close measurements are only available from 12:12:07 UTC to 12:17:08 UTC ranging from 1448 m to 1585 m.

Conclusions
In this paper, we investigated the problem of computing a :::::::: investigate ::: the :::::::::::: reconstruction ::: of :: the : 3D reconstruction :::::::: geometry :: of ::::: clouds : from fisheye cameras using dense stereo approaches from photogrammetry. We presented :::::: present a complete approach 25 for stereo cloud photogrammetry using hemispheric sky imagers. Our approach combines calibration, epipolar rectification, and block-based correspondence search for dense fisheye stereo reconstruction for clouds. We showed :::: show that cloud photogrammetry is able to compute the cloud envelope geometry and demonstrated :::::::::: demonstrate the potential of such methods for the analysis of detailed cloud morphologies. By applying an epipolar rectification together with a dense (semi-)global stereo matching algorithminstead of using the original images, we are able to compute clouds shapes that are more complete and 30 contiguous than reconstructions relying on regular feature-based methods. Once the cameras are calibrated, the method can be fully automated to deliver real-time information of the cloud scene.
Acknowledgements. This study has been performed within the framework of the High Definition Clouds and Precipitation for advancing  Figure 1. We employ two hemispheric sky imagers (left) in a stereo setup with a baseline of 300 m to derive dense and detailed geometric information such as height and morphology of clouds ::::: within : a ::::: range :: of ::::: about : 5 ::: km : using the simultaneously recorded image pairs. The derived heights (right) of an exemplary recorded cloud (middle) using the fisheye images, as shown in Figure 4 (c), of the two sky imagers.  Figure 3. The two hemispheric cameras are located at C L and C R with independent orientation in the world coordinate system ΩW . The projections x L and x R of a 3D point P on the image hemisphere in each camera system ΩL and ΩR span together with the baseline t the epipolar plane and can be used to reconstruct P via triangulation.   . Illustration of dense stereo matching using epipolar rectified images. The correspondence information is stored in a disparity map.