In February 2017, together with World Wildlife Fund, ArtScience Museum and Google Zoo, MediaMonks launched a large-scale mixed reality experience "Into The Wild" to help people in Singapore experience the devastating effects of deforestation and learn more about some of the world’s most endangered species and their habitats.

It was the world’s first Tango-enabled smartphone Lenovo Phab 2 Pro, and guided visitors through personalised digital adventures, which started with AR on the ground floor of the exhibition space, before transitioning to full VR.

mediamonks into the wild

The end of the experience shifts back to AR, where users go up to the fourth floor for an experience that includes planting a virtual tree.

Transforming over 1,000 square meters of the Singapore ArtScience Museum into a virtual, interactive rainforest, making it the largest AR experience in the world, and second ever AR museum experience developed using Google Tango.

And it wasn’t easy. From a technical perspective, we faced the massive challenge of how to accurately and smoothly map a virtual rainforest onto a physical and dynamic museum space, making sure the walls aligned with trees, corridors with the forest’s paths, and that we worked our way around the museum’s existing exhibitions and staging. 

So how did we do it?

To start with, if you’re augmenting the real world with virtual objects, it’s important that the device rendering your view (such as a smartphone, monitor, CAVE or head mounted device) is exactly aware of where it is in the real world.

For this, a device needs to know its position and orientation in a three-dimensional space.

In the case of Tango, where the augmentation happens on a camera feed, the position and orientation of the rendering device needs to be in real world coordinates. Only if the position and orientation of a Tango device is reported accurately, and fast enough, proper augmented reality is possible.

The fact that Google Tango does this for you is very cool because it allows developers to augment real world locations within their own virtual world which is different from Snapchat-like AR which, for example, augments bunny ears to your head.

With real world bound augmentations, you can potentially create shared AR experiences that revolve around and involve landmarks.

In this case, it allowed us to transform the ArtScience Museum into a lush virtual rainforest and from the user’s perspective, exploring the rainforest becomes as natural as exploring the museum itself because every corridor or obstacle in the virtual world matches a corridor or obstacle in the real one.

Google Tango coordinates

We used Unity3D to create our virtual world. To begin, we assured our Unity developers that they wouldn’t have to worry about alignment and were free to design the virtual world using whichever position or orientation they liked, as long as it was true to scale.

Developers familiar with Geographic Information Systems (GIS) know there are a lot of coordinate systems out there called "datums". Historically, a lot of institutes developed their own, but since the introduction of GPS, the US developed WGS84 which is the most often used for commercial devices.

The great thing about this coordinate system is that it is Cartesian, it calculates in meters, and uses the centre of the earth as its point of origin. This is important because, in a properly mapped environment, Google Tango can give you its exact position and orientation on the globe and gives you these in WGS84.

Google Tango calls these coordinates ecef coordinates, so, we’ll call it ecef also.

Determining the correct approach

The next step is to ensure our Unity world overlaps with the real world so we can achieve augmented reality. Two approaches to solve this come to mind.

  1. Transform (move+rotate) the Unity world to sit on top of the ecef coordinates of the museum.
  2. Transform (move+rotate) the ecef Tango device coordinates into Unity world coordinates.

The approaches are 80 percent the same, as in both cases you have to calculate the transformation from virtual (Unity) to real (ecef). The difference, however, lies in whether you transport the virtual world onto the real one (approach one), or whether you transport the real camera onto the virtual world (approach two).

To determine which approach is best, we had to see what these coordinates look like in a real use case. Here are some examples of how Unity coordinates look:

Object A: [10.000, 63.250, -11.990]

Object B: [-92.231, 33.253, -62.123]

By contrast, below are two examples of how ecef coordinates look:

Hilversum MediaMonks HQ 2nd floor near the elevator: [3899095,5399920414; 353426,87901774078; 5018270.6428830456]

Singapore ArtScienceMuseum in front of cashier shop: [-1527424,0031555446; 6190898,8392925877; 142221,77658961274]

Obviously, the ecef coordinates are quite large numbers. In fact, it’s clear that single-precision floating points (or floats) are going to have a lot of trouble with these.

Without going too much into detail about floats, it’s important to note that performing arithmetic with numbers around 10-6 with numbers around 106 means that you significantly lose accuracy.

In addition, there's also no way of getting around the fact that a lot of 3D programming is done around 10-3 to 103 (think of transformation, model, view, or projection matrices).

To understand this further, I recommend watching this video as it demonstrates this point perfectly. It shows a fighter jet taking off from around the origin [0; 0; 0] with a camera following it and, as its own position gets larger and larger (as well as the camera’s position), the floating point calculations become less and less accurate.

Imagine then what the error would be if the coordinates of your camera are like the ecef coordinates shown above? You would be combining fine scaled rotation values with very large position values. The error in the result will be enormous.

AR isn’t quite as fun if the augmentation isn’t done accurately.

Add to this the fact that Unity is hard coded to work with floats (rather than double-precision floating points, or doubles), and the fact is that we can't afford any large errors in AR. It’s therefore clear that approach one is unfeasible. This is because the camera needs to stay relatively close to the origin to avoid precision errors.

So, we proceed with approach two which is to transform the ecef Tango device coordinates into Unity world coordinates.

Find the transformation

Transformations between coordinate systems in 3D graphics usually entail finding translation (positional), orientation and scaling values.

Each of these three concepts act in 3D space, so they must describe their positioning, rotation, and scaling for each of the three axis (x, y and z). This gives us nine values to find.

The nine unknowns are a hint of how many equations you need to find these nine unknowns. This is important when determining how many real world coordinates are needed to anchor our virtual world to.

Our initial idea was to create a transformation that would deal with all three concepts (translation, rotation, and scale). However, due to difficulties, and the fact we were able to design our virtual world true to scale, we decided to drop accounting for scale and focus on translation and rotation only.

This meant that effectively we now only need to find six unknowns.

Calculate the transformation matrix

At this point in solving this challenge, we're down to finding a transformation matrix that only accounts for location and rotation. Luckily, this problem has been solved a million times already by Computer Science students.

If you simply search Google, you’ll find countless examples of how to transport a rigid body from one coordinate system to another. This is one example that will get you there 90 percent of the way.

Finding a transformation matrix revolves around minimising the sum of squares error between two sets of data points. The following method is tailored for this problem since it deals with rotation and translation separately.

Conceptually, we approach this by picking a point in the real world, and we say that that point corresponds to another point in the virtual world.

Basically, the worlds are anchored to each other on that point. However, as you can imagine, choosing a single point as an anchor still allows the worlds to pivot around the anchor, in which case they will be misaligned most of the time.

Therefore, to place the virtual world squarely on top of the real world perfectly, you need at least a few anchors. Depending on the number of unknowns you’re trying to find, you need an equal or more amount of equations.

Equations can be derived from known pairs (in this case ecef and Unity coordinates). In this case, a total of 3 pairs (or anchors) is enough to allow us to find a full 3D transformation matrix.

The idea is that you choose N amount of points (at least three) in the real world, and find their ecef coordinates. Then, you go into the virtual world and place a point on their corresponding virtual locations (so N in total). For the museum project, we used 10 easy to find landmarks at the base of each pillar inside the museum.

Above: Two ecef coordinates we measured in the real world. 
Below: Their virtual world counterparts.

For this, we used a third-party library called Math.Net that allows us to do linear algebra with doubles. You only have to run this code once at the start of the program. 

The result is that now have 10 ecef coordinates and 10 Unity space coordinates shaped in a circle, resulting in 10 pairs of coordinates. The next step then is to apply the steps discussed in this article and find a transformation matrix that allows us to transform a point from ecef space to Unity space.

We ran into a few problems while implementing this. For example, Unity is a left-handed coordinate system, while ecef is right-handed. And the article we referenced above also used row major ordering, while the library used column major ordering.

This makes filling, transposing, and multiplication ordering of matrices different. We eventually overcame all of these problems through careful reasoning, and not trying to take too many steps at the same time.

Apply the transformation

Following the previous step, we have a transformation matrix we could call ecefTunity, (or unityTecef depending on how you calculated the matrix). So, transforming a point in ecef space to unity space becomes as trivial as:

With this, we can complete the alignment. And since Tango reports the device's coordinates back in ecef coordinates, we can easily calculate the corresponding unity coordinates. Effectively we update the Unity camera with every Tango update we receive using this transformation.

What’s more, for every virtual tree planted, a real tree was planted in Rimbang Baling, one of the last pristine rainforests in Sumatra where the endangered Sumatran tiger lives. 5000 new trees were pledged in the project’s first month.

I hope by sharing this we can inspire the imagination of current and aspiring developers to build even more exciting AR/VR experiences that map to the real world. Go forth and conquer!