Thursday, August 1, 2013

Recipe Time: Matrix Transpose and Cross Product with Float32x4

This post has simple code recipes for for transposing a 4x4 matrix and taking the cross product between two 3D vectors using methods in the Float32x4 class.

In this week's Dart SDK there have been new methods added to the Float32x4 class which makes it possible to transpose a matrix entirely with Float32x4 operations. I should note, this may not actually be the fastest choice: On my machine it is faster to use a Float32List.view on the Float32x4List. The following code snippet will transpose the 4x4 matrix stored in a Float32x4List:
void transpose(Float32x4List m) {
  var m0 = m[0];
  var m1 = m[1];
  var m2 = m[2];
  var m3 = m[3];
  var t0 = m0.interleaveXY(m1);
  var t1 = m2.interleaveXY(m3);
  var t2 = m0.interleaveZW(m1);
  var t3 = m2.interleaveZW(m3);
  m[0] = t0.interleaveXYPairs(t1);
  m[1] = t0.interleaveZWPairs(t1);
  m[2] = t2.interleaveXYPairs(t3);
  m[3] = t2.interleaveZWPairs(t3);
}
In last week's Dart SDK the complete set of shuffle methods were added to the Float32x4 class. The following code snippet will take the cross product of two 3D vectors:

Float32x4 cross(Float32x4 a, Float32x4 b) {
  var t0 = a.yzxw;
  var t1 = b.zxyw;
  var l = t0 * t1;
  t0 = a.zxyw;
  t1 = b.yzxw;
  var r = t0 * t1;
  return l-r;
}

Tuesday, July 30, 2013

SIMD on the Web July 2013 Update (ARM, JavaScript, and AVX-512)

Accessing the SIMD instruction sets available in both desktop and mobile CPUs can greatly speedup 3D graphics (and other) applications . Until recently, web programmers have been unable to access this part of the CPU losing out on performance and battery savings. In the six months since I gave my talk, Bringing SIMD to the Web via Dart, a lot of things have happened. Read on for a complete status update on SIMD on the web including the latest on ARM, JavaScript, and AVX-512 support.



A demonstration of the speed gain you get when using Float32x4 for 3D games can be seen in Web Languages and VMs: Fast Code is Always in Fashion. Now on to the status update:

1. Status of Dart implementation

Dart's implementation of the Float32x4, Uint32x4, and Float32x4List types is complete. The API may change slightly in the future but any changes will be minor and easy to adapt to.

2. Status of Dart acceleration

Dart fully accelerates Float32x4, Uint32x4, and Float32x4List types on IA32, X64, and ARM (with NEON) CPUs. Thanks to Zachary Anderson for the ARM implementation.

3. Status of dart2js support

The Dart to JavaScript compiler fully supports the Float32x4, Uint32x4, and Float32x4List types. When compiled to JavaScript the types are implemented in software and do not give speedups. A flag will be introduced allowing your program to detect when SIMD will be slow. If your program executing on an ARM CPU without NEON (most current generation smart phones have NEON support, but older ones may not.) support or via JavaScript it is recommend that you use non-SIMD code paths.

4. What about JavaScript acceleration?

I've proposed these types for ECMAScript 7. I've created a polyfill for those interested in what the API will look like. Time will tell how serious the ECMAScript members are about SIMD and how fast the JavaScript engines can implement support for it.

5. What about AVX and AVX-512?

For those of you who don't follow the latest CPU instruction sets, AVX is the successor to SSE and has 256-bit wide registers (YMM). AVX-512 is a follow up to AVX and adds 512-bit wide registers (ZMM) and doubles (32 instead of 16) the number of register names available. Exciting stuff. AVX exists in the wild and I plan on implementing Float32x8 later this year. AVX-512 was only just announced and no chips support it (yet), once AVX-512 becomes closer to reality Dart will get a Float32x16 type.

Monday, July 29, 2013

Multiple render targets and deferred rendering with WebGL

Last week I added support to Spectre to render into multiple render targets using the WEBGL_draw_buffers extension. To demonstrate this new functionality, I've created the simplest possible deferred rendering setup. The example focuses just on mechanics of rendering to multiple targets and performing a fullscreen pass leaving fancy shader techniques for another post.

I render the scene into four textures, one for each of the red, green, blue, and alpha channels.


For this pass, I used the following fragment shader:

#extension GL_EXT_draw_buffers : require
precision mediump float;

/// The diffuse sampler.
uniform sampler2D diffuse;

/// The texture coodinate of the vertex.
varying vec2 texCoord;

void main()
{
  vec4 c = texture2D(diffuse, texCoord);
  gl_FragData[0] = vec4(c.r, 0.0, 0.0, 0.0);
  gl_FragData[1] = vec4(0.0, c.g, 0.0, 0.0);
  gl_FragData[2] = vec4(0.0, 0.0, c.b, 0.0);
  gl_FragData[3] = vec4(0.0, 0.0, 0.0, c.a);
}

Note that in order to access the gl_FragData variables you must include the following line in your fragment shader:

#extension GL_EXT_draw_buffers : require

The deferred rendering is done as a fullscreen post process pass. A fullscreen quad is rendered and the fragment shader reads from the red, green, blue, and alpha textures and merges them together for display.


For this pass, I used the following fragment shader:

precision mediump float;

varying vec2 samplePoint;
uniform sampler2D sourceR;
uniform sampler2D sourceG;
uniform sampler2D sourceB;
uniform sampler2D sourceA;

void main() {
  float r = texture2D(sourceR, samplePoint).r;
  float g = texture2D(sourceG, samplePoint).g;
  float b = texture2D(sourceB, samplePoint).b;
  float a = texture2D(sourceA, samplePoint).a;
  gl_FragColor = vec4(r, g, b, a);
}

Before rendering the scene pass you must create a render target and configure it to have four color targets. Using Spectre makes this easy:

Texture2D redColorBuffer;
Texture2D greenColorBuffer;
Texture2D blueColorBuffer;
Texture2D alphaColorBuffer;
Texture2D depthBuffer;
RenderTarget renderTarget;

int offscreenWidth = 1024;
int offscreenHeight = 1024;

// Create color buffers.
redColorBuffer = new Texture2D('redColorBuffer', graphicsDevice);
redColorBuffer.uploadPixelArray(offscreenWidth, offscreenHeight, null);
greenColorBuffer = new Texture2D('greenColorBuffer', graphicsDevice);
greenColorBuffer.uploadPixelArray(offscreenWidth, offscreenHeight, null);
blueColorBuffer = new Texture2D('blueColorBuffer', graphicsDevice);
blueColorBuffer.uploadPixelArray(offscreenWidth, offscreenHeight, null);
alphaColorBuffer = new Texture2D('alphaColorBuffer', graphicsDevice);
alphaColorBuffer.uploadPixelArray(offscreenWidth, offscreenHeight, null);
// Create depth buffer.
depthBuffer = new Texture2D('depthBuffer', graphicsDevice);
depthBuffer.pixelFormat = PixelFormat.Depth;
depthBuffer.pixelDataType = DataType.Uint32;
depthBuffer.uploadPixelArray(offscreenWidth, offscreenHeight, null);
// Create render target.
renderTarget = new RenderTarget('renderTarget', graphicsDevice);
// Use color buffers.
renderTarget.setColorTarget(0, redColorBuffer);
renderTarget.setColorTarget(1, greenColorBuffer);
renderTarget.setColorTarget(2, blueColorBuffer);
renderTarget.setColorTarget(3, alphaColorBuffer);
// Use depth buffer.
renderTarget.setDepthTarget(depthBuffer);
// Verify that it's renderable.
if (!renderTarget.isRenderable) {
  throw new UnsupportedError('Render target is not renderable: '
                             '${renderTarget.statusCode}');
}

The following code configures Spectre to render into the offscreen render targets:

// Set the render target to be the offscreen buffer.
graphicsContext.setRenderTarget(renderTarget);

// Set the viewport (2D area of render target to render on to).
graphicsContext.setViewport(offscreenViewport);
// Clear it.
graphicsContext.clearColorBuffer(0.0, 0.0, 0.0, 0.0);
graphicsContext.clearDepthBuffer(1.0);

After rendering the scene into the offscreen render targets, the following code configures Spectre to render into the WebGL front buffer for display:

// Use the system provided render target by switching to
// RenderTarget.systemRenderTarget.
graphicsContext.setRenderTarget(RenderTarget.systemRenderTarget);

The final step is to render a fullscreen quad with the above shader.

Thursday, July 25, 2013

Using the new vector_math_lists library

The latest vector_math release (v 1.3.5) includes a new library for managing lists of vectors. Introducing:

import 'package:vector_math/vector_math_lists.dart';

The library introduces three lists (one for each of Vector2, Vector3, and Vector4) that have the following interface:

class VectorNList {
  VectorNList(int length, [int offset = 0, int stride = 0]);
  VectorNList.fromList(List<VectorN> list, [int offset = 0, int stride = 0]);
  VectorNList.view(Float32List buffer, [int offset = 0, int stride = 0]);

  /// Copy element at index into vector. Zero allocations.
  void load(int index, VectorN vector);
  // Replace element at index with vector. Zero allocations.
  void store(int index, VectorN vector);

  // Read element at index. Allocates a new VectorN.
  VectorN operator[](int index); 
  // Replace element at index with vector. Zero allocations.
  void operator[]=(int index, VectorN vector);
}

Under the hood each VectorNList is backed by a Float32List. If you need to, you can pass in your own storage by using the view constructor

When constructing a vector list you can specify an offset and stride. Offset specifies the underlying Float32List that the vector list begins at. The stride specifies the number of floats in the Float32List between each Vector stored. By default the stride is the length of the vector but you can specify a larger value.

The following are three examples showing how to use a vector list:

Example #1: Construct a Vector2 list of 10 elements storing them offset by 1 in the Float32List

/// Construct a new Vector2 list with 10 items. Index 0 is offset by 1 float.
Vector2List list = new Vector2List(10, 1);
// Store vector (1.0, 2.0) into index 0.
list[0] = new Vector2(1.0, 2.0);
// Verify that list[0] is the vector we just stored.
relativeTest(list[0].x, 1.0);
relativeTest(list[0].y, 2.0);
// Verify that the vector list starts at offset 1.
relativeTest(list.buffer[0], 0.0);  // unset
relativeTest(list.buffer[1], 1.0);
relativeTest(list.buffer[2], 2.0);
relativeTest(list.buffer[3], 0.0);  // unset

Example #2: Construct a Vector2 list view on top of buffer. The view starts offset by 1 and has a stride of 3, meaning there will be a gap of 1 float between each Vector2.

Float32List buffer = new Float32List(8);
Vector2List list = new Vector2List.view(buffer, 1, 3);
// The list length should be (8 - 1) ~/ 3 == 2.
expect(list.length, 2);
list[0] = new Vector2(1.0, 2.0);
list[1] = new Vector2(3.0, 4.0);
expect(buffer[0], 0.0);
expect(buffer[1], 1.0);
expect(buffer[2], 2.0);
expect(buffer[3], 0.0);
expect(buffer[4], 3.0);
expect(buffer[5], 4.0);
expect(buffer[6], 0.0);
expect(buffer[7], 0.0);

Example #3: Construct a Vector2 list from an existing list of Vector2. The interesting bit here is that the offset and stride are specified in the list constructor control how the data is copied from the existing list.

List input = new List(3);
input[0] = new Vector2(1.0, 2.0);
input[1] = new Vector2(3.0, 4.0);
input[2] = new Vector2(5.0, 6.0);
Vector2List list = new Vector2List.fromList(input, 2, 5);
expect(list.buffer.length, 17);
expect(list.buffer[0], 0.0);
expect(list.buffer[1], 0.0);
expect(list.buffer[2], 1.0);
expect(list.buffer[3], 2.0);
expect(list.buffer[4], 0.0);
expect(list.buffer[5], 0.0);
expect(list.buffer[6], 0.0);
expect(list.buffer[7], 3.0);
expect(list.buffer[8], 4.0);
expect(list.buffer[9], 0.0);
expect(list.buffer[10], 0.0);
expect(list.buffer[11], 0.0);
expect(list.buffer[12], 5.0);
expect(list.buffer[13], 6.0);
expect(list.buffer[14], 0.0);
expect(list.buffer[15], 0.0);
expect(list.buffer[16], 0.0);

It is possible to use the lists without allocating any memory by using the load method instead of the index operator.

Tuesday, July 23, 2013

Convergence on vector_math

The Dart Game Developer community has spoken on their preferred math library.  The latest high profile users of the vector_math library are Three.dart and Box2D.dart. They join the many games and libraries already using vector_math.

Thanks to Anders Forsell for finishing the port of Three.dart to vector_math.

Thanks to Laszlo Korte for finishing the port of Box2D.dart to vector_math.

Be sure to upgrade your pub dependency to the latest version:

dependencies:
  vector_math: '>= 1.3.5'

Switching back to Blogger

To make posting easier I've switched back to Blogger. I hope to restore the missing posts.

For now, take a look at how you can solve a 4x4 matrix using vector_math

import 'package:vector_math/vector_math.dart';

void solve() {
  var A = new Matrix4(2.0, 12.0, 8.0, 8.0,
                      20.0, 24.0, 26.0, 4.0,
                      8.0, 4.0, 60.0, 12.0,
                      16.0, 16.0, 14.0, 64.0);
  var b = new Vector4(32.0, 64.0, 72.0, 8.0);
  var x = new Vector4.zero();

  // Solve A*x = b without computing the inverse of A.
  Matrix4.solve(A, x, b);
}
Thanks to Laszlo Korte for providing this feature!

Monday, November 19, 2012

OpenGL Mouse Picking

Many games require the user to click on an object in the view. For example, the user must click on an infantry unit before giving it attack orders. If the game is 3D, figuring out which game object was clicked requires many steps. Read on learn how go from window coordinates of the mouse cursor to a world space ray that can be tested against your game objects.

Below is a diagram of the camera view. The camera captures a pyramid-like view of the world. I’m simplifying here, but, any object within the camera’s pyramid and between the near and far planes will be rendered. Similarly any object the user could click on will also be within this pyramid.


The camera transformation matrix is constructed from two transformations. The first is the view transformation, which specifies where the camera is located and the direction it is pointed in. The second is the projection transformation which describes the pyramid view, specifically, how to transform coordinates inside the pyramid into the unit cube. The two most common are are orthographic and perspective projections. After being applied any vertex that is within the camera’s view pyramid will be mapped between (-1,-1,-1) and (1,1,1). A unit cube with each face being 2 units long:


After coordinates are mapped to the unit cube they are mapped to the view port or window coordinates.  Typically, the view port is the output window of the application, defined with as (0,0)...(width,

The final camera transformation is constructed with a simple matrix multiplication:

CameraMatrix = ProjectionMatrix*ViewMatrix;

So when a vertex is rendered it is transformed into model space, then view space, then clip space, and finally into viewport space. Mouse picking is going the other direction, from viewport space to clip space, then view space, and finally into world space. Once you have the mouse coordinates in world space it’s ready to be interesected with your game objects.

Using Dart Vector Math there are two functions that you will need to use to go from mouse coordinates to world space ray. The first is unproject and the second is pickRay. The unproject function goes from window space all the way to world space:

bool unproject(mat4 cameraMatrix, num viewportX, num viewportWidth,
                       num viewportY, num viewportHeight,
                       num pickX, num pickY, num pickZ,
                       vec3 pickWorld);

The first parameter is cameraMatrix defined above, following that is a description of the viewport. pickX, pickY, and pickZ are the mouse cursor coordinates. pickWorld is the output.

The natural question at this point is how do you determine pickZ? The mouse cursor has no depth so there is no Z component. If pickZ == 0.0 it is at the near plane and if pickZ == 1.0 it is at the far plane. Any value between 0.0 and 1.0 will be linearly interpolated between the near and far plane.

The pickRay function takes care of all of this for you:

bool pickRay(mat4 cameraMatrix, num viewportX, num viewportWidth,
                     num viewportY, num viewportHeight,
                     num pickX, num pickY,
                     vec3 rayNear, vec3 rayFar);

The input parameters are almost the same as unproject but it doesn’t take a Z coordinate and has two outputs: rayNear and rayFar which is the 3D world space point of the mouse cursor at the near and far planes respectively. Internally pickRay calls unproject twice first with pickZ == 0.00 and second with pickZ == 1.0.

Once you have rayNear and rayFar you have a line from the camera into the scene which intersects the far plane where the mouse cursor would. You can now test your entire game scene with this ray, keeping the closest object hit.