Spamegg's personal page
I participated in 2025’s Google Summer of Code, to contribute to Cyfra.
Cyfra is a DSL (in Scala 3) and runtime to do general programming on the GPU. It compiles its DSL to SPIR-V assembly and runs it via Vulkan runtime on GPUs.
My major contributions (merged into dev
) can be found at:
My minor contributions (merged into main
):
Add fs2 integration to Cyfra, so that fs2 streams can be executed on the GPU via a pipeline.
Create a Cyfra interpreter (to run on the CPU) that can simulate Cyfra GPU programs for debugging and profiling.
I would like to create a GPU programming course suitable for beginners using Cyfra.
During the bonding period, in order to get familiar with Cyfra, I made some code refactors and additions:
During testing I was able to discover some bugs in the Cyfra compiler, which were fixed.
Much of these changes, especially involving GMem
, became legacy and were archived,
as the Cyfra runtime was redesigned from the ground up.
Once the coding period started, I worked on how to run an fs2
Stream
on Cyfra.
Fs2 has a Pipe
type:
type Pipe[F[_], -I, +O] = Stream[F, I] => Stream[F, O]
which can then be used on a Stream
with
.through
.
The analogy is obvious: a stream runs through a pipe, gets transformed to a new stream.
This is very similar to a .map
method, but instead of a function that transforms
individual elements, it works with a function that transforms the whole stream.
The idea is to create a sort of “Cyfra pipe”. So the computations happen on the Cyfra side, executed on the GPU. Then the stream goes back to fs2 / Scala land.
To get a better idea, look at this basic unit test:
test("fs2 through gPipeMap, just ints"):
val inSeq = (0 until 256).toSeq // just some numbers
val stream = Stream.emits(inSeq) // fs2 stream
val pipe = gPipeMap[Pure, Int32, Int](_ + 1) // Cyfra pipe, done on GPU
val result = stream.through(pipe).compile.toList // resulting fs2 stream
val expected = inSeq.map(_ + 1) // same, on Scala/CPU, for comparison
result
.zip(expected)
.foreach: (res, exp) =>
assert(res == exp, s"Expected $exp, got $res")
Initially I had a working version that used the older Cyfra runtime (now archived).
The older runtime’s GMem
was not polymorphic, it needed separate classes for each type.
Which meant implementing the pipe multiple times, for each data type. For example:
// using the now-legacy, older runtime
extension (stream: Stream[Pure, Int]) // not generic
def gPipeInt(fn: Int32 => Int32)(using GContext): Stream[Pure, Int] =
val gf: GFunction[Empty, Int32, Int32] = GFunction(fn)
stream
.chunkMin(256)
.flatMap: chunk =>
val gmem = IntMem(chunk.toArray) // non-polymorphic memory
val res = gmem.map(gf).asInstanceOf[IntMem].toArray
Stream.emits(res)
We would have to do one of these for every pair of Stream[F, O1] => Stream[F, O2]
that we wanted to have. It was obvious that the runtime needed to be redesigned.
The main challenge was the changing Cyfra Runtime API that was in development in parallel to my project, and adapting to these changes (as an API user). The runtime is even more complicated than my projects, so I won’t explain it much here.
Another challenge was the high conceptual difficulty of the problem. The solution required high level, abstract, imaginative thinking. Without some small concrete details to grab onto, I struggled a lot with this.
In particular:
java.nio.ByteBuffer
.I tried some approaches using Match types but could not make it work. Then I tried a typeclass approach.
After thinking a lot and getting stuck for some time, eventually this idea emerged:
trait Bridge[CyfraType <: Value: FromExpr: Tag, ScalaType: ClassTag]:
def toByteBuffer(inBuf: ByteBuffer, chunk: Chunk[ScalaType]): ByteBuffer
def fromByteBuffer(outBuf: ByteBuffer, arr: Array[ScalaType]): Array[ScalaType]
Here Chunk
refers to fs2.Chunk
,
a lower-level data structure that fs2.Stream
is built on.
This bridge lets us convert data back and forth between Cyfra and Scala types, so data can be passed around in buffers.
We create
given
instances
of these bridges to be used implicitly. That’s how Scala handles type classes.
For example, we have these few basic bridges between Cyfra and Scala types:
// left: Cyfra type, right: Scala type
given Bridge[Int32, Int]:
def toByteBuffer(inBuf: ByteBuffer, chunk: Chunk[Int]): ByteBuffer =
inBuf.asIntBuffer().put(chunk.toArray[Int]).flip()
inBuf
def fromByteBuffer(outBuf: ByteBuffer, arr: Array[Int]): Array[Int] =
outBuf.asIntBuffer().get(arr).flip()
arr
given Bridge[Float32, Float]:
// ...
given Bridge[Vec4[Float32], fRGBA]:
// ...
given Bridge[GBoolean, Boolean]:
// ...
These bridges allow us to take a chunk of data from an fs2 Stream
,
correctly measure the size we need to allocate for a ByteBuffer
,
and after the computation is done, put it back into an fs2 Stream
,
making sure everything type-checks correctly.
This approach is much more generic; if someone wants a pipe for a Stream[F, O]
,
all they have to implement is a given instance between the two appropriate types
(instead of implementing the whole pipe, like before in the legacy version).
Making these type bridges even more polymorphic / generic would be very useful.
The bridge should be able to work with more general data structures than just fs2.Chunk
.
After this big obstacle, running an fs2 stream through a Cyfra pipe still required a lot of work, but it was fairly straightforward. The type signature looks very scary, but it’s actually not that bad:
object GPipe:
def gPipeMap[
F[_],
C1 <: Value: FromExpr: Tag,
C2 <: Value: FromExpr: Tag,
S1: ClassTag,
S2: ClassTag
](f: C1 => C2)(
using cr: CyfraRuntime,
bridge1: Bridge[C1, S1],
bridge2: Bridge[C2, S2]
): Pipe[F, S1, S2]
There is a lot more involved, but the core part of the code looks a bit like this:
stream
.chunkMin(params.inSize)
.flatMap: chunk =>
bridge1.toByteBuffer(inBuf, chunk)
region.runUnsafe(
init = ProgramLayout(
in = GBuffer[C1](inBuf),
out = GBuffer[C2](outBuf)
),
onDone = layout => layout.out.read(outBuf)
)
val arr = new Array[S2](params.inSize)
val result = bridge2.fromByteBuffer(outBuf, arr)
Stream.emits(result)
As you can see, we are using the type bridges to put the data in / out of buffers.
We start with a Stream
on the fs2 side, and we end up with another Stream
.
The region
, ProgramLayout
and GBuffer
refer to parts of Cyfra’s API
for writing GPU programs, using the redesigned runtime.
Here we are still using the type bridges from before, but the discussion will focus more on how to stitch together multiple GPU programs in Cyfra.
In Scala land, the signature looks like this:
object GPipe:
// ...
def gPipeFilter[
F[_],
C <: Value: Tag: FromExpr,
S: ClassTag
](predicate: C => GBoolean)(
using cr: CyfraRuntime,
bridge: Bridge[C, S]
): Pipe[F, S, S] =
(stream: Stream[F, S]) => ???
Implementing a filter method on the GPU, in parallel, is quite tricky. Thankfully there is good literature on the subject (although only in CUDA C/C++): Parallel prefix sum
For example, let’s say we want to filter even numbers:
List(1, 2, 4, 3, 3, 1, 4, 8, 2, 5, 7) // filters to:
List(., 2, 4, ., ., ., 4, 8, 2, ., .) // compacts to:
List(2, 4, 4, 8, 2)
On the CPU/JVM we can do this easily, but on the GPU it’s not so simple. The GPU works on large fixed blocks in parallel, not sequentially.
The literature guides us to the following approach:
List(1, 2, 4, 3, 3, 1, 4, 8, 2, 5, 7) // maps to:
List(0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0)
List(1, 2, 4, 3, 3, 1, 4, 8, 2, 5, 7) // maps to:
List(0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0) // scans to:
List(0, 1, 2, 2, 2, 2, 3, 4, 5, 5, 5) // these are inclusive prefix sums
List(1, 2, 4, 3, 3, 1, 4, 8, 2, 5, 7) // original
// X X X X X // filtered elements
List(0, 1, 2, 2, 2, 2, 3, 4, 5, 5, 5) // (inclusive) prefix sum
List(1, 2, 4, 3, 3, 1, 4, 8, 2, 5, 7) // original
// X X X X X // filtered elements
List(0, 1, 2, 2, 2, 2, 3, 4, 5, 5, 5) // prefix sums (inclusive),
// 0 1 2 3 4 // minus 1
List(2, 4, 4, 8, 2)
// 0 1 2 3 4 the indices equal: prefix sums - 1
The initial portion is just like the map from before.
We apply predicate
to Cyfra values; but instead of GBoolean
we’ll return Int32
.
This is what Cyfra’s GProgram
API looks like; it needs definitions of memory layouts,
the input and output buffers, their sizes as parameters, the size of a
workgroup
(static or dynamic), and so on.
Then we provide the code in Cyfra land that is executed for each invocation in parallel:
val predicateProgram = GProgram[Params, Layout](
layout = params =>
Layout(
in = GBuffer[C](params.inSize), // collection to be filtered
out = GBuffer[Int32](params.inSize) // predicate results, as 0/1
),
dispatch = (layout, params) => GProgram.StaticDispatch((params.inSize, 1, 1)),
): layout =>
val invocId = GIO.invocationId // each "GPU thread" / array index
val element = GIO.read[C](layout.in, invocId) // read input at that index
val result = when(predicate(element))(1: Int32).otherwise(0)
GIO.write[Int32](layout.out, invocId, result) // write result to output buffer at index
Here when
and .otherwise
are the if/else expressions of Cyfra’s DSL.
You can read more on that in the Interpreter project below.
Then this GPU program is handed to an execution handler to be performed.
This is one of the hardest parts of the project! It consists of two parts, both of which are recursive: upsweep and downsweep. I will illustrate with a small example of an array with 8 elements. For simplicity, each array position holds a value of 1:
index | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
---|---|---|---|---|---|---|---|---|
value | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
Since n
and n-1
phases.
The final result of the prefix sum should be: 1, 2, 3, 4, 5, 6, 7, 8
.
We do some additions on subintervals recursively. The mid point of an interval gets added to its end point:
interval size / index | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
---|---|---|---|---|---|---|---|---|
2 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
phase 1 | 🡦 | 🡣 | 🡦 | 🡣 | 🡦 | 🡣 | 🡦 | 🡣 |
4 | 1 | 2 | 1 | 2 | 1 | 2 | 1 | 2 |
phase 2 | 🡦 | 🡣 | 🡦 | 🡣 | ||||
8 | 1 | 2 | 1 | 4 | 1 | 2 | 1 | 4 |
phase 3 | 🡦 | 🡣 | ||||||
result | 1 | 2 | 1 | 4 | 1 | 2 | 1 | 8 |
Notice the number of additions in each phase follows the pattern
The GPU program looks something like this:
val upsweep = GProgram[Params, ScanLayout](
layout = params =>
ScanLayout(
ints = GBuffer[Int32](params.inSize),
intervalSize = GUniform(ScanArgs(params.intervalSize)),
),
dispatch = (layout, params) =>
GProgram.StaticDispatch((params.inSize / params.intervalSize, 1, 1)),
): layout =>
val ScanArgs(size) = layout.intervalSize.read
val invocId = GIO.invocationId
val root = invocId * size
val mid = root + (size / 2) - 1
val end = root + size - 1
val oldValue = GIO.read[Int32](layout.ints, end)
val addValue = GIO.read[Int32](layout.ints, mid)
val newValue = oldValue + addValue
GIO.write[Int32](layout.ints, end, newValue)
Notice the dispatch size (which controls the number of invocations) is changed by a factor of the interval size (as in the table). The end point value is added to the mid point value, and the sum is written back.
One new thing here is the GUniform
type: this represents a small part of a computation
that does not change or depend on invocation (so it is “uniform” for all invocations).
So they are great for passing parameters and using global configuration.
The GPU can use its memory for such values to be accessed easily and fast.
Vulkan provides (read-only) uniform buffers for this:
Vulkan Buffers
We use this to have access to the interval size during the program,
so we can calculate the correct addresses / indices in the target (prefix sum) array.
Downsweep starts from the result of upsweep. Interval sizes go back in reverse. The end-point of an interval gets added to the mid point of the interval next to it:
interval size / index | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
---|---|---|---|---|---|---|---|---|
4 | 1 | 2 | 1 | 4 | 1 | 2 | 1 | 8 |
phase 1 | 🡦 | 🡣 | ||||||
2 | 1 | 2 | 1 | 4 | 1 | 6 | 1 | 8 |
phase 2 | 🡦 | 🡣 | 🡦 | 🡣 | 🡦 | 🡣 | ||
result | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
For example, in phase 1, the intervals are 1, 2, 1, 4
and 1, 2, 1, 8
.
The end point of the first, 4, gets added to the mid point of the second, 2, for total 6.
The numbers of additions in phases follow the pattern
The GPU program looks similar, with the logic slightly adjusted:
val downsweep = GProgram[Params, ScanLayout](
layout = params =>
ScanLayout(
ints = GBuffer[Int32](params.inSize),
intervalSize = GUniform(ScanArgs(params.intervalSize)),
),
dispatch = (layout, params) =>
GProgram.StaticDispatch((params.inSize / params.intervalSize, 1, 1)),
): layout =>
val ScanArgs(size) = layout.intervalSize.read
val invocId = GIO.invocationId
val end = invocId * size - 1
val mid = end + (size / 2)
val oldValue = GIO.read[Int32](layout.ints, mid)
val addValue = when(end > 0)(GIO.read[Int32](layout.ints, end)).otherwise(0)
val newValue = oldValue + addValue
GIO.write[Int32](layout.ints, mid, newValue)
Here we need to stitch many program layouts together:
C
, to which we applied the predicate,We also need to pass the total prefix sum as a parameter, to determine the size of the compacted stream in the layout.
TODO
I would like to finish off the filter method, and implement more of fs2’s streaming API on Cyfra, adding more methods. This can be very tricky as we saw with the filter example.
Currently we are mostly dealing with “pure” fs2 streams without side effects.
In the future we should find a way to do the side effects of the effect type F[_]
.
For example, how can we print some text on the GPU? Is it even possible?
Some side effects might be impossible, but a partial solution might work.
We would like to generalize Cyfra pipes to any sort of data, not just fs2 Stream
s.
That way Cyfra can be used with many types of streaming data.
Getting good performance will be tricky; we need to minimize the CPU<->GPU back and forth. To chain multiple stream operations, we need to keep the data on the GPU. Cyfra’s new runtime API has the right building blocks to facilitate this.
My other project is to make an interpreter for Cyfra that can run on the CPU and simulate GPU computations.
Since making an interpreter is a well-understood problem, and the GPU was not involved here, this project was easier and went more smoothly.
It evolved gradually in stages:
Expression.scala
)gio/GIO.scala
)Cyfra’s DSL has a lot of Expression
types!
def simOne(e: Expression[?]) = e match
case CurrentElem(tid: Int) => ???
case AggregateElem(tid: Int) => ???
case Negate(a) => ???
case Sum(a, b) => ???
case Diff(a, b) => ???
case Mul(a, b) => ???
case Div(a, b) => ???
case Mod(a, b) => ???
case ScalarProd(a, b) => ???
case DotProd(a, b) => ???
case BitwiseAnd(a, b) => ???
case BitwiseOr(a, b) => ???
case BitwiseXor(a, b) => ???
case BitwiseNot(a) => ???
case ShiftLeft(a, by) => ???
case ShiftRight(a, by) => ???
case GreaterThan(a, b) => ???
case LessThan(a, b) => ???
case GreaterThanEqual(a, b) => ???
case LessThanEqual(a, b) => ???
case Equal(a, b) => ???
case And(a, b) => ???
case Or(a, b) => ???
case Not(a) => ???
case ExtractScalar(a, i) => ???
case ToFloat32(a) => ???
case ToInt32(a) => ???
case ToUInt32(a) => ???
case ConstFloat32(value) => ???
case ConstInt32(value) => ???
case ConstUInt32(value) => ???
case ConstGB(value) => ???
case ComposeVec2(a, b) => ???
case ComposeVec3(a, b, c) => ???
case ComposeVec4(a, b, c, d) => ???
case value: FloatType => ???
case value: IntType => ???
case value: UIntType => ???
case GBoolean(source) => ???
case Vec2(tree) => ???
case Vec3(tree) => ???
case Vec4(tree) => ???
case ReadBuffer(buffer, index) => ???
case ReadUniform(uniform) => ???
case WhenExpr(when, thenCode, otherConds, otherCaseCodes, otherwise) => ???
case ExtFunctionCall(fn, args) => ???
case FunctionCall(fn, body, args) => ???
case InvocationId => ???
case Pass(value) => ???
case Dynamic(source) => ???
case e: GArrayElem[?] => ???
case e: FoldSeq[?, ?] => ???
case e: ComposeStruct[?] => ???
case e: GetField[?, ?] => ???
Most of these are self-explanatory: doing basic arithmetic, logic, vector algebra etc.
At the end, they turn into Float | Int | Boolean
or
Vector[Float] | Vector[Int] | Vector[Boolean]
.
Let’s call this type Result
.
ReadBuffer
refers to each invocation on the GPU reading a different buffer address,
whereas ReadUniform
refers to the “uniform” section of GPU memory
that does not depend on the invocation id.
Since we don’t have access to the GPU here, we’ll fake the buffers with some arrays:
case class SimData(
bufMap: Map[GBuffer[?], Array[Result]],
uniMap: Map[GUniform[?], Result]
)
The most interesting is WhenExpr
, which is like an if / else if / else
chain;
this is where some invocations will execute a branch, while others will remain idle.
The naive approach is to simply follow the recursive structure of the DSL:
case Sum(a, b) => simOne(a) + simOne(b)
This runs into the obvious problem of stack overflow due to recursion.
The typical way to overcome stack overflow in cases like this is to rewrite it
using tail recursion.
This can be difficult to do when the traversed structure isn’t a simple list.
In our case we have an AST.
Each expression on the tree has an id number called treeid
.
One common approach is to turn the tree into a list, so that it is topologically sorted. This way, when an expression appears somewhere in the list, every sub-expression that it needs will already have been calculated, because they are earlier in the list:
List(..., a, b, ..., Sum(a, b), ...) // a,b are guaranteed to appear earlier than the Sum
So we can evaluate the list in order, cache earlier results in a map, then look them up. Since it’s a simple list, it can be made tail-recursive very easily!
@annotation.tailrec
def simIterate(exprs: List[Expression[?]], cache: Map[TreeId, Result]): Result
But what about the topological sort? It’s a non-trivial algorithm.
Thankfully, I did not have to implement that from scratch!
Cyfra’s compiler
already has a topological sorter called buildBlock
:
def buildBlock(tree: Expression[?], providedExprIds: Set[Int] = Set.empty): List[Expression[?]]
It consumes an Expression
and returns a topo-sorted list of its entire AST.
Normally it is used to compile Cyfra to SPIR-V, but now it can pull double duty!
Then simOne
can use it with simIterate
:
def simOne(e: Expression[?]) = simIterate(buildBlock(e), Map())
In contrast, there are very few GIO[T]
types:
case Pure(value) => ???
case WriteBuffer(buffer, index, value) => ???
case WriteUniform(uniform, value) => ???
case FlatMap(gio, next) => ???
case Repeat(n, f) => ???
In fact, FlatMap
and Repeat
just sequence the other operations,
and Pure
just simulates a side-effect-free value.
So the only important ones here are the two write operations.
This is mainly because, the monad is supposed to model the side effecting operations (like writing data). You might be thinking that reading is also a side effect, but here by side effect we mean “altering the state of the outside world”, in other words, writing.
So we do the side effect and write data to the (fake) buffer.
def interpretWriteBuffer(gio: WriteBuffer[?], data: SimData): (Result, SimData) = gio match
case WriteBuffer(buffer, index, value) =>
val index = Simulate.sim(index) // get the write index
val writeVal = Simulate.sim(value, indexSc) // get the value to be written
val newData = data.write(buffer, index, writeVal) // write value to (fake) buffer
(valueToWrite, newData) // return value and updated data
WriteUniform
is similar.
Now we need to mimic the GPU’s behavior of progressing hundreds of invocations together along a computation. Going back to our sum example, you can think of earlier sub-expressions evaluating to different results on different invocations (because they will read values from different buffer addresses):
Expr | invoc0 | invoc1 | invoc2 | … |
---|---|---|---|---|
a |
2 | -3 | 7 | … |
b |
1 | -2 | 5 | … |
Sum(a, b) |
3 | -5 | 12 | … |
So we need a cache of previous results for each invocation. This naturally lends itself to a map of maps:
type Cache = Map[TreeId, Result]
type Records = Map[InvocId, Cache]
So now we need to look up results for each invocation, use them to compute, then record the new results for each invocation, and keep moving along (the code here is a bit simplified for demonstration purposes):
def simOne(e: Expression[?], records: Records): Records = e match
// all the expression cases... here's one example
case Sum(a, b) =>
records.map: (invocId, cache) =>
val aResult = cache(a.treeid) // look up previous results, per invocation
val bResult = cache(b.treeid)
val result = aResult + bResult // sum the results
invocId -> cache.updated(e.treeid, result)
// ...
@annotation.tailrec
def simIterate(blocks: List[Expression[?]], records: Records): Records = blocks match
case head :: next =>
val newRecords = simOne(head, records)
simIterate(next, newRecords)
case Nil => records
Very similar, with a little redesign. Here we are starting to keep track of writes, which will be explained later below (again, the code here is a bit simplified for demonstration purposes):
def interpretWriteBuffer(gio: WriteBuffer[?], records: Records, data: SimData): (Records, SimData) = gio match
case WriteBuffer(buffer, index, value) =>
val indices = Simulate.sim(index, records) // get index to write for each invoc
val values = Simulate.sim(value, records) // get value to write for each invoc
val newData = data.write(buffer, indices, values) // write all data
val writes = indices.map: (invocId, index) => // track writes
invocId -> (buffer, index, values(invocId))
val newRecords = records.addWrites(writes) // add writes to records
(newRecords, newData)
It is useful for debugging and profiling purposes to track reads and writes.
For this, we will have to do some redesigning.
We have to expand our Records
to cache not just Result
s but reads/writes as well:
enum Read:
case ReadBuf(id: TreeId, buffer: GBuffer[?], index: Int, value: Result)
case ReadUni(id: TreeId, uniform: GUniform[?], value: Result)
enum Write:
case WriteBuf(buffer: GBuffer[?], index: Int, value: Result)
case WriteUni(uni: GUniform[?], value: Result)
// each invocation has its own Record instance
case class Record(cache: Cache, writes: List[Write], reads: List[Read])
type Records = Map[InvocId, Record] // used to be just Cache, now full Record
Along with our simulation data, the stuff we have to keep track of is growing.
For technical reasons, we also need to track the Results
of the last expression.
So let’s collect them in one place and pass it around in the code:
type Results = Map[InvocId, Result] // for each invoc, the result of last expr evaluated
case class SimContext(results: Results, records: Records, data: SimData)
We update the code we had before accordingly.
Now we consume and return SimContext
instead.
For example (again, simplified):
def simOne(e: Expression[?], sc: SimContext): SimContext = e match
// ... other cases
case ReadBuffer(buffer, index) => simReadBuffer(e, sc) // reads will be tracked here
case ReadUniform(uniform) => simReadUniform(e, sc) // reads will be tracked here
// ... other cases
@annotation.tailrec
def simIterate(blocks: List[Expression[?]], sc: SimContext): SimContext = blocks match
case head :: next =>
val newSc = simOne(head, sc) // reads tracked here if needed
simIterate(next, newSc)
case Nil => sc
Similarly for the interpreter and writes.
Now we want to check if reads / writes of many invocations happen on contiguous addresses. This type of analysis can be very useful for performance profiling. If invocations read/write contiguously on the GPU memory, the performance is higher. If they are jumping over a bunch of memory addresses, performance will be lower.
We need to track some read / write profiles. So, a bit more redesign.
Add one more field to our SimContext
:
case class SimContext(results: Results, records: Records, data: SimData, profiles: List[Coalesce])
The addresses are contiguous if, when ordered from smallest to largest, they increase by 1 at each step:
List(23, 24, 25, 26, 27)
Another way to say this is: the number of elements equals the max minus the min plus 1:
List(23, 24, 25, 26, 27) // 5 elements, 27 - 23 + 1 = 5
If they jumped over an address, then it’s not the case:
// jump over 26
List(23, 24, 25, 27, 28) // 5 elements, 28 - 23 + 1 = 6 != 5
But is this logic correct? We are assuming that the addresses are distinct. If two invocations try to write to the same address, like a race condition, then this could give us a false positive:
// 25 listed twice, but also jump over 26
List(23, 24, 25, 25, 27) // 5 elements, 27 - 23 + 1 = 5
So we also need to check for race conditions: addresses should be distinct for writes. Reading from the same address should be OK. Then we implement this logic to check addresses for races and coalescence:
enum Profile:
case ReadProfile(id: TreeId, addresses: Seq[Int])
case WriteProfile(buffer: GBuffer[?], addresses: Seq[Int])
enum Coalesce:
case RaceCondition(profile: Profile)
case Coalesced(startAddress: Int, endAddress: Int, profile: Profile)
case NotCoalesced(profile: Profile)
object Coalesce:
def apply(addresses: Seq[Int], profile: Profile): Coalesce =
val size = addresses.length
val distinct = addresses.distinct.length == size
val (start, end) = (addresses.min, addresses.max)
val coalesced = end - start + 1 == size
profile match
case WriteProfile(_, _) =>
if !distinct then RaceCondition(profile)
else if coalesced then Coalesced(start, end, profile)
else NotCoalesced(profile)
case ReadProfile(_, _, _) =>
if coalesced && distinct then Coalesced(start, end, profile)
else NotCoalesced(profile)
Then we add some logic to both the simulator and the interpreter so that they can create and add instances of these profiles. For example:
def simReadBuffer(e: ReadBuffer[?], sc: SimContext): SimContext =
val SimContext(results, records, data, profs) = sc // pattern matching to de-structure
e match
case ReadBuffer(buffer, index) =>
// get read addresses, read the values, record the read operations
val indices = records.view.mapValues(_.cache(index.tree.treeid)).toMap
val readValues = indices.view.mapValues(i => data.lookup(buffer, i)).toMap
val newRecords = records.map: (invocId, record) =>
val read = ReadBuf(e.treeid, buffer, indices(invocId), readValues(invocId))
invocId -> record.addRead(read)
// check if the read addresses coalesced or not
val addresses = indices.values.toSeq
val profile = ReadProfile(e.treeid, addresses)
val coalesce = Coalesce(addresses, profile)
SimContext(readValues, newRecords, data, coalesce :: profs) // return new context
Similar for writes and uniforms.
Recall the if / else if / else expressions of the DSL:
case class WhenExpr[T <: Value: Tag](
when: GBoolean, // if
thenCode: Scope[T], // then
otherConds: List[Scope[GBoolean]], // list of else if's
otherCaseCodes: List[Scope[T]], // list of then's
otherwise: Scope[T], // else
) extends Expression[T]
Invocations can evaluate when
differently from each other.
Those that evaluate true
will follow that branch and evaluate thenCode
,
but others will have to idle and wait.
They perform a “noop” in that case.
Then the waiting invocations will try the next “else if” branch, and so on.
Here’s a conceptual example with 4 invocations and 5 logic branches.
Invocation 1 enters the first branch immediately, the others wait.
Then invocations 0 and 2 evaluate true
in the first else if
branch.
Invocation 3 never evaluates true
,
it keeps waiting all the way until the final else
(here named otherwise
):
inv 0 | inv 1 | inv 2 | inv 3 | |
---|---|---|---|---|
when | noop | enter | noop | noop |
else1 | enter | noop | enter | noop |
else2 | noop | noop | noop | noop |
else3 | noop | noop | noop | noop |
otherwise | noop | noop | noop | enter |
It is helpful to track the periods of idleness, which expression it happens on,
for which invocation, and for how long (and redesign our Record
to include this):
case class Record(cache: Cache, writes: List[Write], reads: List[Read], idles: List[Idle])
case class Idle(treeid: TreeId, invocId: InvocId, length: Int)
We add the necessary logic to the simulator, where WhenExpr
is simulated.
The code is quite complicated so I’ll skip it here.
There are some missing Expression
types, such as external function calls,
GSeq
s, and so on, which are not simulated properly yet.
The interpreter treats invocations as one linear sequence; in the future it should accomodate more general concepts such as workgroups and dimensions.
I’d like to generalize the interpreter to GProgram
s that build on and go beyond GIO
s.
The biggest struggle was the unbearable summer heat.
Putting that aside, the biggest issue was the unknown nature of the project. A lot of things had to be figured out and required creativity. Some of this was very abstract and general; I struggled not knowing what to do. My mentor was very helpful in this regard (thank you so much again!). He gave me small tasks to get me started and ideas to get me unstuck.
I also struggled with the time pressure and the deadline. Not because I was behind in my work, but from fear of failure, due to the unknown factors of the problems mentioned above. It was very stressful and I was very anxious. I realized I worried needlessly, and I need to learn not to worry so much. Paradoxically, I ended up being ahead of schedule 😆