\[ \lVert (x_1, x_2, \ldots, x_n)^T \rVert = |x_1| + |x_2| + \ldots + |x_n| \]

The

]]>\[ \lVert (x_1, x_2, \ldots, x_n)^T \rVert = |x_1| + |x_2| + \ldots + |x_n| \]

The solution was actually quite straightforward: The isosurfaces of the \(L_1\) norm in \(\mathbb{R}^n\) look something like this, which each color corresponding to a particular isosurface:

I realized that to uniformly sample this red surface, I could sample \(n\) iid exponential random variables \( \{X_i \sim \text{Exp}(\lambda =1)\}_{i = 1 \ldots n} \) and then normalize the resulting vector \( X = (X_i, \ldots X_n)^T \) using its norm. This works because the pdf of \(X\) is

\[\begin{aligned} f_X(x) &= f_{X_1}(x_1) f_{X_2}(x_2) \ldots f_{X_n}(x_n) \\ &= e^{-x_1} \ldots e^{-x_n} \\ &= e^{-(x_1 + x_2 + \ldots x_n)} \\ &= e^{-\lVert x \rVert_1}\end{aligned}\]

Meaning the probability of sampling a given point depends only on its norm, so re-scaling all vectors to have unit norm will yield a uniform sampling on the unit norm surface. This looks something like this:

This got me thinking: can this be used to sample arbitrary surfaces? Essentially the same technique can be used to uniformly sample the n-sphere \(S^n\) embedded in \(\mathbb{R}^{n+1}\) by instead using the \(L_2\) norm and sampling from a Normal random variable (you use the squared \(L_2\) norm but it still uses the symmetry of the exponential sampling). The main difference is that the \(L_1\) sampling was done for strictly positive values while this \(L_2\) norm sampling can be done for both positive and negative values.

First we focus on sampling vectors in all four quadrants.

Generalizing slightly, we could let our surface we want to sample be a Quadric, a surface defined as the solutions of the quadratic form

\[ R = y^T A y + b^T y\]

where \( A \) is a real symmetric matrix and \(b\) is a vector, and \(R\) is some positive scalar. Then, we want our joint random variable to have the the pdf

\[ f_Y(y) = \frac1{Z} \exp(-(y^T A y + b^T y)) \]

Where \(Z \) is our normalization constant to make \(f_Y\) a pdf.

We can compute \(Z\) using the eigendecomposition of \(A\), \(A = Q \Lambda Q^T \) where \( \Lambda \) is a diagonal matrix with entries \( \lambda_i \) and \(Q\) is an orthogonal matrix ( \( Q^T Q = Q Q^T = I_{n\times n} \) ).

We calculate

\[\begin{aligned}Z = \int_{\mathbb{R}^n} f_Y(y) dy &= \int_{\mathbb{R}^n} \exp(-y^T A y - b^T y)dy \\ &= \int_{\mathbb{R}^n} \exp(-y^T Q \Lambda Q^T y - b^T Q Q^T y)dy \\&= \int_{\mathbb{R}^n} \exp(-(Q^T y)^T \Lambda (Q^T y) - (Q^T b)^T (Q^T y))dy \\ &= \int_{\mathbb{R}^n} \exp(-x^T \Lambda x - (Q^T b)^T x) dx \\ &= \int_{\mathbb{R}^n} \exp(\sum_{i=1}^n -x_i^2 \lambda_i - (Q^T b)_i x_i) dx \\ &= \int_{\mathbb{R}^n} \prod_{i=1}^n \exp(-x_i^2 \lambda_i - (Q^T b)_i x_i)dx \end{aligned}\]

Here we reach a point where we'd like to exchange our product sign and our integral using Fubini's theorem, but this integral may actually diverge. This is because if any of the eigenvalues \( \lambda_i \) is not strictly positive, our quadric form has unbounded surface area. Thus we are forced to impose the constraint that all the \(\lambda_i > 0\). Applying Fubini's theorem we get

\[\begin{aligned}Z &= \prod_{i=1}^n \int_{\mathbb{R}} \exp(-x^2 \lambda_i - (Q^T b)_i x) dx \\&= \prod_{i=1}^n \sqrt{\frac{\pi}{\lambda_i}} \exp(\frac{(Q^T b)_i^2}{4\lambda_i}) \\&= \sqrt{\pi^n \det(A)^{-1}} \exp(\sum_{i=1}^n \frac{(Q^T b)_i^2}{4\lambda_i})\end{aligned}\]

This yields a well-defined pdf. Notice that this normalization constant grows unbounded as any of the \( \lambda_i \rightarrow 0 \).

Notice that above, we decomposed \(f_Y\) in terms of a product of \(n\) independent functions.

We use these as our independent random variables which we will sample. Note that these are expressed in the eigenbasis of \(A\), and thus to obtain our final PDF we will have to apply the inverse of our \(Q^T\) isometry that we used in our integration.

Each \( f_i \) is defined as follows:

\[\begin{aligned}f_i(x) &= \sqrt{\frac{\lambda_i}{\pi}} \exp(-\frac{(Q^T b)_i^2}{4\lambda_i}) \exp(-x^2 \lambda_i - (Q^T b)_i x) \\&= \sqrt{\frac{\lambda_i}{\pi}} \exp(-\frac{(Q^T b)_i^2}{4\lambda_i} - x^2 \lambda_i - (Q^T b)_i x)\end{aligned}\]

Identifying the proper terms, we can express the CDF \(F_i\) as

\[ F_i(x) = \frac{1}{2} \left(\Phi\left(\frac{2 \lambda_i x+ (Q^T b)_i}{2 \sqrt{\lambda_i}}\right)+1\right)\]

Where \( \Phi \) is the Gaussian definite integral \( \Phi(x) = \frac{2}{\sqrt{\pi }}\int _0^x e^{-t^2} dt \).

We then calculate the inverse CDF \( F_i^{-1} \) directly as

\[ F_i^{-1}(x) = \frac{2 \sqrt{\lambda_i} \Phi^{-1}(2 x-1)- (Q^T b)_i}{2 \lambda_i} \]

Using the inverse CDF method of sampling, we then can sample \(n\) iid uniform random variables \( U_i \). With the sample values \( \{u_1, \ldots u_n \} \), we compute the vector

\[ x = (F_1^{-1}(u_1), \ldots, F_n^{-1}(u_n))^T \]

Note that we need to perform our rescaling to constraint \( x \) to be on the surface in our eigenbasis. We compute

\[ x' = \frac{R}{x^T \Lambda x + (Q^T b)^T x} x \]

and finally

\[ y = Qx' \]

Where \( Q \) is applied in order to shift back to the principle axes of our quadric surface. This procedure will yield samples \( y \) that lie uniformly on our surface.

The black vector here denotes the subspace orthogonal to the tangent space of the submanifold, and the red is the tangent space of the submanifold.

If we remove our postitive-definite condition on our metric, however, we obtain a much more peculiar structure.

For our manifold \(M\), say we chose our metric \(\eta\) to be the standard Lorentzian metric used in Minkowski spacetime: \[ ds^2 = dt^2 - dx^2 - dy^2 - dz^2\]. We no longer have our positive definite condition. Let's choose some \(3\) dimensional embedded hypersurface \(S\), and consider it's tangent space at a point \(p \in S, T_p S\). Our embedding allows us to identify \(T_p S\) as a sub-vector space of \(T_p M\), the tangent space at \(p\) in the original manifold.

So what do different potential configurations of \(T_p S\) look like? We know they will form a \(3\) dimensional subspace. For illustration's sake, we move to a 2-dimensional analogue, supressing the \(z\) coordinate. We first look at the case when \(T_p S\) is directly orthogonal to the local \(t\) coordinate:

Just like in the positive-definite case, our "normal" subspace is independent of the subspace of \(T_p S\), shown in red. We have some new structure, however. The blue cone shown (which was previously just the origin) is the space of "null vectors", or vectors \(X\) for which \(\eta(X, X) = 0\). Interestingly, even though we have a different structure for these null vectors, we still end up with the same kind of orthogonality we would have previously. When we start to "tilt" \(T_p S\), the way in which our normal subspace changes is quite different:

Here we see that as the subspace tilts, the normal tilts *towards* the subspace, and at the point of the cone, they are equivalent. This means that when \(T_p S\) intersects with this null cone, the normal subspace to \(T_p S\) is it's intersection with the light cone!

We can pull this back to a property of \(S\), defining \(S\) as a *null hypersurface* if at every \(p \in S\), the normal subspace to \(T_p S\) lies in the normal cone.

Physically, our manifold is spacetime, with time having the positive sign in our metric. Particles are curves in spacetime with velocities that lie somewhere on the interior or the surface of the null cone in every tangent space. A tangent vector lying *on* the null cone corresponds to a velocity of a particle traveling at the speed of light.

Imagine converting a point-like piece of matter to photons with them scattering in all directions. The path these photons take will trace out a surface in our spacetime in the exact shape of the null cone, meaning this surface is a null hypersurface! In general these surfaces are very useful for performing computations and describing certain submanifolds of spacetime.

]]>*This is a the result of the second project of CS194-26 from Fall 2020*

We compute the magnitude of the gradient by first finding the gradient in the x and y directions, dx and dy respectively, then computing `(dx**2 + dy**2)**0.5`

to

*This is a the result of the second project of CS194-26 from Fall 2020*

We compute the magnitude of the gradient by first finding the gradient in the x and y directions, dx and dy respectively, then computing `(dx**2 + dy**2)**0.5`

to get the true magnitude.

For the following image:

The squared magnitude of the gradient is shown below.

The problem with this specific method is that we get a fair amount of noise: we want the edges of the image but the gradient still has a significant value in smaller places such as the grass.

If we look at a histogram of the squared gradient we see that there are a fair number of small values.

So we can simply threshold these out and we get something slightly cleaner:

Another way to get the true edges would be to first denoise the image using a gaussian filter:

And then compute the gradient on this resulting image:

Here we can see this gives us a better representation of edges at the expense of granularity.

Since this is just two repeated convolutions we can simply convolve the filters in order to get a single filter which computes the x and y gaussian gradients.

And we can see that this gives us the same result as before

Now we'll demonstrate how we can use the image gradient for straightening the following image:

Instead of looking at the magnitude of the gradient, we will instead look at the *angle* of it, given by `theta = arctan(dy/dx)`

.

This will tell us what the angle of the gradient is in a given region and thus what a given edge is like (we can do this in addition to filtering out low magnitude gradients to prioritize edges).

Here is the `theta`

for the above image and the distribution of angles.

We can see it's just slightly off at 0 and at `pi/2`

and `-pi/2`

.

We want to maximize how close this is to those angles so we count the number of points with angles whose absolute value are close to `0, pi/2`

.

Rotating between -15 and 15 degrees, we get the following percentage of angles in the given threshold:

We then choose the image with the highest such amount and look at that specific rotation:

Instead rotating the color version of the image we get:

Nice! This works well for images with a fair number of horizontal and vertical edges such as:

And

This does break, however, for images with less common or pronounced vertical and horizonal features such as

The rotations are likely thrown off by the strong gradients and angles shown on the kids in the foreground even though we can see the wall is slanted in the background.

Now let's play with some filters. First let's try to sharpen this image:

We know we can filter out high frequencies if we use a gaussian kernel such as the 5x5 one below:

Which gives us a blurred version of the image.

If we subtract this blurred version from the original image, we are left with mostly high frequencies in our image:

Hence adding it back with some scaling factor will give us a "sharper" version of the image, in which the higher frequencies make the image seem sharper.

The sharpenend image is on the left in this image.

We can also combine this into a single filter by combining the unit impulse filter with the gaussian to get the laplacian:

And we see that using just this filter gives us the same result. Sweet!

Done per-channel on the color image looks a little strange but generally seems to work:

Here it is applied to a Henri Cartier-Bresson shot. We can see a significant effect of sharpening.

If, however, we try to sharpen after blurring, it doesn't have much effect.

This is due to the fact that we have little by the way of high frequencies to add back in, as applying the gaussian filters many of them out.

We can use our techniques of high and low frequency manipulation to make images which appear to be different depending on how far away they are seen.

Consider the following aligned images.

We can overlay these by taking the high frequencies from one and the low from the other using our gaussian filter.

Combining these images linearly we get

Which while large looks like a cat, at a small view looks like our good friend Derek.

I for some reason felt compelled to combine the following two images of Rick and Alyosha.

Which yields the following beauty:

That when seen from afar looks just like Rick!

We can look at how these images are represented in the frequency domain using the log absolute value of the fourier transform:

If we look at the frequency representation of the combined image, we see that the higher frequencies (those further from the center) are far more like Alyosha's frequencies than Rick's.

It might be difficult to see, but the lower frequencies are also more similar to ricks.

The bottom two images are the fourier transforms of the respectively filtered images, which aren't particularly illuminating (I think the artifacting in Rick's image is from the weird way I cropped it).

In quarantine I have missed going one of my favourite restaurants in Berkeley, the infamous pizza place Jupiter. The following images are dedicated to the BBQ Chicken pizza at Jupiter.

Oh wow look a cool planet!

Oh wow look a tasty pizza!

This works great in both above examples as we can discern the low frequency image by its low frequencies and the high frequency image by its high frequencies.

If we can't do this, we don't get the same effect.

Consider my following attempt to mix a snowy and leafy forest:

!

None of them really look like much, as we can't discern the leafy forest by its low frequencies or the snowy one by its high frequencies.

We can also use progressively higher variance gaussians to get a perspective of different levels of frequencies in an image.

For example, the following image has low frequencies that look like Lincoln while it's high and medium frequencies look like a lady in front of a mural.

The Gaussian and Laplacian stacks show this perfectly, with only Lincoln present the lowest frequencies, the blocks of the mural in the high frequencies, and the woman primarily in the middle frequencies.

This can also demonstrate the effect on our friend Rickyosha. We see mostly the texture of Alyosha in the highest frequencies and only the shape of Rick in the lowest frequencies (note that you may have to open this image separately as it may be too small to properly see the high frequencies in the bottom left).

Oh, you didn't think I had forgotten about Jupiter yet, did you? Let's blend these two images cleanly using more of our frequency techniques!

Here we'll use the following mask (just a simple linear step function) and pass it through a Gaussian stack as above.

We'll also compute the laplacian stacks for both the Jupiter and pizza to get different bands of frequencies.

We will then combine these with the gaussian-stacked mask to get a wide blend for low frequencies and a closer blend for high frequencies!

If we sum up all these blended images, we get the following... tasty... tasty... image...

We can simply apply this per channel to get an even more.. oh.. mmmm..

Could we also use this technique to combine our snowy/leafy example which failed above?

Wow, looks awesome! We aren't limited to just a vertical mask though. Let's try something fancy: let's put that pepperoni-looking spot on jupiter on our pepperoni pizza using the following mask:

Looks nice and smooth

I definitely learned the utility of gaussian filters as low-pass filters. I previously thought the most reasonable way to do such a thing would be to clip the fft of an image, but this works shockingly well and produces some pretty cool results!

]]>Firstly, for a positive random variable \(X\) and any \(a > 0\), Markov's inequality states that

\[ Pr(X \geq a) \leq \frac{\mathbb{E}[X]}{a}\] or \[ a Pr(X \geq a) \leq \mathbb{E}[X] \]

This appears fairly arbitrary, but if we visualize the distribution of \(X\) in the right way, this inequality appears totally natural.

For all values \(x\) which \(X\) takes on, construct a block of height \(x\) and width \(Pr(X = x)\) and stack them all next to each other: For example, the given distribution results in the adjacent "bar graph"

Now we can see that the total area of these rectangles is the following: \[1 \cdot \frac1{5} + 2 \cdot \frac1{5} + 3 \cdot \frac2{5} + 4 \cdot \frac1{5} \] Which we see is just \(\mathbb{E}[X]\)!

Hence the total area of this graph is simply \(\mathbb{E}[X]\). We can also see what \(a Pr(X \geq a)\) represents for, say, \(a = 3\)

Just looking at the graph now, we see that no matter *how* we choose \(a\), the value \(aPr(X \geq a)\) will always be the area of some rectangle that lies within these blocks, and hence its area must be less than the total area of the blocks, \(\mathbb{E}[X]\)! Thus, Markov's inequality holds.

This demonstrates how important it can be to have the right way of looking at something. The key here is scaling the size of the x-axis to align with the probabilities, while letting the y-axis take on the actual values of our random variable.

This happens to be an extremely good idea in theoretical statistics and is used to do powerful things, like unify discrete and continuous probability. By looking at real-valued random variables as real-valued functions on the sample space \[f: \Omega \rightarrow \mathbb{R} \] and using the probability of events, \(Pr(A)\) for \(A \subseteq \Omega\), as a notion of volume in \(\Omega\), we get a similar graph as the one we constructed above (depending on how we order the elements of \(\Omega\)).

Then the expected value of that random variable \(f\) becomes the 'area under the graph of \(f\)', which requires a surprising amount of math and leads to the Lesbegue Integral. If you're interested, I highly recommend reading this, which gives a great formal introduction to theoretical statistics and requires little more than highschool calculus.

]]>A lot of students in office hours seemed to be unsure of what a Modspace is and exactly what \(\mod m\) means (is it a function?) or how \(3 \equiv 18 \mod 15\). To start out, we need to talk about what a Modspace is. (Note: super mathy things that you should read about if you're interested are written in *italics, *ex: *tropical math*).

Consider the integers, \(\mathbb{Z}\). For integers, \(+\) is some operation called an 'addition' that takes in two numbers and adds them together. There is an additive identity \(0\) defined to be the number such that \(a + 0 = a\) and each number \(a\) has an additive inverse \(-a\), defined to be the number such that \(a + (-a) = 0\).

There's also some operation \(\times\) called a 'multiplication', for which there is a multiplicative identity \(1\) such that \(1 \times a = a\) for any number \(a\), and which distributes over addition, meaning \[c \times (a + b) = c \times a + c \times b \]

Super mathy people called *Algebraists* call structures with these properties c*ommutative rings**. *It's a mathy structure to describe some collection of things that you can add and multiply, like we're used to. Modspaces define one of these structures on some smaller subset of \(\mathbb{Z}\). What happens if we take this structure and constrict its elements down from \(\{\dots -2, -1, 0, 1, 2 \dots \}\) to something like \(\{0, 1, \dots m-1\}\)? Well, we need to redefine addition and multiplication to be compatible with this change, and modspaces choose a very particular way of doing this.

When we write \(3 \equiv 18 \mod 15\), we don't mean that \(3 = 18\) in the integers. What we mean is that in the Modspace with \(m=15\), \(3\) is *equivalent* to \(18\). We can talk about defining an *equivalence relationship* for \(\mod 15\), called \(\sim\), which behaves like \(=\) in that it's reflexive (\( a \sim a\)), symmetric (\(a \sim b \Longleftrightarrow b \sim a\)), and transitive (\(a \sim b \land b \sim c \Longleftrightarrow a \sim c \)).

In the case of Modspaces, we define \(a\) and \(b\) to be *equivalent *(\(a \sim b \)), if \(a=b + k \times m \) for some \(k \in \mathbb{Z} \). This intuitively means that \(a \sim b\) if \(a\) and \(b\) are equivalent up to adding multiples of \(m\). Note that \(a,b,k,m\) are integers, \(=\) is the normal equivalence in the integers, and \(+\) and \(\times \) are the normal operations from the integers, so this equivalence class is defined in terms of operations on the integers.

What Modspaces do is take all of the integers and 'glue together' those which are equivalent with respect to \(\sim \). We bundle all these up into a single element of the Modspace, called an *equivalence class*. These classes are denoted as \([a]\), and can be vaguely understood as the collection of elements equivalent to it (\([a] \approx \{b \mid a \sim b\})\). Another way to understand this is as following: let's deal with \(\mod 15\) and imagine we draw out the number line in this specific way

What we are doing when defining this equivalence class is collapsing these lines along the y-axis. This looks like the following:

Now that we have a set of elements (all the \([a]\) for \(a \in \{0, 1, \dots m\}\)) we want to turn it into one of those mathy structures like the integers. In order to do so, we need to define an 'addition' and 'multiplication' that exhibit the properties we require.

To keep clear which multiplication/addition is which, I'll use the symbol \(\otimes\) and \(\oplus\) to denote the multiplication and addition respectively in \(\mod m\). We define these in the obvious way: $$ [a] \oplus [b] \sim [a + b] $$ $$ [a] \otimes [b] \sim [a \times b]$$

To add or multiply any two elements of the Modspace, we say simply look at the equivalence class of their integer sum/product. Simple right? (A good exercise is to check that our addition/multiplication \(\oplus\) and \(\otimes\) are *well-defined*, that is you get the same result regardless of which representatives of the equivalence class you choose).

Going back to this example, let's try to understand exactly what's happening: $$3 \equiv 18 \mod 15$$

Firstly, \(3,18 \mod 15\) represent the equivalence classes \([3]\) and \([18]\). As \([3] \equiv [3 + k \times m]\) and for \(k=1, 3+1\times 15 = 18\), then \([3]=[18]\) and \(3 \equiv 18 \mod 15\). We can also use our picture of the number line to see that 3 and 18 both get 'squished' down to the same number:

You can also understand how our definition of \(\oplus\) and \(\otimes\) works by looking at how an expression such as \(8 + 14 \equiv 7 \mod 15\). We simply jump along the number line (like we used to back in elementary school), and see that we end up landing on the equivalence class corresponding to \(7\):

Not so bad, right?

]]>A common practice in graph algorithms is to use a standard node-edge based model for computation. While this has many different representations, the model in which computational steps stays very similar. I stumbled upon this PhD showcase from SC18 entitled "Linear Algebra is the Right Way to Think About Graphs". It argues that, instead of a vertex-edge based representation of graph computations, linear algebra-based methods are more concise, expressible, portable, and potentially more performant than traditional models.

Each step of a graph traversal can be cleverly represented as a sparse matrix-vector multiplication, where the vector represents nodes to be traversed and the matrix represents the edges in the graph:

*by Carl Yang (UC Davis; LBNL)*

Here each \(row,column\) entry in the matrix represents an edge from \(column \rightarrow row\)

By combining this concept with Push-Pull variant of BFS presented in Direction-Optimizing Breadth-First Search, the potential for fast, parallel BFS algorithms arises. One nice side-effect of this model is that Push and Pull are duals of each other. Push represents the column \(i\) as the nodes which vertex \(i\) goes to and row \(j\) as the nodes which go into vertex \(j\). Thus, to implement pull, we need only swap the row-column representation and run the same algorithm!

This formulation does require some additional optimizations relative to the standard formulations however. As the place we start from is a single point in a massive vector, we have both a sparse matrix and a sparse vector. Since modern SpMV algorithms are not optimized for a sparse vector, what we need is a sparse matrix sparse vector multiply (SpMSpV) algorithm, which Fast Sparse Matrix and Sparse Vector Multiplication Algorithm on the GPU implements quite cleverly and in a similar vein to many SpMV algorithms.

Additionally, once a vertex is reached, there is no need to explore it. Using masking techniques which skip some areas of computation available in hardware level SpMV techniques or in algorithmic tweaks , we can further reduce the amount of work required for the BFS in the later stages of its search and approach the efficiency of standard graph computation methods.

The performance of these linear algebra single-gpu based algorithms are mixed. They rival similar algorithms 2-CPU performance, but do not currently scale to multiple GPUs/multiple machines. In order for these linear algebra-based graph algorithms to be truly useful, they must be able to scale to multiple GPUs and multiple systems. There are GPU-based alternatives, however, such as Gunrock , which scale to multiple GPUs and multiple machines. Although these do not process data in a linear algebraic way, they are still quite fast. These techniques still acheive nearly the performance of libraries such as Gunrock in most areas though:

The main advantage of centering an algorithm around these common linear algebra routines is that if a fast vendor library or a new algorithm for arises for the routine which a graph algorithm depends on, the algorithm naturally runs faster. The portability of these methods is their true value and if they can be parallelized from a single Shared Memory machine (like a gpu) to Distributed Memory systems (like a top500 supercomputer), they will see their maximum potential.

]]>