I recently came across the definition for Markov's Inequality as it's used in measure-theory and was shocked at how intuitive it was. It turns out this definition is roughly equivalent to the one taught in most undergraduate probability theory courses and helped me a lot to understand it.

Firstly, for a positive random variable \(X\) and any \(a > 0\), Markov's inequality states that

\[ Pr(X \geq a) \leq \frac{\mathbb{E}[X]}{a}\] or \[ a Pr(X \geq a) \leq \mathbb{E}[X] \]

This appears fairly arbitrary, but if we visualize the distribution of \(X\) in the right way, this inequality appears totally natural.

For all values \(x\) which \(X\) takes on, construct a block of height \(x\) and width \(Pr(X = x)\) and stack them all next to each other: For example, the given distribution results in the adjacent "bar graph"

Now we can see that the total area of these rectangles is the following: \[1 \cdot \frac1{5} + 2 \cdot \frac1{5} + 3 \cdot \frac2{5} + 4 \cdot \frac1{5} \] Which we see is just \(\mathbb{E}[X]\)!

Hence the total area of this graph is simply \(\mathbb{E}[X]\). We can also see what \(a Pr(X \geq a)\) represents for, say, \(a = 3\)

Just looking at the graph now, we see that no matter *how* we choose \(a\), the value \(aPr(X \geq a)\) will always be the area of some rectangle that lies within these blocks, and hence its area must be less than the total area of the blocks, \(\mathbb{E}[X]\)! Thus, Markov's inequality holds.

This demonstrates how important it can be to have the right way of looking at something. The key here is scaling the size of the x-axis to align with the probabilities, while letting the y-axis take on the actual values of our random variable.

This happens to be an extremely good idea in theoretical statistics and is used to do powerful things, like unify discrete and continuous probability. By looking at real-valued random variables as real-valued functions on the sample space \[f: \Omega \rightarrow \mathbb{R} \] and using the probability of events, \(Pr(A)\) for \(A \subseteq \Omega\), as a notion of volume in \(\Omega\), we get a similar graph as the one we constructed above (depending on how we order the elements of \(\Omega\)).

Then the expected value of that random variable \(f\) becomes the 'area under the graph of \(f\)', which requires a surprising amount of math and leads to the Lesbegue Integral. If you're interested, I highly recommend reading this, which gives a great formal introduction to theoretical statistics and requires little more than highschool calculus.