The Boltzmann distribution, also known as the Gibbs or Maxwell-Boltzmann distribution, is a key concept in statistical mechanics. It describes the probability distribution of particles across energy states in a system at equilibrium. In effect, relating the microscopic particle properties to macroscopic observables, enabling analysis of complex ensembles.

In a different post I showed a thermodynamics-inspired derivation of the distribution. In this post, I’ll be showing two derivations that are more statistics inspired. They are:

An information theory based approach
A combinatorics based approach

This notebook assumes that you’ve taken one or two statistics courses. As long as you’ve done that, you shouldn’t run into anything that you can’t figure out with a quick google search or by reading the pt1 post. If you see anything wrong, please send me an email at contactme@harrisonsantiago.com

Information Theory Based Approach

For this section we will use Shannon entropy, defined as:

\[ H = – \Sigma p(x) ln(p(x)) \]

where \(p(x)\) is the probability of some event.

In my the first post, we learned how the principle of maximizing entropy is fundamental in statistical mechanics. Let’s now provide some intuition behind it’s use here.

Entropy, in the context of information theory and statistical inference, serves as a fundamental measure of uncertainty or randomness within a probability distribution. Conceptually, it quantifies the average amount of information conveyed by each event in a distribution. A higher entropy value indicates a more uniform and less predictable distribution, characterized by greater uncertainty about the outcomes. Conversely, lower entropy suggests a more concentrated and predictable distribution, where certain outcomes are significantly more likely than others. This is because any other distribution would imply some additional knowledge or bias that we don’t actually have, thus lowering the uncertainty, thus lowering the entropy. If we used a six-sided die as an example, a truly random die would hive higher entropy than one we know to be heavily weighted.

The Principle of Indifference posits that in the absence of any distinguishing information or rational basis for preferring one outcome over another, one should assign equal probabilities to all possible outcomes. The maximum entropy principle extends the concept of indifference to more complex scenarios involving partial information or constraints. It asserts that among all probability distributions satisfying a given set of constraints, one should select the distribution with the highest entropy. This principle provides a systematic method for incorporating known information into a probabilistic model while maintaining maximum uncertainty about unknown aspects.

By maximizing entropy subject to constraints, we obtain a distribution that respects all available information without introducing extraneous assumptions or structure. This approach ensures that the resulting probability distribution is as unbiased and broadly applicable as possible, given the constraints of the problem at hand.In other words, we’re being as non-committal as possible about the things we don’t know, while fully accounting for the things we do know.

So how can we use this to derive the Boltzmann distribution>

Setup

Let’s consider a system with a set of states \(i\), each characterized by some general property \(x_i\). This could be energy (\(E_i\)), but it could also be any other quantity of interest (e.g., volume, magnetization, particle number). It’s important to note that here \(x_i\) is one microstate of the system. In thermodynamics \(x_i\) would typically represent the energy of a specific configuration of particles (or a single particle if there is only one).

We want to find the probability distribution \(p_i\) over these states, where \(p_i\) refers to the probability of a particular microstate.

We have two main constraints:

Normalizaiton: \(\Sigma p_i = 1\)
Fixed average state \(\bar{x} = \Sigma p_i x_i\)

Let’s break down each constraint and explore its significance:

Normalization Constraint: \(\Sigma p_i = 1\)

The sum of all probabilities equaling 1 is a fundamental property of any probability distribution. In the context of least biased inference, this constraint is telling us that we know with certainty that the system exists in some state, but without additional information, we don’t know which one.

Fixed Average State Constraint: \(\bar{x} = \Sigma p_i x_i\)

This constraint embodies our knowledge about the system’s state. In a physical system, this often would correspond to a fixed average energy, while tells us the temperature.In this context it is particularly important as it would maintain conservation of energy. Beyond this, the average state is often something we can measure or estimate from macroscopic properties of the system. This constraint, therefore, connects our microscopic description to macroscopic, measurable quantities.

In the context of least biased inference, by maximizing the entropy subject to these constraints, we’re saying: “Given that we know the system exists in some state, and we know its average energy, what’s the least biased way to assign probabilities to all the possible energy states?”

So how do we actually maximize our entropy, \(H\), subject to our constraints? This can be done using the method of Lagrange multipliers. We construct the Lagrangian \(L\):

\[ L = – \Sigma p_i ln(p_i) – \lambda(\Sigma p_i – 1) – \beta(\Sigma p_i x_i – \bar{x}) \]

where \(\lambda\) and \(\beta\) are lagrange multipliers

To find the maximum, we differentiate \(L\) with respect to each \(p_i\) and set the derivative to zero:

\[ \frac{\delta L}{\delta p_i} = – ln(p_i) – 1 – \lambda – \beta x_i = 0 \]

Now, let’s solve this equation for \(p_i\):

\[ -ln(p_i) = 1 + λ + βx_i \]

\[ p_i = exp(-(1 + λ + βx_i)) \]

\[ p_i = exp(-(1 + λ)) · exp(-βx_i) \]

If we define \(Z = exp(1 + λ)\). Then we have:

\[ p_i = (\frac{1}{Z}) exp(-βx_i) \]

which is our Boltzmann distribution when \(\beta = \frac{1}{k_B T}\)

At this point, \(Z\) is just a constant that ensures normalization. But by using the normalization constraint, we know that the probabilities must sum to 1 and can say:

\[ \Sigma (\frac{1}{Z}) e^{-βx_i} = 1 \]

\[ \frac{1}{Z} \Sigma e^{-βx_i} = 1 \]

Solving for Z:

\[ Z = Σ e^{-βx_i} \]

And putting it together, we can say that

\[ p_i = \frac{1}{\Sigma_M e^{-\beta x_i}} e^{-\beta x_i} \]

where \(M\) is the number of accessible states.

This new expression for Z, but now it’s in a form that we can actually calculate if we know the energy levels and β. Looking at this, the formulation for \(Z\) may seem a little pointless. After all, we just defined it as the multiplicative inverse of our Boltzmann factors. However, in practice it allows us to readily approximate Z provided we can adequately measure \(x_i\) at lower energies.

This goal of this derivation was to bridges the gap between statistical mechanics and information theory, demonstrating that the Boltzmann distribution is not merely a thermal phenomenon, but a fundamental principle of constrained uncertainty. Furthermore, the parameter β (inverse temperature)emerges from the math, not from thermodynamic considerations. Since \(\beta = \frac{\partial H}{\partial \bar{x}}\), we see that it represents the sensitivity of the entropy to changes in the constrained variable. We can interpret this as saying β is related to the amount of information provided by knowing the energy of a state. A higher |β| means that knowing the energy of a state provides more information about its probability. This results in a more sharply peaked distribution around the average energy.

Combinatorics approach

While the information theory based approach may be more useful in an applied setting, a combinatorics approach is much closer to how Ludwig Boltzmann originally derived the distribution. Boltzmann’s insight was to consider an isolated system with a fixed number of particles \(N\) and total energy \(E\), and to enumerate all possible microstates consistent with these macroscopic constraints. Here, each accessible microstate is assumed to be equally probable, a principle known as the ergodic hypothesis. The central quantity in this approach is the density of states \(Ω(E,N,V)\), which represents the number of microstates available to the system at a given energy \(E\), particle number \(N\), and volume \(V\). The entropy of the system is then defined as \(S = k_B ln Ω\), where \(k_B\) is Boltzmann’s constant. Now let us see it in action

This setup is going to have slightly different notation, because I find easier to track in this context. Let’s consider a large isolated system with total energy \(\mathbf{E}\). If we divide this system into \(N\) identical subsystems (most often particles), it stands to reason that each subsystem can be in various energy states, \(E_i\). As often done, let’s assume the energy levels are discrete and labeled as \(E_0, E_1, E_2, …\). Under these criteria a microstate is a specific configuration of all the particles, while a macrostate is defined by the number of particles that are at each energy level, \(n_0, n_1, n_2,…\).

Similar to our earlier work, we will have two constraints. (1) The total number of particles is fixed: \(N = \Sigma n_i\). (2) The total energy of the system is fixed: \(\mathbf{E} = \Sigma n_i * E_i\)

From here it is trivial to say that for a given macrostate, the number of microstates (or the number of ways to distribute N subsystems among the energy levels), \(W\), is given by:

\[ W = \frac{N!}{\Pi n_i!} \]

Here we make the note that the particles are distinguishable. If they are not, we should divide by \(N!\) since permutation does not lead to a new state.

As we’ve seen before, entropy is defined as \(H = – \Sigma p(x) log(p(x)) \). However, we’re going to take advantage of the fact that all of our microstates are equiprobable to reduce this formula, and also change the notation to make it more thermodynamics friendly. Here, we’ll say that entropy is defined as \(S = k * ln(W)\), where \(k\) is Boltzmann’s constant, for large \(N\) we can use the Stirling’s approximation (\(ln(N!) ≈ Nln(N) – N\)) and apply it to \(W\):

\[ ln(W) = ln(\frac{N!}{\Pi n_i!}) \approx Nln(N) – \Sigma (n_i ln(n_i) – n_i) \]

Now, we want to use our idea of entropy to find the most probable distribution, which means maximizing \(S\) subject to the constraints. This is equivalent to maximizing \(ln(W)\). Similarly to the earlier derivation we are going to use the method of Lagrange multipliers. We construct our Lagrangian:

\[ L = ln(W) – \alpha (\Sigma n_i – N) – \beta (\Sigma n_i E_i – \mathbf{E}) \]

Then to maximize \(L\), we need \(\frac{\partial L}{\partial n_i} = 0\) for all \(i\). So:

\[ \frac{\partial L}{\partial n_i} = \frac{\partial ln(W)}{\partial n_i} – \alpha – \beta e_i = 0 \]

From our earlier approximation, we see \(\frac{\partial ln(W)}{\partial n_i} = – ln(n_i)\). Substituting this back in we get:

\[ -ln(n_i) – \alpha – \beta E_i = 0 \]

\[ ln(n_i) + \alpha + \beta E_i = 0 \]

To solve for \(n_i\), we exponentiate both sides:

\[ e^{ln(n_i) + \alpha + \beta E_i} = e^0 \]

This simplifies to:

\[ n_i * e^\alpha * e^{\beta E_i} = 1 \]

\[ n_i = e^{ – \alpha} * e^{ – \beta E_i} \]

With \(\beta = (kT)^{-1}\), this is an example of the famous Boltzmann factor derived in 1868. Let’s define \(A = e^{-α}\) for simplicity. Then:

\[ n_i = A * e^{-β E_i} \]

This gives us the number of particles in energy state \(E_i\) as a function of that energy state and our Lagrange multipliers. But how can we turn this into a probability distribution? Well we can remember from our constraint that the total energy \(\mathbf{E} = \Sigma n_i E_i\). So we can easily see that the probability of the system being at any energy \(E\) is

\[ p(E_i) = \frac{e^{E_i \beta}}{\Sigma_{j=1}^M e^{E_j \beta} } \]

where \(M\) is the number of accessible states.

Introduction to the Boltzmann Distribution (pt 2)

Information Theory Based Approach

Combinatorics approach