The distribution of a variable (i.e. how frequent are its values) is a very useful tool to understand the behavior of a given dataset. In my case, I was exploring this reddit dataset and trying to see how the upvotes of different submissions are distributed.

Although gnuplot is a great tool to visualize data, plotting distribution of variables from external files is not very intuitive. To plot the distribution in linear axes we can use the **smooth frequency** option:

# Configure the output

set terminal png

set output "dist_linear_axes.png"

# Define a bin() function to aggregate close x-values.

bin_width = 5

bin(x) = bin_width * floor(x / bin_width)

# Apply the bin() function and save result to "temp.dat"

set table "temp.dat"

plot "data.dat" using (bin($1)):(1.0)

unset table

# Plot the result using linear scale.

set xlabel "Number of Upvotes"

set ylabel "Count"

set nokey

plot "temp.dat" using 1:2 smooth frequency with points

First I define a bin() function to group close x values. In this example I have used a bin width of 5. That means, for example, that values 2 and 4 will be grouped into the same bin. Intuitively, the next step would be to use the plot command as follows:

plot "temp.dat" using (bin($1):(1.0) smooth frequency with points

However this will not work as expected. The reason is that gnuplot will first group x values and only after that apply the bin() function. Instead, we want to apply the bin() function before smooth frequency groups the x values. To achieve this result I have applied the bin() function and saved the results to temporary file ("temp.dat") using the set table command. The final result will look like this:

Distribution in linear axes (it does not look very nice!) |

As we can see, the result does not look very good. Most of the points got clustered in the bottom of the plot. The reason is that the range of x and y values is big and does not allows us to see the shape of the distribution. Often, when we have variables that span a big range of values, it is useful to use log scale axes.

## Log scale axes

We can tell gnuplot to use log scales axes by using the set **logscale** command:

set logscale xy

However, in this case (plotting distributions) we have a
little problem: gnuplot applies the log function to the x and y values before
applying the **using smooth** option. Since in our case the y values are all 1 and
log(1) = 0, the count for each bin will also be zero.

We can solve this problem by using a second temporary file. First we save the results from smooth frequency options and then we tell gnuplot to use the log scale:

# Configure the output

set terminal png

set output "dist_log_axes.png"

# Define a bin() function to aggregate close x-values.

bin_width = 5

bin(x) = bin_width * floor(x / bin_width)

# Apply the bin() function and save result to "temp.dat"

set table "temp.dat"

plot "data.dat" using (bin($1)):(1.0)

unset table

# Apply smooth frequency and save result to "temp2.dat"

set table "temp2.dat"

plot "temp.dat" using 1:2 smooth frequency with points

unset table

# Plot the result using log scale.

set xlabel "Number of Upvotes"

set ylabel "Count"

set nokey

set logscale xy

plot "temp2.dat" using 1:2 with points

And here is the resulting plot:

Distribution using log axes. |