The distribution of a variable (i.e. how frequent are its values) is a very useful tool to understand the behavior of a given dataset. In my case, I was exploring this reddit dataset and trying to see how the upvotes of different submissions are distributed.
Although gnuplot is a great tool to visualize data, plotting distribution of variables from external files is not very intuitive. To plot the distribution in linear axes we can use the smooth frequency option:
# Configure the output
set terminal png
set output "dist_linear_axes.png"
# Define a bin() function to aggregate close x-values.
bin_width = 5
bin(x) = bin_width * floor(x / bin_width)
# Apply the bin() function and save result to "temp.dat"
set table "temp.dat"
plot "data.dat" using (bin($1)):(1.0)
unset table
# Plot the result using linear scale.
set xlabel "Number of Upvotes"
set ylabel "Count"
set nokey
plot "temp.dat" using 1:2 smooth frequency with points
First I define a bin() function to group close x values. In this example I have used a bin width of 5. That means, for example, that values 2 and 4 will be grouped into the same bin. Intuitively, the next step would be to use the plot command as follows:
plot "temp.dat" using (bin($1):(1.0) smooth frequency with points
However this will not work as expected. The reason is that gnuplot will first group x values and only after that apply the bin() function. Instead, we want to apply the bin() function before smooth frequency groups the x values. To achieve this result I have applied the bin() function and saved the results to temporary file ("temp.dat") using the set table command. The final result will look like this:
Distribution in linear axes (it does not look very nice!) |
As we can see, the result does not look very good. Most of the points got clustered in the bottom of the plot. The reason is that the range of x and y values is big and does not allows us to see the shape of the distribution. Often, when we have variables that span a big range of values, it is useful to use log scale axes.
Log scale axes
We can tell gnuplot to use log scales axes by using the set logscale command:
set logscale xy
However, in this case (plotting distributions) we have a little problem: gnuplot applies the log function to the x and y values before applying the using smooth option. Since in our case the y values are all 1 and log(1) = 0, the count for each bin will also be zero.
We can solve this problem by using a second temporary file. First we save the results from smooth frequency options and then we tell gnuplot to use the log scale:
# Configure the output
set terminal png
set output "dist_log_axes.png"
# Define a bin() function to aggregate close x-values.
bin_width = 5
bin(x) = bin_width * floor(x / bin_width)
# Apply the bin() function and save result to "temp.dat"
set table "temp.dat"
plot "data.dat" using (bin($1)):(1.0)
unset table
# Apply smooth frequency and save result to "temp2.dat"
set table "temp2.dat"
plot "temp.dat" using 1:2 smooth frequency with points
unset table
# Plot the result using log scale.
set xlabel "Number of Upvotes"
set ylabel "Count"
set nokey
set logscale xy
plot "temp2.dat" using 1:2 with points
And here is the resulting plot:
Distribution using log axes. |