One of the most powerful ways through which we convey the results of data science is visualization, from simple Excel graphs through advanced displays like network diagrams and bespoke visuals. What most outside the data science community don’t realize is just how much artistry is involved in the creation of some of those visualizations, from the impact of color schemes on perception in geographic mapping to the layout algorithms and data filtering used in network visualizations. Given the rising use of networks to understand everything from social media to semantic graphs, just how much of an impact do our layout algorithms and filtering decisions have on the final images we see?
Network visualizations are at once beautiful and informative, helping us make sense of the macro through micro patterns in the vast connected ecosystems that define the world around us. Yet, like any form of data visualization, network visualization does not capture the sum total reality of our data so much as it constructs one possible reality.
When we think of scientific visualization, we think that the images we see present to us the one single “truth” of a dataset, without realizing that any given dataset can tell many different stories depending on the questions we ask of it and the filters we apply to answer those questions.
The myriad possible filters we apply to a graph to reduce its dimensionality, the layout algorithm that places the nodes in space, the clustering algorithms like modularity that group nodes by “similarity,” the definition of “similarity” that we hand to those clustering algorithms, the node sizing algorithms like PageRank, the color scheme and the random seeds used by many algorithms that ensure each run yields a very different image: all of these conspire to ensure that a single dataset can yield a nearly infinite number of possible visualizations.
How does this process play out in a real world visualization task?
In April 2016 my open data GDELT Project began recording the list of hyperlinks found in the body of each worldwide online news article it monitors. Not all news articles contain links, but many link to external websites such as the homepages of organizations being mentioned in the article or other news outlets from which specific story elements were sourced. These external sources of information provide powerful insights into which websites each news outlet considers worthy of mentioning, in much the same way that the references in an academic paper offer insights into the works believed most relevant and reputable by each field.
As of last month, GDELT’s link database had recorded more than 1.78 billion outlinks from more than 304 million articles. Collapsing each URL to its root domain and connecting each news outlet with the unique list of external domains it has linked to over the last three years and the number of days it published at least one article linking to that domain, the final dataset is composed of just over 30 million distinct pairings of news outlets and external websites, including links to other news outlets.
This link dataset is a classic network graph that can be readily visualized using off the shelf visualization packages like the open source Gephi.
However, its size and density mean that the graph must be filtered to reduce it down to a specific subset of greatest methodological interest, while the edge count must be lowered to reduce the graph to its most “significant” edges.
Instead of focusing on which sites a given news outlet links to, a far more interesting question in light of the current interest in combating “fake news” is to restrict the analysis to only links between news outlets and to compile a list of the top news outlets that link to a given other news outlet. In other words, for a news outlet like CNN, what are the top other news outlets around the world that link most heavily to CNN as a source in their own reporting? Much like academic citation networks convey authority, the linking behavior of news outlets can similarly convey a proxy of “news authoritativeness.”
Thus, the 30 million edge graph was methodologically inverted and only edges connecting news outlets were retained. A link from a CNN article to a UN report would be discarded, but a link from a CNN article to a New York Times article would be preserved.
As an initial visualization, the top 30 news outlets linking to each news outlet on at least 30 days or more were extracted and a random subset of 10,000 edges were used to form a new graph. The nodes were positioned using the OpenOrd layout algorithm and colored by Blondel et al’s modularity, with coloration selected by Gephi’s built-in palette generator.
The final image is seen below.
OpenOrd, like many layout algorithms, utilizes random seeds meaning it will yield a slightly different result each time it is run. This is an important distinction that is lost to many unfamiliar with network visualizations: there is no single “truth” to the visualized structure of a graph. Every rendering of a graph will present it in a slightly different way. Researchers frequently run a layout algorithm multiple times until they find a presentation that either looks the “best” or does the best job of visually separating the clusters of greatest importance to their analysis.
What happens if we change the background color from white to black, while otherwise leaving the graph exactly as-is?