Why Data Visualization Is Equal Parts Data Art And Data Science

One of the most powerful ways through which we convey the results of data science is visualization, from simple Excel graphs through advanced displays like network diagrams and bespoke visuals. What most outside the data science community don’t realize is just how much artistry is involved in the creation of some of those visualizations, from the impact of color schemes on perception in geographic mapping to the layout algorithms and data filtering used in network visualizations. Given the rising use of networks to understand everything from social media to semantic graphs, just how much of an impact do our layout algorithms and filtering decisions have on the final images we see?

Network visualizations are at once beautiful and informative, helping us make sense of the macro through micro patterns in the vast connected ecosystems that define the world around us. Yet, like any form of data visualization, network visualization does not capture the sum total reality of our data so much as it constructs one possible reality.

When we think of scientific visualization, we think that the images we see present to us the one single “truth” of a dataset, without realizing that any given dataset can tell many different stories depending on the questions we ask of it and the filters we apply to answer those questions.

The myriad possible filters we apply to a graph to reduce its dimensionality, the layout algorithm that places the nodes in space, the clustering algorithms like modularity that group nodes by “similarity,” the definition of “similarity” that we hand to those clustering algorithms, the node sizing algorithms like PageRank, the color scheme and the random seeds used by many algorithms that ensure each run yields a very different image: all of these conspire to ensure that a single dataset can yield a nearly infinite number of possible visualizations.

How does this process play out in a real world visualization task?

In April 2016 my open data GDELT Project began recording the list of hyperlinks found in the body of each worldwide online news article it monitors. Not all news articles contain links, but many link to external websites such as the homepages of organizations being mentioned in the article or other news outlets from which specific story elements were sourced. These external sources of information provide powerful insights into which websites each news outlet considers worthy of mentioning, in much the same way that the references in an academic paper offer insights into the works believed most relevant and reputable by each field.

As of last month, GDELT’s link database had recorded more than 1.78 billion outlinks from more than 304 million articles. Collapsing each URL to its root domain and connecting each news outlet with the unique list of external domains it has linked to over the last three years and the number of days it published at least one article linking to that domain, the final dataset is composed of just over 30 million distinct pairings of news outlets and external websites, including links to other news outlets.

This link dataset is a classic network graph that can be readily visualized using off the shelf visualization packages like the open source Gephi.

However, its size and density mean that the graph must be filtered to reduce it down to a specific subset of greatest methodological interest, while the edge count must be lowered to reduce the graph to its most “significant” edges.

Instead of focusing on which sites a given news outlet links to, a far more interesting question in light of the current interest in combating “fake news” is to restrict the analysis to only links between news outlets and to compile a list of the top news outlets that link to a given other news outlet. In other words, for a news outlet like CNN, what are the top other news outlets around the world that link most heavily to CNN as a source in their own reporting? Much like academic citation networks convey authority, the linking behavior of news outlets can similarly convey a proxy of “news authoritativeness.”

Thus, the 30 million edge graph was methodologically inverted and only edges connecting news outlets were retained. A link from a CNN article to a UN report would be discarded, but a link from a CNN article to a New York Times article would be preserved.

As an initial visualization, the top 30 news outlets linking to each news outlet on at least 30 days or more were extracted and a random subset of 10,000 edges were used to form a new graph. The nodes were positioned using the OpenOrd layout algorithm and colored by Blondel et al’s modularity, with coloration selected by Gephi’s built-in palette generator.

The final image is seen below.

GDELT GKG 2016-2018 Outlink Graph Top 30 / Random 10,000 White Background


OpenOrd, like many layout algorithms, utilizes random seeds meaning it will yield a slightly different result each time it is run. This is an important distinction that is lost to many unfamiliar with network visualizations: there is no single “truth” to the visualized structure of a graph. Every rendering of a graph will present it in a slightly different way. Researchers frequently run a layout algorithm multiple times until they find a presentation that either looks the “best” or does the best job of visually separating the clusters of greatest importance to their analysis.

What happens if we change the background color from white to black, while otherwise leaving the graph exactly as-is?

GDELT GKG 2016-2018 Outlink Graph (Top 30 / Random 10,000) Black Background
GDELT GKG 2016-2018 Outlink Graph (Top 30 Reciprocal / Top 30,000)


This graph shows a far more centralized structure, with a center core of tightly connected outlets around which the rest of the media ecosystem revolves.

This paints a very different picture of our global media structure, from the earlier diffuse dense collective to a galaxy-like mass of small clusters orbiting around a central core of international stature outlets. Much of this comes from our use of strongest edges rather than random edges, reminding us of the critical impact our sampling decisions have on the final structure we see.

Adjusting the thickness of each edge based on edge strength makes the central core less prominent and instead emphasizes the isolated nature of the myriad smaller clusters around the periphery.

How much of an impact does the layout algorithm have on our understanding of the structure of a graph?

Here we reduce the graph to the top five inlink outlets by news outlet and display the top 50,000 strongest connections using the same OpenOrd algorithm used in all of the above graphs.

GDELT GKG 2016-2018 Outlink Graph (Top 5 / Top 50,000) Using OpenOrd


The result is a very diffuse structure like the earlier renderings, showing complex structure with multiple cores, a complex interconnected structure on the left and numerous other clusters.

In contrast, the image below shows the results of running the exact same graph through the Force Atlas 2 algorithm instead. This image looks like an entirely different graph, with the entire network extending outwards from a central core.

GDELT GKG 2016-2018 Outlink Graph (Top 5 / Top 50,000) Using Force Atlas 2


Here’s another comparison, this time limiting to just those news outlets indexed in Google News circa mid-2017 and similarly limiting to the top 5 inlink domains by outlet and displaying the top 50,000 strongest connections.

The OpenOrd layout shows a diffuse structure.

GDELT GKG 2016-2018 Outlink Graph (Google News / Top 5 / Top 50,000) Using OpenOrd


The Force Atlas 2 layout, on the other hand, once again centralizes the graph structure.

GDELT GKG 2016-2018 Outlink Graph (Google News / Top 5 / Top 50,000) Using Force Atlas 2


Sometimes the centralized perspective of Force Atlas 2 can be helpful in drawing attention to the centralized clustering of a graph.

Here the same graph as above, reduced from the top 50,000 strongest connections down to the top 10,000.

The OpenOrd layout predictably shows a fairly diffuse layout, though helpfully captures the graph’s dual center.

GDELT GKG 2016-2018 Outlink Graph (Google News / Top 5 / Top 10,000) Using OpenOrd


The Force Atlas 2 version collapses this center but makes it more apparent that the entire graph revolves around a complex center.

GDELT GKG 2016-2018 Outlink Graph (Google News / Top 5 / Top 10,000) Using Force Atlas 2


Graphs are most commonly displayed as edge visualizations in which the connections between nodes are the focal point of the image. This tends to be the norm in domains where it is the connectivity structure that is of greatest interest. In other domains it is the nodes themselves that are of primary importance, with the graph structure used only to position them in space according to their relatedness.

The image below shows a traditional OpenOrd layout of the graph of news outlets based in the United States, using their top 10 reciprocal edges and limiting to the top 30,000 strongest edges.

GDELT GKG 2016-2018 Outlink Graph (US News Outlets / Top 10 Reciprocal / Top 30,000) Using Edge Weights And OpenOrd Edge Focus


The image below shows the exact same graph, but with the edges hidden to show only the nodes.

This actually makes many of the peripheral clusters clearer and draws the macro structure of the graph into starker focus. For high density graphs like this, node visualizations can often make it easier to understand graph structure without the burden of tens of thousands of distracting spaghetti lines crisscrossing the image.

Putting this all together, we see just how much of an impact our algorithmic and methodological decisions have on the final visual representation we receive of a given dataset. Every visualization on this page displays the very same dataset but offers different perspectives by filtering it in different ways and using different algorithmic and visual selections.

In the end, perhaps the biggest takeaway is the reminder that the incredible imagery that emerges from our vast datasets are equal parts data art and data science, constructing rather than reflecting reality.

Original post: https://www.forbes.com/sites/kalevleetaru/2019/02/24/why-data-visualization-is-equal-parts-data-art-and-data-science/

Leave a Reply

Your email address will not be published. Required fields are marked *