Ethics500


What is (bio/medical) ethics about? Where is it standing? And where is it going?

Ethics500 is an algorithmic answer to this question.

It is a semantic network map, displaying the most influential 500 keywords used in academic articles related to ethics indexed in PubMed in a given timeframe. The connections are computed as co-occurrencies (how often keywords are used together in the same paper). The clusters are computed using modularity class, i.e. a measure of the strength of the division of a network into clusters.

In Ethics500 you therefore have an interactive (i.e. searchable and scrollable and zoomable and ‘clusterable’) representation of keywords, connections, and clusters in (bio/medical) ethics.

The menu offers:


Nerdy methodological details

Data source

PubMed, queried as follows through TopicTracker:

"YYYY/MM/DD"[Date - Publication] : "YYYY/MM/DD"[Date - Publication] AND *ethic*[TiAb]

Analisys

The process described here is included in TopicTracker‘s code since version 1.4.

Keywords are extracted from the dataframe as a list of lists (i.e: each item of the 1st list is a list itself, containing the keywords from one paper) and cast to lowercase.

The list of lists is flattened in order to calculate the absolute frequency of each keyword. Only the top 500 are kept for the analysis (although this value is arbitrary and can be changed by the user when running the analysis).

The keywords which are not going to be included in the analysis are in-place removed from the list of lists, which is then used to calculate a co-occurrence matrix.

Gephi requires 2 tables to produce a visualization: nodes and edges.

The nodes table contains the keywords themselves (ad both ID and label) the absolute frequency, and the normalized frequency. The normalized frequency is calculated as (instances of keyword) / (number of papers in the corpus).

The edges table contains the connections between keywords. It lists the source node, the destination node, the absolute frequency of that co-occurrence, and its normalized frequency. The normalized frequency is calculated as (instances of the co-occurrence) / (number of papers in the corpus with 2 or more keywords).

Visualization

The visualization is prepared in Gephi, “the leading visualization and exploration software for all kinds of graphs and networks. Gephi is open-source and free”.

In a new workspace I import first the nodes, then the edges (as non-directed connections).

I calculate the modularity (using a resolution of 0.5, arbitrary, but yields meaningful clustering IMHO) and use the modularity class to determine the colour partitioning of the nodes (with a color-blind-friendly palette). For more info about modularity and resolution, see:

Vincent D Blondel, Jean-Loup Guillaume, Renaud Lambiotte, Etienne Lefebvre, Fast unfolding of communities in large networks, in Journal of Statistical Mechanics: Theory and Experiment 2008 (10), P1000

R. Lambiotte, J.-C. Delvenne, M. Barahona Laplacian Dynamics and Multiscale Modular Structure in Networks 2009

The modularity class is also what determines the layout (Circle Pack layout, using modularity class as the first and only hierarchy). For determining the node size I use the node’s weight, i.e. its normalized frequency, rendered with a size scale 10-50.

After some minor positioning tweaks (expansion, label adjust, pruning edges below a given threshold) the semantic network map is ready and can be exported both as a static pdf and as an interactive Sigma.js template.

Bragging

Once it’s all done, you automatically acquire bragging rights and you can bother your colleagues spamming them with the link / ask your boss to print the pdf as a big fat poster to put in some meeting room.