Abstract

Given the sheer amount of news sources in the digital age (e.g., newspapers, blogs, social media) it has become difficult to determine where news is first introduced and how it diffuses across sources. RNewsflow provides tools for analyzing content homogeneity and diffusion patterns using computational text analysis. The content of news messages is compared using techniques from the field of information retrieval, similar to plagiarism detection. By using a sliding window approach to only compare messages within a given time distance, many sources can be compared over long periods of time. Furthermore, the package introduces an approach for analyzing the news similarity data as a network, and includes various functions to analyze and visualize this network.

Introduction

The news diffusion process in the digital age involves many interdependent sources, ranging from news agencies and traditional newspapers to blogs and people on social media [@meraz11; @paterson05; @pew10]. We offer the RNewsflow R package as a toolkit to analyze the homogeneity and diffusion of news content using computational text analysis. This analysis consists of two steps. First, techniques from the field of information retrieval are used to measure the similarity of news messages (e.g., addressing the same event, containing identical phrases). Second, the temporal order in which messages are published is used to analyze consistent patterns in who follows whom.

The main contribution of this package lies in the specialized application of document similarity measures for the purpose of comparing news messages. News is a special type of information in the sense that it has a time dimension—it quickly loses its relevance. Therefore, we are often only interested in the similarity of news documents that occured within a short time distance. By restricting document comparisons to a given time distance, it becomes possible to compare many documents over a long period of time, but available document comparison software generally does not have this feature. We therefore offer the newsflow_compare function which compares documents using a sliding window over time. In addition, this package offers tools to aggregate, analyze and visualize the document similarity data. By covering all steps of the text and data analysis in a single R package, it becomes easier to develop and share methods, and analyses become more transparent and replicable.

The primary intended audience of this package is scholars and professionals in fields where the impact of news on society is a prime factor, such as journalism, political communication and public relations [@baum08;@boczkowski07;@ragas14]. To what extent the content of certain sources is homogeneous or diverse has implications for central theories of media effects, such as agenda-setting and the spiral of silence [@bennett08;@blumler99]. Identifying patterns in how news travels from the initial source to the eventual audience is important for understanding who the most influential “gatekeepers” are [@shoemaker09]. Furthermore, the document similarity data enables one to study news values [@galtung65] by analyzing what elements of news predict their diffusion rate and patterns.

The package is designed to appeal to scholars and professionals without prior experience in computational text analysis. This vignette covers the entire chain from processing raw data—written text with source and date information—to analyzing and visualizing the output. It points to relevant software within and outside of R for pre-processing written texts, and demonstrates how to use the core functions of this package. For more advanced users there are additional functions and parameters to support versatility. RNewsflow is completely open-source to promote active involvement in developing and evaluating the methodology. The source code is available on Github—https://github.com/kasperwelbers/RNewsflow. All data used in this vignette is included in the package for easy replication.

The structure of this vignette is as follows. The first part discusses the data preparation. There are several choices to be made here that determine on what grounds the content of documents is compared. The second part shows how the core function of this packages, newsflow_compare, is used to calculate document similarities for many documents over time. The third part demonstrates functions for exploring, aggregating and visualizing the document similarity data. Finally, conclusions regarding the current version of the package and future directions are discussed.

Preparing the data

To analyze content homogeneity and news diffusion using computational text analysis, we need to know who said what at what time. Thus, our data consists of messages in text form, including meta information for the source and publication date of each message. The texts furthermore need to be pre-processed and represented as a document-term matrix (DTM). This section first discusses techniques and refers to existing software to pre-process texts and create the DTM, and reflects on how certain choices influence the analysis. Second, it shows how the source and date information should be organized. Third, it discusses several ways to filter and weight the DTM based on word statistics.

Pre-processing texts and creating the DTM

A DTM is a matrix in which rows represent documents, columns represent terms, and cells indicate how often each term occurred in each document. This is referred to as a bag of words representation of texts, since we ignore the order of words. This simplified representation makes analysis much easier and less computationally demanding, and as research has shown: “a simple list of words, which we call unigrams, is often sufficient to convey the general meaning of a text'' [@grimmer13: 6].

As input for this package, the DTM can either be in the dfm class of the quanteda package [@quanteda] or in the DocumentTermMarix class of the tm package [@feinerer08]. We recommend using quanteda, because the dfm class can contain the document meta data (called docvars in quanteda), and is generally more efficient and versatile. For detailed instructions you can visit the quanteda website. Here we will demonstrate one of the easiest ways to create a DTM with the meta data included, based on a data.frame. We first create a corpus with the corpus function, and then process this corpus into a DTM with the dfm function.

library(quanteda)
d = data.frame(id = c(1,2,3),
               text = c('Socrates is human', 'Humans are mortal', 'Therefore, Socrates is mortal'),
               author = c('Aristotle','Aristotle','Aristotle'),
               stringsAsFactors = F)

corp = corpus(d, docid_field = 'id', text_field='text')
dtm = dfm(corp)

dtm
## Document-feature matrix of: 3 documents, 8 features (54.2% sparse).
## 3 x 8 sparse Matrix of class "dfm"
##     features
## docs socrates is human humans are mortal therefore ,
##    1        1  1     1      0   0      0         0 0
##    2        0  0     0      1   1      1         0 0
##    3        1  1     0      0   0      1         1 1
docvars(dtm)
##      author
## 1 Aristotle
## 2 Aristotle
## 3 Aristotle

Based on the representation of texts in the DTM, the similarity of documents can be calculated as the similarity of row vectors. However, as seen in the example, this approach has certain pitfalls if we simply look at all words in their literal form. For instance, the words "human” and “humans” are given separate columns, despite having largely the same meaning. As a result, this similarity of the first two texts is not recognized. Also, the words “are”, and “is” do not have any substantial meaning, so they are not informative and can be misleading for the calculation of document similarity. There are various preprocessing techniques to filter and transform words that can be used to mend these issues. In addition, we can use these techniques to steer on what grounds documents are compared.

[lemma]: Stemming and lemmatization are both techniques for reducing words to their root, or more specifically their stem and lemma. This is used to group different forms of the same word together. Without going into specifics, lemmatization is a much more computationally demanding approach, but generally gives better results. Especially for richly inflicted languages such as German or Dutch it is highly recommended to use lemmatization instead of stemming.

[pos]: Part-of-speech tagging is a technique that identifies types of words, such as verbs, nouns and adjectives.

All mentioned preprocessing techniques except for lemmatization and part-of-speech tagging can be used in quanteda (as arguments in the dfm function). To use lemmatization or part-of-speech tagging there are several free to use grammar parsers, such as CoreNLP for English [@corenlp] and Frog for Dutch [@bosch07]. For more information about preprocessing, please consult [@welbers17].

For this vignette, data is used that has been preprocessed with the Dutch grammar parser Frog. The data is a sample of data used in a study on the influence of a Dutch news agency on the print and online editions of Dutch newspapers in political news coverage [@welbers18]. The terms have been lemmatized, and only the nouns and proper names of the headline and first five sentences (that generally contain the essential who, what and where of news) are used. By focusing on these elements, the analysis in this study focused on whether documents address the same events. This data is also made available in the package as demo data, in which the actual nouns and proper names have been substituted with indexed word types (e.g., the third organization in the vocabulary is referred to as organization.3, the 20th noun as noun.20).

library(RNewsflow)

dtm = rnewsflow_dfm ## copy the demo data
dtm
## Document-feature matrix of: 1,754 documents, 6,968 features (99.6% sparse).

To compare news across media and over time, it is necessary to have meta data about the date of a document, and convenient to have meta data about the source of a document. In the demo data, this is included as the docvars of the dfm. Note that if the DocumentTermClass class of the tm package is used, the meta data has to be managed manually, as a data.frame in which the rows (i.e. documents) match the rows (i.e. documents) in the DTM.

head(docvars(dtm), 3)
##   docname_   docid_ segid_                date     source sourcetype
## 1 35573532 35573532      1 2013-06-01 06:00:00 Print NP 2   Print NP
## 2 35573536 35573536      1 2013-06-01 06:00:00 Print NP 2   Print NP
## 3 35573539 35573539      1 2013-06-01 06:00:00 Print NP 2   Print NP

Using word statistics to filter and weight the DTM

As a final step in the data preparation, we can filter and weight words based on word statistics, such as how often a word occured. Since we are analyzing news diffusion, a particularly interesting characteristic of words is the distribution of their use over time. To focus the comparison of documents on words that indicate new events, we can filter out words that are evenly used over time. To calculate this, we provide the term_day_dist function.

tdd = term_day_dist(dtm)
tail(tdd, 3)
##              term freq doc.freq days.n days.pct days.entropy days.entropy.norm
## 6966 location.320    2        2      2   0.0667         2.00            0.0667
## 6967    noun.3667    4        4      2   0.0667         1.75            0.0585
## 6968    noun.1021    7        7      3   0.1000         2.94            0.0981

Of particular interest is the days.entropy score, which is the entropy of the distribution of words over days. This tells us whether the occurrence of a word over time is evenly distributed (high entropy) or concentrated (low entropy).[autoboiler] The maximum value for entropy is the total number of days (in case of a uniform distribution). The days.entropy.norm score normalizes the entropy by dividing by the number of days. By selecting the terms with low entropy scores, the DTM can be filtered by using the selected terms as column values.

[autoboiler]: Note that this is also a good automatic approach for filtering out stopwords, boilerplate words, and word forms such as articles and common verbs.

select_terms = tdd$term[tdd$days.entropy.norm <= 0.3]
dtm = dtm[,select_terms]

In addition to deleting terms, we should also weight terms. Turney explains that ''The idea of weighting is to give more weight to surprising events and less weight to expected events'', which is important because ''surprising events, if shared by two vectors, are more discriminative of the similarity between the vectors than less surprising events'' [@turney02: 156]. Thus, we want to give more weight to rare words than common words. A classic weighting scheme and recommended standard in information retrieval is the term-frequency inverse document frequency (tf.idf) [@sparck72,@monroe08]. This and other weighting schemes can easily be applied using the quanteda package, for instance using the dfm_tfidf function.

dtm = quanteda::dfm_tfidf(dtm)

Calculating document similarities

Given a DTM with document variables (see ?quanteda::docvars) that indicate publication time, the document similarities over time can be calculated with the newsflow_compare function. The calculation of document similarities is performed using a vector space model [@salton75;@salton03] approach, but with a sliding window over time to only compare documents that occur within a given time distance. Any other columns (in our case “source” and “sourcetype”) will automatically be included as document meta information in the output.

Furthermore, three parameters are of particular importance. The hour_window parameter determines the time window in hours within which each document is compared to other documents. The argument is a vector of length 2, in which the first and second value determine the left and right side of the window, respectively. For example, c(0, 36) will compare each document to all documents within the next 36 hours. The measure parameter, which determines what measure for similarity is used, defaults to cosine similarity[alternativemeasure]. This is a commonly used measure, which indicates similarity as a score between 0 (no similarity) and 1 (identical)[cosine_exception]. The min_similarity parameter is used to ignore all document pairs below a certain similarity score. In the current example we use a minimum similarity of 0.4, because a validity test in the paper from which the current data is taken found this to be a good threshold for finding documents that address the same events. Whether or not a threshold should be used and what the value should be depends on the goal of the analysis and the data. We recommend manual validation to verify that the similarity threshold matches human interpretation of whether or not two documents are similar, given a pre-defined interpretation of similarity (e.g., same event, same theme).

[alternativemeasure]: Alternatively, the current version also supports overlap_pct, which os a measure giving the percentage of overlapping term occurences. This is an asymmetrical measures: the overlap percentage of document 1 to document 2 can be different from the overlap percentage of document 2 to document 1. When using an asymmetrical measure, the direction should be carefully taken into account in the analysis.

[cosine_exception]: If the DTM contains negative values, the cosine similarity score can range from -1 to 1. Note that for the other similarity measures there can be no negative values in the DTM.

g = newsflow_compare(dtm, date_var='date',
                     hour_window = c(0,36), 
                     min_similarity = 0.4)

The output g is a network, or graph, in the format of the igraph package [@csardi06}. The vertices (or nodes) of this network represent documents, and the date and source of each document are stored as vertex attributes. The edges (or ties) represent the similarity of documents, and the similarity score and time difference are stored as edge attributes. To avoid confusion, keep in mind that from hereon when we talk about vertices or a vertex we are talking about documents, and that edges represent document pairs. An advantage of using a network format is that it combines this data in an efficient way, without copying the document meta information for each edge. This network forms the basis for all the analysis functions offered in this package[external_sim].

[external_sim]: If data about document similarities is imported, then the document_network function can be used to create this network. This way the functions of this package for aggregating and visualizing the network can still be used.

A full understanding of the igraph package is not required to use the current package, but one does need some basic understanding of the functions for viewing and extracting the document/vertex and edge attributes. First, vertex and edge attributes cannot directly be extracted using the $ symbol, but require the functions V() and E() to be used, for vertex and edge attributes, respectively. These can be used as follows.

vertex_sourcetype = V(g)$sourcetype
edge_hourdiff = E(g)$hourdiff

head(vertex_sourcetype)
## [1] "Online NP"  "Newsagency" "Newsagency" "Newsagency" "Print NP"  
## [6] "Newsagency"
head(edge_hourdiff)
## [1]  0.05  1.10  1.32  1.25 10.50  1.12

Alternatively, all vertex and edge attributes can be viewed or extracted with the get.data.frame function.

v = as_data_frame(g, 'vertices')
e = as_data_frame(g, 'edges')

head(v[,c('name','date','source','sourcetype')],3)
##              name                date      source sourcetype
## 37538427 37538427 2013-06-20 11:27:00 Online NP 1  Online NP
## 97361198 97361198 2013-06-02 14:13:00  Newsagency Newsagency
## 97360803 97360803 2013-06-03 16:54:00  Newsagency Newsagency
head(e,3)    
##       from       to weight hourdiff
## 1 37538427 37599339      1     0.05
## 2 97361198 35677816      1     1.10
## 3 97360803 35734376      1     1.32

The weight attribute of the edges represents the similarity score. The hourdiff attribute represents the time difference in hours between two documents, where the value indicates how long the to article was published after the from article. A histogram can provide a good first indication of this data.

hist(E(g)$hourdiff, main='Time distance of document pairs', 
     xlab = 'Time difference in hours', breaks = 150, right=F)

plot of chunk unnamed-chunk-11

In the histogram we see that most document pairs with a similarity score above the threshold are about an hour apart (1 on the x-axis). This is mainly because the online newspapers often follow the news agency within a very short amount of time. As time distance increases, the number of document pairs decreases, which makes sense because news gets old fast, so news diffusion should occur within a limited time after publication.

Tailoring the document comparison window

If news diffuses from one source to another, then the time difference cannot be zero, since the source that follows needs time to edit and publish the news. This delay period can also differ between sources. Websites can adopt news within minutes, but newspapers have a long time between pressing and publishing the newspaper, meaning that there is a period of several hours before publication during which influence is not possible. Thus, we have to adjust the window for document pairs. To make it more convenient to adjust and inspect the window settings for different sources, we offer the filter_window and show_window functions.

The filter_window function can be used to filter the document pairs (i.e. edges) using the hour_window parameter, which works identical to the hour_window parameter in the newsflow_compare function. In addition, the from_vertices and to_vertices parameters can be used to select the vertices (i.e. documents) for which this filter is applied. This makes it easy to tailor the window for different sources, or source types. For the current data, we first set the minimum time distance for all document pairs to 0.1 hours. Then, we set a minimum time distance of 6 hours for document pairs where the to document is from a print newspaper.

# set window for all vertices
g = filter_window(g,  hour_window = c(0.1, 36))

# set window for print newspapers
g = filter_window(g,  hour_window = c(6, 36), 
           to_vertices = V(g)$sourcetype == 'Print NP')

For all sources the window has now first been adjusted so that a document can only match a document that occured at least 0.1 hours later. For print newspapers, this is then set to 6 hours. With the show_window function we can view the actual window in the data. This function aggregates the edges for all combinations of attributes specified in from_attribute and to_attribute, and shows the minimum and maximum hour difference for each combination. We set the to_attribute parameter to “source”, and leave the from_attribute parameter empty. This way, we get the window for edges from any document towards each source.

show_window(g, to_attribute = 'source')
##           from          to window.left window.right
## 1 any document  Newsagency       0.150         35.4
## 2 any document Online NP 1       0.100         35.8
## 3 any document Online NP 2       0.117         36.0
## 4 any document  Print NP 1       7.017         36.0
## 5 any document  Print NP 2       7.350         35.1

We see here that edges towards “Print NP 2” have a time distance of at least 7.4 hours and at most 35.1 hours. This falls within the defined window of at least 6 and at most 36 hours.

Analyzing the document similarity network

Before we aggregate the network, it can be informative to look at the individual sub-components. If a threshold for document similarity is used, then there should be multiple disconnected components of documents that are only similar to each other. With the current data, these components tend to reflect documents that address the same or related events. Decomposing the network can be done with the decompose() function from the igraph package.

g_subcomps = decompose(g)

The demo data with the current settings has 681 sub-components. To visualize these components, we offer the document_network_plot function. This function draws a network where nodes (i.e. documents) are positioned based on their date (x-axis) and source (y-axis).

gs = g_subcomps[[55]] # select the second sub-component
document_network_plot(gs)

plot of chunk unnamed-chunk-15

The visualization shows that a news message was first published by the news agency on June 5th at 9:21 AM. Soon after this messages was adopted by two online newspapers, and the next day a print newspaper followed. The grayscale and width of the edges also show that the online newspaper messages were more similar (i.e. thick and black) to the news agency message compared to the print newspaper message.

By default, the “source” attribute is used for the y-axis, but this can be changed to other document attributes using the source_attribute parameters. If a DTM is also provided, the visualization will also include a word cloud with the most frequent words of these documents.

document_network_plot(gs, source_attribute = 'sourcetype', 
                      dtm=dtm)

plot of chunk unnamed-chunk-16

These visualizations and the corresponding subcomponents help us to qualitatively investigate specific cases. This also helps to evaluate whether the document similarity measures are valid given the goal of the analysis. Furthermore, they illustrate how we can analyze homogeneity and news diffusion patterns. For each source we can count what proportion of its publications is similar to earlier publications by specific other sources. We can also analyze the average time between publications.

Another useful application is that we can use them to see whether certain transformations of the network might be required. Depending on the purpose of the analysis it can be relevant to add or delete certain edges. For instance, in the previous visualizations we see that the print newspaper message matched both the newsagency message and two online newspaper messages. If we are specifically interested in who the original source of the message is, then it makes sense to only count the edge to the newsagency. Here we demonstrate the only_first_match function, which transforms the network so that a document only has an edge to the earliest dated document it matches within the specified time window[duplicate].

[duplicate]: If there are multiple earliest dated documents (that is, having the same publication date) then edges to all earliest dated documents are kept.

gs_onlyfirst = only_first_match(gs)
document_network_plot(gs_onlyfirst)

plot of chunk unnamed-chunk-17

Aggregating the document similarity network

This package offers the network_aggregate function as a versatile way to aggregate the edges of the document similarity network based on the vertex attributes (i.e. the document meta information). The first argument is the network (in the igraph class). The second argument, for the by parameter, is a character vector to indicate one or more vertex attributes based on which the edges are aggregated. Optionally, the by characteristics can also be specified separately for by_from and by_to. This gives flexible control over the data, for instance to aggregate from sources to sourcetypes, or to aggregate scores for each source per month.

By default, the function returns the number of edges, as well as the number of nodes that is connected for both the from and to group. The number of nodes that is connected is only relevant if a threshold for similarity (edge weight) is used, so that whether or not an edge exists indicates whether or not two documents are similar. In addition, if an edge_attribute is given, this attribute will be aggregated using the function specified in agg_FUN. For the following example we include this to analyze the median of the hourdiff attribute.

g_agg = network_aggregate(g, by='source', 
                          edge_attribute='hourdiff', 
                          agg_FUN=median)

e = as_data_frame(g_agg, 'edges')
head(e)
##          from          to edges agg.hourdiff from.V from.Vprop to.V to.Vprop
## 1  Newsagency  Newsagency   115         8.47     95     0.1591   99   0.1658
## 2  Print NP 1  Newsagency    19         8.33     15     0.1020   18   0.0302
## 3 Online NP 1  Newsagency   105         4.82     82     0.1775   94   0.1575
## 4 Online NP 2  Newsagency    68         6.71     59     0.1415   62   0.1039
## 5  Print NP 2  Newsagency    10        10.75     10     0.0763   10   0.0168
## 6  Newsagency Online NP 1   435         1.27    381     0.6382  407   0.8810

In the edges of the aggregated network there are six scores for each edge. The edges attribute counts the number of edges from the from group to the to group. For example, we see that Newsagency documents have 434 edges to later published Online NP 1 documents. The agg.hourdiff attribute shows that the median of the hourdiff attribute of these 434 edges is 1.27 (1 hour and 16 minutes). In addition to the edges, we can look at the number of messages (i.e. vertices) in the from group that matched with at least one message in the to group. This is given by the from.V attribute, which shows here that 380 Newsagency documents matched with a later published Online NP 1 document[lowerthanedges]. This is also given as the proportion of all vertices/documents in the from group, as the from.Vprop attribute. Substantially, the from.Vprop score thus indicates that 63.65% of political news messages in Newsagency is similar or identical to later published Online NP 1 messages.

Alternatively, we can also look at the to.V and to.Vprop scores to focus on the number and proportion of messages in the to group that match with at least one message in the from group. Here we see, for instance, that 87.88% of political news messages in Online NP 1 is similar to or identical to previously published messages Newsagency messages. The from.Vprop and to.Vprop scores thus give different and complementary measures for the influence of Newsagency on Online NP 1.

[lowerthanedges]: Note that the edges score is always equal to or higher than the from.matched score, since one document can match with multiple other documents.

Inspecting and visualizing results

The final step is to interpret the aggregated network data, or to prepare this data for use in a different analysis. We already saw that the network data can be transformed to a common data.frame with the as_data_frame function. Alternatively, igraph has the as_adjacency_matrix function to return the values for one edge attribute as an adjacency matrix, with the vertices in the rows and columns. This is a good way to present the scores of the aggregated network.

adj_m = as_adjacency_matrix(g_agg, attr= 'to.Vprop', sparse = FALSE)
round(adj_m, 2) # round on 2 decimals
##             Newsagency Print NP 1 Online NP 1 Online NP 2 Print NP 2
## Newsagency        0.17       0.24        0.88        0.82       0.34
## Print NP 1        0.03       0.03        0.05        0.04       0.02
## Online NP 1       0.16       0.22        0.11        0.33       0.31
## Online NP 2       0.10       0.18        0.26        0.08       0.31
## Print NP 2        0.02       0.01        0.03        0.07       0.01

Alternatively, an intuitive way to present the aggregated network is by visualizing it. For this we can use the visualization features of the igraph package. To make this more convenient, we also offer the directed_network_plot function. This is a wrapper for the plot.igraph function, which makes it easier to use different edge weight attributes and to use an edge threshold, and has default plotting parameters to work well for directed graphs with edge labels. For this example we use the to.Vprop edge attribute with a threshold of 0.2.

directed_network_plot(g_agg, weight_var = 'to.Vprop',
                       weight_thres = 0.2)

plot of chunk unnamed-chunk-20

For illustration, we can now see how the results change if we transform the network with the only_first_match function, which only counts edges to the first source that published a document.

g2 = only_first_match(g)
g2_agg = network_aggregate(g2, by='source', 
                           edge_attribute='hourdiff', 
                           agg_FUN=median)

directed_network_plot(g2_agg, weight_var = 'to.Vprop',
                       weight_thres = 0.2)

plot of chunk unnamed-chunk-21

The first network is much more dense compared to the second. In particular, we see stronger edges between the print and online editions of the newspapers. In the second network only the edges from the news agency towards the print and online newspapers remain, meaning that edges between newspapers all score less than 0.2. This implies that many of the edges between newspapers that could be observed in the first network resulted from cases where both newspapers adopt the same news agency articles.

Note that the second network is not better per se. It is possible that the initial source of a message is not the direct source. For example, a blog might not have access to the news agency feed, and therefore only receive news agency messages if they are published by another source. Thus, the most suitable approach depends on the purpose of the analysis. One of the goals of creating this package is to facilitate a platform for scholars and professionals to develop best practices for different occasions.

Alternative applications of this package

There are several alternative applications of the functions offered in this package that are not covered in this vignette. Here we briefly point out some of the more useful alternatives.

In the network_aggregate function it is possible to use different vertex attributes to aggregate the edges for from and to nodes. A particularly interesting application of this feature is to use the publication date in the aggregation. For instance, with the following settings, we can get the proportion of matched documents per day. Here we also demonstrate the return_df parameter in the network_aggregate function. This way the results are returned as a single data.frame in which relevant edge and vertex information is merged.

V(g)$day = format(as.Date(V(g)$date), '%Y-%m-%d')
agg_perday = network_aggregate(g, 
              by_from='sourcetype', by_to=c('sourcetype', 'day'), 
              edge_attribute='hourdiff', agg_FUN=median, 
              return_df=TRUE)

head(agg_perday[agg_perday$to.sourcetype == 'Online NP',  
                c('from.sourcetype', 'to.sourcetype', 'to.day','to.Vprop')])
##    from.sourcetype to.sourcetype     to.day to.Vprop
## 66      Newsagency     Online NP 2013-06-01    0.789
## 67       Online NP     Online NP 2013-06-01    0.368
## 68        Print NP     Online NP 2013-06-01    0.263
## 69      Newsagency     Online NP 2013-06-02    0.778
## 70       Online NP     Online NP 2013-06-02    0.111
## 71      Newsagency     Online NP 2013-06-03    0.864

Looking at the to.Vprop score, we see that on 2013-06-01, 78.95% of Online NP messages match with previously published Newsagency messages[notfrom]. This way, the aggregated document similarity data can be analyzed as a time-series. For instance, to analyze whether seasonal effects or extra-media developments such as election campaigns affect content homogeneity or intermedia dynamics. The return_df feature is convenient for this purpose, because it directly matches all the vertex and edge attributes (as opposed to the as_data_frame function).

[notfrom]: The from.Vprop cannot be interpreted in the same way, because it gives the proportion of all messages in the from group. If one is interested in the from.Vprop score per day, the by.from and by.to arguments in network_aggregate need to be switched. Note that one should not simply aggregate both by.from and by.to by date, because then only documents that both occured on this date will be aggregated.

Another useful application of this feature is to only aggregate the by_to group, by using the document name in the by_from argument. This way, the data in the by_to group is aggregated for each individual message. In the following example we use this to see for each individual message whether it matches with later published messages in each sourcetype. This could for instance be used to analyze whether certain document level variables (e.g., news factors, sensationalism) make a message more likely to be adopted by other news outlets. We set the edge_attribute to “weight” and agg_FUN to max, so that for each document we can see how strong the strongest match with each source was.

agg_perdoc = network_aggregate(g, 
                  by_from='name', by_to='sourcetype', 
                  edge_attribute='weight', agg_FUN=max,
                  return_df=TRUE)
docXsource = xtabs(agg.weight ~ from.name + to.sourcetype, 
                   agg_perdoc, sparse = FALSE)
head(docXsource)
##            to.sourcetype
## from.name   Newsagency Online NP Print NP
##   147908507      0.400     0.425    0.000
##   148662598      0.000     0.767    0.000
##   150454094      0.520     0.528    0.000
##   150454102      0.000     0.895    0.000
##   150454110      0.000     0.402    0.000
##   150454167      0.000     0.996    0.000

Finally, note that we have now only compared documents with future documents. We thereby focus the analysis on news diffusion. To focus on content homogeneity, each document can be compared to both past and future documents. By measuring content homogeneity aggregated over time, patterns such as longitudinal trends can be analyzed.

Conclusion and future improvements

We have demonstrated how the RNewsflow package can be used to perform a many-to-many comparison of documents. The primary focus and most important feature of this package is the newsflow_compare function. This function compares all documents that occur within a given time distance, which makes it computationally feasible for longitudinal data. Using this data, we can analyze to what extent different sources publish the same content and whether there are consistent patterns in who follows whom. The secondary focus of this package is to provide functions to organize, visualize and analyze the document similarity data within R. By enabling both the document comparison and analysis to be performed within R, this type of analysis becomes more accesible to scholars less versed in computational text analysis, and it becomes easier to share and replicate methods at a very detailed level.

The data input required for this analysis consists solely of textual documents and their corresponding publication date and source. Since no human coding is required, the package enables large scale comparative and longitudinal studies. Although the demonstration in this vignette used a moderate sized dataset, the newsflow_compare function can handle much larger data and is fast, thanks to a dedicated Rcpp (c++) implementation of sparse matrix multiplication for comparing documents within the given date window.

The validity of the method presented here relies on various factors; most importantly the type of data, the pre-processing and DTM preparation steps and the similarity threshold. It is thus recommended to use a gold standard, based on human interpretation, to test its validity. In a recent empirical study (author citation, forthcoming) we obtained good performance for whether documents addressed the same events and whether they contained identical phrases by determining thresholds based on a manually coded gold standard. Also, by using a news website that consistently referred to a news agency by name, we confirmed that we could reliably identify news agency influence for this format. Naturally, there are always limitations to the accuracy with which influence in news diffusion can be measured using only content analysis. However, we generally do not have access to other reliable information. The advantage of a content analysis based approach is that it can be used on a large scale and over long periods of time, without relying on often equally inaccurate self-reports of news workers. The current package contributes to the toolkit for this type of analysis.

Our goal is to continue developing this package as a specialized toolkit for analyzing the homogeneity and diffusion of news content. First of all, additional approaches for measuring whether documents are related will be added. Currently only a vector space model approach for calculating document similarity is implemented. For future versions alternative approaches such as language modeling will also be explored. In particular, we want to add measures to express the relation of documents over time in terms of probability and information gain. This would also allow us to define a more formal criterion for whether or not a relation exists, other than using a constant threshold for document similarity. Secondly, new methods for analyzing and visualizing the network data will be explored. In particular, methods will be implemented for analyzing patterns beyond dyadic ties between news outlets, building on techniques from the field of network analysis. To promote the involvement of other scholars and professionals in this development, the package is published entirely open-source. The source code is hosted on GitHub—https://github.com/kasperwelbers/RNewsflow.

Practical code example

library(RNewsflow)
library(quanteda)

# Prepare DTM
dtm = rnewsflow_dfm  ## copy demo data

tdd = term_day_dist(dtm)
dtm = dtm[,tdd$term[tdd$days.entropy.norm <= 0.3]]

dtm = dfm_tfidf(dtm)

# Prepare document similarity network
g = newsflow_compare(dtm, hour_window = c(-0.1,60), 
                     min_similarity = 0.4)
g = filter_window(g, hour_window = c(6, 36), 
                  to_vertices = V(g)$sourcetype == 'Print NP')
show_window(g, to_attribute = 'source')

g_subcomps = decompose(g)
document_network_plot(g_subcomps[[55]], dtm=dtm)

g_agg = network_aggregate(g, by='source', 
                          edge_attribute='hourdiff', agg_FUN=median)

as_adjacency_matrix(g_agg, attr='to.Vprop')
directed_network_plot(g_agg, weight_var = 'to.Vprop', 
                      weight_thres=0.2)

g2 = only_first_match(g)
g2_agg = network_aggregate(g2, by='source', 
                           edge_attribute='hourdiff', agg_FUN=median)

as_adjacency_matrix(g2_agg, attr='to.Vprop')
directed_network_plot(g2_agg, weight_var = 'to.Vprop', 
                      weight_thres=0.2)

References