The aim of the
rmdfiltr word count filter is to provide a more accurate estimate of the number of words in a document than can be gleaned from the R Markdown source document. Output from (inline) R chunks as well as formatted citations and references can not enter the word count, when the source document is analyzed. Hence, the word count filter is applied after the document has been knitted and while it is being processed by
pandoc. At this stage, the document is represented as an abstract syntax tree (AST), a semantic nested list, and can be manipulated by applying so-called filters.
One the filters that is applied to R Markdown by default is
pandoc-citeproc, which formats citations and inserts references. To obtain an accurate estimate, the word count filter should therefore be applied after
pandoc-citeproc has been applied. To do so, it is necessary to disable the default application of
pandoc-citeproc, because it is always applied last, by adding the following to the documents YAML front matter:
To manually apply
pandoc-citeproc and subsequently the
rmdfiltr word count filter add the
pandoc arguments to the output format of your R Markdown document as
pandoc_args. Each filter returns a vector of command line arguments; they take previous arguments as
args and add to them. Hence, the calls to add filters can be nested:
#>  "--filter" #>  "/Applications/RStudio.app/Contents/MacOS/pandoc/pandoc-citeproc"
#>  "--filter" #>  "/Applications/RStudio.app/Contents/MacOS/pandoc/pandoc-citeproc" #>  "--lua-filter" #>  "/private/var/folders/nv/mz4ffsbn045101ngdd_mx0th0000gn/T/Rtmpp9G8mG/Rinsta4a16c970fe/rmdfiltr/wordcount.lua"
When adding the filters to
pandoc_args the R code needs to be preceded by
!expr to declare it as to-be-interpreted expression.
The word count filter reports the word counts in the console or the R Markdown tab in RStudio, respectively.
285 words in text body 23 words in reference section
Although word counting appears to be a trivial matter, the counts of different methods often disagree. The magnitude of those disagreements depends on the complexity of the document.
To get a feeling for the performance of the word count filter, I briefly compared the estimates for two documents across several common methods. The first document, a paper by Stahl & Aust (2018) is a rather simple consisting of only text with citations and a reference section. The second document is a more complicated—it contains math, code, verbatim output, etc.
The word counts for the text body do not contain, tables or images (or their captions), or the reference section (which required some manual labor in Word, Pages, and wordcounter.net).
Overall, all methods provide similar estimates for the text body of the simple document. Although the document contains a considerable number of citations, the
wordcountaddin which is applied to the R Markdown source file before
pandoc-citeproc, provides a good estimate. As expected there is less agreement on the word count for the shorter and more complex document. In particular, the
texcount word count is off—it displayed several errors related to the displayed R code and verbatim output. I think the errors may have caused
texcount to ignore some bits and are probably the reason for the low word count of the text body. Similarly, the
wordcountaddin cannot count the verbatim output.
The pattern for the reference sections of the simple and complex documents are comparable. Pages and
texcount count more words than Word, wordcounter.net and the
rmdfiltr word count filter. I suspect the difference is due to how the methods handle the URLs in the references. The
wordcountaddin cannot provide a word count for reference sections.
Overall I’m fairly happy with the performance of the
rmdfiltr filter. The word counts are quite similar to those of the majority of the other methods. I’m sure the filter can be improved (and I’ll gladly take any suggestion) but I think in its current form it is a decent solution.
Stahl, C., & Aust, F. (2018). Evaluative conditioning as memory-based judgment. Social Psychological Bulletin, 13(3), Article e28589. https://doi.org/10.5964/spb.v13i3.28589