This follows on from a previous post describing the initial steps of a systematic quantitative literature review, and how python and R can make your life much easier by automating keyword extraction and cross-checking.
In this post, I'll explain (briefly) the rest of the process, and share the code and packages we used along the way.
We left off last time having designed and optimized the search keywords for finding relevant articles. This is great and all, but there's another crucial step in the systematic review method - cross-checking the reference lists of all articles in your database to see if there are any highly-cited papers you have missed in your searching.
I'll be honest, the idea of doing this manually was an unpleasant thought. So, I broke it down into a few steps.
Step one - extract reference lists from PDFs
Well, it turns out step one was the hardest. It's relatively simple to extract keywords from PDFs of articles, but reference lists turned out to be a whole other problem due to the variation in formatting. My attempts to modify my keywords python script used in the previous step were useless.
It also greatly surprised me that none of the existing reference manager software was able to extract the reference lists. The closest I found was in my old favourite Qiqqa (which I love so much that one day I might do a post about it), but it still didn't quite get me what I needed - a csv or similar of every reference from each paper.
Actual photo of me after spending days trying to solve this problem
In the end, I found what I needed in the slightly obscure software package ParsCit. This package takes articles and extracts the reference lists as XML files - you can read about it here. But a word of warning - ParsCit was hard to install. Even the website says "You should have a working knowledge of UNIX and some experience in installing UNIX tools."
If you make it through the minefield of installing ParsCit and getting it working, you'll need to first convert your PDFs to text. You can do this easily with a shell script:
#!/bin/bash
for file in *.pdf; do
name=${file##*/}
outfile=${name%.*}
pdftotext $file $outfile.txt;
done
Then, just run ParsCit from the command line:
citeExtract.pl [-m mode] [-i <inputType>] <filename> [outfile]
Step two - cross-reference xml files
At this point, you should have a directory with each article reference list converted to an individual xml file. What we want to do is compile the reference lists from each xml file into one document, and count how many times each unique reference appears. This, luckily, is also simple, and can be done from a shell script also.
If you run the following shell script from within the folder the xml files are stored in, it will loop through all files in the directory, looking for unique titles, and will count number of occurrences. The output is a text file (text.txt) which contains a list of all unique references sorted in descending order of occurrence. Neat, huh?
cat *.xml | grep -E "<title>" | sed "s/<title>//g;s/<\/title>//g" | sort | uniq -c | sort -k1nr > text.txt
And that's pretty much it! From there you can find if there are any frequently-cited papers that you have missed during your searches.
As usual, all of the code discussed here is available on github. Let me know if you have any comments, questions or suggestions!
Write a comment