The Arxiv and me (Part 1)

The ArXiv is a popular pre-print article server for physics, mathematics, and computer science (and other subjects) hosted by Cornell University. It is a fairly common practice for academics to upload a preliminary version of their articles (or other works) to the ArXiv to make them publicly available before they are formally published in a journal. (The process of publication is often lengthy, and many consider it best to make the article available in advance, even though it probably has not yet been peer-reviewed.)

At present, there are around 1.4 million articles hosted on the ArXiv, and more are added every day. (Should you wish to see a visual representation of the articles on the ArXiv, which I assume you do, you should visit paperscape.) In the sub-topics that I watch, there are (approximately) between 4-10 new papers added per day, and these topics are not amongst the most active on the ArXiv. The problem then is to filter the daily uploads to find the papers that are likely to be of interest for me. Luckily, about the same time that I started to think of an automated solution to this problem, I discovered that the ArXiv has some tools that can help with this problem.

My go-to language for automating things is Python, which has many tools for retrieving processing web data. Building on an example provided in the ArXiv’s API (Application Programming Interface) documentation, I decided to use the feedparser to gather and process the ArXiv’s daily RSS feed. The basic code is as follows.

1
2
3
import feedparser
url = 'http://arxiv.org/rss/math'
feed = feedparser.parse(url)

Once the feed has been retrieved, we must extract the data that we need. My original approach was to use Python’s built-in namedtuple to store the data. These give a nice class-like object with named attributes (in this case authors, title, id, and, abstract), but are still relatively light-weight data structures. The namedtuple comes from the Python ‘collections’ module in the standard library where many useful data-structure types can be found.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
from collections import namedtuple 
ArxivEntry = namedtuple(
    'ArxivEntry', 
    ('authors', 
     'title', 
     'id', 
     'abstract')
    )
entries = [
    ArxivEntry(
        entry.authors, 
        entry.title, 
        entry.id, 
        entry.summary) 
    for entry in feed.entries]

Now comes the tricky part of filtering out those entries that may be of interest. The method that I chose for this task was a simple keyword filter on the abstract of each entry. I set up a list of keywords from articles that I have read in the past, stored in a list called keywords and filtered entries by whether any of the keywords appeared in the abstract.

1
2
3
4
keywords = [ ... ] # too many to list here. 
accepted = [entry for entry in entries 
            if any(kw in entry.abstract 
                   for kw in keywords)]

Now that I had the vital parts of the problem solved, I added some code to write all the accepted articles into a file for each day, and set the script to run on my Raspberry Pi at 6 am each day, using a Cron job.

So far it has selected some 30 articles, though not all have been exactly to my taste. The script too is rather simplistic, and does not allow for easy modification to the filtering method. I have started working on a new and improved set of tools for filtering ArXiv entries, which will eventually allow me to customise and experiment with the filtering method without major rewrites to my code.