Meanwhile, Andrej Karpathy has a project arxiv-sanity-preserver, which could contain the source code for the arxiv-sanity-preserver. The functions include fetch the data from arxiv API, download the pdfs, process the pdfs and analyze the pdfs.
So, based on these two project, I want to build a autocomplete server based on arxiv papers. The project is here.
I use the arxiv-sanity code to download the pdf and process the pdfs to text files, and then follow the torch-rnn-server pipeline to train the model. Nothing fancy :)
However, the result is not satifying, as which is also mentioned in original torch-rnn-server blog post. My intuitive guess is that the data need more cleaning; also the size of the corpus and the type of model used to train also matters.