内容摘录
PaperQA2
<!-- pyml disable-num-lines 6 line-length -->
GitHub
PyPI version
tests
!License
!PyPI Python Versions
PaperQA2 is a package for doing high-accuracy retrieval augmented generation (RAG) on PDFs, text files, Microsoft Office documents, and source code files,
with a focus on the scientific literature.
See our recent 2024 paper
to see examples of PaperQA2's superhuman performance in scientific tasks like
question answering, summarization, and contradiction detection.
<!--TOC-->
---
**Table of Contents**
Quickstart
Example Output
What is PaperQA2
PaperQA2 vs PaperQA
PaperQA2 Goes CalVer in December 2025
What's New in Version 5 (aka PaperQA2)?
What's New in December 2025?
PaperQA2 Algorithm
Installation
CLI Usage
Bundled Settings
Rate Limits
Library Usage
Agentic Adding/Querying Documents
Manual (No Agent) Adding/Querying Documents
Async
Choosing Model
Locally Hosted
Embedding Model
Specifying the Embedding Model
Local Embedding Models (Sentence Transformers)
Adjusting number of sources
Using Code or HTML
Multimodal Support
Using External DB/Vector DB and Caching
Creating Index
Manifest Files
Reusing Index
Using Clients Directly
Settings Cheatsheet
Where do I get papers?
Callbacks
Caching Embeddings
Customizing Prompts
Pre and Post Prompts
FAQ
How come I get different results than your papers?
How is this different from LlamaIndex or LangChain?
Can I save or load?
Reproduction
Citation
---
<!--TOC-->
Quickstart
In this example we take a folder of research paper PDFs,
magically get their metadata - including citation counts with a retraction check,
then parse and cache PDFs into a full-text search index,
and finally answer the user question with an LLM agent.
Example Output
Question: Has anyone designed neural networks that compute with proteins or DNA?
The claim that neural networks have been designed to compute with DNA is supported by multiple sources.
The work by Qian, Winfree, and Bruck demonstrates the use of DNA strand displacement cascades
to construct neural network components, such as artificial neurons and associative memories,
using a DNA-based system (Qian2011Neural pages 1-2, Qian2011Neural pages 15-16, Qian2011Neural pages 54-56).
This research includes the implementation of a 3-bit XOR gate and a four-neuron Hopfield associative memory,
showcasing the potential of DNA for neural network computation.
Additionally, the application of deep learning techniques to genomics,
which involves computing with DNA sequences, is well-documented.
Studies have applied convolutional neural networks (CNNs) to predict genomic features such as
transcription factor binding and DNA accessibility (Eraslan2019Deep pages 4-5, Eraslan2019Deep pages 5-6).
These models leverage DNA sequences as input data,
effectively using neural networks to compute with DNA.
While the provided excerpts do not explicitly mention protein-based neural network computation,
they do highlight the use of neural networks in tasks related to protein sequences,
such as predicting DNA-protein binding (Zeng2016Convolutional pages 1-2).
However, the primary focus remains on DNA-based computation.
What is PaperQA2
PaperQA2 is engineered to be the best agentic RAG model for working with scientific papers.
Here are some features:
A simple interface to get good answers with grounded responses containing in-text citations.
State-of-the-art implementation including document metadata-awareness
in embeddings and LLM-based re-ranking and contextual summarization (RCS).
Support for agentic RAG, where a language agent can iteratively refine queries and answers.
Automatic redundant fetching of paper metadata,
including citation and journal quality data from multiple providers.
A usable full-text search engine for a local repository of PDF/text files.
A robust interface for customization, with default support for all [LiteLLM][LiteLLM providers] models.
[LiteLLM providers]: https://docs.litellm.ai/docs/providers
[LiteLLM general docs]: https://docs.litellm.ai/docs/
By default, it uses OpenAI embeddings
and models with a Numpy vector DB to embed and search documents.
However, you can easily use other closed-source, open-source models or embeddings (see details below).
PaperQA2 depends on some awesome libraries/APIs that make our repo possible.
Here are some in no particular order:
Semantic Scholar
Crossref
Unpaywall
Pydantic
tantivy
[LiteLLM][LiteLLM general docs]
pybtex
PaperQA2 vs PaperQA
We've been working hard on fundamental upgrades for a while
and mostly followed SemVer, until December 2025.
Meaning we've incremented the major version number on each breaking change.
This brings us to the current major version number v5.
So why call is the repo now called PaperQA2?
We wanted to remark on the fact though that we've
exceeded human performance on many important metrics.
So we arbitrarily call version 5 and onward PaperQA2,
and versions before it as PaperQA1 to denote the significant change in performance.
We recognize that we are challenged at naming and counting at FutureHouse,
so we reserve the right at any time to arbitrarily change the name to PaperCrow.
PaperQA2 Goes CalVer in December 2025
Prior to December 2025 we used semantic versioning.
This eventually led to confusion in two ways:
Developers: should we major version bump based on
settings or fundamental system capabilities?
What if a bug fix requires breaking changes to the agent's behaviors?
Speaking: should one use terminology from our publications
(e.g. PaperQA1,
PaperQA2)
or the Git tags (e.g. v5) from this repo/package?
When someone says "PaperQA" -- what version do they mean?
To resolve these confusions, in December 2025,
we moved to calendar versioning.
The developer burden is diminished because
we're basically removing guarantees of backwards compatibility across releases
(as CalVer is ZeroVer bound to dates).
It solves the "speaking" issue because Git tags are now
quite different from publication terminology (e.g. PaperQA2 vs v2025.12.17).…