FTP Service
Base URL: ftp://ftp.ncbi.nlm.nih.gov/pub/pmc
The PMC FTP Service provides access to:
- index files and files for the articles in the PMC Open Access Subset;
- the Author Manuscript Collection and its index files; and
- the Historical OCR Collection.
The FTP service also allows users to cross-reference PMC articles with identifiers such as PubMed IDs, DOIs, and Manuscript IDs.
Please note the following:
- After a series of experiments using FTP clients with NCBI's FTP server, we've found that the configuration of FTP clients can seriously affect performance. NCBI recommends setting the TCP buffer size to 32Mb. For more information, please see ftp://ftp.ncbi.nlm.nih.gov/README.ftp.
- To access the complete OA Subset you will need to use the Commercial Use and Non-Commercial Use Collections. These collections complement each other, rather than duplicating files.
- In order to prevent any one FTP folder from having thousands of files, the .tgz and .pdf files in the oa_package and pa_pdf directories are distributed randomly in a two-level-deep structure. There are two ways to locate a specific article on the FTP site:
- Use one of the file index lists described below.
- Use the OA web service, which provides an API to locate articles on the FTP site by PMCID, or by an update date range.
If you have questions or comments about the FTP service, please write to the PMC help desk. Further information on retrieving full text and other common developer queries can be found on Developer Resources page.
Index Files for the PMC Open Access Subset
The FTP site includes six index files to assist with locating an open access article on the FTP site. Search these index files for either a PMC accession number (PMCID) or a PubMed ID (PMID). The matching entry will point you to the specific FTP directory and file name for the article.
.txt Index Files
Filename | Location | Content of Index File |
---|---|---|
oa_file_list.txt | ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_file_list.txt | Complete Open Access Subset |
oa_comm_use_file_list.txt | ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_comm_use_file_list.txt | Commercial Use Collection (i.e., Open Access Subset articles with a machine-readable CC BY or CC0 license) |
oa_non_comm_use_pdf.txt | ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_non_comm_use_pdf.txt | Non-Commercial Use PDF Collection (i.e., Open Access Subset articles with machine-readable non-commercial use licenses that have PDFs) |
The first line of each .txt index file gives the date and time at which it was last generated. Every subsequent line contains information about one article in PMC.
This line is divided into 5 fields, delimited by tab characters, For example:
oa_package/66/8b/PMC555938.tar.gz BMC Bioinformatics. 2005 Mar 7; 6:44 PMC555938 PMID:15748298 CC BY
The 5 fields are:
- The fully qualified name of the .tar.gz file for an article
- The article citation, comprising the journal title abbreviation, publication date, volume, issue, and the page range or elocation ID
- PMC accession number (PMCID)
- PubMed ID (PMID)
- License type*
* The field value for “license type” can be any of the standard Creative Commons license variants (e.g., CC BY; CC BY-NC; CC BY-NC-ND) or “NO-CC CODE”. “NO-CC CODE” appears when the license is missing, has custom terms (i.e., not a Creative Commons license), or is not machine decodable.
.csv Index Files
Filename | Location | Content of Index File |
---|---|---|
oa_file_list.csv | ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_file_list.csv | Complete Open Access Subset |
oa_comm_use_file_list.csv | ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_comm_use_file_list.csv | Commercial Use Collection (i.e., articles with a machine-readable CC BY or CC0 license) |
oa_non_comm_use_pdf.csv | ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_non_comm_use_pdf.csv | Non-Commercial Use PDF Collection (i.e., articles with machine-readable non-commercial use licenses that have PDFs) |
Metadata fields are the same as above for the .txt index files, except separated by commas, and with the addition of a timestamp indicating the last update to the article in PMC. The timestamp appears before the PMID. For example:
oa_package/d2/6d/PMC2137107.tar.gz,Environ Health Perspect. 2007 Dec; 115(12):A580a,PMC2137107,2014-05-16 12:59:15,18087575,CC0
Directories and File Formats
The directories available via the FTP service include:
Directory | Contents | Format |
---|---|---|
oa_package | Open access individual articles packages | .tar.gz including
|
oa_pdf | Open access individual article PDFs available for non-commercial use* | .pdf – same PDF as found in the oa_package tar.gz file |
oa_bulk | Open access bulk articles packages for either XML or extracted text. Divided into two collections:
| .tar.gz |
manuscript |
|
.tar.gz including:
|
historical_ocr | Extracted text of
|
.tar.gz |
* To access PDFs that allow commercial use, use the oa_package directory to download articles you confirm are part of the Commercial Use Collection.