Title: | R wrapper for the UniProt website REST API |
---|---|
Description: | Simple functions to access the UniProt website REST API. Wraps httr2 functions to easily map IDs and query all available databases. |
Authors: | Charlotte Dawson [aut, cre] |
Maintainer: | Charlotte Dawson <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.0.0 |
Built: | 2025-02-19 03:44:08 UTC |
Source: | https://github.com/csdaw/uniprotREST |
This function performs a request for data from the UniProt REST API, fetches the results using pagination, and saves them to a file or into memory.
You likely won't use this function directly, but rather one of the
wrapper functions: uniprot_map()
, uniprot_search()
, or uniprot_single()
.
The pagination endpoint is less expensive for the API infrastructure to
deal with versus fetch_stream()
, as the memory demand is distributed
over a longer period of time.
If the connection is interrupted while fetching results with pagination, only the request for the current page needs to be reattempted. With the stream endpoint, the entire request needs to be completely restarted.
fetch_paged(req, n_pages, format = "tsv", path = NULL, verbosity = NULL)
fetch_paged(req, n_pages, format = "tsv", path = NULL, verbosity = NULL)
req |
|
n_pages |
|
format |
|
path |
|
verbosity |
|
By default, returns an object whose type depends on format
:
tsv
: data.frame
json
: list
fasta
: Biostrings::AAStringSet (or named character
if
Biostrings not installed)
If parse = FALSE
, returns an httr2_response
. If path
is specified,
saves the parsed results to the file path indicated, and returns NULL
invisibly.
## Not run: req <- uniprot_request( "https://rest.uniprot.org/uniref/search", query = "P99999", format = "tsv", fields = "id,name,count", size = 1 ) fetch_paged(req, n_pages = 3) ## End(Not run)
## Not run: req <- uniprot_request( "https://rest.uniprot.org/uniref/search", query = "P99999", format = "tsv", fields = "id,name,count", size = 1 ) fetch_paged(req, n_pages = 3) ## End(Not run)
This function performs a request for data from the UniProt REST API, fetches the results using the stream endpoint, and saves them to a file or into memory.
You likely won't use this function directly, but rather one of the
wrapper functions: uniprot_map()
, uniprot_search()
, or uniprot_single()
.
The stream endpoint is expensive for the API to process. If this endpoint
has too many requests a 429
status error will occur. In this case use
fetch_paged()
or try again later.
Up to 10,000,000 results can be fetched via stream. If you want to get
more results you should use fetch_paged()
, or consider downloading
some datasets from UniProt's FTP website.
fetch_stream(req, format = "tsv", parse = TRUE, path = NULL, verbosity = NULL)
fetch_stream(req, format = "tsv", parse = TRUE, path = NULL, verbosity = NULL)
req |
|
format |
|
parse |
|
path |
|
verbosity |
|
By default, returns an object whose type depends on format
:
tsv
: data.frame
json
: list
fasta
: Biostrings::AAStringSet (or named character
if
Biostrings not installed)
If parse = FALSE
, returns an httr2_response
. If path
is specified,
saves the parsed results to the file path indicated, and returns NULL
invisibly.
## Not run: req <- uniprot_request( "https://rest.uniprot.org/uniref/stream", query = "P99999", format = "tsv", fields = "id,name,count" ) fetch_stream(req) ## End(Not run)
## Not run: req <- uniprot_request( "https://rest.uniprot.org/uniref/stream", query = "P99999", format = "tsv", fields = "id,name,count" ) fetch_stream(req) ## End(Not run)
This dataframe contains the available download file types i.e.
values for format
for each UniProt database.
See Examples below for how to use this object
formats
formats
An object of class data.frame
with 43 rows and 3 columns.
Columns:
func | character , uniprotREST function to be used. |
database | character , UniProt database to be queried. |
format | character , the different formats available to download
i.e. the string you'll use with the uniprotREST function. |
UniProtKB download formats have been determined by hand by querying
each REST API database with an unavailable format (usually txt
) and
determining the allowed formats from the resulting error response.
# What UniProt databases are available to query? levels(formats$database)[1:12] # What formats are available for querying the `proteomes` database, # using `uniprot_search()` formats[formats$database == "proteomes" & formats$func == "search", "format"]
# What UniProt databases are available to query? levels(formats$database)[1:12] # What formats are available for querying the `proteomes` database, # using `uniprot_search()` formats[formats$database == "proteomes" & formats$func == "search", "format"]
This dataframe contains details of the databases that can be
mapped from/to with uniprot_map()
. See from_to_rules for the rules on
which database identifiers can be mapped to what other databases.
from_to_dbs
from_to_dbs
An object of class data.frame
with 100 rows and 5 columns.
Columns:
name | character , database name. |
from | character , is the database allowed to be mapped from ? |
to | character , is the database allowed to be mapped to ? |
url | character , database URL. |
formats_db | factor , relevant formats database, see formats
for which download formats are available |
From/to pairs have been downloaded according to the "Valid from and to databases pairs" section on the UniProt ID Mapping page.
This list contains the valid from/to pairings that can be used
with uniprot_map()
. See Examples below for how to use this object.
Also see from_to_dbs for the more information on each database.
from_to_rules
from_to_rules
An object of class list
of length 98.
From/to pairs have been downloaded according to the "Valid from and to databases pairs" section on the UniProt ID Mapping page.
# Show valid `from` values names(from_to_rules) # Show valid `to` values for a given `from` e.g. SGD from_to_rules[["SGD"]]
# Show valid `from` values names(from_to_rules) # Show valid `to` values for a given `from` e.g. SGD from_to_rules[["SGD"]]
This function gets FASTA data from an httr2_response
and either saves it to a file or parses it in memory into a
Biostrings::AAStringSet or named character
vector.
resp_body_fasta(resp, con = NULL, encoding = NULL)
resp_body_fasta(resp, con = NULL, encoding = NULL)
resp |
|
con |
|
encoding |
|
By default, returns a Biostrings::AAStringSet object. If the
Biostrings
package is not installed, returns a named character
vector.
If con
is not NULL, returns nothing and saves the FASTA sequences to the
file specified by con
.
resp <- structure( list(method = "GET", url = "https://example.com", body = charToRaw(">Protein1\nAAA\n>Protein2\nCCC")), class = "httr2_response" ) resp_body_fasta(resp)
resp <- structure( list(method = "GET", url = "https://example.com", body = charToRaw(">Protein1\nAAA\n>Protein2\nCCC")), class = "httr2_response" ) resp_body_fasta(resp)
This function gets tab-delimited data from an httr2_response
and either saves it to a file or parses it into a data.frame
in memory.
resp_body_tsv(resp, page = NULL, con = NULL, encoding = NULL)
resp_body_tsv(resp, page = NULL, con = NULL, encoding = NULL)
resp |
|
page |
|
con |
|
encoding |
|
By default, returns a data.frame
. If con
is not NULL, returns
nothing and saves tab-delimited text to the file specified by con
.
resp <- structure( list(method = "GET", url = "https://example.com", body = charToRaw("Entry\tGene Names (primary)\nP99999\tCYCS\n")), class = "httr2_response" ) resp_body_tsv(resp)
resp <- structure( list(method = "GET", url = "https://example.com", body = charToRaw("Entry\tGene Names (primary)\nP99999\tCYCS\n")), class = "httr2_response" ) resp_body_tsv(resp)
This dataframe contains the valid fields (i.e. columns) of
data you can request from UniProt. The strings in the field
column is what
you'll use with the functions in this package.
See Examples below for how to use this object.
return_fields
return_fields
An object of class data.frame
with 389 rows and 4 columns.
Columns:
database | character , UniProt database to be queried. |
section | character , similar return fields are grouped together in sections. |
field | character , the return field i.e. the string used to request the desired column of information. |
label | character , human readable column name that will be returned by the API. |
UniProtKB return fields have been scraped from the UniProtKB return fields page. The return fields from other Uniprot databases have been determined by hand using Web Developer Tools (F12) to inspect the GET request made when searching the different database on the UniProt website.
# What UniProt databases are available to query? levels(return_fields$database) # What fields are available for the `proteomes` database? return_fields[return_fields$database == "proteomes", "field"]
# What UniProt databases are available to query? levels(return_fields$database) # What fields are available for the `proteomes` database? return_fields[return_fields$database == "proteomes", "field"]
This function wraps the
UniProt ID Mapping service which
maps between the identifiers used in one database, to the identifiers of
another. By default it maps UniProtKB accessions to UniProt, and returns
a data.frame
with metadata about the mapped protein accessions. You can
also map IDs from/to other databases e.g. from = "Ensembl", to = "UniProtKB"
.
This service has limits on the number of IDs allowed. Very large mapping requests are likely to fail. Try to split your queries into smaller chunks in case of problems.
100,000 = maximum number of input ids
allowed
500,000 = maximum number of entries that will be output
uniprot_map( ids, from = "UniProtKB_AC-ID", to = "UniProtKB", format = "tsv", path = NULL, fields = NULL, isoform = NULL, method = "paged", page_size = 500, compressed = NULL, verbosity = NULL, dry_run = FALSE )
uniprot_map( ids, from = "UniProtKB_AC-ID", to = "UniProtKB", format = "tsv", path = NULL, fields = NULL, isoform = NULL, method = "paged", page_size = 500, compressed = NULL, verbosity = NULL, dry_run = FALSE )
ids |
|
from |
|
to |
|
format |
|
path |
|
fields |
|
isoform |
|
method |
|
page_size |
|
compressed |
|
verbosity |
|
dry_run |
|
By default, returns an object whose type depends on format
:
tsv
: data.frame
fasta
: Biostrings::AAStringSet (or named character
if
Biostrings not installed)
If path
is specified, saves the results to the file path indicated,
and returns NULL
invisibly. If dry_run = TRUE
, returns a
list containing information about the request, including the request
method
, path
, and headers
.
Other API wrapper functions: uniprot_search()
, uniprot_single()
## Not run: # Default, get info about UniProt IDs uniprot_map( "P99999", format = "tsv", fields = c("accession", "gene_primary", "feature_count") ) # Other common use, mapping other IDs to UniProt # (or vice-versa) uniprot_map( c("ENSG00000088247", "ENSG00000162613"), from = "Ensembl", to = "UniProtKB" ) ## End(Not run)
## Not run: # Default, get info about UniProt IDs uniprot_map( "P99999", format = "tsv", fields = c("accession", "gene_primary", "feature_count") ) # Other common use, mapping other IDs to UniProt # (or vice-versa) uniprot_map( c("ENSG00000088247", "ENSG00000162613"), from = "Ensembl", to = "UniProtKB" ) ## End(Not run)
This function creates an httr2::request
object.
You likely won't use this function directly, but rather one of the
wrapper functions: uniprot_map()
, uniprot_search()
, or uniprot_single()
.
uniprot_request(url, method = "GET", ..., max_tries = 5, rate = 1/1)
uniprot_request(url, method = "GET", ..., max_tries = 5, rate = 1/1)
url |
|
method |
|
... |
Name-value pairs that provide query parameters. Each value must be
either a length-1 atomic vector (which is automatically escaped) or |
max_tries |
|
rate |
|
Returns an httr2_request
object, (which is essentially a list).
# Construct a request for the accession and # gene name for human Cytochrome C uniprot_request( url = "https://rest.uniprot.org/uniprotkb/P99999", fields = "accession,gene_primary", format = "tsv" )
# Construct a request for the accession and # gene name for human Cytochrome C uniprot_request( url = "https://rest.uniprot.org/uniprotkb/P99999", fields = "accession,gene_primary", format = "tsv" )
Search Uniprot via REST API. By default it searches the supplied
query against UniProtKB and returns a data.frame
of matching proteins.
It is a wrapper for
this UniProt API
endpoint.
uniprot_search( query, database = "uniprotkb", format = "tsv", path = NULL, fields = NULL, isoform = NULL, method = "paged", page_size = 500, compressed = NULL, verbosity = NULL, dry_run = FALSE )
uniprot_search( query, database = "uniprotkb", format = "tsv", path = NULL, fields = NULL, isoform = NULL, method = "paged", page_size = 500, compressed = NULL, verbosity = NULL, dry_run = FALSE )
query |
|
database |
|
format |
|
path |
|
fields |
|
isoform |
|
method |
|
page_size |
|
compressed |
|
verbosity |
|
dry_run |
|
By default, returns an object whose type depends on format
:
tsv
: data.frame
fasta
: Biostrings::AAStringSet (or named character
if
Biostrings not installed)
If path
is specified, saves the results to the file path indicated,
and returns NULL
invisibly. If dry_run = TRUE
, returns a
list containing information about the request, including the request
method
, path
, and headers
.
The following databases are available to query:
uniprotkb
: UniProt Knowledge Base
uniref
: UniProt Reference Clusters
uniparc
: UniProt Archive
proteomes
: Reference proteomes
taxonomy
: Taxonomy
keywords
: Keywords
citations
: Literature references
diseases
: Disease queries
database
: Cross references
locations
: Subcellular location
unirule
: UniRule
arba
: ARBA
(Association-Rule-Based Annotator)
Other API wrapper functions: uniprot_map()
, uniprot_single()
## Not run: # Search for all human glycoproteins from SwissProt res <- uniprot_search( query = "(proteome:UP000005640) AND (keyword:KW-0325) AND (reviewed:true)", database = "uniprotkb", format = "tsv", fields = c("accession", "gene_primary", "feature_count") ) # Look at the resulting dataframe head(res) ## End(Not run)
## Not run: # Search for all human glycoproteins from SwissProt res <- uniprot_search( query = "(proteome:UP000005640) AND (keyword:KW-0325) AND (reviewed:true)", database = "uniprotkb", format = "tsv", fields = c("accession", "gene_primary", "feature_count") ) # Look at the resulting dataframe head(res) ## End(Not run)
Get a single entry from UniProt.
By default it fetches the webpage using format = "json"
and outputs a
list of information, but different formats are available from different
databases. It is a wrapper for
this UniProt API
endpoint.
uniprot_single( id, database = "uniprotkb", format = "json", path = NULL, fields = NULL, isoform = NULL, verbosity = NULL, dry_run = FALSE )
uniprot_single( id, database = "uniprotkb", format = "json", path = NULL, fields = NULL, isoform = NULL, verbosity = NULL, dry_run = FALSE )
id |
|
database |
|
format |
|
path |
|
fields |
|
isoform |
|
verbosity |
|
dry_run |
|
By default, returns an object whose type depends on format
:
tsv
: data.frame
json
: list
fasta
: Biostrings::AAStringSet (or named character
if
Biostrings not installed)
If path
is specified, saves the results to the file path indicated,
and returns NULL
invisibly. If dry_run = TRUE
, returns a
list containing information about the request, including the request
method
, path
, and headers
.
The following databases are available to query:
uniprotkb
: UniProt Knowledge Base
uniref
: UniProt Reference Clusters
uniparc
: UniProt Archive
proteomes
: Reference proteomes
taxonomy
: Taxonomy
keywords
: Keywords
citations
: Literature references
diseases
: Disease queries
database
: Cross references
locations
: Subcellular location
unirule
: UniRule
arba
: ARBA
(Association-Rule-Based Annotator)
Other API wrapper functions: uniprot_map()
, uniprot_search()
## Not run: # Download the entry for human Cytochrome C res <- uniprot_single("P99999", format = "json") # Look at the structure of the resulting list str(res, max.level = 1) ## End(Not run)
## Not run: # Download the entry for human Cytochrome C res <- uniprot_single("P99999", format = "json") # Look at the structure of the resulting list str(res, max.level = 1) ## End(Not run)