--- title: "Getting started with uniprotREST" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Getting started with uniprotREST} %\VignetteEngine{knitr::rmarkdown} \usepackage[utf8]{inputenc} --- This document will show you the basics of **uniprotREST**. This package uses [httr2](https://httr2.r-lib.org) to wrap the latest UniProt REST API, which was updated in June 2022. I wrote this package as an easy-to-use interface to the API for R users who need to regularly and reproducibly download information from UniProt. **uniprotREST** has 3 main functions to use: 1. `uniprot_map()` to map _to_ or _from_ UniProt accessions. 1. `uniprot_search()` to perform text search queries. 1. `uniprot_single()` to get detailed information for a single entry. ```r library(uniprotREST) ``` ## 1. ID mapping with `uniprot_map` This is by far the most frequently used tool. Say hypothetically, you have been given a list of [UniProt accessions](https://www.uniprot.org/help/accession_numbers). You have no clue what proteins they refer to, or what properties these proteins have. You can use `uniprot_map()` to find this out. ```r # Accessions of interest aoi <- c("A0A8I6AN81", "A0A0N4SVP8", "Q9H6R0") ``` ### Default settings Here we just use the default settings, which will map the IDs from `UniProtKB_AC-ID` to `UniProtKB`, and output a dataframe. ```r result1 <- uniprot_map(ids = aoi) ## Running job: 30b29a9ff6a8afbb177c3cd24d2860825a383978 ## Checking job status... ## Job complete! ## Downloading: page 1 of 1 ``` The job ID is automatically printed (stop printing by setting the `verbosity` argument to 0). Job IDs and the job data are kept by UniProt for approximately 7 days, and are then deleted. ```r # All 3 proteins are RNA helicases head(result1) ## From Entry Entry.Name Reviewed ## 1 A0A8I6AN81 A0A8I6AN81 A0A8I6AN81_RAT unreviewed ## 2 A0A0N4SVP8 A0A0N4SVP8 A0A0N4SVP8_MOUSE unreviewed ## 3 Q9H6R0 Q9H6R0 DHX33_HUMAN reviewed ## Protein.names ## 1 RNA helicase (EC 3.6.4.13) ## 2 RNA helicase (EC 3.6.4.13) ## 3 ATP-dependent RNA helicase DHX33 (EC 3.6.4.13) (DEAH box protein 33) ## Gene.Names Organism Length ## 1 Rig1 Rattus norvegicus (Rat) 881 ## 2 Eif4a3l2 Mus musculus (Mouse) 411 ## 3 DHX33 DDX33 Homo sapiens (Human) 707 ``` By default, the output will be a dataframe with 8 columns: - `From` = accessions used to map from - `To` = accessions they were mapped to - `Entry.Name` = UniProtKB entry name - `Reviewed` = is the protein in Swiss-Prot? - `Protein.names` = name of protein in UniProtKB - `Gene.Names` = gene names associated with this protein (can be multiple) - `Organism` = name of organism the protein is from - `Length` = amino acid length And n rows which depends on: - How many ids were successfully mapped - If the mapping was 1:1 or not The output columns can be customised with the `fields` argument. ### Return fields UniProt has a lot of metadata available for each protein. You can access this in the results by requesting different columns or 'return fields' using the `fields` argument. Here we will request some different return fields. See [Return Fields - UniProt](https://csdaw.github.io/uniprotREST/articles/01_return_fields_uniprot.html) and [Return Fields - Other](https://csdaw.github.io/uniprotREST/articles/02_return_fields_other.html) for lists of all available `fields`. ```r # Jobs are stored for 7 days # so subsequent queries will be faster result2 <- uniprot_map( ids = aoi, fields = c( "gene_primary", "organism_name", "length", "mass" ) ) ## Running job: 30b29a9ff6a8afbb177c3cd24d2860825a383978 ## Checking job status... ## Job complete! ## Downloading: page 1 of 1 ``` ```r head(result2) ## From Gene.Names..primary. Organism Length Mass ## 1 A0A8I6AN81 Rig1 Rattus norvegicus (Rat) 881 101151 ## 2 A0A0N4SVP8 Eif4a3l2 Mus musculus (Mouse) 411 46959 ## 3 Q9H6R0 DHX33 Homo sapiens (Human) 707 78874 ``` ### From/to database `uniprot_map()` can be used to map IDs from other databases to UniProt IDs, and vice-versa. See [Databases](https://csdaw.github.io/uniprotREST/articles/03_databases.html) for a list of databases available for mapping, and [From/to Rules](https://csdaw.github.io/uniprotREST/reference/from_to_rules.html) for the rules of which databases can be mapped to what. Here we'll map some Ensembl gene IDs to _reviewed_ UniProtKB accessions. ```r # Genes of interest goi <- c("ENSG00000088247", "ENSG00000162613") ``` The `fields` argument only works when mapping _to_ a UniProtKB, UniRef, or UniParc database. ```r result3 <- uniprot_map( ids = goi, from = "Ensembl", to = "UniProtKB-Swiss-Prot", fields = c("accession", "gene_primary") ) ## Running job: fcc9728d59b6c0e3f618d5cf59445fc83d359a9a ## Checking job status... ## Job complete! ## Downloading: page 1 of 1 ``` ```r head(result3) ## From Entry Gene.Names..primary. ## 1 ENSG00000088247 Q92945 KHSRP ## 2 ENSG00000162613 Q96AE4 FUBP1 ``` ### Format The UniProt REST API can deliver results in different data formats. The formats available depends on the database being accessed and the **uniprotREST** function being used. See [Formats](https://csdaw.github.io/uniprotREST/articles/04_formats.html) for a full list of available formats. The **uniprotREST** wrapper functions do not support all formats yet. Each tool currently supports the following formats: - `uniprot_map()` = `tsv, fasta` - `uniprot_search()` = `tsv, fasta` - `uniprot_single()` = `tsv, fasta, json` Here we'll re-use the `result3` job above, but request the FASTA protein sequences instead. If the `Biostrings` package is installed (highly recommended) the output will be a `Biostrings::AAStringSet`, or otherwise a `named character`. ```r result4 <- uniprot_map( ids = goi, from = "Ensembl", to = "UniProtKB-Swiss-Prot", format = "fasta" ) ## Running job: fcc9728d59b6c0e3f618d5cf59445fc83d359a9a ## Checking job status... ## Job complete! ## Downloading: page 1 of 1 ``` ```r result4 ## AAStringSet object of length 2: ## width seq names ## [1] 711 MSDYSTGGPPPGPPPPAGGGGGA...YGQTPGPGGPQPPPTQQGQQQAQ sp|Q92945|FUBP2_H... ## [2] 644 MADYSTVPPPSSGSAGGGGGGGG...QAAYYAQTSPQGMPQHPPAPQGQ sp|Q96AE4|FUBP1_H... ``` ### Path The previous examples all save the data from UniProt into an object in memory. However, you can also save the data to a file on disk. To do this, just specify a file path with the correct extension. The file must not already exist otherwise an error is thrown. ```r # Get temp path for this example (and delete when done) tmp <- tempfile(fileext = ".tsv") on.exit(unlink(tmp)) # Save results to a tsv file uniprot_map( ids = goi, from = "Ensembl", to = "UniProtKB-Swiss-Prot", fields = c("accession", "gene_primary"), format = "tsv", path = tmp ) ## Running job: fcc9728d59b6c0e3f618d5cf59445fc83d359a9a ## Checking job status... ## Job complete! ## Downloading: page 1 of 1 # Check file contents read.delim(tmp) ## From Entry Gene.Names..primary. ## 1 ENSG00000088247 Q92945 KHSRP ## 2 ENSG00000162613 Q96AE4 FUBP1 ``` ### Other arguments The other arguments in `uniprot_map()` are as follows: #### Isoform By default, the UniProt APIs will only provide results with a proteins' [canonical sequence](https://www.uniprot.org/help/canonical_and_isoforms). If you set `isoform = TRUE`, then isoform sequences will be included as well. This is typically only relevant when `format = "fasta"` although I have run into some exceptions. Here we get the canonical and isoform sequence for human GAPDH. ```r result5 <- uniprot_map( ids = "P04406", format = "fasta", isoform = TRUE ) ## Running job: 5feb5aff87ab072e97ab3ab620e270f19b36e103 ## Checking job status... ## Job complete! ## Downloading: page 1 of 1 ``` ```r result5 ## AAStringSet object of length 2: ## width seq names ## [1] 335 MGKVKVGVNGFGRIGRLVTRAAF...WYDNEFGYSNRVVDLMAHMASKE sp|P04406|G3P_HUM... ## [2] 293 MVYMFQYDSTHGKFHGTVKAENG...WYDNEFGYSNRVVDLMAHMASKE sp|P04406-2|G3P_H... ``` #### Method and page_size The UniProt API provides results via 2 endpoints: stream, and pagination, which you can choose via the `method` argument. By default, `uniprot_map()` and `uniprot_search()` use `method = "paged"` which is more robust but slightly slower, with the default recommended `page_size` of 500. Whereas `uniprot_single()` only uses the stream endpoint. Paged endpoint: - Slightly slower. - Processes results in chunks, so much more reliable to connection issues. - Can theoretically handle more than 10,000,000 results. Stream endpoint: - Slightly faster. - Expensive for the API, uses a lot of memory. - Can return a `429` status error if it currently has too many requests. - Up to 10,000,000 results can be fetched. #### Compressed Should gzipped data be requested? This is `FALSE` by default, and it is only used if `method = "stream"` and `path` is specified. For example: ```r # Get temp path for this example (and delete when done) tmp <- tempfile(fileext = ".fasta.gz") on.exit(unlink(tmp)) # Save results to a tsv file uniprot_map( ids = "P04406", format = "fasta", isoform = TRUE, method = "stream", path = tmp, compressed = TRUE ) ## Running job: 5feb5aff87ab072e97ab3ab620e270f19b36e103 ## Checking job status... ## Job complete! ## Downloading: 0 B Downloading: 0 B Downloading: 0 B Downloading: 0 B Downloading: 0 B Downloading: 420 B Downloading: 420 B Downloading: 420 B Downloading: 420 B # Check file contents Biostrings::readAAStringSet(tmp) ## AAStringSet object of length 2: ## width seq names ## [1] 335 MGKVKVGVNGFGRIGRLVTRAAF...WYDNEFGYSNRVVDLMAHMASKE sp|P04406|G3P_HUM... ## [2] 293 MVYMFQYDSTHGKFHGTVKAENG...WYDNEFGYSNRVVDLMAHMASKE sp|P04406-2|G3P_H... ``` #### Verbosity Controls the amount of information to print: - Use `verbosity = 0` to not print anything. - Use `verbosity = 1`, `2`, or `3` to print increasing amounts of information about the HTTP requests made to the UniProt API (typically for debugging purposes). #### Dry_run If `TRUE`, performs the request locally with `httr2::req_dry_run()` instead of the actually sending it to the UniProt REST API. This is useful for debugging purposes if you are getting `400 - Bad request` status errors. ## 2. Querying UniProt with `uniprot_search` This function is used to perform text searches against UniProt, akin to using the search bar on their website. The different databases available from the search bar are also available via `uniprot_search()` (see [Databases](https://csdaw.github.io/uniprotREST/articles/03_databases.html)). It's very important that the search string is constructed correctly, see [this page](https://www.uniprot.org/help/text-search) for help building queries. If you get a `400 - Bad request` error, its likely your search string is not formatted correctly. Here we'll do a search for human proteins annotated with the glycoprotein keyword, which are in SwissProt i.e. have been manually reviewed. ```r result6 <- uniprot_search( query = "(proteome:UP000005640) AND (keyword:KW-0325) AND (length<100)", database = "uniprotkb", format = "tsv", fields = c("accession", "gene_primary") ) ## Downloading: page 1 of 1 ``` ```r head(result6) ## Entry Gene.Names..primary. ## 1 P06028 GYPB ## 2 P80098 CCL7 ## 3 Q16627 CCL14 ## 4 P0DMC3 APELA ## 5 P25063 CD24 ## 6 P31358 CD52 ``` The other UniProt databases other than UniProtKB are available to query as well. In this example we'll look for all reference proteomes with the word 'dog' in their title. ```r result7 <- uniprot_search( "dog", database = "proteomes", format = "tsv", fields = c("upid", "organism") ) ## Downloading: page 1 of 1 ``` ```r head(result7) ## Proteome.Id ## 1 UP000252519 ## 2 UP000645828 ## 3 UP000805418 ## 4 UP000029752 ## 5 UP000201396 ## 6 UP000277561 ## Organism ## 1 Ancylostoma caninum (Dog hookworm) ## 2 Nyctereutes procyonoides (Raccoon dog) (Canis procyonoides) ## 3 Canis lupus familiaris (Dog) (Canis familiaris) ## 4 Cadicivirus A (isolate Dog/Hong Kong/209/2008) (CaPdV-1) (Canine picodicistrovirus (isolate 209)) ## 5 Raccoon dog amdovirus ## 6 Human associated gemykibivirus 2 ``` ## 3. Retrieving an entry with `uniprot_single` `uniprot_single()` is used to quickly retrieve information about a single entry in UniProt. By default the `json` format is requested and is parsed into a list which contains _all_ information available for that particular entry. All other arguments work the same as `uniprot_search()`. For example: ```r result8 <- uniprot_single( id = "P99999", verbosity = 0 ) ``` ```r str(result8, max.level = 1) ## List of 17 ## $ entryType : chr "UniProtKB reviewed (Swiss-Prot)" ## $ primaryAccession : chr "P99999" ## $ secondaryAccessions :List of 6 ## $ uniProtkbId : chr "CYC_HUMAN" ## $ entryAudit :List of 5 ## $ annotationScore : num 5 ## $ organism :List of 4 ## $ proteinExistence : chr "1: Evidence at protein level" ## $ proteinDescription :List of 1 ## $ genes :List of 1 ## $ comments :List of 10 ## $ features :List of 36 ## $ keywords :List of 14 ## $ references :List of 19 ## $ uniProtKBCrossReferences:List of 179 ## $ sequence :List of 5 ## $ extraAttributes :List of 3 ``` Again, there are other UniProt databases available apart from UniProtKB (see [Databases](https://csdaw.github.io/uniprotREST/articles/03_databases.html)). For example UniParc: ```r result9 <- uniprot_single( id = "UPI0001C61C61", database = "uniparc", verbosity = 0 ) ``` ```r str(result9, max.level = 1) ## List of 6 ## $ uniParcId : chr "UPI0001C61C61" ## $ uniParcCrossReferences :List of 4 ## $ sequence :List of 5 ## $ sequenceFeatures :List of 7 ## $ oldestCrossRefCreated : chr "2010-03-03" ## $ mostRecentCrossRefUpdated: chr "2023-06-28" ```