feat: added ability to download files in parallel (#88)

## changes - [x] ability to download mulitple files in parallel using ~~asyncio~~ concurrent.futures - [x] docs improvements - [x] added changelog to docs - [x] added docs on how to download files - [x] added methods to download files from the CLI - [x] added docs on platform API features - [x] ~~download_database should use download_files, not download_file~~ next sprint - [x] download_files should allow users to download to a specific folder -- no need to specify all names - [x] download_files should work with no args -- and it will download all files - [x] dropped support for python 3.9 ## technical discussion there are two ways we can go about downloading files. one is to use asyncio and the other is to use concurrent.futures i initially implemented everything asyncio, but decided to switch to concurrent.futures because - mixing async with sync code leads to a lot of boilerplate and repeated code - asyncio code doesn't work natively in jupyter notebooks. we have to use a 3rd party package called nested_asyncio to get it to work, which is more overhead - technically asyncio should be lighter and faster, but i didn't see this in practice - using asyncio also requires us to use the low level async client, which leads to more doubling of work and code - using the low-level asyncio client makes testing more complex, because we can't hot swap clients ### rejected asyncio implementation: ```python async def _create_file_download_urls_async(file_ids: list[str]) -> list[str]: """async method to create file download URLs for a list of files. Do not use this method. This is called internally by the `download_files` method.""" async_client = _api._get_default_client(use_async=True) tasks = [] for file_id in file_ids: tasks.append( _api.create_file_download_url(file_id=file_id, client=async_client) ) data = await asyncio.gather(*tasks) urls = [item.data.download_url for item in data] return urls ``` and ```python @beartype def download_files( file_ids: Optional[list[str]] = None, names: Optional[list[str]] = None, ): """download multiple files in parallel using asyncio If you want to download a single file, use download_file as it has lower overhead. Args: file_ids: IDs of the files on Deep Origin names: Names of the files. Optional. If None, names will be retrieved from Deep Origin Returns: None """ # we need nest_asynio to allow this work # in a jupyter kernel import nest_asyncio nest_asyncio.apply() if names is None: # names not provided, determine names from Deep Origin files = list_files(file_ids=file_ids) names = [item.file.name for item in files] # create presigned URLs for all files in parallel urls = asyncio.run(_create_file_download_urls_async(file_ids)) # download files in parallel asyncio.run(_download_files_async(urls, names)) return urls ``` and ```python async def _download_async(session, url, save_path) -> None: """Downloads a single file asynchronously and saves it to the specified path. Do not use this. Use the synchronous wrapper function download_files instead.""" async with session.get(url) as response: with open(save_path, "wb") as file: async for chunk in response.content.iter_chunked(8192): if chunk: # Filter out keep-alive chunks file.write(chunk) async def _download_files_async(urls: list[str], save_paths: list[str]) -> None: """Downloads multiple files asynchronously. Do not use this. Use the synchronous wrapper function download_files instead.""" async with aiohttp.ClientSession() as session: tasks = [] for url, save_path in zip(urls, save_paths): tasks.append(_download_async(session, url, save_path)) await asyncio.gather(*tasks) ````
deeporiginbio · Oct 8, 2024 · 88856a3 · 88856a3
1 parent eacd1ff
commit 88856a3
Show file tree

Hide file tree

Showing 22 changed files with 372 additions and 462 deletions.
diff --git a/.github/workflows/main.yml b/.github/workflows/main.yml
@@ -55,7 +55,7 @@ jobs:
       fail-fast: true
       matrix:
         os: [ubuntu-latest, windows-latest]
-        python-version: ["3.9", "3.10", "3.11", "3.12"]
+        python-version: ["3.10", "3.11", "3.12"]
 
     needs: [code-formatting-syntax-and-docstrings]
     steps:

diff --git a/docs/changelog.md b/docs/changelog.md
@@ -1,6 +1,25 @@
 # Changelog
 
-### `deeporigin v2.1.1` 
+
+## `deeporigin v2.2.1`
+
+- added the ability to interact with the platform, including
+    - fetching user secrets from the platform
+    - fetching user info from the platform
+    - fetching workstation info from the platform
+- removed support for graphQL endpoints for secrets and variables
+- support for new (breaking) syntax for deleting databases, folders, rows, and columns
+- better support for caching tokens in filesystem
+- numerous bug fixes and small improvements
+- dropped support for Python 3.9
+- better syntax for interacting with the `api.list_files` function
+- Improvements to Deep Origin DataFrames, including:
+    - ability to create new columns in dataframes and push them to a database
+    - DataFrames now show names of user who created and last updated a database, not just a ID
+    - DataFrame pretty printing methods now never interrupt core logic, and fall back to printing methods of superclass on failure
+    - DataFrames now support `df.tail()` and `df.head()`
+
+## `deeporigin v2.1.1` 
 
 
 - fixed a bug where dataframe header links were sometimes broken
@@ -9,7 +28,7 @@
 - added ability to check for newest versions on PyPI
 - version numbers are now [PEP 440](https://peps.python.org/pep-0440/) compliant
 
-### `deeporigin v2.1.0` 
+## `deeporigin v2.1.0` 
 
 
 - dropped support for column keys

diff --git a/docs/data-hub/dataframes.md b/docs/data-hub/dataframes.md
@@ -136,4 +136,4 @@ df
 
 ## Reference
 
-Read more about the `to_deeporigin` method [here](../ref/data-hub/types.md#src.data_hub.dataframe.DataFrame.sync). 
+Read more about the `to_deeporigin` method [here](../ref/data-hub/types.md#src.data_hub.dataframe.DataFrame.to_deeporigin). 
diff --git a/docs/how-to/variables.md → docs/how-to/compute-hub/variables.md b/docs/how-to/variables.md → docs/how-to/compute-hub/variables.md
diff --git a/docs/how-to/workstation-info.md → docs/how-to/compute-hub/workstation-info.md b/docs/how-to/workstation-info.md → docs/how-to/compute-hub/workstation-info.md
diff --git a/docs/how-to/data-hub/create.md b/docs/how-to/data-hub/create.md
@@ -104,4 +104,4 @@ To create a new database column in an existing database, run:
     )
     ```
 
-This code creates a new column in the existing database. To configure the type of the column, use the `type` argument. The type must a member of [DataType](../../ref/data-hub/types.md#src.utils.DataType).
+This code creates a new column in the existing database. To configure the type of the column, use the `type` argument. The type must a member of [DataType](../../ref/data-hub/types.md#src.utils.constants.DataType).
diff --git a/docs/how-to/data-hub/download-files.md b/docs/how-to/data-hub/download-files.md
@@ -0,0 +1,44 @@
+# Download files
+
+This page describes how to download files from Deep Origin to your local computer. 
+
+
+## Download one or many files from the Data hub
+
+To download file(s) to the Deep Origin data hub, run the following commands:
+
+=== "CLI"
+
+    ```bash
+    deeporigin data download-files
+    ```
+
+    This will download all files on Deep Origin to the current folder. 
+
+    To download files that have been assigned to a particular row, use:
+
+    ```bash
+    deeporigin data download-files --assigned-row-ids <row-id-1>  <row-id-2> ...
+    ```
+
+    To download specific files, pass the file IDs using:
+
+
+    ```bash
+    deeporigin data download-files --file-ids <file-1> <file-1> ...
+    ```
+
+
+=== "Python"
+
+    ```py
+    from deeporigin.data_hub import api
+    api.download_files(files)
+    ```
+
+    `files` is a list of files to download, and is a list of `ListFilesResponse` objects. To obtain this list, use `api.list_files()`, the output of which can be used as an input to `download_files`. 
+
+    !!! Tip "Download all files"
+        To download all files, call `api.list_files()` and pass the output to `download_files`.
+
+
diff --git a/docs/how-to/platform/user-info.md b/docs/how-to/platform/user-info.md
@@ -0,0 +1,62 @@
+# Get user information 
+
+## Get info about current user
+
+To get information about the currently logged in user, including the user ID, use:
+
+```python
+from deeporigin.platform import api
+api.whoami()
+```
+
+returns information about the current user. A typical response is:
+
+```json
+{
+    "data": {
+        "attributes": {
+            "company": null,
+            "expertise": null,
+            "industries": null,
+            "pendingInvites": [],
+            "platform": "OS",
+            "title": null
+        },
+        "id": "google-apps|[email protected]",
+        "type": "User"
+    },
+    "links": {
+        "self": "https://os.deeporigin.io/users/me"
+    }
+}
+```
+
+## Get information about a user 
+
+To get information about a user, use:
+
+
+```python
+from deeporigin.platform import api
+api.resolve_user("user-id")
+```
+
+where `user-id` is the ID of the user, in the format returned by `api.whoami()`. A typical response looks like:
+
+
+```json
+{
+    "data": {
+        "attributes": {
+            "avatar": "https://...",
+            "email": "[email protected]",
+            "name": "User Name"
+        },
+        "id": "918ddd25-ab97-4400-9a14-7a8be1216754",
+        "type": "User"
+    },
+    "links": {
+        "self": "https://..."
+    }
+}
+```
diff --git a/docs/how-to/platform/workstation-info.md b/docs/how-to/platform/workstation-info.md
@@ -0,0 +1,75 @@
+
+# Get information about workstations
+
+To list all workstations on Deep Origin, use:
+
+```python
+from deeporigin.platform import api
+api.get_workstations()
+```
+
+This returns a list of objects, where each object correspond to a workstation. A typical entry looks like:
+
+```json
+{
+    "attributes": {
+        "accessMethods": [
+            {
+                "icon": "/assets/icons/catalog-items/jupyterlab.svg",
+                "id": "jupyterlab",
+                "name": "JupyterLab"
+            },
+            {
+                "icon": "/assets/icons/catalog-items/code-server.svg",
+                "id": "code-server",
+                "name": "VS Code (web)"
+            }
+        ],
+        "accessSettings": {
+            "publicKey": "ssh-ed25519 ... ",
+            "ssh": true
+        },
+        "autoStopIdleCPUThreshold": 0,
+        "autoStopIdleDuration": 30,
+        "blueprint": "deeporigin/deeporigin-python:staging",
+        "cloudProvider": {
+            "region": "us-west-2",
+            "vendor": "aws"
+        },
+        "clusterId": "3bb775e4-8be6-4936-a6b9",
+        "created": "2024-10-05T17:01:06.840Z",
+        "description": "dfd",
+        "drn": "drn:...",
+        "enableAutoStop": true,
+        "name": "forthcoming-tyrannosaurus-8fd",
+        "nextUserActions": [
+            "DELETE"
+        ],
+        "orgHandle": "deeporigin-com",
+        "requestedResources": {
+            "cpu": 8,
+            "gpu": 0,
+            "gpuSize": "NONE",
+            "memory": 32,
+            "storage": 250
+        },
+        "state": {
+            "error": "",
+            "isError": false,
+            "stage": "READY",
+            "status": "TERMINATED"
+        },
+        "status": "TERMINATED",
+        "summary": "",
+        "templateVersion": "v0.1.0",
+        "updated": "2024-10-07T12:46:46.511Z",
+        "userHandle": "google-apps|[email protected]",
+        "volumeDrns": [
+            "..."
+        ],
+        "wasAutoStopped": false
+    },
+    "id": "...",
+    "type": "ComputeBench"
+}
+```
diff --git a/docs/platform/index.md b/docs/platform/index.md
@@ -0,0 +1,3 @@
+# Platform API
+
+The Deep Origin CLI and python client allows you to control and interact with the Deep Origin platform.
Original file line number	Diff line number	Diff line change
Expand Up		@@ -136,4 +136,4 @@ df

		## Reference

		Read more about the `to_deeporigin` method [here](../ref/data-hub/types.md#src.data_hub.dataframe.DataFrame.sync).
		Read more about the `to_deeporigin` method [here](../ref/data-hub/types.md#src.data_hub.dataframe.DataFrame.to_deeporigin).
-Original file line number
+Diff line change
@@ Expand Up @@
         )
         ```
-    This code creates a new column in the existing database. To configure the type of the column, use the `type` argument. The type must a member of [DataType](../../ref/data-hub/types.md#src.utils.DataType).
+    This code creates a new column in the existing database. To configure the type of the column, use the `type` argument. The type must a member of [DataType](../../ref/data-hub/types.md#src.utils.constants.DataType).
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,3 @@
		# Platform API

		The Deep Origin CLI and python client allows you to control and interact with the Deep Origin platform.