Skip to content

Commit

Permalink
feat: added ability to download files in parallel (#88)
Browse files Browse the repository at this point in the history
## changes

- [x] ability to download mulitple files in parallel using ~~asyncio~~
concurrent.futures
- [x] docs improvements
- [x] added changelog to docs
- [x] added docs on how to download files
- [x] added methods to download files from the CLI
- [x] added docs on platform API features 
- [x] ~~download_database should use download_files, not download_file~~
next sprint
- [x] download_files should allow users to download to a specific folder
-- no need to specify all names
- [x] download_files should work with no args -- and it will download
all files
- [x] dropped support for python 3.9


## technical discussion

there are two ways we can go about downloading files. one is to use
asyncio and the other is to use concurrent.futures

i initially implemented everything asyncio, but decided to switch to
concurrent.futures because

- mixing async with sync code leads to a lot of boilerplate and repeated
code
- asyncio code doesn't work natively in jupyter notebooks. we have to
use a 3rd party package called nested_asyncio to get it to work, which
is more overhead
- technically asyncio should be lighter and faster, but i didn't see
this in practice
- using asyncio also requires us to use the low level async client,
which leads to more doubling of work and code
- using the low-level asyncio client makes testing more complex, because
we can't hot swap clients


### rejected asyncio implementation:

```python
async def _create_file_download_urls_async(file_ids: list[str]) -> list[str]:
    """async method to create file download URLs for a list of files.

    Do not use this method. This is called internally by the
    `download_files` method."""

    async_client = _api._get_default_client(use_async=True)

    tasks = []
    for file_id in file_ids:
        tasks.append(
            _api.create_file_download_url(file_id=file_id, client=async_client)
        )
    data = await asyncio.gather(*tasks)
    urls = [item.data.download_url for item in data]
    return urls

```

and 


```python
@beartype
def download_files(
    file_ids: Optional[list[str]] = None,
    names: Optional[list[str]] = None,
):
    """download multiple files in parallel using asyncio

    If you want to download a single file, use download_file as it has lower overhead.

    Args:
        file_ids: IDs of the files on Deep Origin
        names: Names of the files. Optional. If None, names will be retrieved from Deep Origin

    Returns:
        None

    """

    # we need nest_asynio to allow this work
    # in a jupyter kernel
    import nest_asyncio

    nest_asyncio.apply()

    if names is None:
        # names not provided, determine names from Deep Origin
        files = list_files(file_ids=file_ids)
        names = [item.file.name for item in files]

    # create presigned URLs for all files in parallel
    urls = asyncio.run(_create_file_download_urls_async(file_ids))

    # download files in parallel
    asyncio.run(_download_files_async(urls, names))

    return urls
```

and 

```python
async def _download_async(session, url, save_path) -> None:
    """Downloads a single file asynchronously and saves it to the specified path.

    Do not use this. Use the synchronous wrapper function
    download_files instead."""

    async with session.get(url) as response:
        with open(save_path, "wb") as file:
            async for chunk in response.content.iter_chunked(8192):
                if chunk:  # Filter out keep-alive chunks
                    file.write(chunk)


async def _download_files_async(urls: list[str], save_paths: list[str]) -> None:
    """Downloads multiple files asynchronously.

    Do not use this. Use the synchronous wrapper function
    download_files instead."""

    async with aiohttp.ClientSession() as session:
        tasks = []
        for url, save_path in zip(urls, save_paths):
            tasks.append(_download_async(session, url, save_path))

        await asyncio.gather(*tasks)
````
  • Loading branch information
sg-s authored Oct 8, 2024
1 parent eacd1ff commit 88856a3
Show file tree
Hide file tree
Showing 22 changed files with 372 additions and 462 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@ jobs:
fail-fast: true
matrix:
os: [ubuntu-latest, windows-latest]
python-version: ["3.9", "3.10", "3.11", "3.12"]
python-version: ["3.10", "3.11", "3.12"]

needs: [code-formatting-syntax-and-docstrings]
steps:
Expand Down
23 changes: 21 additions & 2 deletions docs/changelog.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,25 @@
# Changelog

### `deeporigin v2.1.1`

## `deeporigin v2.2.1`

- added the ability to interact with the platform, including
- fetching user secrets from the platform
- fetching user info from the platform
- fetching workstation info from the platform
- removed support for graphQL endpoints for secrets and variables
- support for new (breaking) syntax for deleting databases, folders, rows, and columns
- better support for caching tokens in filesystem
- numerous bug fixes and small improvements
- dropped support for Python 3.9
- better syntax for interacting with the `api.list_files` function
- Improvements to Deep Origin DataFrames, including:
- ability to create new columns in dataframes and push them to a database
- DataFrames now show names of user who created and last updated a database, not just a ID
- DataFrame pretty printing methods now never interrupt core logic, and fall back to printing methods of superclass on failure
- DataFrames now support `df.tail()` and `df.head()`

## `deeporigin v2.1.1`


- fixed a bug where dataframe header links were sometimes broken
Expand All @@ -9,7 +28,7 @@
- added ability to check for newest versions on PyPI
- version numbers are now [PEP 440](https://peps.python.org/pep-0440/) compliant

### `deeporigin v2.1.0`
## `deeporigin v2.1.0`


- dropped support for column keys
Expand Down
2 changes: 1 addition & 1 deletion docs/data-hub/dataframes.md
Original file line number Diff line number Diff line change
Expand Up @@ -136,4 +136,4 @@ df

## Reference

Read more about the `to_deeporigin` method [here](../ref/data-hub/types.md#src.data_hub.dataframe.DataFrame.sync).
Read more about the `to_deeporigin` method [here](../ref/data-hub/types.md#src.data_hub.dataframe.DataFrame.to_deeporigin).
File renamed without changes.
File renamed without changes.
2 changes: 1 addition & 1 deletion docs/how-to/data-hub/create.md
Original file line number Diff line number Diff line change
Expand Up @@ -104,4 +104,4 @@ To create a new database column in an existing database, run:
)
```

This code creates a new column in the existing database. To configure the type of the column, use the `type` argument. The type must a member of [DataType](../../ref/data-hub/types.md#src.utils.DataType).
This code creates a new column in the existing database. To configure the type of the column, use the `type` argument. The type must a member of [DataType](../../ref/data-hub/types.md#src.utils.constants.DataType).
44 changes: 44 additions & 0 deletions docs/how-to/data-hub/download-files.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
# Download files

This page describes how to download files from Deep Origin to your local computer.


## Download one or many files from the Data hub

To download file(s) to the Deep Origin data hub, run the following commands:

=== "CLI"

```bash
deeporigin data download-files
```

This will download all files on Deep Origin to the current folder.

To download files that have been assigned to a particular row, use:

```bash
deeporigin data download-files --assigned-row-ids <row-id-1> <row-id-2> ...
```

To download specific files, pass the file IDs using:


```bash
deeporigin data download-files --file-ids <file-1> <file-1> ...
```


=== "Python"

```py
from deeporigin.data_hub import api
api.download_files(files)
```

`files` is a list of files to download, and is a list of `ListFilesResponse` objects. To obtain this list, use `api.list_files()`, the output of which can be used as an input to `download_files`.

!!! Tip "Download all files"
To download all files, call `api.list_files()` and pass the output to `download_files`.


62 changes: 62 additions & 0 deletions docs/how-to/platform/user-info.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
# Get user information

## Get info about current user

To get information about the currently logged in user, including the user ID, use:

```python
from deeporigin.platform import api
api.whoami()
```

returns information about the current user. A typical response is:

```json
{
"data": {
"attributes": {
"company": null,
"expertise": null,
"industries": null,
"pendingInvites": [],
"platform": "OS",
"title": null
},
"id": "google-apps|[email protected]",
"type": "User"
},
"links": {
"self": "https://os.deeporigin.io/users/me"
}
}
```

## Get information about a user

To get information about a user, use:


```python
from deeporigin.platform import api
api.resolve_user("user-id")
```

where `user-id` is the ID of the user, in the format returned by `api.whoami()`. A typical response looks like:


```json
{
"data": {
"attributes": {
"avatar": "https://...",
"email": "[email protected]",
"name": "User Name"
},
"id": "918ddd25-ab97-4400-9a14-7a8be1216754",
"type": "User"
},
"links": {
"self": "https://..."
}
}
```
75 changes: 75 additions & 0 deletions docs/how-to/platform/workstation-info.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@

# Get information about workstations

To list all workstations on Deep Origin, use:

```python
from deeporigin.platform import api
api.get_workstations()
```

This returns a list of objects, where each object correspond to a workstation. A typical entry looks like:

```json
{
"attributes": {
"accessMethods": [
{
"icon": "/assets/icons/catalog-items/jupyterlab.svg",
"id": "jupyterlab",
"name": "JupyterLab"
},
{
"icon": "/assets/icons/catalog-items/code-server.svg",
"id": "code-server",
"name": "VS Code (web)"
}
],
"accessSettings": {
"publicKey": "ssh-ed25519 ... ",
"ssh": true
},
"autoStopIdleCPUThreshold": 0,
"autoStopIdleDuration": 30,
"blueprint": "deeporigin/deeporigin-python:staging",
"cloudProvider": {
"region": "us-west-2",
"vendor": "aws"
},
"clusterId": "3bb775e4-8be6-4936-a6b9",
"created": "2024-10-05T17:01:06.840Z",
"description": "dfd",
"drn": "drn:...",
"enableAutoStop": true,
"name": "forthcoming-tyrannosaurus-8fd",
"nextUserActions": [
"DELETE"
],
"orgHandle": "deeporigin-com",
"requestedResources": {
"cpu": 8,
"gpu": 0,
"gpuSize": "NONE",
"memory": 32,
"storage": 250
},
"state": {
"error": "",
"isError": false,
"stage": "READY",
"status": "TERMINATED"
},
"status": "TERMINATED",
"summary": "",
"templateVersion": "v0.1.0",
"updated": "2024-10-07T12:46:46.511Z",
"userHandle": "google-apps|[email protected]",
"volumeDrns": [
"..."
],
"wasAutoStopped": false
},
"id": "...",
"type": "ComputeBench"
}
```
3 changes: 3 additions & 0 deletions docs/platform/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Platform API

The Deep Origin CLI and python client allows you to control and interact with the Deep Origin platform.
Loading

0 comments on commit 88856a3

Please sign in to comment.