-
Notifications
You must be signed in to change notification settings - Fork 145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Started a data module, to eventually replace the data folder. #25
Open
arokem
wants to merge
1
commit into
uwescience:master
Choose a base branch
from
arokem:download-data
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,144 @@ | ||
"""shablona.data: download and read data.""" | ||
import os | ||
import sys | ||
import contextlib | ||
import os.path as op | ||
from hashlib import md5 | ||
from shutil import copyfileobj | ||
|
||
if sys.version_info[0] < 3: | ||
from urllib2 import urlopen, urljoin | ||
else: | ||
from urllib.request import urlopen, urljoin | ||
|
||
# Set a user-writeable file-system location to put files: | ||
SHABLONA_HOME = op.join(os.path.expanduser('~'), '.shablona') | ||
|
||
def get_file_data(fname, url): | ||
""" | ||
Put data from a URL into a local file. | ||
|
||
Paramters | ||
--------- | ||
fname : str | ||
Local file-name for the resulting data file. | ||
|
||
url : str | ||
The URL of the remote file to download. | ||
""" | ||
with contextlib.closing(urlopen(url)) as opener: | ||
with open(fname, 'wb') as data: | ||
copyfileobj(opener, data) | ||
|
||
|
||
def get_file_md5(filename): | ||
""" | ||
Compute the md5 checksum of a file. | ||
|
||
Parameters | ||
---------- | ||
filename : string | ||
The name of the file. | ||
|
||
Returns | ||
------- | ||
md5 checksum of the file contents | ||
""" | ||
md5_data = md5() | ||
with open(filename, 'rb') as f: | ||
for chunk in iter(lambda: f.read(128*md5_data.block_size), b''): | ||
md5_data.update(chunk) | ||
return md5_data.hexdigest() | ||
|
||
|
||
def check_md5(filename, stored_md5=None): | ||
""" | ||
Compute the md5 checksum of a file and validate against stored checksum. | ||
|
||
Parameters | ||
----- | ||
filename : string | ||
Path to a file. | ||
stored_md5 : string | ||
Known md5 of filename to check against. If None (default), checking is | ||
skipped. | ||
""" | ||
if stored_md5 is not None: | ||
computed_md5 = _get_file_md5(filename) | ||
if stored_md5 != computed_md5: | ||
msg = """The downloaded file, %s, does not have the expected md5 | ||
checksum of "%s". Instead, the md5 checksum was: "%s". This could mean that | ||
something is wrong with the file or that the upstream file has been changed. | ||
""" % (filename, stored_md5, computed_md5) | ||
raise ValueError(msg) | ||
|
||
|
||
def fetch_data(files, folder, data_size=None): | ||
""" | ||
Download files to folder and validate their md5 checksums. | ||
|
||
Parameters | ||
---------- | ||
files : dictionary | ||
|
||
For each file in `files` the key should be the local file name and the value | ||
should be (url, md5). The file will be downloaded from url if the file | ||
does not already exist or if the file exists but the md5 checksum does | ||
not match. | ||
|
||
folder : str | ||
The directory where to save the file, the directory will be created if | ||
it does not already exist. | ||
data_size : str, optional | ||
A string describing the size of the data (e.g. "91 MB") to be logged to | ||
the screen. Default does not produce any information about data size. | ||
|
||
Raises | ||
------ | ||
ValueError | ||
Raises if the md5 checksum of the file does not match the expected | ||
value. The downloaded file is not deleted when this error is raised. | ||
|
||
""" | ||
if not os.path.exists(folder): | ||
print("Creating new folder %s" % (folder)) | ||
os.makedirs(folder) | ||
|
||
if data_size is not None: | ||
print('Data size is approximately %s' % data_size) | ||
|
||
all_skip = True | ||
for f in files: | ||
url, md5 = files[f] | ||
fullpath = pjoin(folder, f) | ||
if os.path.exists(fullpath) and (get_file_md5(fullpath) == md5): | ||
continue | ||
all_skip = False | ||
print('Downloading "%s" to %s' % (f, folder)) | ||
get_file_data(fullpath, url) | ||
check_md5(fullpath, md5) | ||
if all_skip: | ||
msg = 'Dataset is already in place. If you want to fetch it again ' | ||
msg += 'please first remove the folder %s ' % folder | ||
print(msg) | ||
else: | ||
print("Files successfully downloaded to %s" % (folder)) | ||
|
||
|
||
def fetch_shablona_data(data_size="16 kb"): | ||
""" | ||
|
||
""" | ||
# This is the URL to figshare data repository: | ||
base_url = "https://ndownloader.figshare.com/articles/2543089/versions/2" | ||
files = {"ortho.csv": (urljoin(base_url, "ortho.csv"), | ||
'001eff7cf46bc57220bc6288e6c21563'), | ||
"para.csv": (urljoin(base_url, "para.csv"), | ||
'a0252d4ac6e0846f87e9d8995c79070e')} | ||
|
||
fetch_data(files, SHABLONA_HOME, data_size=data_size) | ||
return files | ||
|
||
def read_shablona_data(): | ||
files = fetch_shablona_data() | ||
for k in files.keys: |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about pulling from an environment variable (
SHABLONA_DATA_PATH
), and defaulting to this?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I tend to avoid putting data in a hidden directory--I use those for settings only, in general. Just a note about my preference, no request for change :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Scikit-learn allows an optional environment variable; see https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/datasets/base.py#L72
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's going to be hidden, please use an XDG-compliant directory instead of cluttering
~
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking a look, and thanks for the comment. What is an XDG compliant directory?
From a brief look around, it seems that these are often directories like
/usr/local/
, which are not always user-writeable. This needs to be user-writeable! But if I got it all wrong, please educate me. Better yet, a proposal for what to replace this with would be much appreciated.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something that follows the XDG basedir standard.
Basically, if it's data, it goes in
$XDG_DATA_HOME
, if it's cache, it can go in$XDG_CACHE_HOME
. If these variables are not defined, they default to$HOME/.local/share
and$HOME/.cache
, respectively. And then you'd place it in an application-defined subdirectory of those.There's also the appdirs package that will find the correct location for these sorts of things on all platforms.