A flexible, cache-aware download manager for Python, supporting multiple backends (requests, pycurl), with integrated caching and metadata management.
- Multiple Backends: Choose between
requestsandpycurlfor downloads. - Cache Integration: Seamless integration with
cachedirfor efficient file reuse and metadata tracking. - Flexible Destinations: Download to disk, in-memory buffer, or cache.
- Automatic Metadata: Tracks download status, timestamps, HTTP headers, file hashes, and more.
- Configurable: Supports configuration via Python dict or config file.
- Pre-commit, Linting, and CI: Ready for robust development workflows.
pip install git+https://github.com/saezlab/dlmachine.gitIf your are developing:
git clone https://github.com/saezlab/dlmachine.git
cd dlmachine
poetry installimport dlmachine as dm
# Basic download to buffer
manager = dm.DownloadManager(backend='requests')
data = manager.download('https://www.google.com', dest=False)
print(data.read())
# Download to a file
manager = dm.DownloadManager(path='/tmp')
filepath = manager.download('https://www.google.com', dest='/tmp/google.html')
print(f"Downloaded to {filepath}")
# Download with cache integration
manager = dm.DownloadManager(path='/tmp')
filepath = manager.download('https://www.google.com')
print(f"Cached at {filepath}")The package is built around four core components:
DownloadManager: orchestrates cache lookup, backend selection, retries, and metadata updates.Descriptor: normalizes request parameters (URL, query, headers, JSON, multipart, TLS CA path).RequestsDownloaderandCurlDownloader: backend-specific implementations of the download workflow.cachedir: optional persistence layer for file reuse and download metadata.
flowchart LR
U[User code] --> M[DownloadManager]
M --> D[Descriptor]
M --> C[(cachedir Cache)]
M --> B{backend}
B --> R[RequestsDownloader]
B --> P[CurlDownloader]
D --> R
D --> P
R --> OUT[Path or BytesIO]
P --> OUT
M --> OUT
- Build or accept a
Descriptor. - Resolve backend from config (
requestsby default). - Resolve destination policy:
dest='/path/file': force download to that path.dest=Noneordest=True: use cache path if cache is configured, otherwise memory buffer.dest=False: force memory buffer.
- If cache is enabled, look up best matching item with URI + relevant download params.
- If no valid cached item exists, perform download and update cache metadata (status, timestamps, response headers, checksum, size, HTTP code).
- Return either path or
io.BytesIO.
sequenceDiagram
participant U as User
participant M as DownloadManager
participant C as Cache
participant X as Backend Downloader
U->>M: download(url, dest, kwargs)
M->>M: Build Descriptor
M->>C: best_or_new(...) if cache enabled
alt cache hit
M-->>U: return cached path
else cache miss/uninitialized
M->>X: instantiate(desc, path_or_none)
M->>X: download()
X-->>M: headers, status, bytes/file
M->>C: update metadata
M-->>U: return path or BytesIO
end
- In-memory processing: use
dest=Falseto getio.BytesIO. - Forced file output: pass explicit
dest='/tmp/file.ext'. - Cache-first retrieval: initialize
DownloadManager(path='/tmp/cache')and calldownload(url)withoutdest. - POST/JSON: pass
query={...}withpost=Trueorjson=True. - Multipart uploads: pass
multipart={...}with file paths included in the mapping.
DownloadManager: Main interface for downloads and cache management.Descriptor: Describes a download (URL, headers, POST/GET, etc).CurlDownloader: PyCurl-based downloader.RequestsDownloader: Requests-based downloader.
You can configure the download manager via keyword arguments or a config file:
dm.DownloadManager(
path='/my/cache/dir',
backend='curl', # or 'requests'
# ...other options
)- Linting:
poetry run flake8 dlmachine - Tests:
poetry run pytest - Coverage:
poetry run pytest --cov - Pre-commit: Install with
pre-commit install
BSD 3-Clause License
Developed by the OmniPath team at Heidelberg University Hospital.
If you use this software, please cite the repository and the OmniPath team.