pachctl put file
#
At any time, run pachctl put file --help
for the complete list of flags available to you.
Load your data into Pachyderm by using
pachctl
requires that one or several input repositories have been created.pachctl create repo <repo name>
Use the
pachctl put file
command to put your data into the created repository. Select from the following options:- Atomic commit: no open commit exists in your input repo. Pachyderm automatically starts a new commit, adds your data, and finishes the commit.
pachctl put file <repo>@<branch>:</path/to/file1> -f <file1>
Alternatively, you can manually start a new commit, add your data in multiple
put file
calls, and close the commit by runningpachctl finish commit
.- Start a commit:
pachctl start commit <repo>@<branch>
- Put your data:
pachctl put file <repo>@<branch>:</path/to/file1> -f <file1>
- Put more data:
pachctl put file <repo>@<branch>:</path/to/file2> -f <file2>
- Close the commit:
pachctl finish commit <repo>@<branch>
- Start a commit:
Filepath Formats #
Pachyderm uses *?[]{}!()@+^
as reserved characters for glob patterns. Because of this, you cannot use these characters in your filepath.
In Pachyderm, you specify the path to file by using the -f
option. A path
to file can be a local path or a URL to an external resource. You can add
multiple files or directories by using the -i
option. To add contents
of a directory, use the -r
flag.
The following table provides examples of pachctl put file
commands with
various filepaths and data sources:
Put data from a URL:
pachctl put file <repo>@<branch>:</path/to/file> -f http://url_path
Put data from an object store. You can use
s3://
,gcs://
, oras://
in your filepath:chctl put file <repo>@<branch>:</path/to/file> -f s3://object_store_url
If you are configuring a local cluster to access an external bucket, make sure that Pachyderm has been given the proper access.
Add multiple files at once by using the
-i
option or multiple-f
flags. In the case of-i
, the target file must be a list of files, paths, or URLs that you want to input all at once:chctl put file <repo>@<branch> -i <file containing list of files, paths, or URLs>
Add an entire directory or all of the contents at a particular URL, either HTTP(S) or object store URL,
s3://
,gcs://
, andas://
, by using the recursive flag,-r
:pachctl put file <repo>@<branch> -r -f <dir>
Loading Your Data Partially #
Depending on your use case and the volume of your data, you might decide to keep your dataset in its original source and process only a subset in Pachyderm.
Add a metadata file containing a list of URL/path to your external data to your repo.
Your pipeline code will retrieve the data following their path without the need to preload it all. In this case, Pachyderm will not keep versions of the source file, but it will keep track and provenance of the resulting output commits.