S3

Read: s3cmd usage page

s3 is the tier2 storage system DCAN uses mostly for storing large datasets. Each user has 120 TB of space, which is why it is preferred to store data here as opposed to tier 1. For a comprehensive introduction to s3, see this tutorial video (recorded as part of the Data Processing Workshop hosted for undergraduate research assistants). It is encouraged to watch this tutorial prior to your first time using s3 commands. Note that after you run workbench commands, make sure to module rm workbench prior to doing any s3cmd commands.

s3 Basic Commands

  1. Setting up the bucket and basic commands:

    • To make a new bucket, use: s3cmd mb s3://bucket-name/

      • NOTE: You are no longer allowed to have to an underscore or uppercase letter in your bucket name. MSI might still let you create a bucket with an uppercase, but it can cause issues with other programs that interact with s3 buckets. As a guideline, follow the example bucket name ("s3://bucket-name/") above. For more information, visit this link.
    • To see what is inside the bucket, use: s3cmd ls s3://bucket-name/

    • To add a file to the bucket, use: s3cmd sync path/to/directory s3://bucket-name/

    • To pull down a single file from a bucket use s3cmd get s3://path/to/file /path/to/where/you/want/file/

    • To place a single file into a bucket use s3cmd put path/to/file s3://path/to/where/you/want/file/

    • To remove a file in a bucket, use s3cmd rm s3://bucket-name/ --recursive

    • To remove a bucket, use s3cmd rb s3://bucket-name/ --recursive

    • To sync a file or directory from your s3 bucket, use: s3cmd sync s3://bucket-name/sub_dir /path/to/where/you/want/the/directory/to/go/

      • Include the --recursive flag if it is a directory
    • To add a directory to the bucket, use: s3cmd sync --recursive dir_name s3://bucket-name/

      • If your job runs out of time before the data upload is complete, use s3cmd sync --recursive --no-check-md5 to restart the upload process where it left off and assure everything makes it. Run this even if the sync says complete to validate

      • If you're re-running a pipeline/process and you only want the new outputs to be in the directory that you're syncing to, include the --delete-removed flag to your sync command. This will remove files that are present in the destination path but aren't present in the source path.

    • To check the size of a single bucket: s3cmd du -H s3://bucket-name/

      • This process can take awhile and an srun will most likely be needed. Recommended: use s5cmd client for commands such as du and ls on large buckets
    • To understand how much of your individual s3 quota you have used (each person in the lab is allocated 120TB of s3 space): s3info -u x500

    • For more s3 command usage, see here.

Setting a policy on s3

  1. Setting an S3 policy on MSI:

    • Must use this path for your ~/.bashrc to be able to use the set_s3policy command: export PATH=/home/dhp/public/storage/s3policy_bin/:$PATH

      • To open your ~/.bashrc and modify it, use emacs ~/.bashrc &

      • For this change to take effect, close out of your terminal/OpenOnDemand session and reenter

    • To check who already has access to the bucket, run: /home/faird/shared/code/internal/utilities/MSI-utilities/s3_get_x500/get_x500.sh s3://bucket_name/

      • Note that this won't include you, so don't forget to include yourself, especially if it isn't your bucket (you can revoke your own access).

      • The DCAN google drive has a Lab Data Locations sheet with the locations and owners of s3 buckets

    • Then use this command to set the s3 policy, DO NOT include the s3:// prefix in the command:

      set_s3policy x500_1,x500_2,etc bucket_name FULL

    • IMPORTANT: every time you set a policy you need to list EVERYONE that needs access. You CANNOT just list that one person's x500 or only that one person will have access to the bucket! (Along with the bucket owner)

  2. Instructions to the manual method of setting an s3 policy on MSI can be found at this link. An example policy file that gives full access to the s3 bucket “bucket_name” is as follows:

    Example s3 policy

Setting up a .s3cfg

  1. Setting up a .s3cfg file:

    • Open a new terminal in MSI and create a file named .s3cfg-RANDOM in your home directory

      • Can't just be .s3cfg, name it something identifiable
    • Run s3cmd -c ~/.s3cfg-RANDOM --configure and provide your Access Key ID and Secret Access Key when prompted

    • You shouldn't have to make any changes to the file unless the s3 bucket is hosted by something different than s3.amazonaws.com

    • Test your access through an ls command - the s3cmd -c ~/.s3cfg-RANDOM will have to be part of every command you run to access the buckets that require the credentials you provided.

s3 access with Cyberduck

Resource: View your MSI S3 credentials (login required)

Read: Cyberduck S3 documentation

Cyberduck is a cloud storage browser for Windows and macOS devices, which can be used to access s3 storage from your local machine. To access MSI Tier 2 storage in Cyberduck, connect to s3.msi.umn.edu using your s3 credentials, which can be retrieved from the link at the top of this subsection or by running s3info in your MSI terminal.

To log in for the first time, follow the following steps:

  • Click "Open Connection" in the top left corner of the screen

  • Select "Amazon s3" from the drop down menu

  • Type in s3.msi.umn.edu for the server

  • Enter you access and secret keys

  • Click "Connect"

s5cmd client

Read: s5cmd GitHub

The s5cmd client is installed on MSI at /home/faird/shared/code/external/utilities/s5cmd/s5cmd.

s5cmd is an alternative S3 client that performs much faster on MSI than s3cmd for certain operations (roughly 10x-100x faster for du and ls). Its usage is similar to s3cmd, but does require the additional argument --endpoint-url https://s3.msi.umn.edu so that commands are directed to MSI's S3 (Tier 2) and not AWS S3.

Example using du -H: /home/faird/shared/code/external/utilities/s5cmd/s5cmd --endpoint-url https://s3.msi.umn.edu du -H s3://MY_BUCKET/*

Currently, s5cmd cannot resume interrupted file transfers (upload or download), so it is advisable not to use it over s3cmd for file transfers.