mirror of https://github.com/jlevy/the-art-of-command-line.git synced 2024-12-14 10:53:03 +02:00

曾楚杰 45a6d37888 翻译完Processing files and data节

2015-06-21 18:55:05 +08:00

21 KiB

Raw Blame History

命令行的艺术

Meta
Basics
Everyday use
Processing files and data
System debugging
One-liners
Obscure but useful
More resources
Disclaimer

$curl -s 'https://raw.githubusercontent.com/jlevy/the-art-of-command-line/master/README.md' | egrep -o '\w+' | tr -d '`' | cowsay -W50$

熟练使用命令行是一种常常被忽视或被认为晦涩难懂，但实际上，它可以提高你作为工程师的灵活性以及生产力。本文是一份我在Linux上工作时发现的一些关于命令行的使用的小技巧的摘要。这些小技巧有基础的、相当复杂的甚至晦涩难懂的。这篇文章并不长，但当你能够熟练掌握这里列出的所有技巧时，你就学会了很多关于命令行的东西了。

这里的大部分内容首次出现于 Quora, 但考虑到这里的人们都具有学习的天赋且乐于接受别人的建议，使用Github来做这件事是更佳的选择。如果你在本文中发现了错误或者存在可以改善的地方，请果断提交Issue或Pull Request！（当然在提交前请看一下meta节和已有的issue/PR）。

Basics

学习Bash的基础知识。具体来说, 输入 man bash 并至少全文浏览一遍; 它很简单并且不长。其他的shell可能很好用，但Bash功能强大且几乎所有情况下都是可用的 ( 只学习 zsh, fish或其他的shell的话, 在你自己的电脑上会显得很方便, 但在很多情况下会限制你, 比如当你需要在服务器上工作时)。
学习并掌握至少一个基于文本的编辑器。通常 Vim (vi) 会是你最好的选择。
学会如何使用man命令去阅读文档。学会使用apropos去查找文档。了解有些命令并不对应可执行文件，而是Bash内置的，可以使用help和help -d命令获取帮助信息。
学会使用>和<来重定向输出和输入，学会使用|来重定向管道。了解标准输出stdout和标准错误stderr。
学会使用通配符* ( 若能?和{...}更好) 和引用以及引用中'和"的区别。
熟悉Bash任务管理工具: &, ctrl-z, ctrl-c, jobs, fg, bg, kill 等。
了解ssh, 以及基本的无密码认证, ssh-agent, ssh-add等。
学会基本的文件管理: ls 和 ls -l (了解ls -l中每一列代表的意义), less, head, tail和tail -f (甚至 less +F), ln and ln -s (了解软连接和硬连接的区别), chown, chmod, du (硬盘使用情况概述: du -sk *)。关于文件系统的管理,学习 df, mount, fdisk, mkfs。
学习基本的网络管理: ip 或 ifconfig, dig。
熟悉正则表达式，以及grep/egrep里不同参数的作用，例如-i, -o, -A,和 -B。
学会使用apt-get, yum, 或dnf (取决于你使用的Linux发行版)来查找或安装包。确保你的环境中有 pip 来安装基于Python的命令韩工具 (部分程序使用pip来安装会很简单)。

Everyday use

在Bash中，可以使用Tab自动补全参数，使用ctrl-r搜索命令行历史。
在Bash中，使用ctrl-w删除你键入的最后一个单词，使用ctrl-u删除整行，使用alt-b和alt-f按单词移动，使用ctrl-k从光标处删除到行尾，使用ctrl-l清屏。键入man readline查看Bash中的默认快捷键，内容很多。例如alt-. 循环地移向前一个参数, 以及**alt-***展开通配符。
你喜欢的话，可以键入set -o vi来使用vi风格的快捷键。
键入history查看命令行历史记录。其中有许多缩写，例如!$ (最后键入的参数)和!!（最后键入的命令），尽管通常被 ctrl-r*和alt-.**取代。
回到上一个工作路径: cd -
如果你输入命令的时候改变了注意，按下**alt-#**在行首添加#（将你输入的命令视为注释），并回车。这样做的话，之后你可以很方便的利用命令行历史回到你刚才输入到一半的命令。
使用xargs ( 或parallel)。他们非常给力。注意到你可以控制每行参数个数(-L)和最大并行数 (-P)。如果你不确定它们是否会按你想的那样工作，先使用xargs echo查看一下。此外, 使用-I{}会很方便。例如:

      find . -name '*.py' | xargs grep some_function
      cat hosts | xargs -I{} ssh root@{} hostname

pstree -p有助于展示进程树。
使用pgrep和pkill根据名字查找进程或发送信号。
了解你可以发往进程的信号的种类。比如，使用kill -STOP [pid]停止一个进程。使用man 7 signal查看详细列表。
使用nohup或disown使一个后台进程持续运行。
使用netstat -lntp或ss -plat检查哪些进程在监听端口(默认是检查TCP端口; 使用参数-u检查UDP端口)。
有关打开套接字和文件，请参阅lsof。
在Bash脚本中，使用set -x去调试输出，尽可能的使用严格模式，使用set -e令脚本在发生错误时退出而不是继续运行，使用set -o pipefail严谨地对待错误（尽管问题可能很微妙）。当牵扯到很多脚本时, 使用trap。
在Bash脚本中，子shell（使用括号(...)）是一种便捷的方式去组织参数。一个常见的例子是临时地移动工作路径，代码如下：

      # do something in current dir
      (cd /some/other/dir && other-command)
      # continue in original dir

在Bash中, 注意到其中有许多形式的扩展。检查变量是否存在: ${name:?error message}。例如, 当Bash脚本需要一个参数时, 可以使用这样的代码input_file=${1:?usage: $0 input_file}。数学表达式: i=$(( (i + 1) % 5 ))。序列: {1..10}。截断字符串: ${var%suffix}和${var#prefix}。例如，假设var=foo.pdf, 那么echo ${var%.pdf}.txt将输出foo.txt。
通过使用<(some command)可以将输出视为文件。例如, 对比本地文件/etc/hosts和一个远程文件:

      diff /etc/hosts <(ssh somehost cat /etc/hosts)

了解Bash中的"here documents", 例如cat <<EOF ...。
在Bash中，同时重定向标准输出和标准错误，some-command >logfile 2>&1。通常，为了保证命令不会在标准输入里残留一个打开了的文件句柄导致你当前所在的终端无法操作，添加</dev/null是一个好习惯。
使用man ascii查看具有十六进制和十进制值的ASCII表。man unicode, man utf-8,以及 man latin1 有助于你去了解通用的编码信息。
使用screen或tmux来使用多个屏幕, 当你在使用ssh时（保存session信息）将尤为有用。另一个轻量级的解决方案是dtach。
ssh中, 了解如何使用-L或-D(偶尔需要用-R)去开启隧道是非常有用的，例如当你需要从一台远程服务器上访问web。
对ssh设置做一些小优化可能是很有用的，例如这个~/.ssh/config文件包含了防止特定环境下断开连接、压缩数据、多通道等选项：

      TCPKeepAlive=yes
      ServerAliveInterval=15
      ServerAliveCountMax=6
      Compression=yes
      ControlMaster auto
      ControlPath /tmp/%r@%h:%p
      ControlPersist yes

部分其他的关于ssh的选项是安全敏感且应当小心启用的。例如在可信任的网络中: StrictHostKeyChecking=no, ForwardAgent=yes
获取文件的八进制格式权限，使用类似如下的代码：

      stat -c '%A %a %n' /etc/timezone

使用percol可以交互式地从另一个命令输出中选取值。
使用fpp(PathPicker)可以与基于另一个命令（例如git）输出的文件交互。
将web服务器上当前目录下所有的文件（以及子目录）暴露给你所处网络的所有用户，使用： python -m SimpleHTTPServer 7777 (使用端口7777和Python 2)或python -m http.server 7777 (使用端口7777和Python 3)。

Processing files and data

在当前路径下通过文件名定位一个文件，find . -iname '*something*'(或类似的)。在所有路径下通过文件名查找文件，使用 locate something (但请记住updatedb可能没有对最近新建的文件建立索引)。
使用ag在源或文件里检索（比grep -r更好）。
将HTML转为文本: lynx -dump -stdin
Markdown, HTML, 以及所有文档格式之间的转换, 试试 pandoc。
如果你不得不处理XML， xmlstarlet宝刀未老。
使用jq处理json。
Excel或CSV文件的处理, csvkit提供了in2csv, csvcut, csvjoin, csvgrep等工具。
关于Amazon S3, s3cmd很方便而s4cmd更快。Amazon官方的aws是其他AWS相关工作的基础。
了解如何使用sort和uniq，包括uniq的-u参数和-d参数，详见后文one-liners。
了解如何使用cut，paste和join来更改文件。大部分人都会使用cut但忘了join。
了解如何运用wc去计算新行数(-l), 字符数(-m),单词数(-w)以及字节数(-c)。
了解如何使用tee将标准输入复制到文件甚至标准输出，例如ls -al | tee file.txt。
了解语言环境对许多命令行工具的微妙影响，包括排序的顺序和性能。大多数Linux的安装过程会将LANG或其他有关的变量设置为符合本地的设置。意识到当你改变语言环境时，排序的结果可能会改变。明白国际化可能会时sort或其他命令运行效率下降许多倍。某些情况下（例如集合运算）你可以放心的使用export LC_ALL=C来忽略掉国际化并使用基于字节的顺序。
了解awk和sed关于数据的简单处理的用法。例如, 将文本文件中第三列的所有数字求和: awk '{ x += $3 } END { print x }'. 这可能比同等作用的Python代码块三倍且代码量少三倍。
替换一个或多个文件中出现的字符串:

      perl -pi.bak -e 's/old-string/new-string/g' my-files-*.txt

依据某种模式批量重命名多个文件，使用rename。对于复杂的重命名规则，repren或许有帮助。

      # Recover backup files foo.bak -> foo:
      rename 's/\.bak$//' *.bak
      # Full rename of filenames, directories, and contents foo -> bar:
      repren --full --preserve-case --from foo --to bar .

使用shuf从一个文件中随机选取行。
了解sort的参数。明白键的工作原理(-t和-k)。例如，注意到你需要-k1,1来仅按第一个域来排序，而-k1意味着按整行排序。
稳定排序(sort -s)在某些情况下很有用。例如,以第二个域为主关键字，第一个域为次关键字进行排序，你可以使用sort -k1,1 | sort -s -k2,2
如果你想在Bash命令行中写tab制表符，按下ctrl-v [Tab] 或键入$'\t'(后者可能更好，因为你可以复制粘贴它)。
对于二进制文件，使用hd使其以十六进制显示以及使用bvi来编辑二进制。
同样对于二进制文件，使用strings(包括grep等等)允许你查找一些文本。
To convert text encodings, try iconv. Or uconv for more advanced use; it supports some advanced Unicode things. For example, this command lowercases and removes all accents (by expanding and dropping them):

      uconv -f utf-8 -t utf-8 -x '::Any-Lower; ::Any-NFD; [:Nonspacing Mark:] >; ::Any-NFC; ' < input.txt > output.txt

拆分文件，查看split(按大小拆分)和csplit(按模式拆分)。
使用zless, zmore, zcat和zgrep对压缩过的文件进行操作。

System debugging

For web debugging, curl and curl -I are handy, or their wget equivalents, or the more modern httpie.
To know disk/cpu/network status, use iostat, netstat, top (or the better htop), and (especially) dstat. Good for getting a quick idea of what's happening on a system.
For a more in-depth system overview, use glances. It presents you with several system level statistics in one terminal window. Very helpful for quickly checking on various subsystems.
To know memory status, run and understand the output of free and vmstat. In particular, be aware the "cached" value is memory held by the Linux kernel as file cache, so effectively counts toward the "free" value.
Java system debugging is a different kettle of fish, but a simple trick on Oracle's and some other JVMs is that you can run kill -3 <pid> and a full stack trace and heap summary (including generational garbage collection details, which can be highly informative) will be dumped to stderr/logs.
Use mtr as a better traceroute, to identify network issues.
For looking at why a disk is full, ncdu saves time over the usual commands like du -sh *.
To find which socket or process is using bandwidth, try iftop or nethogs.
The ab tool (comes with Apache) is helpful for quick-and-dirty checking of web server performance. For more complex load testing, try siege.
For more serious network debugging, wireshark, tshark, or ngrep.
Know about strace and ltrace. These can be helpful if a program is failing, hanging, or crashing, and you don't know why, or if you want to get a general idea of performance. Note the profiling option (-c), and the ability to attach to a running process (-p).
Know about ldd to check shared libraries etc.
Know how to connect to a running process with gdb and get its stack traces.
Use /proc. It's amazingly helpful sometimes when debugging live problems. Examples: /proc/cpuinfo, /proc/xxx/cwd, /proc/xxx/exe, /proc/xxx/fd/, /proc/xxx/smaps.
When debugging why something went wrong in the past, sar can be very helpful. It shows historic statistics on CPU, memory, network, etc.
For deeper systems and performance analyses, look at stap (SystemTap), perf, and sysdig.
Confirm what Linux distribution you're using (works on most distros): lsb_release -a
Use dmesg whenever something's acting really funny (it could be hardware or driver issues).

One-liners

A few examples of piecing together commands:

It is remarkably helpful sometimes that you can do set intersection, union, and difference of text files via sort/uniq. Suppose a and b are text files that are already uniqued. This is fast, and works on files of arbitrary size, up to many gigabytes. (Sort is not limited by memory, though you may need to use the -T option if /tmp is on a small root partition.) See also the note about LC_ALL above and sort's -u option (left out for clarity below).

      cat a b | sort | uniq > c   # c is a union b
      cat a b | sort | uniq -d > c   # c is a intersect b
      cat a b b | sort | uniq -u > c   # c is set difference a - b

Use grep . * to visually examine all contents of all files in a directory, e.g. for directories filled with config settings, like /sys, /proc, /etc.
Summing all numbers in the third column of a text file (this is probably 3X faster and 3X less code than equivalent Python):

      awk '{ x += $3 } END { print x }' myfile

If want to see sizes/dates on a tree of files, this is like a recursive ls -l but is easier to read than ls -lR:

      find . -type f -ls

Use xargs or parallel whenever you can. Note you can control how many items execute per line (-L) as well as parallelism (-P). If you're not sure if it'll do the right thing, use xargs echo first. Also, -I{} is handy. Examples:

      find . -name '*.py' | xargs grep some_function
      cat hosts | xargs -I{} ssh root@{} hostname

Say you have a text file, like a web server log, and a certain value that appears on some lines, such as an acct_id parameter that is present in the URL. If you want a tally of how many requests for each acct_id:

      cat access.log | egrep -o 'acct_id=[0-9]+' | cut -d= -f2 | sort | uniq -c | sort -rn

Run this function to get a random tip from this document (parses Markdown and extracts an item):

      function taocl() {
        curl -s https://raw.githubusercontent.com/jlevy/the-art-of-command-line/master/README.md |
          pandoc -f markdown -t html |
          xmlstarlet fo --html --dropdtd |
          xmlstarlet sel -t -v "(html/body/ul/li[count(p)>0])[$RANDOM mod last()+1]" |
          xmlstarlet unesc | fmt -80
      }

Obscure but useful

expr: perform arithmetic or boolean operations or evaluate regular expressions
m4: simple macro processor
yes: print a string a lot
cal: nice calendar
env: run a command (useful in scripts)
look: find English words (or lines in a file) beginning with a string
cut and paste and join: data manipulation
fmt: format text paragraphs
pr: format text into pages/columns
fold: wrap lines of text
column: format text into columns or tables
expand and unexpand: convert between tabs and spaces
nl: add line numbers
seq: print numbers
bc: calculator
factor: factor integers
gpg: encrypt and sign files
toe: table of terminfo entries
nc: network debugging and data transfer
socat: socket relay and tcp port forwarder (similar to netcat)
slurm: network trafic visualization
dd: moving data between files or devices
file: identify type of a file
tree: display directories and subdirectories as a nesting tree; like ls but recursive
stat: file info
tac: print files in reverse
shuf: random selection of lines from a file
comm: compare sorted files line by line
hd and bvi: dump or edit binary files
strings: extract text from binary files
tr: character translation or manipulation
iconv or uconv: conversion for text encodings
split and csplit: splitting files
7z: high-ratio file compression
ldd: dynamic library info
nm: symbols from object files
ab: benchmarking web servers
strace: system call debugging
mtr: better traceroute for network debugging
cssh: visual concurrent shell
rsync: sync files and folders over SSH
wireshark and tshark: packet capture and network debugging
ngrep: grep for the network layer
host and dig: DNS lookups
lsof: process file descriptor and socket info
dstat: useful system stats
glances: high level, multi-subsystem overview
iostat: CPU and disk usage stats
htop: improved version of top
last: login history
w: who's logged on
id: user/group identity info
sar: historic system stats
iftop or nethogs: network utilization by socket or process
ss: socket statistics
dmesg: boot and system error messages
hdparm: SATA/ATA disk manipulation/performance
lsb_release: Linux distribution info
lshw: hardware information
fortune, ddate, and sl: um, well, it depends on whether you consider steam locomotives and Zippy quotations "useful"

More resources

awesome-shell: A curated list of shell tools and resources.
Strict mode for writing better shell scripts.

Disclaimer

With the exception of very small tasks, code is written so others can read it. With power comes responsibility. The fact you can do something in Bash doesn't necessarily mean you should! ;)

21 KiB Raw Blame History