Linux脚本Bash中的文本利器-awk


awk确实很复杂,平常用的也是很少的一部分。边查边用,把平常用的做做笔记,也是方便自己的查找。

*调用方式
awk [-F field-separator] 'commands' input-file(s)
默认空格作为field-separator。

*模式
awk 'BEGIN{} {command} END{}' input.txt

*正则表达式
\ ^ $ . [] | () * + ?
但+(一个或多个) ?(出现频率)不适应于grep和sed。

*匹配与不匹配
awk 'if ($3~/pattern/) actions' input.txt
awk 'if ($3!~/pattern/) actions' input.txt
awk 'if ($3=="abc") actions' input.txt

*awk内置变量
-----------------------------------------------------
A R G C 命令行参数个数
A R G V 命令行参数排列
E N V I R O N 支持队列中系统环境变量的使用
FILENAME a w k浏览的文件名
F N R 浏览文件的记录数
F S 设置输入域分隔符,等价于命令行- F选项
N F 浏览记录的域个数
N R 已读的记录数
O F S 输出域分隔符
O R S 输出记录分隔符
R S 控制记录分隔符
-----------------------------------------------------

*awk内置字符串函
-----------------------------------------------------
g s u b ( r, s ) 在整个$ 0中用s替代r
g s u b ( r, s , t ) 在整个t中用s替代r
i n d e x ( s , t ) 返回s中字符串t的第一位置
l e n g t h ( s ) 返回s长度
m a t c h ( s , r ) 测试s是否包含匹配r的字符串
s p l i t ( s , a , f s ) 在f s上将s分成序列a
s p r i n t ( f m t , e x p ) 返回经f m t格式化后的e x p
s u b ( r, s ) 用$ 0中最左边最长的子串代替s
s u b s t r ( s , p ) 返回字符串s中从p开始的后缀部分
s u b s t r ( s , p , n ) 返回字符串s中从p开始长度为n的后缀部分
-----------------------------------------------------

$1, $2...依次表示第一个,第二个。。。内部自动变量,$0表示整条记录。
首先执行BEGIN,当awk读完所有的输入行后,执行END(如果有的化)。


And now for a grand example:

# This awk program collects statistics on two 
# "random variables" and the relationships
# between them. It looks only at fields 1 and
# 2 by default Define the variables F and G
# on the command line to force it to look at
# different fields. For example:
# awk -f stat_2o1.awk F=2 G=3 stuff.dat \
# F=3 G=5 otherstuff.dat
# or, from standard input:
# awk -f stat_2o1.awk F=1 G=3
# It ignores blank lines, lines where either
# one of the requested fields is empty, and
# lines whose first field contains a number
# sign. It requires only one pass through the
# data. This script works with vanilla awk
# under SunOS 4.1.3.
BEGIN{
F=1;
G=2;
}
length($F) > 0 && \
length($G) > 0 && \
$1 !~/^#/ {
sx1+= $F; sx2 += $F*$F;
sy1+= $G; sy2 += $G*$G;
sxy1+= $F*$G;
if( N==0 ) xmax = xmin = $F;
if( xmin > $F ) xmin=$F;
if( xmax < $F ) xmax=$F;
if( N==0 ) ymax = ymin = $G;
if( ymin > $G ) ymin=$G;
if( ymax < $G ) ymax=$G;
N++;
}

END {
printf("%d # N\n" ,N );
if (N <= 1)
{
printf("What's the point?\n");
exit 1;
}
printf("%g # xmin\n",xmin);
printf("%g # xmax\n",xmax);
printf("%g # xmean\n",xmean=sx1/N);
xSigma = sx2 - 2 * xmean * sx1+ N*xmean*xmean;
printf("%g # xvar\n" ,xvar =xSigma/ N );
printf("%g # xvar unbiased\n",xvaru=xSigma/(N-1));
printf("%g # xstddev\n" ,sqrt(xvar ));
printf("%g # xstddev unbiased\n",sqrt(xvaru));

printf("%g # ymin\n",ymin);
printf("%g # ymax\n",ymax);
printf("%g # ymean\n",ymean=sy1/N);
ySigma = sy2 - 2 * ymean * sy1+ N*ymean*ymean;
printf("%g # yvar\n" ,yvar =ySigma/ N );
printf("%g # yvar unbiased\n",yvaru=ySigma/(N-1));
printf("%g # ystddev\n" ,sqrt(yvar ));
printf("%g # ystddev unbiased\n",sqrt(yvaru));
if ( xSigma * ySigma <= 0 )
r=0;
else
r=(sxy1 - xmean*sy1- ymean * sx1+ N * xmean * ymean)
/sqrt(xSigma * ySigma);
printf("%g # correlation coefficient\n", r);
if( r > 1 || r < -1 )
printf("SERIOUS ERROR! CORRELATION COEFFICIENT");
printf(" OUTSIDE RANGE -1..1\n");

if( 1-r*r != 0 )
printf("%g # Student's T (use with N-2 degfreed)\n&", \
t=r*sqrt((N-2)/(1-r*r)) );
else
printf("0 # Correlation is perfect,");
printf(" Student's T is plus infinity\n");
b = (sxy1 - ymean * sx1)/(sx2 - xmean * sx1);
a = ymean - b * xmean;
ss=sy2 - 2*a*sy1- 2*b*sxy1 + N*a*a + 2*a*b*sx1+ b*b*sx2 ;
ss/= N-2;
printf("%g # a = y-intercept\n", a);
printf("%g # b = slope\n" , b);
printf("%g # s^2 = unbiased estimator for sigsq\n",ss);
printf("%g + %g * x # equation ready for cut-and-paste\n",a,b);
ra = sqrt(ss * sx2 / (N * xSigma));
rb = sqrt(ss / ( xSigma));
printf("%g # radius of confidence interval ");
printf("for a, multiply by t\n",ra);
printf("%g # radius of confidence interval ");
printf("for b, multiply by t\n",rb);
}

相关内容