coreutils: Input processing in ptx

1 
1 7.5.3 Word selection and input processing
1 -----------------------------------------
1 
1 ‘-b FILE’
1 ‘--break-file=FILE’
1 
1      This option provides an alternative (to ‘-W’) method of describing
1      which characters make up words.  It introduces the name of a file
1      which contains a list of characters which can_not_ be part of one
1      word; this file is called the “Break file”.  Any character which is
1      not part of the Break file is a word constituent.  If both options
1      ‘-b’ and ‘-W’ are specified, then ‘-W’ has precedence and ‘-b’ is
1      ignored.
1 
1      When GNU extensions are enabled, the only way to avoid newline as a
1      break character is to write all the break characters in the file
1      with no newline at all, not even at the end of the file.  When GNU
1      extensions are disabled, spaces, tabs and newlines are always
1      considered as break characters even if not included in the Break
1      file.
1 
1 ‘-i FILE’
1 ‘--ignore-file=FILE’
1 
1      The file associated with this option contains a list of words which
1      will never be taken as keywords in concordance output.  It is
1      called the “Ignore file”.  The file contains exactly one word in
1      each line; the end of line separation of words is not subject to
1      the value of the ‘-S’ option.
1 
1 ‘-o FILE’
1 ‘--only-file=FILE’
1 
1      The file associated with this option contains a list of words which
1      will be retained in concordance output; any word not mentioned in
1      this file is ignored.  The file is called the “Only file”.  The
1      file contains exactly one word in each line; the end of line
1      separation of words is not subject to the value of the ‘-S’ option.
1 
1      There is no default for the Only file.  When both an Only file and
1      an Ignore file are specified, a word is considered a keyword only
1      if it is listed in the Only file and not in the Ignore file.
1 
1 ‘-r’
1 ‘--references’
1 
1      On each input line, the leading sequence of non-white space
1      characters will be taken to be a reference that has the purpose of
11      identifying this input line in the resulting permuted index.  ⇒
      Output formatting in ptx, for more information about reference
1      production.  Using this option changes the default value for option
1      ‘-S’.
1 
1      Using this option, the program does not try very hard to remove
1      references from contexts in output, but it succeeds in doing so
1      _when_ the context ends exactly at the newline.  If option ‘-r’ is
1      used with ‘-S’ default value, or when GNU extensions are disabled,
1      this condition is always met and references are completely excluded
1      from the output contexts.
1 
1 ‘-S REGEXP’
1 ‘--sentence-regexp=REGEXP’
1 
1      This option selects which regular expression will describe the end
1      of a line or the end of a sentence.  In fact, this regular
1      expression is not the only distinction between end of lines or end
1      of sentences, and input line boundaries have no special
1      significance outside this option.  By default, when GNU extensions
1      are enabled and if ‘-r’ option is not used, end of sentences are
1      used.  In this case, this REGEX is imported from GNU Emacs:
1 
1           [.?!][]\"')}]*\\($\\|\t\\|  \\)[ \t\n]*
1 
1      Whenever GNU extensions are disabled or if ‘-r’ option is used, end
1      of lines are used; in this case, the default REGEXP is just:
1 
1           \n
1 
1      Using an empty REGEXP is equivalent to completely disabling end of
1      line or end of sentence recognition.  In this case, the whole file
1      is considered to be a single big line or sentence.  The user might
1      want to disallow all truncation flag generation as well, through
11      option ‘-F ""’.  ⇒Syntax of Regular Expressions
      (emacs)Regexps.
1 
1      When the keywords happen to be near the beginning of the input line
1      or sentence, this often creates an unused area at the beginning of
1      the output context line; when the keywords happen to be near the
1      end of the input line or sentence, this often creates an unused
1      area at the end of the output context line.  The program tries to
1      fill those unused areas by wrapping around context in them; the
1      tail of the input line or sentence is used to fill the unused area
1      on the left of the output line; the head of the input line or
1      sentence is used to fill the unused area on the right of the output
1      line.
1 
1      As a matter of convenience to the user, many usual backslashed
1      escape sequences from the C language are recognized and converted
1      to the corresponding characters by ‘ptx’ itself.
1 
1 ‘-W REGEXP’
1 ‘--word-regexp=REGEXP’
1 
1      This option selects which regular expression will describe each
1      keyword.  By default, if GNU extensions are enabled, a word is a
1      sequence of letters; the REGEXP used is ‘\w+’.  When GNU extensions
1      are disabled, a word is by default anything which ends with a
1      space, a tab or a newline; the REGEXP used is ‘[^ \t\n]+’.
1 
11      An empty REGEXP is equivalent to not using this option.  ⇒
      Syntax of Regular Expressions (emacs)Regexps.
1 
1      As a matter of convenience to the user, many usual backslashed
1      escape sequences, as found in the C language, are recognized and
1      converted to the corresponding characters by ‘ptx’ itself.
1