coreutils: Input processing in ptx
1
1 7.5.3 Word selection and input processing
1 -----------------------------------------
1
1 ‘-b FILE’
1 ‘--break-file=FILE’
1
1 This option provides an alternative (to ‘-W’) method of describing
1 which characters make up words. It introduces the name of a file
1 which contains a list of characters which can_not_ be part of one
1 word; this file is called the “Break file”. Any character which is
1 not part of the Break file is a word constituent. If both options
1 ‘-b’ and ‘-W’ are specified, then ‘-W’ has precedence and ‘-b’ is
1 ignored.
1
1 When GNU extensions are enabled, the only way to avoid newline as a
1 break character is to write all the break characters in the file
1 with no newline at all, not even at the end of the file. When GNU
1 extensions are disabled, spaces, tabs and newlines are always
1 considered as break characters even if not included in the Break
1 file.
1
1 ‘-i FILE’
1 ‘--ignore-file=FILE’
1
1 The file associated with this option contains a list of words which
1 will never be taken as keywords in concordance output. It is
1 called the “Ignore file”. The file contains exactly one word in
1 each line; the end of line separation of words is not subject to
1 the value of the ‘-S’ option.
1
1 ‘-o FILE’
1 ‘--only-file=FILE’
1
1 The file associated with this option contains a list of words which
1 will be retained in concordance output; any word not mentioned in
1 this file is ignored. The file is called the “Only file”. The
1 file contains exactly one word in each line; the end of line
1 separation of words is not subject to the value of the ‘-S’ option.
1
1 There is no default for the Only file. When both an Only file and
1 an Ignore file are specified, a word is considered a keyword only
1 if it is listed in the Only file and not in the Ignore file.
1
1 ‘-r’
1 ‘--references’
1
1 On each input line, the leading sequence of non-white space
1 characters will be taken to be a reference that has the purpose of
11 identifying this input line in the resulting permuted index. ⇒
Output formatting in ptx, for more information about reference
1 production. Using this option changes the default value for option
1 ‘-S’.
1
1 Using this option, the program does not try very hard to remove
1 references from contexts in output, but it succeeds in doing so
1 _when_ the context ends exactly at the newline. If option ‘-r’ is
1 used with ‘-S’ default value, or when GNU extensions are disabled,
1 this condition is always met and references are completely excluded
1 from the output contexts.
1
1 ‘-S REGEXP’
1 ‘--sentence-regexp=REGEXP’
1
1 This option selects which regular expression will describe the end
1 of a line or the end of a sentence. In fact, this regular
1 expression is not the only distinction between end of lines or end
1 of sentences, and input line boundaries have no special
1 significance outside this option. By default, when GNU extensions
1 are enabled and if ‘-r’ option is not used, end of sentences are
1 used. In this case, this REGEX is imported from GNU Emacs:
1
1 [.?!][]\"')}]*\\($\\|\t\\| \\)[ \t\n]*
1
1 Whenever GNU extensions are disabled or if ‘-r’ option is used, end
1 of lines are used; in this case, the default REGEXP is just:
1
1 \n
1
1 Using an empty REGEXP is equivalent to completely disabling end of
1 line or end of sentence recognition. In this case, the whole file
1 is considered to be a single big line or sentence. The user might
1 want to disallow all truncation flag generation as well, through
11 option ‘-F ""’. ⇒Syntax of Regular Expressions
(emacs)Regexps.
1
1 When the keywords happen to be near the beginning of the input line
1 or sentence, this often creates an unused area at the beginning of
1 the output context line; when the keywords happen to be near the
1 end of the input line or sentence, this often creates an unused
1 area at the end of the output context line. The program tries to
1 fill those unused areas by wrapping around context in them; the
1 tail of the input line or sentence is used to fill the unused area
1 on the left of the output line; the head of the input line or
1 sentence is used to fill the unused area on the right of the output
1 line.
1
1 As a matter of convenience to the user, many usual backslashed
1 escape sequences from the C language are recognized and converted
1 to the corresponding characters by ‘ptx’ itself.
1
1 ‘-W REGEXP’
1 ‘--word-regexp=REGEXP’
1
1 This option selects which regular expression will describe each
1 keyword. By default, if GNU extensions are enabled, a word is a
1 sequence of letters; the REGEXP used is ‘\w+’. When GNU extensions
1 are disabled, a word is by default anything which ends with a
1 space, a tab or a newline; the REGEXP used is ‘[^ \t\n]+’.
1
11 An empty REGEXP is equivalent to not using this option. ⇒
Syntax of Regular Expressions (emacs)Regexps.
1
1 As a matter of convenience to the user, many usual backslashed
1 escape sequences, as found in the C language, are recognized and
1 converted to the corresponding characters by ‘ptx’ itself.
1