Команды SFK


    1        2        3        4        5        6        7        8        9        10    

Раздел 5. Search and Compare - Поиск и сравнение
deplist | dupfind | extract | find | hexfind | md5 | md5check | md5gento | ofind | pathfind | reflist | xfind | xfindbin | xhexfind |


Help:   Рус   |   Eng        Refer:   Рус   |   Eng  

Команда: extract

Extract text from files using wildcards and expressions on the command line

with the free Swiss File Knife for Windows, Mac OS X and Linux.
sfk extract dirName "/searchtext/totext/"

extract data from text and binary files using wildcards * and ?
as well as SFK Simple Expressions in brackets [].

produces a (binary) data stream that can be
- written to terminal as hex dump (default)
- written to file by option -tofile
- sent to xed by +xed command chaining

subdirectories are included by default
   the sfk default for most commands is to process the given folders,
   as well as all subdirs within them. specify -nosub to disable this.

options
   -nosub        do not include files in subdirectories.
   -nobin[ary]   skip binary files.
   -case         case-sensitive text comparison. default is insensitive.
                 for details type: sfk help nocase
   -text         starts a list of search patterns of the form /src/ or
                 /src/totext/ where / is the separator char, src the text
                 to search for, and totext a mask to reformat output.
                 any separator char can be used which is not part of the
                 search text, i.e. /foo/ or _foo_ both search "foo".
                 -text is not required if a single filename is given.
   -pat          the same as -text, starting a pattern list.
   -bylist x.txt read search patterns from a file x.txt, supporting
                 multiple lines per pattern. (add -full for more.)
   -bylinelist x read /from/to/ or just /from/ patterns from a file x
                 with one pattern per line. (add -full for more.)
                 -by(line)list does not support sfk variables.
                 to use variables in patterns create an sfk script
                 with patterns as parameters. "sfk script" for more.
   -arc          XE: include content of .zip .jar .tar etc. archives
                     as deep as possible, including nested archives.
                 XD: demo will read first 1000 bytes of each entry.
   -qarc         quick read top level archives but not nested ones.
   -firsthit     process only first found pattern match per file.
   -quiet        do not show progress infos.
   -stat         show statistics like hits per pattern and no. of files.
   -perf         show performance statistics.
   -full         print full help text telling about -bylist pat. files,
                 special character case sensitivity and nested or repeated
                 replace behaviour.

output options
   -dump         create hexdump of search hits or replaced text.
    -wide        with -dump: show 16 bytes per line.
    -lean        with -dump: show  8 bytes per line.
   -dumpfrom     always dump search hits but not replaced text.
   -dumpall      dump search text and replaced text.
   -nodump       do not create a hexdump, list only matching files.
   -astext       no hexdump, but print search hits as plain text.
                 use this only with plain text files, not binary.
   -showle       highlight CR/LF line endings in hex dump output
   -context=n    with hexdump: show additional n bytes of context.
   -reldist      with hexdump: tell relative distances to previous hits.
   -nofile       do not insert :file header lines in output.
   -crlf, -lf    for file headers and default totext: force crlf or lf
                 line endings instead of system default
   -filehead s   file header to insert on every matching file.
                 only [file.name] surrounded by text can be used.
                 default is -filehead ":file [file.name]" unless a
                 single file is searched. cannot be used with xhexfind.
                 to get result and name in the same line use [file.name]
                 in the expression, like: sfk xfind -pure -nofile mydir
                 "/foo*bar/[file.name]: [all]\n/"
   -sep s        define separator s between hits in a file
   -to dir\$file write output files to given path. for details about
                 output file masks, type "sfk help opt" or "sfk run".
   -tofile x     write output data to a single output filename x
                 (which is not interpreted as a mask but taken as is).
   +tofile x     as last parameter (command chaining): write text as
                 displayed on terminal to a file x.
   -more[n]      pause output every 30 or n lines.

return codes for batch files
   0 = no matches, 1 = matches found, >1 = major error occurred.
   see also "sfk help opt" on how to influence error processing.

quoted multi line parameters are supported in scripts
   using full trim. type "sfk script" for details.

wildcards and SFK expressions
   SFK Expressions are simple patterns containing literal text,
   wildcards * and ? and character classes in square brackets [].
   basically, the syntax provides extended wilcards but no
   further logic and is not related to regular expressions.

   search patterns are surrounded by a separator character which
   can be anything not contained in the search text, like / or _

   within a pattern /fromtext/totext/ the fromtext may contain:

     *                       - 0 to 4000 characters in the same
                               text line or paragraph, i.e. all
                               bytes not being CR, LF or NULL.
                               4000 is just a default maximum
                               that can be changed by:
     [0.100000 chars]        - 0 to 100000 characters in the same
                               text line or paragraph, i.e. the
                               same as * but with a larger range.
     ?                       - one character.
     ?????                   - same as [5.5 chars] or [5 chars]
     [bytes]                 - 0 to 4000 bytes (with CR,LF,NULL)
                               i.e. it collects stream text
                               across lines, even in binary data
     **                      - the same as [bytes].
     [0.100 bytes]           - 0 to 100 bytes
     [.100000 bytes]         - up to 100000 bytes
     [1.* bytes]             - 1 to default maximum bytes
     [2 chars]               - exactly 2 chars
     [30 bytes]              - exactly 30 bytes
     [byte of aeiou]         - one vocal (a OR A OR e OR ...),
                               case insensitive by default.
                               "aeiou" is a character list.
     [byte of \\\x2f]        - a backslash \ or forw. slash /
     [bytes of \r\n \t]      - whitespace incl. line ends
     [bytes of (\r\n \t)]    - the same, () are optional
     [bytes not \r\n\0]      - up to 4000 bytes as long as no
                               CR, LF or NULL byte appears
     [chars]                 - the same as [bytes not \r\n\0],
                               i.e. collect text in a line
     [char not ( \t)]        - same as [byte not ( \r\n\0\t)],
                               everything not blanks and tabs
     [char not )( \t]        - not brackets, blanks and tabs,
                               same as not (\(\) \t)
     [chars of a-z0-9]       - means a-zA-Z0-9 as search is
                               case insensitive by default
     [chars of \x61-\x7A]    - search a-z but not A-Z, or use
                               option -case for case search
     [eol]                   - end of line by characters:
                               CRLF or LF or CR

     [white]     = chars of (\t )     - 0 or more whitespaces
     [xwhite]    = bytes of (\t \r\n) - same but across lines
     [1 white]   = byte  of (\t )     - 1 whitespace
     [digit]     = byte  of (0-9)     - 1 digit
     [digits]    = bytes of (0-9)     - 0 or more digits
     [hexdigit]  = byte  of (0-9a-f)  - 1 hexadecimal digit
     [hexdigits]  = bytes of (0-9a-f) - 0 or more hex digits

     special keywords that do not count as tokens:
     [skip]   - at the start of a pattern: skip such text
                completely, do not count it as a search hit.
     [keep]   - search also the following text but keep it
                in the input data, without consuming it.
     [ortext] - foo[ortext]bar searches word foo or bar.
                [ortext] is allowed only between literals.

     anchors that have no length of their own:
     [start]  - start of file
     [end]    - end of file
     [lstart] - line start, i.e. start or CRLF or CR or LF
     [lend]   - logical line end, i.e. eol or end of file.
                to replace line ends use [eol] instead.

     how to search or replace special characters:
     -  to search or replace text containing the literal characters
        * ? \ [ ] then these must be escaped like \* \? \\ \[ \]
     -  ( ) are escaped only within character lists, like \( \)
     -  to search or replace the forward slash '/' type \x2f or use
        another char around from/to text, e.g. _fromtext_totext_
     -  parameters with blanks and non trivial characters need double
        quotes "", see also "about Shell Command Characters" below.

     expansion priorities: (highest first)
     if two search parts are side by side, and the same input
     character matches both, then these priorities apply:

       5:  start, end, lstart, lend
       4:  literal text, eol
       3:  whitelist classes: byte of, bytes of
       2:  blacklist classes: chars not, bytes not
       1:  plain wildcards: ?, *, **, byte, bytes, chars

     this means in "/[bytes]foo/" the [bytes] will stop to collect
     characters as soon as "foo" is found, as "foo" is a literal.
     on same or higher priority the right side stops the left side.

   the totext may contain:

     [part 1]            use first text part of the fromtext.
                         e.g. the fromtext /*foo[.100 chars]bar*/
                         contains parts :   1 2         3    4 5
     [part1]             the same (blank is optional).
     [parts 1,2,3]       use parts 1, 2 and 3.
     [parts 1-10]        use parts 1 to 10.
     [strip(part1,\0)]   use part 1 but remove zero bytes.
                         only zero bytes "\0" can be removed.
     [file.name]         full input filename with path
     [file.relname]      input filename without path
     [file.path]         input file's path
     [file.base]         relname without last .extension
     [file.ext]          input filename extension
     [all]               use all parts from fromtext.

     [setvar name]...[endvar]   set variable "name" with data
                                between setvar and endvar.
     [getvar name]              fill in data from variable "name"

     although anchors like lstart, lend count as a separate part
     they need NOT be specified in the totext. this means that
     /[lstart]foo[lend]/bar/ just changes the word "foo".

supported slash patterns
   \t    = TAB
   \r    = CR
   \n    = LF
   \x00  = one byte with code 00 hexadecimal
   \0    = short form for \x00
   \q    = a double quote "
   \\    = the backslash character \ itself
   \[    = the bracket open character [
   \]    = the bracket close character ]
   \*    = the literal star character *
   \?    = the literal question mark  ?
   \-    = to use literal "-" in a command
   Within multi line -bylist files:
   \     = slash+blank is changed to a single blank
   Only within "char of" or "byte not" lists:
   \(    = to use literal character "("
   \)    = to use literal character ")"

SFK expression options
   -showpart(s)  print /from/ part numbers, range statistics
                 and expansion priority points per part.
                 done automatically if a required /to/ text
                 is not given with a command.
   -showbest     if a /from/ pattern finds nothing, use this to
                 see how many parts would match so far, and with
                 up to how many bytes per part. anchors like [lstart]
                 may show a non zero length when matching (CR)LF.
   -showlist     with -bylist, show the internal joined list if
                 commands are spread across multiple lines.
   -showall      show all of the above.
   -xmaxlen=n    set default maximum length for chars or bytes commands,
                 e.g. -xmaxlen=10000 means /foo*bar/ matches with up to
                 10000 characters between foo and bar. the default max
                 length without this option is 4000 characters.

performance notes
 - always use a string literal, or single byte or char, at the start
   of your search expressions, like in /foo*bar/ starting with 'f'.
   Do not use a wildcard like * at the start like in /*foobar/
   when searching huge input data, as your search will slow down by
   factor 256. Use /[lstart]*foobar/ instead.
 - the system may cache output file(s), writing to disk in background
   after sfk has finished. subsequent batch commands may execute slower.

chaining support
   sfk extract output can be sent only to +xed or +xex.
   other commands require an xed conversion step like
   sfk extract ... +xed +view

aliases
   sfk xhexfind is the same as xfind -hex
   to extract unmodified binary data you may use either
   sfk xfind -pure ... -tofile or sfk extract ... -tofile

office file support
   sfk ofind        search in .xml text file contents of
                    office files like .docx .xlsx .ods .odt.
   sfk help office  for more infos and options

see also
   --- open source commands ---
   sfk xfind     search  wildcard text in   plain text files
   sfk ofind     search  in office files    .docx .xlsx .ods
   sfk xfindbin  search  wildcard text in   text/binary files
   sfk xhexfind  search  in text/binary with hex dump output
   sfk extract   extract wildcard data from text/binary files
   sfk filter    filter  and edit text with simple wildcards
   sfk find      search  fixed    text in   text        files
   sfk findbin   search  fixed    text in   text/binary files
   sfk hexfind   search  fixed    text in        binary files
   sfk replace   replace fixed    text in   text/binary files
   --- freeware commands ---
   sfk view      GUI tool to search text as you type
   --- xe commercial commands ---
   sfk replace   replace fixed    text with high performance
   sfk xreplace  replace wildcard text in   text/binary files
   sfk help xe   about SFK XE and xreplace with SFK Expressions.

beware of Shell Command Characters.
   to find or replace text patterns containing spaces or special
   characters like <>|!&?* you must add quotes "" around parameters
   or the shell environment will destroy your command. for example,
   pattern /foo bar/other/ must be written like "/foo bar/other/"
   within a .bat or .cmd file the percent % must be escaped like %%
   even within quotes: sfk echo -spat "percent %% is a percent \x25"

web reference
   http://stahlworks.com/sfk-extract

about example numbers with [brackets]
   if you see [1] type "sfk cmd 1" for whole command in one line.

bad examples with corrections
   if input text contains:
      bool bClFoo;
      bool bClBar   ;
   sfk xfind in.txt "/bool[xwhite]bCl*[xwhite];/"
      does NOT match "bool bClFoo;" because * eats the
      whole input line including ";" so no input is left
      for "[xwhite];" and the whole expression fails.
   sfk xfind in.txt "/bool[xwhite]bCl[* not ;][xwhite];/"
      does both match "bool bClFoo;" and "bool bClBar   ;".
      this means whenever your search fails to work write
      in detail which characters (not) to collect where.
   sfk xex in.txt "/[lstart]foo/[lstart]goo/"
      there is no need to write an anchor like [lstart]
      within totext as it contains no data. use instead:
         sfk xex in.txt "/[lstart]foo/goo/"
   sfk xex in.txt "/foo[lend]bar/goo[part2]bar/"
      anchors like [lend] must be at start or end of fromtext
      and cannot be referenced within totext. use instead:
         sfk xex in.txt "/foo[eol]bar/goo[part2]bar/"

working examples
   sfk xfind -text "/class [bytes]{[bytes]}/[all]\n\n/"
    -dir mydir -file .hpp +tofile out.txt
      collect class definitions from mydir and write output
      indirectly (via command chaining) to out.txt [13]
   sfk extract in.txt -text "/foo*bar/"
      search in.txt for patterns starting with foo and ending
      with bar, in the same line, with up to 4000 characters inbetween.
   sfk extract in.txt -text "/foo*bar/" +view
      same as above, but show the result in the depeche view
      text browser tool for easy reading.
   sfk xhex -text "/foo[0.100000 bytes]bar/" -dir mydir
      search all text and binary files of mydir for patterns of
      foo and bar with 0 to 100000 bytes (including NULL, CR
      and LF) inbetween and print output as hex dump.
   sfk extract -text "/printf(**);/" -dir mydir -file .cpp
      find all printf statements in source code, including statements
      across multiple lines.
   sfk extract in.txt "/foo[0.100 chars of (a-z0-9_@ )]bar/"
      extracts from a single input file in.txt all phrases
      starting foo and ending bar, in the same line, with
      0 to 100 characters inbetween being alphanumeric or
      one of @ _ or a blank character.
   sfk sel mydir .txt +extract "/foo*bar/"
      extract foo*bar from all .txt files in mydir.
   sfk extract mydir "/\x66\x6f\x6f[0.100 bytes]\x62\x61\x72/" -tofile out.dat
      find binary data starting with bytes 0x66, 0x6f, 0x6f,
      ending with 0x62, 0x61, 0x72 and up to 100 bytes inbetween
      within all files of folder mydir, writing found data
      to a single file out.dat
   sfk extract -text "/class [bytes]{[bytes]}/[all]\n\n/"
    -tofile out.txt -dir mydir -file .hpp
      collect class definitions from mydir directly to out.txt [10]
   sfk extract -dir mydir -file .cpp -text "/printf([bytes]);/[all]\n/"
    +xed "/);[eol]/[all]/" "/[eol][1.* white]/ /"
      extract all (multi line) printf statements from source code,
      convert multi line to single line, stripping whitespace. [11]
      the "/);[eol]/[all]/" keeps all line endings after ");"
   sfk extract -text "/$version:vernum=*,*name=*,*os=*,/
    [file.name]: [part6] v[part2] for [part10]\n/"
    -tofile versions.txt -dir mydir -file .exe -nofilenames
      search all .exe files in mydir for a text block like
       $version:vernum=1.6.9,name=fooprog,os=windows
      then extract and reformat version informations,
      writing results without :file headers to versions.txt [12]
   sfk extract in.zip "/PK\x05\x06[0.100 bytes]/"
      search characters 'P','K' then bytes 0x05 0x06
      and then up to 100 bytes in raw compressed data
      of a .zip file without extracting any contents.
   sfk extract -arc in.zip "/class*/"
      XE: find phrases starting with "class" in .zip contents
      XD: demo will search first 1000 bytes per .zip sub file