grep command to filter distinct values from XML tags

I have a ton of Oracle Forms XML export files and wanted to know, which different patterns occur for the value of the FormatMask XML attribute. The input looks as follows:

<Item Name="CREATION_DATE" UpdateAllowed="false" DirtyInfo="false" Visible="false" QueryAllowed="false" InsertAllowed="false" Comment="TABLE ALIAS&amp;#10;  FDA&amp;#10;&amp;#10;BASED ON TABLE&amp;#10;  TMI_FINANCIAL_DATA&amp;#10;&amp;#10;COLUMN USAGES&amp;#10;  ...    CREATION_DATE                 SEL&amp;#10;" ParentModule="OBJLIB1" Width="10" Required="false" ColumnName="CREATION_DATE" DataType="Date" ParentModuleType="25" Label="Creation Date" ParentType="15" ParentName="QMSSO$QUERY_ONLY_ITEM" MaximumLength="10" PersistentClientInfoLength="142" ParentFilename="tmiolb65_mla.olb" FormatMask="DD-MM-RRRR">

A naive grep command would print out the whole line, including the file name. After some iterations I came to the following command, which does what I want in a single line.

grep -R -h -o -e FormatMask=\"[^\"]* * | sed 's/FormatMask="//g' | sort | uniq

What the command does is:

  • grep recursively (-R) for a regular expression (-e)
  • search for FormatMask="<any-char-until-quotation>
  • print only the matching part of the line (-o). This will include the prefix FormatMask="
  • print without the file name (-h)
  • strip off the prefix with sed
  • sort the results alphabetically
  • remove duplicate lines (uniq)

The result (excerpt)is:

00
09
099
0999
0999999
09999999
0D0
0D999
0D9999
9
90
90D0
90D000
...