About

Raw data cannot always be presented in text as is. Numbers might need some formatting, names need cleaning or normalization. With Data Enrichment component, a set of data transformation functions can be defined. They will modify the raw data before sending it to the text generation component.

Let's say we have accounting data like this

Account CurrentPeriod (Q2) PriorPeriod (Q1)
Gross Sales (ID1220) 90447 82018
Advertising (ID3011) 1280 1982

When generating text we do not want ID part in the account name and we want the amounts in periods rounded to thousands using bite size formatting plus "USD" is needed at the end. We want something like this:

Account CurrentPeriod (Q2) PriorPeriod (Q1)
Gross Sales around 90k USD around 82k USD
Advertising around 1k USD around 2k USD

Defining transformations

Accelerated Text stores data transformation rules in the api/resources/config/enrich.edn file. There might be separate transformation rules for different data types, this is controlled via filename-pattern parameter. Which fields (columns) have to receive which changes is specified under fields parameter. Fields in turn is a collection of per field configurations. Configuration file structure is as follows:

  • file-pattern - regular expression defining the file name for which this config will be active
  • fields - a collection of field configurations
    • name-pattern - regex defining column name for this data type
    • transformations - a collection of functions performing transformations
      • function - any function which can transform the data (see bellow for the required parameter list for such function)
      • args - a map of arguments for the transformation function

Note that -pattern fields will take on regular expressions, but their patternless versions will be used for exact match. Thus file-pattern can be replace with filename and name-pattern with name.

Configuration example

The following example configuration does the transformations outlined above. This has to be placed in api/resources/config/enrich.edn file for the transformations to take the effect.

[{:filename "accounts.csv"
  :fields
  [{:name "Account"
    :transformations
    [{:function :api.nlg.enrich.data.transformations/cleanup
      :args     {:regex #regex" \\(.*?\\)" :replacement ""}}]}
   {:name-pattern #regex".*Period .*"
    :transformations
    [{:function :api.nlg.enrich.data.transformations/number-approximation
      :args     {:scale      1000
                 :language   :en
                 :formatting :numberwords.domain/bites
                 :relation   :numberwords.domain/around}}
     {:function :api.nlg.enrich.data.transformations/add-symbol
      :args     {:symbol " USD" :position :back}}]}
   {:name "Increase"
    :transformations
    [{:function :api.nlg.enrich.data.transformations/add-symbol
      :args     {:symbol "$" :position :front :skip #{\- \+}}}]}]}]

Transformation functions

Any custom transformation function can be used as long as it conforms to this specification:

  • its first parameter is the value from the data cell
  • its second parameter is a map as specified in args configuration section
  • it returns a modified cell value as string

Accelerated text provides a few transformation functions in its api.nlg.enrich.data.transformations namespace:

  • number-approximation - Using Number Words package turn a number to its numeric expression
  • add-symbol - Add extra symbol to the front or the back of the value. Useful to add measurements or currency symbols
  • cleanup - Cleanup the string using clojure.string/replace
  • reformat-date - Change the date formatting. If your data is in YYYY-MM-dd and you want to go to YYYY/dd/MM, use this function and specify correspoding formats in input-format and output-format arguments.