CountPattern Transform
The countpattern transform counts the number of occurrences of a text pattern, as defined by a regular expression. This transform will iterate through each data object and count all unique pattern matches found within the designated text field.
Both the pattern and stopwords parameters below are not “raw” regular expression patterns – they are embedded in a string. As a result, take care to make sure you use proper escape characters as needed. For example, to match digits, use "\\d"
, not "\d"
.
Transform Parameters
Property | Type | Description |
---|---|---|
field | Field | Required. The data field containing the text data. |
pattern | String | A string containing a well-formatted regular expression, defining a pattern to match in the text. All unique pattern matches will be separately counted. The default value is [\\w\']+ , which will match sequences containing word characters and apostrophes, but no other characters. |
case | String | A lower- or upper-case transformation to apply prior to pattern matching. One of lower , upper or mixed (the default). |
stopwords | String | A string containing a well-formatted regular expression, defining a pattern of text to ignore. For example, the value "(foo|bar|baz)" will treat the words "foo" , "bar" and "baz" as stopwords that should be ignored. The default value is the empty string ("" ), indicating no stop words. |
as | String[ ] | The output fields for the text pattern and occurrence count. The default is ["text", "count"] . |
Usage
This example counts the occurrences of each digit sequence in the comment field, except for the number 13.
{
"type": "countpattern",
"field": "comment",
"pattern": "\\d+",
"stopwords": "13"
}
Running the transform on this input data
[
{"comment": "between 12 and 12.43"},
{"comment": "43 minutes past 12 o'clock (and 13 seconds)"}
]
will produce the output
[
{"text": "12", "count": 3},
{"text": "43", "count": 2},
]