A Brief Introduction to bdpar

Miguel Ferreiro Diaz

2020-01-08

Abstract

Bdpar is a tool to easily build customized data flows to pre-process large volumes of information from different sources. To this end, bdpar allows to (i) easily use and create new functionalities and (ii) develop new data source extractors according to the user needs. Additionally, the package provides by default a predefined data flow to extract and preprocess the most relevant information (tokens, dates, … ) from some textual sources (SMS, email, tweets, YouTube comments).

Introduction and basics

The package has been develop using R6 class for implement, for example, the extraction of input data and the 18 Pipes that there are in the application by default. Next, the different tools that make up the application are explained and how the features it offers can be extended.

In case of any more specific doubt, use the package help through ?bdpar.

Instance

To manage the information obtained from the input data, the package uses an structure, called Instance, which allows store the extracted properties of the different Pipes. Below are described all the fields comprising the Instance class.

  • source: The input data without modifications.
  • date: The date on which the source was generated or sent.
  • data: The input data with modifications.
  • properties: Contains a list of properties extracted from the data that is being processed.
  • path: Identifier of the input data.
  • isValid: Indicates if the Instance is valid or not.
  • flowPipes: The list contains the Pipes that the Instance has passed through.
  • banPipes: The list contains the Pipes that can not be executed from that moment.

Types of input data available by default

The package have four input data implemented, which are:

bdpar is able to load data source from multiple sources. In fact, by default is fully compatible with Twitter (twtid), SMS (tsms) and YouTube (ytbid) data sources. However, new data loaders can be easily implemented by implementing (i) a new class implementing obtainSource and obtainDate abstract methods from the Instance class and (ii) a new subclass overriding the createInstance method of the InstanceFactory class.

Is important to take into account that the type of Instance used is deducted by default according to the file extension. However, this behaviour can be easily modelled according to user needs.

Enabling a new Instance.

In order to automatically execute the new Instance class (ExtractorTytb), is must be registered by (i) including the new class in the default InstanceFactory class or (ii) implementing a customized class that inherits from InstanceFactory. Below is shown an example describing how a new class (InstanceFactoryCustom) is implemented. As can be seen, this new class (inherited from InstanceFactory) overrides the createInstance() method from parent class and registers two extractors: ExtractorEml and ExtractorTytb.

Pipe

A pipe consists of a simple task responsible for generating a new output by applying some transformations over the input data. A set of sequentially interconnected pipes to achieve a required result is called pipeline. Pipes in bdpar are represented as PipeGeneric class while the pipelining process is defined through TypePipe abstract class. It should be noted that TypePipe should include a set of pipes (PipeGeneric class) representing the whole preprocessing flow.

Dependencies

An important feature that has been added to the Pipe concept is the control of the functionalities that go before and after a Pipe, that is, a control of the preprocessing flow. On the one hand, this control allows us to ensure that if a Pipe needs another before, it has already been executed. On the other hand, prevents a Pipe can not be executed later because it interferes with the functionality of a previous Pipe.

This functionality can be customized in each of the pipes that the user uses and/or develops, allowing to decide what to do in the situation in which the dependencies are not respected.

Pipes available by default

The framework provides over 18 different pipes (inherited from PipeGeneric). Each pipe is classified following two categories: (i) basic-functionality pipes and (ii) external file access pipes.

(i) Pipes of basic functionality

File2Pipe

Obtains the source using the obtainSource method which implements the subclass of the superclass Instance. By default, the subclass implemented are ExtractorEml, ExtractorSms, ExtractorTwtid and ExtractorYtbid.

FindEmojiPipe

Creates a new emoji property where the emojis stored that are in the data attribute are stored. In addition, you can decide if you want to modify the data property of the Instance, removing the emojis that are found.

FindEmoticonPipe

Creates a new emoticon property where the emoticons stored that are in the data attribute are stored. In addition, you can decide if you want to modify the data property of the Instance, removing the emoticons that are found.

FindHashtagPipe

Creates a new hashtag property where the hashtags stored that are in the data attribute are stored. In addition, you can decide if you want to modify the data property of the Instance, removing the hashtags that are found.

FindUrlPipe

Creates a new URLs property where the URLs stored that are in the data attribute are stored. In addition, you can decide if you want to modify the data property of the Instance, removing the URLs that are found.

FindUserNamePipe

Creates a new userName property where the user names stored that are in the data attribute are stored. In addition, you can decide if you want to modify the data property of the Instance, removing the user names that are found.

GuessDatePipe

Obtains the date using the obtainDate method which implements the subclass of the superclass Instance. By default, the subclass implemented are ExtractorEml, ExtractorSms, ExtractorTwtid and ExtractorYtbid.

GuessLanguagePipe

Guesses the language by using language detector of library cld2. Creates the language property which indicates the idiom text. Optionally, it is possible to choose the language provided by Twitter.

MeasureLengthPipe

Creates the length property which indicates the length of the text. The property’s name is customize throught the class constructor.

StoreFileExtPipe

Creates the extension property which indicates file’s extension.

TargetAssigningPipe

Identifies the class of the Instance (target attribute), starting with the path of the file.

TeeCSVPipe

Generates a CSV from the properties of the Instance.

ToLowerCasePipe

Converts the data of an Instance to lower case.

(ii) Pipes that access external files

AbbreviationPipe

Creates a new property abbreviation, where the abbreviations that are in the data attribute are stored. In addition, you can decide if you want to modify the data property of the Instance, replacing the abbreviations found for its extended version. The abbreviations and their substitutions have to be stored in files of type json the file associated with the language of the text is chosen. The loading and handling of the information of these files will be done through the ResourceHandler class.

ContractionPipe

Creates a new property contraction, where the contractions that are in the data attribute are stored. In addition, you can decide if you want to modify the data property of the Instance, replacing the contractions found for its extended version. The contractions and their substitutions have to be stored in files of type json the file associated with the language of the text is chosen. The loading and handling of the information of these files will be done through the ResourceHandler class.

InterjectionPipe

Creates a new property interjection, where the interjections that are in the data attribute are stored. In addition, you can decide if you want to modify the data property of the Instance, removing interjections. The interjections will be stored in files of type json and the file associated with the language of the text is chosen. The loading and handling of the information of these files will be done through the ResourceHandler class.

SlangPipe

Creates a new property slang, where the slang words that are in the data attribute are stored. In addition, you can decide if you want to modify the data property of the Instance, replacing the contractions for its extended version. The words slang and their substitutions hava to be stored in files of type json and the file associated with the language of the text is chosen. The loading and handling of the information of these files will be done through the ResourceHandler class.

StopWordPipe

Creates a new property stopwords, where the empty words that are in the data attribute are stored. In addition, you can decide if you want to modify the data property of the Instance, removing the empty words. The empty words will be stored in files of type json and the file associated with the language of the text is chosen. The loading and handling of the information of these files will be done through the ResourceHandler class.

How to create your customized Pipe

Additionally, in order to improve the customization capabilities, bdpar allows to easily desing and develop new personalized pipes by implementing the pipe method included in the PipeGeneric class (inheritance relation). The code included below, exemplified the creation of a simple pipe in charge of removing multiple consecutive spaces from a string.

## Warning: package 'stringr' was built under R version 3.6.2

Flow of Pipes (pipelining proccess)

Flow of pipes is the set of pipes that comprising the whole preprocessing proccess. By default bdpar provides a default pipelining proccess (implemented in SerialPipes) comprising all the 18 available pipes.

Operator

To manage the flow of pipes, there is a new operator called %>I%. This operator allows check if the Instance was invalidated in the lastest Pipe or not. In the case that the Instance is invalid, the flow of Pipes stops and the preprocessing pass to the next Instance. In the other case, the Instance continues the current flow of Pipes.

Configuration file

The configuration file is used to store the different configuration parameters of the pipes used in the preprocessing. For example, to indicate the keys used to work with the APIs that require it (such as YouTube or Twitter) as well as various configuration parameters that allow to customize the behavior of the application such as the choice of text format to use in case there are multipart emails (plain text or text in html format). It is important to keep in mind that if the parameters are not needed, the value can be omitted. The description of the structure of the configuration file can be accessed through the package help interface (?bdpar). It is important to indicate that the tool has a default template that can be modified by the user through the parameter editConfigurationFile, in both simple and advanced mode.

The following is the template that the configuration file (configurationsTemplate.ini) have initially:

[twitter] 
ConsumerKey = <<consumer_key>>
ConsumerSecret = <<consumer_secret>>
AccessToken = <<access_token>>
AccessTokenSecret = <<access_token_secret>>

[youtube] 
app_id = <<app_id>>
app_password = <<app_password>>

[eml] 
PartSelectedOnMPAlternative= <<part_selected>> (text/html or text/plain)
 
[resourcesPath]

resourcesAbbreviationsPath = <<resources_abbreviations_path>>
resourcesContractionsPath = <<resources_contractions_path>>
resourcesInterjectionsPath = <<resources_interjections_path>>
resourcesSlangsPath = <<resources_slangs_path>>
resourcesStopWordsPath = <<resources_stop_words_path>>
 
 
[CSVPath]
outPutTeeCSVPipePath = <<out_put_TeeCSVPipe_path>>

[cache] 
cachePathTwtid = <<cache_path_twtid>>
cachePathYtbid = <<cache_path_ytbid>>

Development

The bdpar package is also available in a development version at the Github development page: github.com/miferreiro/bdpar