Bdpar is a tool to easily build customized data flows to pre-process large volumes of information from different sources. To this end, bdpar allows to (i) easily use and create new functionalities and (ii) develop new data source extractors according to the user needs. Additionally, the package provides by default a predefined data flow to extract and preprocess the most relevant information (tokens, dates, … ) from some textual sources (SMS, email, tweets, YouTube comments).
The package has been develop using R6 class for implement, for example, the extraction of input data and the 18 Pipes that there are in the application by default. Next, the different tools that make up the application are explained and how the features it offers can be extended.
In case of any more specific doubt, use the package help through ?bdpar.
To manage the information obtained from the input data, the package uses an structure, called Instance, which allows store the extracted properties of the different Pipes. Below are described all the fields comprising the Instance class.
The package have four input data implemented, which are:
bdpar is able to load data source from multiple sources. In fact, by default is fully compatible with Twitter (twtid), SMS (tsms) and YouTube (ytbid) data sources. However, new data loaders can be easily implemented by implementing (i) a new class implementing obtainSource and obtainDate abstract methods from the Instance class and (ii) a new subclass overriding the createInstance method of the InstanceFactory class.
Is important to take into account that the type of Instance used is deducted by default according to the file extension. However, this behaviour can be easily modelled according to user needs.
This example shows how a new type of Instance (named ExtractorTytb) is be created. In this particular case, ExtractorTytb is responsible of extracting textual comments from Youtube files (extension .tytb).
## Warning: package 'readr' was built under R version 3.6.2
ExtractorTytb <- R6Class(
classname = "ExtractorTytb",
inherit = Instance,
public = list(
initialize = function(path) {
if (!"character" %in% class(path)) {
stop("[ExtractorTytb][initialize][Error]
Checking the type of the variable: path ",
class(path))
}
path %>>%
super$initialize()
},
obtainDate = function() {
"" %>>%
super$setDate()
return()
},
obtainSource = function() {
super$getPath() %>>%
read_file() %>>%
super$setSource()
super$getSource() %>>%
super$setData()
return()
}
)
)
In order to automatically execute the new Instance class (ExtractorTytb), is must be registered by (i) including the new class in the default InstanceFactory class or (ii) implementing a customized class that inherits from InstanceFactory. Below is shown an example describing how a new class (InstanceFactoryCustom) is implemented. As can be seen, this new class (inherited from InstanceFactory) overrides the createInstance() method from parent class and registers two extractors: ExtractorEml and ExtractorTytb.
library(R6)
library(tools)
library(bdpar)
InstanceFactoryCustom <- R6Class(
"InstanceFactoryCustom",
public = list(
initialize = function() {
},
createInstance = function(path) {
if (!"character" %in% class(path)) {
stop("[InstanceFactoryCustom][createInstance][Error]
Checking the type of the variable: path ",
class(path))
}
switch(file_ext(path),
`email` = return(ExtractorEml$new(path)),
`tytb` = return(ExtractorTytb$new(path))
)
return()
}
)
)
A pipe consists of a simple task responsible for generating a new output by applying some transformations over the input data. A set of sequentially interconnected pipes to achieve a required result is called pipeline. Pipes in bdpar are represented as PipeGeneric class while the pipelining process is defined through TypePipe abstract class. It should be noted that TypePipe should include a set of pipes (PipeGeneric class) representing the whole preprocessing flow.
An important feature that has been added to the Pipe concept is the control of the functionalities that go before and after a Pipe, that is, a control of the preprocessing flow. On the one hand, this control allows us to ensure that if a Pipe needs another before, it has already been executed. On the other hand, prevents a Pipe can not be executed later because it interferes with the functionality of a previous Pipe.
This functionality can be customized in each of the pipes that the user uses and/or develops, allowing to decide what to do in the situation in which the dependencies are not respected.
The framework provides over 18 different pipes (inherited from PipeGeneric). Each pipe is classified following two categories: (i) basic-functionality pipes and (ii) external file access pipes.
Obtains the source using the obtainSource method which implements the subclass of the superclass Instance. By default, the subclass implemented are ExtractorEml, ExtractorSms, ExtractorTwtid and ExtractorYtbid.
Creates a new emoji property where the emojis stored that are in the data attribute are stored. In addition, you can decide if you want to modify the data property of the Instance, removing the emojis that are found.
Creates a new emoticon property where the emoticons stored that are in the data attribute are stored. In addition, you can decide if you want to modify the data property of the Instance, removing the emoticons that are found.
Creates a new hashtag property where the hashtags stored that are in the data attribute are stored. In addition, you can decide if you want to modify the data property of the Instance, removing the hashtags that are found.
Creates a new URLs property where the URLs stored that are in the data attribute are stored. In addition, you can decide if you want to modify the data property of the Instance, removing the URLs that are found.
Creates a new userName property where the user names stored that are in the data attribute are stored. In addition, you can decide if you want to modify the data property of the Instance, removing the user names that are found.
Obtains the date using the obtainDate method which implements the subclass of the superclass Instance. By default, the subclass implemented are ExtractorEml, ExtractorSms, ExtractorTwtid and ExtractorYtbid.
Guesses the language by using language detector of library cld2. Creates the language property which indicates the idiom text. Optionally, it is possible to choose the language provided by Twitter.
Creates the length property which indicates the length of the text. The property’s name is customize throught the class constructor.
Creates the extension property which indicates file’s extension.
Identifies the class of the Instance (target attribute), starting with the path of the file.
Generates a CSV from the properties of the Instance.
Converts the data of an Instance to lower case.
Creates a new property abbreviation, where the abbreviations that are in the data attribute are stored. In addition, you can decide if you want to modify the data property of the Instance, replacing the abbreviations found for its extended version. The abbreviations and their substitutions have to be stored in files of type json the file associated with the language of the text is chosen. The loading and handling of the information of these files will be done through the ResourceHandler class.
Creates a new property contraction, where the contractions that are in the data attribute are stored. In addition, you can decide if you want to modify the data property of the Instance, replacing the contractions found for its extended version. The contractions and their substitutions have to be stored in files of type json the file associated with the language of the text is chosen. The loading and handling of the information of these files will be done through the ResourceHandler class.
Creates a new property interjection, where the interjections that are in the data attribute are stored. In addition, you can decide if you want to modify the data property of the Instance, removing interjections. The interjections will be stored in files of type json and the file associated with the language of the text is chosen. The loading and handling of the information of these files will be done through the ResourceHandler class.
Creates a new property slang, where the slang words that are in the data attribute are stored. In addition, you can decide if you want to modify the data property of the Instance, replacing the contractions for its extended version. The words slang and their substitutions hava to be stored in files of type json and the file associated with the language of the text is chosen. The loading and handling of the information of these files will be done through the ResourceHandler class.
Creates a new property stopwords, where the empty words that are in the data attribute are stored. In addition, you can decide if you want to modify the data property of the Instance, removing the empty words. The empty words will be stored in files of type json and the file associated with the language of the text is chosen. The loading and handling of the information of these files will be done through the ResourceHandler class.
Additionally, in order to improve the customization capabilities, bdpar allows to easily desing and develop new personalized pipes by implementing the pipe method included in the PipeGeneric class (inheritance relation). The code included below, exemplified the creation of a simple pipe in charge of removing multiple consecutive spaces from a string.
## Warning: package 'stringr' was built under R version 3.6.2
RemovesWhiteSpaces <- R6Class(
"RemovesWhiteSpaces",
inherit = PipeGeneric,
public = list(
initialize = function(propertyName = "",
alwaysBeforeDeps = list(),
notAfterDeps = list()) {
if (!"character" %in% class(propertyName)) {
stop("[RemovesWhiteSpaces][initialize][Error]
Checking the type of the variable: propertyName ",
class(propertyName))
}
if (!"list" %in% class(alwaysBeforeDeps)) {
stop("[RemovesWhiteSpaces][initialize][Error]
Checking the type of the variable: alwaysBeforeDeps ",
class(alwaysBeforeDeps))
}
if (!"list" %in% class(notAfterDeps)) {
stop("[RemovesWhiteSpaces][initialize][Error]
Checking the type of the variable: notAfterDeps ",
class(notAfterDeps))
}
super$initialize(propertyName, alwaysBeforeDeps, notAfterDeps)
},
pipe = function(instance) {
if (!"Instance" %in% class(instance)) {
stop("[RemovesWhiteSpaces][pipe][Error]
Checking the type of the variable: instance ",
class(instance))
}
instance$addFlowPipes("RemovesWhiteSpaces")
if (!instance$checkCompatibility("RemovesWhiteSpaces",
self$getAlwaysBeforeDeps())) {
stop("[RemovesWhiteSpaces][pipe][Error] Bad compatibility between
Pipes.")
}
instance$addBanPipes(unlist(super$getNotAfterDeps()))
instance$getData() %>>%
stringr::str_trim() %>>%
stringr::str_squish() %>>%
instance$setData()
return(instance)
}
)
)
Flow of pipes is the set of pipes that comprising the whole preprocessing proccess. By default bdpar provides a default pipelining proccess (implemented in SerialPipes) comprising all the 18 available pipes.
The code included below shows a pipelining example comprising 18 pipes:
instance %>I%
TargetAssigningPipe$new()$pipe() %>I%
StoreFileExtPipe$new()$pipe() %>I%
GuessDatePipe$new()$pipe() %>I%
File2Pipe$new()$pipe() %>I%
MeasureLengthPipe$new()$pipe("length_before_cleaning_text") %>I%
FindUserNamePipe$new()$pipe() %>I%
FindHashtagPipe$new()$pipe() %>I%
FindUrlPipe$new()$pipe() %>I%
FindEmoticonPipe$new()$pipe() %>I%
FindEmojiPipe$new()$pipe() %>I%
GuessLanguagePipe$new()$pipe() %>I%
ContractionPipe$new()$pipe() %>I%
AbbreviationPipe$new()$pipe() %>I%
SlangPipe$new()$pipe() %>I%
ToLowerCasePipe$new()$pipe() %>I%
InterjectionPipe$new()$pipe() %>I%
StopWordPipe$new()$pipe() %>I%
MeasureLengthPipe$new()$pipe("length_after_cleaning_text") %>I%
TeeCSVPipe$new()$pipe()
Additionally, in order to build a flexible framework, bdpar allows users to define their own (and customized) flow of pipes. To accomplish this task, it is necessary to create a new class that inherits from TypePipe and implements the pipeAll() method. Below is included an example of how a new pipe (called TestPipe) is created:
library(R6)
library(bdpar)
TestPipe <- R6Class(
"TestPipe",
inherit = TypePipe,
public = list(
initialize = function() {
},
pipeAll = function(instance) {
if (!"Instance" %in% class(instance)) {
stop("[TestPipe][pipeAll][Error]
Checking the type of the variable: instance ",
class(instance));
}
message("[TestPipe][pipeAll][Info] ", instance$getPath(), "\n")
tryCatch(
instance %>I%
TargetAssigningPipe$new()$pipe() %>I%
StoreFileExtensionPipe$new()$pipe() %>I%
File2Pipe$new()$pipe() %>I%
RemovesWhiteSpaces$new()$pipe() %>I%
TeeCSVPipe$new()$pipe()
,
error = function(e) {
message("[TestPipe][pipeAll][Error]",
instance$getPath(),
" :",
paste(e),
"\n")
instance$invalidate()
}
)
return(instance)
}
)
)
To manage the flow of pipes, there is a new operator called %>I%. This operator allows check if the Instance was invalidated in the lastest Pipe or not. In the case that the Instance is invalid, the flow of Pipes stops and the preprocessing pass to the next Instance. In the other case, the Instance continues the current flow of Pipes.
The configuration file is used to store the different configuration parameters of the pipes used in the preprocessing. For example, to indicate the keys used to work with the APIs that require it (such as YouTube or Twitter) as well as various configuration parameters that allow to customize the behavior of the application such as the choice of text format to use in case there are multipart emails (plain text or text in html format). It is important to keep in mind that if the parameters are not needed, the value can be omitted. The description of the structure of the configuration file can be accessed through the package help interface (?bdpar). It is important to indicate that the tool has a default template that can be modified by the user through the parameter editConfigurationFile, in both simple and advanced mode.
The following is the template that the configuration file (configurationsTemplate.ini) have initially:
[twitter]
ConsumerKey = <<consumer_key>>
ConsumerSecret = <<consumer_secret>>
AccessToken = <<access_token>>
AccessTokenSecret = <<access_token_secret>>
[youtube]
app_id = <<app_id>>
app_password = <<app_password>>
[eml]
PartSelectedOnMPAlternative= <<part_selected>> (text/html or text/plain)
[resourcesPath]
resourcesAbbreviationsPath = <<resources_abbreviations_path>>
resourcesContractionsPath = <<resources_contractions_path>>
resourcesInterjectionsPath = <<resources_interjections_path>>
resourcesSlangsPath = <<resources_slangs_path>>
resourcesStopWordsPath = <<resources_stop_words_path>>
[CSVPath]
outPutTeeCSVPipePath = <<out_put_TeeCSVPipe_path>>
[cache]
cachePathTwtid = <<cache_path_twtid>>
cachePathYtbid = <<cache_path_ytbid>>
The bdpar package is also available in a development version at the Github development page: github.com/miferreiro/bdpar