Data Analysis Tools that Facilitate Knowledge Discovery – New Data Processing Capabilities based on R

2019-10-06

Regular Expression and Text Pattern

- Working with “Regular Expression” (regex) is mostly “Pattern Matching”
- Regex used in R may be found using “?regex” in R

dplyr

The “grammar of data manipulation”, a very useful extended basic functionality from basic R, worth very careful and skillful use of it, here is the link to the package doc.

tibble

Filter, Select, Create (mutate), Arrange (order by) columns, very much similar to Excel GUI process, consistent and easy to follow

R and Machine Learning

Best Machine Learning Packages in R (2016-06), introduces some basic stellar features of popular R packages. Author mentioned packages

dplyr, ggplot2, reshape2 as data scientist basic tools. “reshape2” mostly a data set transformation into usual analytical format, where SAS had been addressing for decades, see here for a basic intro.

Practical Challenges in Workforce Analysis

Search multiple Excel files, export rows containing particular regex (words group pattern) in any variable; also create a column that has the name of the Excel file so checks can be made if correction or further study are needed.
Select a column of a tibble based on regex pattern
Construct a meta data base with the basic identifier and all essential analysis variables. Or any variables that may be of interest for certain modeling requirement. This may be done by getting the file names (fileNames <- Sys.glob(“*.csv”)), then process the file by searching column names using regex
Challenges in creating these regex:
- race status with variations of spelling, such as “White” vs “W”
- keywords used in job title, or any classification similar to job title, such as job group, job function, but not department, i.e., “Research associate”
- variable name that may indicating base salary, such as “Salary”, “Annual Salary”
Regex for
- E-mail Addresses
- Postal Codes
- Telephone Numbers
- Dates and Times
- Social Security Numbers: this may be helpful, a little complicated

Author: ResearchTech

Research scientist interested in improving discovery productivity through better research method and organization design. View all posts by ResearchTech