2019-10-06
Regular Expression and Text Pattern
-
- Working with “Regular Expression” (regex) is mostly “Pattern Matching”
- Regex used in R may be found using “?regex” in R
dplyr
The “grammar of data manipulation”, a very useful extended basic functionality from basic R, worth very careful and skillful use of it, here is the link to the package doc.
tibble
- Filter, Select, Create (mutate), Arrange (order by) columns, very much similar to Excel GUI process, consistent and easy to follow
R and Machine Learning
- Best Machine Learning Packages in R (2016-06), introduces some basic stellar features of popular R packages. Author mentioned packages
dplyr
,ggplot2
,reshape2
as data scientist basic tools. “reshape2” mostly a data set transformation into usual analytical format, where SAS had been addressing for decades, see here for a basic intro.
Practical Challenges in Workforce Analysis
- Search multiple Excel files, export rows containing particular regex (words group pattern) in any variable; also create a column that has the name of the Excel file so checks can be made if correction or further study are needed.
- Select a column of a tibble based on regex pattern
- Construct a meta data base with the basic identifier and all essential analysis variables. Or any variables that may be of interest for certain modeling requirement. This may be done by getting the file names (fileNames <- Sys.glob(“*.csv”)), then process the file by searching column names using regex
- Challenges in creating these regex:
- race status with variations of spelling, such as “White” vs “W”
- keywords used in job title, or any classification similar to job title, such as job group, job function, but not department, i.e., “Research associate”
- variable name that may indicating base salary, such as “Salary”, “Annual Salary”
- Regex for
- E-mail Addresses
- Postal Codes
- Telephone Numbers
- Dates and Times
- Social Security Numbers: this may be helpful, a little complicated