14  Locales

Computer settings change depending on language and location, and being unaware of this possibility can make certain data wrangling challenges difficult to overcome.

The purpose of locales is to group together common settings that can affect:

  1. Month and day names, which are necessary for interpreting dates.
  2. The standard date format, also necessary for interpreting dates.
  3. The default time zone, essential for interpreting date-times.
  4. Character encoding, vital for reading non-ASCII characters.
  5. The symbols for decimals and number groupings, important for interpreting numerical values.

In R, a locale refers to a suite of settings that dictate how the system should behave with respect to cultural conventions. These settings affect the way data is formatted and presented, encompassing details such as date formatting, currency symbols, decimal separators, and other related aspects.

Locales in R affect several areas, including how character vectors are sorted, and date, number, and currency formatting. Additionally, errors, warnings, and other messages might be translated into languages other than English based on the locale.

14.1 Locales in R

To access the current locale settings in R, you can use the Sys.getlocale() function:

Sys.getlocale()
#> [1] "en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8"

To set a specific locale, use the Sys.setlocale() function. For example, to set the locale to US English:

Sys.setlocale("LC_ALL", "en_US.UTF-8")
#> [1] "en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8"

The exact string to use for setting the locale (like “en_US.UTF-8”) can depend on your operating system and its configuration.

The LC_ALL used in the above code refers to all locale categories. R, like many systems, breaks down the locale into categories, each responsible for different aspects listed below.

  • LC_COLLATE: for string collation
  • LC_TIME: date and time formatting
  • LC_MONETARY: currency formatting.
  • LC_MESSAGES: system message translations.
  • LC_NUMERIC: number formatting.

You can set the locale for each category individually if you don’t want to change everything with LC_ALL.

We have shown tools to control locales. These settings are important because they affect how your data looks and behaves. However, not all of these settings are available on every computer; their availability depends on what kind of computer you have and how it’s set up.

Changing these settings, especially LC_NUMERIC, can lead to unexpected problems when you’re working with numbers in R. For example, if you’re used to using a period as a decimal point, but your locale uses a comma, this disparity can create issues when importing data.

It is important to remember that these locale settings only last as long as one R session. If you change them while you’re working, they will revert to the default settings when you close R and open it again.

14.2 The locale function

The readr package includes a locale() function that can be used to learn or change the current locale from within R:

library(readr)
locale()
#> <locale>
#> Numbers:  123,456.78
#> Formats:  %AD / %AT
#> Timezone: UTC
#> Encoding: UTF-8
#> <date_names>
#> Days:   Sunday (Sun), Monday (Mon), Tuesday (Tue), Wednesday (Wed),
#>         Thursday (Thu), Friday (Fri), Saturday (Sat)
#> Months: January (Jan), February (Feb), March (Mar), April (Apr), May
#>         (May), June (Jun), July (Jul), August (Aug), September
#>         (Sep), October (Oct), November (Nov), December (Dec)
#> AM/PM:  AM/PM

You can see all the locales available on your system by typing:

system("locale -a")

Here is what you obtain if you change the dates locale to Spanish:

locale(date_names = "es")
#> <locale>
#> Numbers:  123,456.78
#> Formats:  %AD / %AT
#> Timezone: UTC
#> Encoding: UTF-8
#> <date_names>
#> Days:   domingo (dom.), lunes (lun.), martes (mar.), miércoles (mié.),
#>         jueves (jue.), viernes (vie.), sábado (sáb.)
#> Months: enero (ene.), febrero (feb.), marzo (mar.), abril (abr.), mayo
#>         (may.), junio (jun.), julio (jul.), agosto (ago.),
#>         septiembre (sept.), octubre (oct.), noviembre (nov.),
#>         diciembre (dic.)
#> AM/PM:  a. m./p. m.

14.3 Example: wrangling a Spanish dataset

In Section 6.3.2, we noted that reading the file:

fn <- file.path(system.file("extdata", package = "dslabs"), "calificaciones.csv")

had a encoding different than UTF-8, the default. We used guess_encoding to determine the correct one:

guess_encoding(fn)$encoding[1]
#> [1] "ISO-8859-1"

and used the locale function to change this and read in this encoding instead:

dat <- read_csv(fn, locale = locale(encoding = "ISO-8859-1"))

This file provides homework assignment scores for seven students. The columns represent the student name, their date of birth, the time they submitted their assignment, and the score they obtained, respectively. You can see the entire file using read_lines:

read_lines(fn, locale = locale(encoding = "ISO-8859-1"))
#> [1] "\"nombre\",\"f.n.\",\"estampa\",\"puntuación\""                       
#> [2] "\"Beyoncé\",\"04 de septiembre de 1981\",2023-09-22 02:11:02,\"87,5\""
#> [3] "\"Blümchen\",\"20 de abril de 1980\",2023-09-22 03:23:05,\"99,0\""    
#> [4] "\"João\",\"10 de junio de 1931\",2023-09-21 22:43:28,\"98,9\""        
#> [5] "\"López\",\"24 de julio de 1969\",2023-09-22 01:06:59,\"88,7\""       
#> [6] "\"Ñengo\",\"15 de diciembre de 1981\",2023-09-21 23:35:37,\"93,1\""   
#> [7] "\"Plácido\",\"24 de enero de 1941\",2023-09-21 23:17:21,\"88,7\""     
#> [8] "\"Thalía\",\"26 de agosto de 1971\",2023-09-21 23:08:02,\"83,0\""

As an illustrative example, we will write code to compute the students age and check if they turned in their assignment by the deadline of September 21, 2023, before midnight.

We can read in the file with correct encoding like this:

dat <- read_csv(fn, locale = locale(encoding = "ISO-8859-1"))

However, notice that the last column, which is supposed to contain exam scores between 0 and 100, shows numbers larger than 800:

dat$puntuación
#> [1] 875 990 989 887 931 887 830

This happens because the scores in the file use the European decimal point, which confuses read_csv.

To address this issue, we can also change the encoding to use European decimals, which fixes the problem:

dat <- read_csv(fn, locale = locale(decimal_mark = ",",
                                    encoding = "ISO-8859-1"))
dat$puntuación
#> [1] 87.5 99.0 98.9 88.7 93.1 88.7 83.0

Now, to compute the student ages, let’s try changing the submission times to date format:

library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#> 
#>     date, intersect, setdiff, union
dmy(dat$f.n.)
#> Warning: All formats failed to parse. No formats found.
#> [1] NA NA NA NA NA NA NA

Nothing gets converted correctly. This is because the dates are in Spanish. We can change the locale to use Spanish as the language for dates:

parse_date(dat$f.n., format = "%d de %B de %Y", locale = locale(date_names = "es"))
#> [1] "1981-09-04" "1980-04-20" "1931-06-10" "1969-07-24" "1981-12-15"
#> [6] "1941-01-24" "1971-08-26"

We can also reread the file using the correct locales:

dat <- read_csv(fn, locale = locale(date_names = "es",
                                    date_format = "%d de %B de %Y",
                                    decimal_mark = ",",
                                    encoding = "ISO-8859-1"))

Computing the students’ ages is now straightforward:

time_length(today() - dat$f.n., unit = "years") |> floor()
#> [1] 42 43 92 54 42 83 52

Finally, let’s check which students turned in their homework past the deadline of September 22:

dat$estampa >= make_date(2023, 9, 22)
#> [1]  TRUE  TRUE FALSE  TRUE FALSE FALSE FALSE

We see that two students where late. However, with times we have to be particularly careful as some functions default to the UTC timezone:

tz(dat$estampa)
#> [1] "UTC"

If we change to the timezone to Eastern Standard Time (EST), we see no one was late:

with_tz(dat$estampa, tz =  "EST") >= make_date(2023, 9, 22)
#> [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE

14.4 Exercises

1. Load the lubridate package and set the locale to French for this exercise.

2. Create a numeric vector containing the following numbers: 12345.67, 9876.54, 3456.78, and 5432.10.

3. Use the format() function to format the numeric vector as currency, displaying the values in Euros. Ensure that the decimal point is represented correctly according to the French locale. Print the formatted currency values.

4. Create a date vector with three dates: July 14, 1789, January 1, 1803, and July 5, 1962. Use the format() function to format the date vector in the “dd Month yyyy” format, where “Month” should be displayed in the French language. Ensure that the month names are correctly translated according to the French locale. Print the formatted date values.

5. Reset the locale to the default setting (e.g., “C” or “en_US.UTF-8”) to revert to the standard formatting.

6. Repeat steps 2-4 for the numeric vector, and steps 5-7 for the date vector to observe the standard formatting.