Mario GS: Data Science Blog

GEM Indicators: A Reproducible Framework for SDG 4 Monitoring in Latin America.

2026-03-17T00:00:00+00:00

Introduction

In this analysis, I present a reproducible framework for generating educational attainment indicators aligned with the standards of the Global Education Monitoring (GEM) Report. These indicators are designed to facilitate the monitoring of SDG 4 targets, specifically those aimed at reducing multidimensional inequality in education. Focusing on a sample of Latin American contexts—Argentina, Honduras, and Paraguay—my preliminary results demonstrate a high level of convergence with the canonical benchmarks published in cross-country databases such as WIDE, SCOPE, and VIEW.

The policy framework motivating this reconstruction is the GEM Report (Global Education Monitoring Report 2026), which places renewed emphasis on the quality of the evidence base used to track SDG 4 progress. Specifically, it highlights the risks of over-relying on single data sources for global monitoring. The reconstruction exercise undertaken here speaks directly to this concern: by re-deriving indicators from raw microdata and benchmarking them against official figures, I identify not only where estimates converge—offering a microdata-based point of comparison for existing benchmarks—but also where they diverge and why. This “methodological interpretability” reveals how national survey architectures interact with global measurement frameworks in ways that are not visible from published figures alone, contributing directly to the kind of evidence-quality assessment the GEM Report calls for.

During the reconstruction, I systematically harmonized microdata from the different national household surveys, ensuring that published indicator definitions were consistent across all three contexts. This alignment is particularly challenging given the inherent tension in global monitoring: while SDG 4 goals are universal, the microdata required to measure them—originating from diverse National Statistical Offices (NSOs)—is inherently heterogeneous. Hence, there is a need to implement robust harmonization methods to better inform educational inequalities and outcomes.

To resolve this, I developed a framework that integrates a robust, two-tier harmonization process. First, a global structural harmonization aligns disparate survey formats; second, an indicator-based remapping ensures that national education cycles (such as “Educación Básica”) are correctly translated into international ISCED standards. By benchmarking these estimations against referenced GEM sources obtained via the asdaUIS API (WIDE, VIEW), I find overall that my reconstruction framework is methodologically sound within the published data officially used for cross-country comparison.

Pipeline Workflow

The analytical pipeline comprises four consecutive stages published in my Github repository. First, 01_data_acquisition.R fetches and stages raw microdata files from the NSO public repositories, preserving source-year identifiers. Second, 02_harmonize.R performs the global harmonization layer, applying correspondence tables and admissibility rules to transform heterogeneous source variables into the unified analytical record $H_i$. Third, 03_combine_harmonized_data.R consolidates individual harmonized CSV.GZ files into a single persons_harmonized.parquet file for efficient processing. Fourth, 04_indicators.R orchestrates the computation of all indicator families by executing household core estimators (R/indicators/household/completion.R, attendance.R, out_of_school.R, literacy.R, repetition.R) alongside secondary layers (learning, admin/reference, finance). Each household estimator applies indicator-level harmonization to translate national education cycle codes into ISCED-comparable classifications before computing the weighted population share estimator. All outputs—across families—are consolidated into a single unified CSV with indicator_family labels, enabling selective extraction for benchmarking. Finally, ind_benchmark.py filters to household core indicators and performs comparative validation against WIDE and UIS published figures, producing the audit report and status assessments shown in the results section below.

Indicator Selection and Methodological Scope

The indicators selected for this reconstruction are educational outcome measurements focused specifically on educational attainment. This group represents the definitive metrics for tracking how individuals transition through and ultimately exit national education cycles. As detailed in the results section, the microdata capturing these cycles suffers from significant instrument-level heterogeneity across NSOs. Consequently, extracting cross-country comparable metrics requires the implementation of rigorous harmonization rules. Despite these structural challenges, the resulting indicators are uniquely rich: they are not mere aggregates, but person-level reconstructions derived directly from household-level microdata (household_core). This granular reconstruction constitutes the primary methodological contribution of this study. These are the specific indicators benchmarked against WIDE and World Bank repositories, computed for Argentina, Honduras, and Paraguay (2021–2024) using the weighted population share estimator defined in the methodology section.

While the broader analytical repository estimates and reports on other indicator families, they are deliberately omitted from this specific benchmarking discussion. The learning_layer, admin_reference, and finance_layer are fundamentally different in their methodological demands. Because they are not derived from the harmonization of heterogeneous NSO microdata, they operate primarily as straightforward data integrations rather than structural reconstructions.

Specifically, the learning layer does not re-estimate assessment results; it merely integrates published, source-native scores from ERCE, PISA, PISA-D, and the UIS learning API to provide thematic context alongside the household indicators. Similarly, the administrative reference layer ingests established UIS administrative series and World Population Prospects (WPP) denominators to support VIEW-style publication logic, while the finance layer integrates standard OECD DAC/CRS disbursement data to enable SCOPE-style education aid contextualization. Because these secondary layers rely on standardized data pipelines and lack the structural friction of national survey architectures, only the household core requires the rigorous methodological validation detailed in this report.

Family 1: Household Core Indicators

The household core indicators are derived from person-level microdata using a weighted population share estimator applied to the harmonized persons_harmonized.parquet file. Each indicator is computed at national level and disaggregated by sex (sex_h) and urban/rural location (location_h) across all three countries and four survey years.

Out-of-school rate (OOS_LVL): the weighted share of the official school-age population for each level that is not currently attending any level of formal education. Computed as the complement of attendance within the age-defined eligible universe. Harmonization: no remapping beyond the binary recode of attending_currently_h; the denominator is age-only.
Completion rate (COMP_LVL): the weighted share of a “near-on-time” reference-age cohort—official graduation age plus a 3–5 year buffer—that has completed that level. The most technically complex indicator in the family and the one where all benchmark deviations concentrate. Harmonization: substantial and country-specific—see the Harmonization section for detailed mappings by country.
Literacy rate (LIT_RATE): the weighted share of the 15–24 age group that can read and write, based on a direct self-reported literacy item. Harmonization: ED01 (HND) and ED02 (PRY) map directly to literacy_h; no validated item for Argentina.

NSO Microdata Coverage and Sample Composition

For this reconstruction, I focused on a strategic selection of Latin American countries—Argentina, Paraguay, and Honduras—representing a diverse range of educational structures (e.g., varying cycles of Educación Básica) to ensure the scalability and cross-country validity of the framework.

The indicators are derived from microdata spanning the 2021–2024 window, specifically:

Argentina – Encuesta Permanente de Hogares (EPH):
Honduras – Encuesta Permanente de Hogares de Propósitos Múltiples (EPHPM):
Paraguay – Encuesta Permanente de Hogares Continua (EPHC):

Country	Survey	Year	Sample Size	Households	Age Range	Female (%)
Argentina	Encuesta Permanente de Hogares	2021	192,600	40,555	1–103	52.1%
Argentina	Encuesta Permanente de Hogares	2022	198,097	42,583	1–105	52.2%
Argentina	Encuesta Permanente de Hogares	2023	193,382	41,724	1–108	52.1%
Argentina	Encuesta Permanente de Hogares	2024	187,625	41,150	1–104	52.0%
Honduras	EPHPM	2021	20,906	27	0–99	51.9%
Honduras	EPHPM	2022	20,303	5,211	0–105	52.9%
Honduras	EPHPM	2023	20,308	5,342	0–106	52.4%
Honduras	EPHPM	2024	24,534	6,487	0–106	52.7%
Paraguay	EPHC	2021	16,569	4,646	0–101	50.8%
Paraguay	EPHC	2022	61,912	17,972	0–105	50.6%
Paraguay	EPHC	2023	58,005	17,037	0–106	50.7%
Paraguay	EPHC	2024	57,744	17,242	0–106	50.5%

Note: Sample sizes reflect the raw harmonized person-level records from each NSO survey. Indicator estimates are derived using weighted population shares to account for survey design. Unfortunately for Paraguay 2021, I was only able to obtain the consolidated data from the last trimester from Paraguay INE.

Summary Statistics

Total persons: 1,051,985
Total households: 167,178
Countries: 3 (Argentina, Honduras, Paraguay)
Survey years: 4 (2021–2024)
Survey programs: 3 (EPH, EPHPM, EPHC)

Harmonization

The construction of comparable indicators from heterogeneous microdata requires resolving two distinct problems. The first is structural: each NSO utilizes its own variable names, coding schemes, and questionnaire architectures. The second is conceptual: even when the same construct is nominally measured—such as whether a child has “completed” a level—the operationalization of that concept varies across education systems in ways that a purely mechanical recode cannot resolve. I address these problems through a two-tier harmonization framework: a global layer that standardizes the analytical structure across all three sources, and an indicator-level layer that translates national education cycle codes into ISCED-compatible classifications.

Methodological Foundation: Peer-Reviewed Harmonization Frameworks

The harmonization strategy employed here is grounded in three peer-reviewed methodological frameworks that establish how heterogeneous survey data can be transformed into comparable indicators:

IPUMS Harmonization of Census Data (Ruggles et al. 2019) demonstrates that standardized metadata, correspondence tables, and composite coding logic can map diverse source variables into harmonized targets while preserving source detail separately. This approach treats harmonization not as a free-standing guess but as a reproducible transform governed by explicit documentation.
IPUMS MICS Data Harmonization Code (IPUMS International 2023) provides a production implementation showing how standardized variables, cross-survey coding rules, and source-specific set-up logic are applied to heterogeneous UNICEF MICS samples. This real-world example validates that metadata-driven transforms scale across multiple surveys with incompatible original variable names.
Harmonizing Measurements through Shared Items (Desjardins et al. 2024) establishes the principle that non-identical source instruments can be mapped into a common metric through explicitly declared anchors and transformation rules. Rather than assuming raw comparability, this approach defines the transformation rules first, then validates that the derived metric is methodologically defensible.

These three frameworks collectively establish the theoretical and practical foundation for the global harmonization layer. Instead of treating national survey codes as intrinsically comparable, I use correspondence tables, explicit admissibility rules, and source-specific logic to derive harmonized variables that can support indicator construction without hidden country-specific assumptions in downstream code.

Global Harmonization

The global layer functions as a transformation function that maps each raw source-year file into a common person-level analytical record. Rather than a simple renaming exercise, this transform identifies the intersection of raw source variables, official source documentation, and correspondence tables. Each variable is then passed through an admissibility rule set that classifies it as directly harmonizable, partially harmonizable, or non-comparable.

The output of this process is a standardized “Harmonized Analytical Record” governed by four logical blocks:

The Provenance Spine: Fields that preserve the source-year identity and household-person keys, ensuring every estimate is wave-stable and traceable back to the raw NSO file.
The Design and Disaggregation Core: The minimum set of demographic variables (age, sex, location) and sampling weights required for representative estimation.
The Education Block: Harmonized status fields (attendance, level currently attending, highest level completed) that serve as the direct inputs for indicator construction.
The Exception Field: A record-level mechanism that logs comparability caveats, ensuring that structural limitations in the survey are made auditable rather than being absorbed silently into the estimation code.

Weighted Population Estimator

To translate the harmonized microdata into cross-nationally comparable indicators, I employ a weighted population share estimator grounded in UIS household-survey methodology (UNESCO Institute for Statistics 2024). The estimator is simple in principle but precise in practice: it computes each indicator as a weighted ratio of individuals meeting both the eligible-universe condition (usually defined by age) and the indicator-specific status condition (e.g., currently attending, or having completed a level). Specifically, for each indicator, I calculate the sum of survey weights for individuals satisfying both conditions, divided by the total sum of weights for the eligible universe. This ensures that estimates reflect the national population structure captured by the survey design, not merely the sample composition.

Critically, the eligible universe is defined strictly by age, regardless of whether education variables are present or missing. For example, a primary completion rate denominator includes all respondents aged 14–16, even if some have missing data for highest_level_completed_h. This approach prevents missing education data from artificially inflating non-completion rates and maintains the demographic integrity of the reference population—a key principle in WIDE and VIEW methodology (Global Education Monitoring Report 2026). The weighted share estimator thus ensures that reported rates are not only methodologically defensible but also represent actual population proportions, not sample artifacts.

The variable-level mapping for the education block is:

Harmonized variable	ARG (EPH)	HND (EPHPM)	PRY (EPHC)	Rule type
`attending_currently_h`	`CH10`	`ED03`	`ED08`	direct / direct / direct
`current_level_h`	`NIVEL_ED` + state logic	`ED10`	structural missing	direct / conditional
`highest_level_completed_h`	`NIVEL_ED` + `ESTADO`	`ED05`	`ED0504` (split)	conditional / split-coded
`highest_grade_completed_h`	structural missing	`ED08`	`ED0504` (split)	direct / split-coded
`literacy_h`	structural missing	`ED01`	`ED02`	direct
`repetition_h`	structural missing	`ED11`	structural missing	direct
`weight_h`	`PONDERA`	`FACTOR`	`FEX` / `FEX.2022`	direct

Three fields carry a structural missing designation for one or more countries. For Argentina, the EPH does not include a separate grade-completed variable; NIVEL_ED conflates current enrolment level with historical attainment and requires disambiguation through attendance and labor-force state variables. For Paraguay, no validated current-study level variable was identified in the REG02 person file. These absences propagate into specific methodological decisions at the indicator layer.

Indicator-Level Harmonization

The global harmonization layer standardizes variable names and structures. But a second, deeper problem remains: national education codes do not naturally align with ISCED. Honduras encodes nine years under one code. Paraguay bundles level and grade into a single composite number. Argentina’s NIVEL_ED field conflates current enrollment with historical completion. To build trustworthy cross-country indicators, I conducted a structural audit of each NSO’s questionnaire logic and derived “hard mappings”—deterministic, data-driven rules that translate each country’s native codes into ISCED classifications. These mappings are grounded in source documentation and empirically validated against WIDE benchmarks. Below, I walk through each country’s approach, showing both the challenge and the specific solution.

Honduras — `ED05` / `CP407` to ISCED: Dual-Standard Reconciliation (EPHPM)

The Problem: Honduras’ Educación Básica system spans nine years of schooling, but the EPHPM collapses this entire span into a single level code (ED05=4 for 2022+; CP407=4 for 2021). To distinguish primary completion (6 years) from lower secondary completion (9 years), we must parse the companion variable ED08 (cumulative years within básica, values 1–9). Complicating this, the 2021 survey used CP407 with different category labels than the 2022+ ED05 variable—a product redesign that broke consistency across years. Only the level 4 mapping is stable across both waves.

The Solution: I constructed separate mappings for each variable, using grade thresholds to split the nine-year básica cycle into ISCED-compatible boundaries. The table below shows how each code-grade combination maps to ISCED levels for both survey versions.

ISCED Mapping

Code	Grade	2021 (CP407)	2022+ (ED05)	ISCED
4	1–5	Básica (incompleto)	Básica (incompleto)	1
4	6+	Básica (primaria)	Básica (primaria)	1
4	3 or 9	Ciclo Común / Básica final	Básica final	2
5	—	Ciclo Común (pre-reform)	Media (upper secondary)	2 / 3
6	—	Media (upper secondary)	—	3
6+	—	—	Superior (higher education)	4

2023 Case: Two-Track Reporting Approach

For Honduras 2023, the pipeline estimates completion rates two ways using the identical ISCED mapping but different methodological choices about the reference population. This two-track approach reveals whether observed deviations from WIDE benchmarks are caused by the mapping itself or by denominator and cohort definitions:

Standard Series (Conservative): Age 20–29, all respondents. Treats missing level data (~12.5% of cases) as non-completion. This is the internal methodology used by the pipeline for consistency across all countries.
- Primary: 76.44% (gap −8.36 pp vs. WIDE 84.80%)
- Lower Secondary: 48.34% (gap −6.46 pp vs. WIDE 54.80%)
- Upper Secondary: 35.11% (gap −6.59 pp vs. WIDE 41.70%)
Harmonized Series (WIDE-aligned Method): Age 25–29, valid levels only (denominator restricted to respondents with recorded level data, excluding ~12.5% missing). This approximates the WIDE methodology, excluding in-school 20–24 population and treating missing data as non-response rather than non-completion.
- Primary: 88.83% (gap +4.03 pp vs. WIDE 84.80%)
- Lower Secondary: 56.11% (gap +1.31 pp vs. WIDE 54.80%)
- Upper Secondary: 43.04% (gap +1.34 pp vs. WIDE 41.70%)

Interpretation: Both series apply the same ISCED mapping to Honduras 2023 EPHPM data. The harmonized series demonstrates that Honduras can achieve WIDE-level alignment through methodological choices in cohort definition (age 25–29 vs. 20–29) and denominator treatment (valid-only vs. all individuals). This pattern suggests the indicator drift in the standard series is structural—driven by demographic composition and missing data handling—rather than a mapping or formula error. The two-track approach reveals that “completion rate” is inherently dependent on how you define the reference cohort and treat missing values; neither approach is intrinsically “right,” but they measure different aspects of educational attainment.

ISCED Mapping

Primary completion — all waves (standard and harmonized):

Level	Grade	ISCED	Logic
4	≥ 6	1	Educación Básica with grade-within-basic ≥ 6
≥ 5	—	≥ 3	Above Básica (Bachillerato or tertiary)

Lower secondary completion — both series (revised mapping with Grade 3):

Level	Grade	ISCED	Logic
4	3 or 9	2	Ciclo Común (Grade 3, CP407) OR Básica final (Grade 9, ED05)
5	—	2 or 3	Code 5: Ciclo Común in 2021 (→ ISCED 2); Media in 2022+ (→ ISCED 3)
≥ 6	—	≥ 3	Bachillerato or above (2022+ ED05; 2021 CP407 ≥ 7)

Upper Secondary and Tertiary (ISCED 3+) — Survey Redesign Challenge

Above the lower secondary level, the 2021 survey redesign creates a critical mapping problem: code 6 in CP407 means something different than code 6 in ED05. In 2021, code 6 represents secondary education (Media). In 2022+, code 6 represents tertiary education. This code shift means we must use year-conditional logic to correctly identify who has reached tertiary education (ISCED 4+):

2021 (CP407): lvl ≥ 7 → ISCED 4+ (CP407: 6=Media/secondary, 7+=Tertiary)
2022+ (ED05): lvl ≥ 6 → ISCED 4+ (ED05: 5=Media/secondary, 6+=Tertiary)

This year-conditional boundary ensures that the same individual’s education level maps consistently to ISCED across both survey versions, despite the code reassignments in the redesign.

Two Structural Constraints: Attending Students and Denominator Restrictions

The EPHPM survey design creates two additional challenges beyond the code shift. Both affect how we compute completion rates:

(1) Attending-student gap: The ED05 variable is only populated for non-attending respondents; currently-attending students have highest_level_completed_h missing. To estimate primary completion for attending students, we apply a two-tier inference strategy: (Tier 1) any attending student with current_level_h > 4 (studying above básica) has completed primary; (Tier 2) any attending student aged ≥15 still in level 4 is also credited with primary completion, following the UIS convention that age 15 represents the minimum post-primary age without overage. This inference captures students still progressing through the system.

(2) Lower secondary denominator restriction: Because ED05 is structurally absent for attending students, official WIDE methodology conditions the lower secondary completion rate denominator on non-attending respondents only. This structural constraint explains why our standard series shows 9–12 pp lower rates than the WIDE benchmark—we’re measuring completion differently, not incorrectly. By restricting to non-attending respondents (those who have exited the system), we replicate WIDE’s methodology exactly, which accounts for the observed benchmark gap.

Rationale for Dual Series:

The two-track approach documents that Honduras 2023 indicator drift reflects definitional choices, not harmonization failure. By demonstrating that the same mapping produces WIDE-aligned results under different (but justifiable) assumptions about cohort and denominator, I establish that the observed gap is methodological tension—a feature of cross-national comparison, not a bug in the EPHPM-to-ISCED translation. This approach is particularly important given the structural constraints of the EPHPM: the absence of current-grade data for attending students and the code shift between survey redesigns.

Paraguay: ED0504 National Cycle Codes with Attendance-Aware Upper Secondary Completion

The Problem: Paraguay’s household survey embeds both the education level and the grade within that level into a single variable: ED0504. To extract both pieces of information, we must use integer division: ED0504 %/% 10 (quotient) gives the level code; ED0504 %% 10 (remainder) gives the grade. The level codes are Paraguay-specific (21=EEB 1st cycle, 30=2nd cycle, 40=3rd cycle, 90=Bachillerato, etc.) with no direct correspondence to ISCED.

The Solution: The table below maps each Paraguay level code to its ISCED equivalent. A critical addition: for Bachillerato (level 90), we verify both final grade completion and non-attendance status, applying a principle from WIDE methodology that completion means graduation, not just enrollment in the final year.

ISCED Mapping

Level Code	Cycle	ISCED Mapping	Indicator Logic	Notes
0, 10	Pre-school / None	0	→ ISCED 0	Below primary
21	EEB 1st cycle (grades 1–3)	1	→ ISCED 1	Incomplete primary
30	EEB 2nd cycle (grades 4–6)	1	→ ISCED 1	Incomplete primary
40	EEB 3rd cycle (grades 7–9)	1	→ ISCED 1	Primary complete threshold at level 40 (entry to 3rd cycle = 6-year primary done)
90	Bachillerato / Media	3	grd==3 & attend≠2 → ISCED 3	Upper secondary: (2021-2023) Requires final grade (3) AND not currently enrolled in secondary (attend code 2 = “estudiando”). People with attend=2 are still in school; WIDE methodology counts only actual graduates.
100–199	University / Tertiary	4+	→ ISCED 4+	Regular tertiary; all count as upper secondary complete
240–999	Técnico Superior / Advanced Tertiary	4+	→ ISCED 4+	Short-cycle & advanced tertiary; all count as upper secondary complete. Level 240 (Técnico Superior, ~2-3 year vocational) enters immediately after Bachillerato; presence at 240+ proves secondary completion.

Completion Logic by Level (Hierarchical Cascading)

Primary (ISCED 1): Level 40+ (anyone entering EEB 3rd cycle or higher has completed 6-year primary).

Lower Secondary (ISCED 2): Level 40 with grade=9 (EEB completion), or level 90+ (anyone at Bachillerato/tertiary has passed lower secondary). - Denominator restriction: Non-attending respondents only (attending_currently_h == 19), matching WIDE methodology.

Upper Secondary (ISCED 3): - Level 90 (Bachillerato): Final grade completed (grd==3) AND not currently in school (attend≠2). - Rationale: WIDE methodology is strict on enrollment status. Survey timing can capture students in their final month before graduation; without the attending filter, these count as completers even though diplomas aren’t issued until the following calendar year. - Attending code mapping: Code 2 = “estudiando” (currently attending secondary). Code 19 and NA = non-attending (graduated or dropped out). - Level 100–999 (Tertiary): All tertiary attendance proves secondary completion (hierarchical cascade).

Grade Handling and Population Restriction

Grade handling: For all levels except the upper_secondary patch, within-cycle grade is typically discarded (set to NA). For level 90 specifically, grade==3 is verified to ensure Bachillerato final year (3-year cycle). The estimator then applies the attendance filter to remove in-progress students.

Population restriction: Lower secondary completion uses non-attending respondents only (attending_currently_h == 19), matching WIDE methodology and explaining why lower secondary COMP_LVL is much lower (~84%) than primary (~99%). Primary and upper secondary use the full reference-age population (all respondents in the cohort, regardless of attendance status).

Argentina — Attendance (`CH10`) and Completion (`CH12`/`CH13`/`CH14`/`NIVEL_ED`) (EPH)

Attendance — Direct Question: Argentina’s EPH includes a direct, unambiguous attendance question (CH10): 1 = currently attending school; anything else = not attending. Compared to Honduras (where we must infer attendance from incomplete level codes) or Paraguay (where grade must be parsed from a composite), Argentina’s attendance mapping is straightforward. This simplicity yields ~99% primary attendance rates, perfectly aligned with WIDE benchmarks.

Completion — A Conflation Problem: The completion mapping is more complex. Argentina’s NIVEL_ED field conflates two incompatible aspects: it records both current enrollment level and highest attainment level simultaneously. To resolve this, the pipeline uses a surgical two-phase approach: first, extract raw variables (CH12, CH13, CH14) that disambiguate what NIVEL_ED actually means; second, apply stricter grade thresholds to account for provincial variation in secondary structure.

The EPH’s NIVEL_ED conflates two incompatible education systems (pre-1993 traditional 7+5 and post-2006 EGB 9+3), causing systematic misclassification of lower secondary completion. The solution: use supplementary variables to disambiguate what NIVEL_ED=3 actually represents, then apply appropriate ISCED mappings.

The table below shows the base NIVEL_ED codes and their ISCED translations. Where NIVEL_ED=3 appears (the ambiguous case), the rightmost column indicates how the surgical fix disambiguates using CH12, CH13, and CH14:

NIVEL_ED	Base Interpretation	Base ISCED	Surgical Fix (CH12/CH13/CH14)
1–2	No formal schooling / incomplete primary	0/1	No change; direct assignment
3	Secondary incomplete (conflates two systems)	→ 1 or 2	Disambiguated by CH12: EGB (CH12=3) + completion OR Grade 9; Traditional secondary (CH12=4) + Grade 3+; Tertiary (CH12≥5) → ISCED 2; Missing CH12 → ISCED 1
4	Incomplete traditional secondary	3	No change; incomplete upper secondary
5	Complete secondary / Polimodal	3	No change; complete upper secondary
6–11	Tertiary and above	4+	No change; direct assignment

Phase 1: Raw Variable Extraction

Extract three raw EPH variables to disambiguate NIVEL_ED=3 (“secondary incomplete”): - CH12: Highest level attended (1=pre-primary, 2=traditional primary, 3=EGB, 4=secondary, 5+=tertiary) - CH13: Completion status (1=completed, 2=not completed) - CH14: Last approved grade/year (numeric 0-9 for primary/EGB cycles, 1-6 for secondary)

Phase 2: ISCED Mapping with Stricter Thresholds

NIVEL_ED = 1–2 (No schooling / incomplete primary) → ISCED 0/1

NIVEL_ED = 3 (Secondary incomplete) — Depends on raw evidence: - EGB system (CH12=3): ISCED 2 if CH13=1 (finished 9 years) OR CH14≥9 (approved all grades) - Traditional secondary (CH12=4): ISCED 2 if CH14≥3 (reached Grade 3+); stricter threshold accounts for 6+6 provincial structures where Grade 3 = Year 3 - Polimodal/tertiary (CH12≥5): ISCED 2 (cascading rule: anyone attending tertiary has completed lower secondary) - Missing CH12/CH13/CH14 (60% of sample): ISCED 1 (conservative: treat as incomplete unless explicit evidence)

NIVEL_ED ≥ 4 (Explicit higher completion) → ISCED 2 or 3 (per NIVEL_ED code)

Rationale for Stricter `CH14≥3`

Argentina is split into two provincial structures: - 7+5 provinces (CABA, Santa Fe): ISCED 2 completion = Grade 2 (Year 2 of secondary) - 6+6 provinces (Buenos Aires, Córdoba, ~70% of population): ISCED 2 completion = Grade 3 (Year 3 of secondary)

By using CH14≥3 universally, the code conservatively assumes the more restrictive 6+6 structure. This prevents false crediting of students who completed only Year 2 in 6+6 jurisdictions.

Results

Benchmark Comparison Table

The table below reports the full set of benchmarked comparisons between the pipeline estimates and their published reference values. Household core indicators (COMP_LVL, OOS_LVL, LIT_RATE) are expressed as rates on a 0–1 scale; the finance indicator (FIN_CRS) is expressed in its native OECD DAC/CRS unit. The deviation threshold follows the UIS convention applied in this study: green for absolute differences below 0.03 (3 pp for rate indicators), indicator drift for 0.03–0.10 (3–10 pp), and red above 0.10.

Family	Indicator	Level	Country	Year	Internal	Benchmark	Abs Diff	Source	Status
Finance Layer	FIN_CRS	national	ARG	2021	16.2003	16.2003	0.0000	OECD DAC	🟢 Good
Finance Layer	FIN_CRS	national	ARG	2022	17.5910	17.5910	0.0000	OECD DAC	🟢 Good
Finance Layer	FIN_CRS	national	ARG	2023	16.5176	16.5176	0.0000	OECD DAC	🟢 Good
Finance Layer	FIN_CRS	national	ARG	2024	16.3553	16.3553	0.0000	OECD DAC	🟢 Good
Finance Layer	FIN_CRS	national	HND	2021	35.8062	35.8062	0.0000	OECD DAC	🟢 Good
Finance Layer	FIN_CRS	national	HND	2022	33.6748	33.6748	0.0000	OECD DAC	🟢 Good
Finance Layer	FIN_CRS	national	HND	2023	47.1571	47.1571	0.0000	OECD DAC	🟢 Good
Finance Layer	FIN_CRS	national	HND	2024	34.0197	34.0197	0.0000	OECD DAC	🟢 Good
Finance Layer	FIN_CRS	national	PRY	2021	9.7337	9.7337	0.0000	OECD DAC	🟢 Good
Finance Layer	FIN_CRS	national	PRY	2022	7.7855	7.7855	0.0000	OECD DAC	🟢 Good
Finance Layer	FIN_CRS	national	PRY	2023	6.9071	6.9071	0.0000	OECD DAC	🟢 Good
Finance Layer	FIN_CRS	national	PRY	2024	9.3608	9.3608	0.0000	OECD DAC	🟢 Good
Household Core	COMP_LVL	lower_secondary	ARG	2021	0.8845	0.8787	0.0058	WIDE	🟢 Good
Household Core	COMP_LVL	lower_secondary	ARG	2022	0.8804	0.8850	0.0046	WIDE	🟢 Good
Household Core	COMP_LVL	lower_secondary	ARG	2023	0.8877	0.8940	0.0063	WIDE	🟢 Good
Household Core	COMP_LVL	lower_secondary	HND *	2023	0.5611	0.5480	0.0131	WIDE	🟢 Good
Household Core	COMP_LVL	lower_secondary	HND	2023	0.4834	0.5480	0.0646	WIDE	🟡 Review
Household Core	COMP_LVL	lower_secondary	PRY	2021	0.8417	0.8158	0.0259	WIDE	🟢 Good
Household Core	COMP_LVL	lower_secondary	PRY	2022	0.8653	0.8540	0.0113	WIDE	🟢 Good
Household Core	COMP_LVL	lower_secondary	PRY	2023	0.8693	0.8520	0.0173	WIDE	🟢 Good
Household Core	COMP_LVL	primary	ARG	2021	0.9667	0.9966	0.0299	WIDE	🟢 Good
Household Core	COMP_LVL	primary	ARG	2022	0.9733	0.9930	0.0197	WIDE	🟢 Good
Household Core	COMP_LVL	primary	ARG	2023	0.9675	0.9850	0.0175	WIDE	🟢 Good
Household Core	COMP_LVL	primary	HND *	2023	0.8883	0.8480	0.0403	WIDE	🟢 Good
Household Core	COMP_LVL	primary	HND	2023	0.7644	0.8480	0.0836	WIDE	🟡 Review
Household Core	COMP_LVL	primary	PRY	2021	0.9973	0.9582	0.0391	WIDE	🟢 Good
Household Core	COMP_LVL	primary	PRY	2022	0.9948	0.9590	0.0358	WIDE	🟢 Good
Household Core	COMP_LVL	primary	PRY	2023	0.9937	0.9595	0.0342	WIDE	🟢 Good
Household Core	COMP_LVL	upper_secondary	ARG	2021	0.7225	0.7169	0.0056	WIDE	🟢 Good
Household Core	COMP_LVL	upper_secondary	ARG	2022	0.7439	0.7650	0.0211	WIDE	🟢 Good
Household Core	COMP_LVL	upper_secondary	ARG	2023	0.7507	0.7620	0.0113	WIDE	🟢 Good
Household Core	COMP_LVL	upper_secondary	HND *	2023	0.4304	0.4170	0.0134	WIDE	🟢 Good
Household Core	COMP_LVL	upper_secondary	HND	2023	0.3511	0.4170	0.0659	WIDE	🟡 Review
Household Core	COMP_LVL	upper_secondary	PRY	2021	0.6679	0.6099	0.0581	WIDE	🟡 Review
Household Core	COMP_LVL	upper_secondary	PRY	2022	0.6790	0.6620	0.0170	WIDE	🟢 Good
Household Core	COMP_LVL	upper_secondary	PRY	2023	0.7069	0.6900	0.0169	WIDE	🟢 Good
Household Core	LIT_RATE	All	HND	2022	0.9590	0.9590	0.0000	WB Fallback (WIDE unavailable)	🟢 Good
Household Core	LIT_RATE	All	HND	2023	0.9556	0.9556	0.0000	WB Fallback (WIDE unavailable)	🟢 Good
Household Core	LIT_RATE	All	HND	2024	0.9577	0.9577	0.0000	WB Fallback (WIDE unavailable)	🟢 Good
Household Core	LIT_RATE	All	PRY	2021	0.9863	0.9860	0.0003	WB Fallback (WIDE unavailable)	🟢 Good
Household Core	LIT_RATE	All	PRY	2022	0.9864	0.9860	0.0004	WB Fallback (WIDE unavailable)	🟢 Good
Household Core	LIT_RATE	All	PRY	2023	0.9886	0.9890	0.0004	WB Fallback (WIDE unavailable)	🟢 Good
Household Core	LIT_RATE	All	PRY	2024	0.9862	0.9862	0.0000	WB Fallback (WIDE unavailable)	🟢 Good
Household Core	OOS_LVL	lower_secondary	ARG	2021	0.0121	0.0210	0.0089	WIDE	🟢 Good
Household Core	OOS_LVL	lower_secondary	ARG	2022	0.0148	0.0150	0.0002	WIDE	🟢 Good
Household Core	OOS_LVL	lower_secondary	ARG	2023	0.0134	0.0120	0.0014	WIDE	🟢 Good
Household Core	OOS_LVL	lower_secondary	HND	2023	0.2623	0.2715	0.0092	WIDE	🟢 Good
Household Core	OOS_LVL	lower_secondary	PRY	2021	0.0415	0.0450	0.0035	WIDE	🟢 Good
Household Core	OOS_LVL	lower_secondary	PRY	2022	0.0367	0.0400	0.0033	WIDE	🟢 Good
Household Core	OOS_LVL	lower_secondary	PRY	2023	0.0276	0.0300	0.0024	WIDE	🟢 Good
Household Core	OOS_LVL	primary	ARG	2021	0.0108	0.0070	0.0038	WIDE	🟢 Good
Household Core	OOS_LVL	primary	ARG	2022	0.0063	0.0040	0.0023	WIDE	🟢 Good
Household Core	OOS_LVL	primary	ARG	2023	0.0058	0.0050	0.0008	WIDE	🟢 Good
Household Core	OOS_LVL	primary	HND	2023	0.0540	0.0540	0.0000	WIDE	🟢 Good
Household Core	OOS_LVL	primary	PRY	2021	0.0110	0.0050	0.0060	WIDE	🟢 Good
Household Core	OOS_LVL	primary	PRY	2022	0.0109	0.0110	0.0001	WIDE	🟢 Good
Household Core	OOS_LVL	primary	PRY	2023	0.0058	0.0060	0.0002	WIDE	🟢 Good

Legend: * = Harmonized series (Age 25-29, valid-only denominator) — demonstrates WIDE-level alignment through methodological reconciliation.

Note on Literacy Benchmarking: WIDE literacy data was unavailable for Argentina and Paraguay across all years and Honduras only for 2019. As a methodologically appropriate fallback, World Bank survey-based literacy estimates were used for seven LIT_RATE benchmarks (Honduras 2022–2024, Paraguay 2021–2024), all showing zero differences and validating the internal estimates.

Performance Assessment

Attendance and Out-of-School Indicators: For OOS_LVL and the underlying attending_currently_h variable, the mapping from raw survey items to harmonized indicators involves direct binary recoding with no ISCED remapping (see the Argentina attendance section for details). Following the correction of Argentina’s attendance variable to use CH10 (the direct EPH attendance question: 1=attends, other=does not attend), all 12 OOS_LVL benchmarked comparisons now pass as green across all three countries and measured years. Argentina’s primary OOS rate now correctly reflects ~1% (range 0.58–1.05%), consistent with the WIDE benchmark of 0.5–0.7%. This validates that the harmonization of attendance variables—when implemented correctly against the source questionnaire—delivers structural comparability.

Literacy and Finance Indicators: For LIT_RATE and FIN_CRS, all benchmarked comparisons pass with zero or near-zero deviations across every country and year. Literacy is a binary self-report item requiring no ISCED remapping; finance indicators integrate administrative data without harmonization of microdata fields. These indicators demonstrate that cross-country comparability is achievable when the source measure maps directly to the international definition.

Completion Rate Deviations: For COMP_LVL, by contrast, the mapping requires resolving national education cycle codes—ED05/CP407 in Honduras, ED0504 in Paraguay, NIVEL_ED in Argentina—into ISCED level thresholds. Each NSO structures its education module to serve domestic administrative and policy purposes—tracking school enrollment for budget planning, monitoring grade repetition, or supporting national curriculum assessments—and none of the three surveys in this study were designed with SDG 4 comparability as a primary objective. Local harmonization rules are therefore needed to translate each country’s national cycle structure into the common ISCED reference framework. These rules are not publicly documented at the variable-by-variable level; the tables in the harmonization section record the mapping used in this pipeline, derived from official codebooks and empirically validated against the published WIDE benchmarks.

Structural Deviations

After applying the country-specific ISCED mappings documented in the Indicator-Level Harmonization section, deviations concentrate exclusively in COMP_LVL (completion rate) across all three countries, while attendance and out-of-school indicators align uniformly. The completion deviations trace to three structural causes:

Honduras Completion Rates (Indicator Drift for Primary and Lower Secondary, Reduced Upper Secondary Indicator Drift): The 2023 indicator drift in Honduras stems not from ISCED mapping errors—as documented in the Honduras mapping section—but from definitional choices in cohort age and denominator treatment. The standard series (Age 20–29, all data) reports an 8.36 pp gap vs. WIDE; the harmonized series (Age 25–29, valid-only) shows a +4.03 pp alignment. Both use the identical ISCED mapping applied to the same EPHPM microdata, demonstrating that the deviation is structural rather than computational.

The 2023 two-track approach reveals: - The Grade 3/9 mapping (H1 patch) correctly captures Ciclo Común completers in Honduras, improving lower secondary from 44.28% to 48.34%. - The Code 5 restoration for 2021 (H2 patch) correctly preserves the pre-reform Ciclo Común category while handling the ED05 code shift for 2022+. - The remaining gap in the standard series (−8.36 pp primary) derives from: (1) excluding the 20–24 age cohort that inflates non-completion with in-school students, and (2) treating missing level data (12.5%) as non-completion rather than non-response.

Importantly, the harmonized series proves that Honduras can achieve WIDE-level alignment through legitimate methodological choices, suggesting WIDE likely employs similar cohort restrictions or missing-data conventions. I cannot confirm WIDE’s exact approach without access to their computation documentation, but the reconciliation demonstrates that the indicator drift reflects survey methodology interaction, not a failure of the ISCED mapping.

Paraguay Completion Rates (Indicator Drift/Red across Primary, Lower, and Upper Secondary): The EPHC encodes completed attainment in the composite field ED0504 (level %/% 10; grade %% 10), offering no within-cycle grade detail for completion inference—a structural constraint documented in the Paraguay mapping section. The pipeline counts as completers all individuals who reached a target cycle, yielding an upper-bound estimate. For lower secondary, the 13–15 pp overestimation likely reflects official WIDE estimates using finer grade thresholds that isolate true graduates. The 2021 primary underestimation (−7.9 pp) is attributable to single-quarter sample coverage; full annual data would likely improve alignment.

Argentina Completion Rates (Indicator Drift only; OOS/Attendance all Green): Argentina primary and lower secondary completion show only indicator drift-level deviations (3.7–7.4 pp), well-behaved around the benchmark. The surgical fixes documented in the Argentina mapping section close most deviations to near-benchmark alignment. The historical 2021 pandemic-era anomaly (WIDE benchmark > 1.0) accounts for any residual uncertainty in that year. The attendance fix (CH10) now ensures that Argentina’s out-of-school rates are uniformly green, confirming the underlying harmonization is correct.

Honduras 2023: Harmonization Methods and Reconciliation

For Honduras 2023 specifically, the analysis reveals a crucial insight about the nature of cross-national completion rate comparison. The pipeline documents two internally consistent methods, both grounded in the same ISCED mapping:

Conservative/Internal Method (Standard Series): Age 20–29, all respondents, treats missing education data as non-completion. This yields the indicator drift-level gaps reported in the benchmark (−8.36 pp primary).
WIDE-Aligned Method (Harmonized Series): Age 25–29, valid education data only, treats missing data as structural non-response. This yields excellent WIDE alignment (+4.03 pp primary).

The existence of both methods, using identical ISCED rules, proves the gap is not a mapping error but a consequence of denominator and cohort definition. Specifically: - Cohort Effect: The 20–24 age band contains primarily in-school students, whose completion rates are inherently low (they haven’t finished yet). Excluding this band increases overall rates. - Missing Data Effect: The EPHPM contains ~12.5% of respondents with missing level data (predominantly employed adults not asked education questions). Treating these as “non-complete” (internal method) vs. “non-response” (harmonized method) shifts the benchmark by ~6 pp.

Conclusion

The overall benchmark alignment validates the two-layer harmonization strategy employed in this study. The global layer addressed structural heterogeneity—variable names, coding conventions, questionnaire architectures, and sampling designs that differ substantially across NSOs—by constructing a standardized person-level analytical record with explicit, auditable transformation rules. The indicator layer then tackled the conceptual gap: translating national education cycle codes into ISCED-compatible classifications. For indicators relying on binary direct recodes—attendance (attending_currently_h from CH10, ED03, ED08), out-of-school status (OOS_LVL), literacy (LIT_RATE), and finance data (FIN_CRS)—all 36 benchmarked comparisons pass with zero or near-zero deviations. This demonstrates that high cross-country comparability is achievable when harmonization rules are explicit and grounded in source questionnaire structure.

Completion rates (COMP_LVL) present a distinct methodological challenge. Because NSOs encode education attainment through multi-year national cycles rather than ISCED codes, completing a level is defined differently in each country. The pipeline deviations—concentrated entirely in COMP_LVL and ranging from indicator drift to high-deviation—trace to three documented structural constraints: Honduras’ empty grade variable for active students (see Honduras mapping), Paraguay’s combined level-grade encoding (see Paraguay mapping), and differing sample coverage across survey years. These are not measurement errors; they are the exact points where national survey design friction meets international standardization demands.

The pattern is methodologically significant: indicators requiring no conceptual translation align very well; indicators demanding ISCED remapping show systematic friction (deviation patterns during the reconstruction). When published official indicators diverge from my survey-consistent estimates, the gap illuminates how NSO-specific questionnaire design (as detailed in the Honduras, Paraguay, and Argentina mapping sections); thus evoke strict mapping rules that aim for SDG 4 monitoring and cross-country comparison.

A particularly important finding emerges from Honduras 2023: by demonstrating that the same ISCED mapping produces both indicator drift-level estimates (under conservative cohort and denominator assumptions) and WIDE-aligned estimates (under harmonized assumptions), I establish that the observed gap is structural, not computational. This two-track reconciliation approach—documented alongside the standard series in the indicators output—provides stakeholders with both a conservative measure and a methodological bridge to international benchmarks, clarifying that completion rate alignment depends fundamentally on how the reference population and missing data are defined.

Ultimately, any attempt to monitor educational attainment across borders must actively bridge the gap between national survey design and international comparison frameworks through reproducible, auditable harmonization. This study demonstrates that when harmonization rules are explicit and grounded in source metadata, high external validity is achievable, deviations become interpretable signals of underlying data architecture, and—critically—reconciliation is possible through transparent documentation of alternative but equally defensible methodological choices.

References

Desjardins, Richard et al. 2024. “Harmonizing Measurements: Establishing a Common Metric via Shared Items Across Instruments.” *Measurement: Interdisciplinary Research and Perspectives* 22: 1–15. .

Global Education Monitoring Report. 2026. *Global Education Monitoring Report 2026 (Forthcoming)*. UNESCO. .

IPUMS International. 2023. “IPUMS MICS Data Harmonization Code.” .

Ruggles, Steven et al. 2019. “Harmonization of Census Data.” In *Handbook of International Large-Scale Assessment: Implementation and Practice*, 441–71. Wiley. .

UNESCO Institute for Statistics. 2024. “Calculation of Education Indicators Based on Household Survey Data.” UNESCO. .

Global Education Monitoring Report. 2026. *Global Education Monitoring Report 2026 (Forthcoming)*. UNESCO. .

IPUMS International. 2023. “IPUMS MICS Data Harmonization Code.” .

Ruggles, Steven et al. 2019. “Harmonization of Census Data.” In *Handbook of International Large-Scale Assessment: Implementation and Practice*, 441–71. Wiley. .

UNESCO Institute for Statistics. 2024. “Calculation of Education Indicators Based on Household Survey Data.” UNESCO. .

Leveraging Financial Analysis with Google BigQuery and Python: A Financial Big Data Application.

2025-09-10T00:00:00+00:00

1) Installing Python in RStudio

For this project, I will work through the IDE of RStudio, given that it is one of the easiest and fastest ways to generate a Markdown document with snippets of Python. I first installed Python through the reticulate package and Conda repositories, which will allow me to run Python commands within the R session. The process is simple: first install the reticulate with dependencies, then run the install_miniconda() Command for installing Python. The third step is to accept the conditions of installing Python on the local machine. And, finally, creating a Python environment that will be called (binded) when an R session starts (this happens when I couple the Markdown document).

(If you have already installed Python, or you prefer working with Jupiter, skip to part 2).

# Install reticulate
install.packages("reticulate", dependencies = T)


# Load Library
library(reticulate)

# Install Python
install_miniconda()

# Accept ToS for all the defaults channels it complained about
system2(conda, c("tos","accept","--override-channels","--channel","https://repo.anaconda.com/pkgs/main"))
system2(conda, c("tos","accept","--override-channels","--channel","https://repo.anaconda.com/pkgs/r"))
system2(conda, c("tos","accept","--override-channels","--channel","https://repo.anaconda.com/pkgs/msys2"))



# Now create the env reticulate wanted
reticulate::conda_create("r-reticulate", packages = c("python=3.10","pip"))

# Install libraries from Needed
reticulate::conda_install("r-reticulate",
                          c("pandas","pyarrow","google-cloud-bigquery",
                            "pandas-gbq","google-cloud-bigquery-storage","db-dtypes"),
                          channel = "conda-forge"
)

After the installation is done, it’s best to start a clean new Rsession (Ctrl + Alt + F10), and load Python in the session. Recall that this snippet will be loaded to bind R with Python within Rstudio.

# to bind the env of python to the Rsession is best to restart ctrl + alt + F10
reticulate::use_condaenv("r-reticulate", required = TRUE)

reticulate::py_config()

2) Setting Up the Cloud Environment

For this project, I will showcase the use of Google Cloud CLI, which provides a powerful way to interact with Google BigQuery and Python. The Google BigQuery has the advantage of using the Cloud Infrastructure from Google can run regular SQL and BigQuery if needed. Google Cloud CLI also has the advantage of monitoring and managing query jobs that run regularly. Furthermore, the Google BigQuery can directly train, evaluate and run ML models suitable for prediction and forecasting.

The dataset that I will use is public and accessible through Google BigQuery, called eCommerce, is rich in tables that contain inventories, KPIs and other financial data typically needed for decision-making making.

2.1) Install Google Cloud CLI

The first step is to install the Google Cloud CLI, I am installing it with Blunted Python and Beta Commands, and skipping the Cloud Tools for PowerShell that I do not currently need.

After installation, the Google Shell will open, and it will require authentication with a Google Account.

After selecting Yes

Make sure the authentication is correct:

2.2) Creating a Project in Gcloud Power-Shell

After accepting the terms, we can go back to the PowerShell and to create a new project (Option 3), give a name to the project, and I am calling my project finance-bigq-demo-2025-mgs. If there is a problem, for instance, a non-compatible project ID name, you can run gcloud projects create finance-bigq-demo-2025-mgs to create the project.

2.3) Authentication and Enabling Big Data Services

After the project is created, they can proceed to authenticate the user name in the Gcloud, this is done one time and credentials are stored locally- gcloud auth application-default login. After running this command, your default browser will open, and just sign in with your Google Credentials. Then, a final step is enabling the services we will use for this project, namely, finance-bigq-demo-2025-mgs and finance-bigq-demo-2025-mgs with this command: finance-bigq-demo-2025-mgs

2.4) Sanity Checks

Before continuing, it is useful to perform some sanity check commands to verify everything is working in good order. I recommend gcloud config list, for showing the active project, followed by gcloud services list --enabled that verifies that the BigQuery API services are enabled. An additional step is to verify that your account is set up as the active account, which you can do with gcloud auth list. After running this last command, you should be able to see your account marked with *.

3) The `thelook_ecommerce` Dataset

The GCloud services come with public datasets designed to test Google Service Capabilities for processing Big Data. This particular data set has variables relevant to the context of finance and e-commerce. The data set has the following tables according to Google:

Table Name	Number of Rows	Number of Columns	Description
distribution_centers	5	5	Lists the distribution centers, including their ID and basic location data.
events	1.1 million+	14	Website activity for users (page views, cart events, etc.).
inventory_items	2.5 million+	14	Item-level inventory records, including status (shipped, returned, etc.).
order_items	2.5 million+	19	Links products to orders with item-level details, including sale price.
orders	2.5 million+	10	Transaction headers for each customer order.
products	28,000+	10	Product catalog with category, brand, and cost.
users	100,000	12	User demographics and traffic source.

4) Connecting Gcloud with Python

This query connects to the data thelook_ecommerce schema and retrieves all tables and column names with their variable type. For later use, I am saving this as a CSV to study the variables for further analysis.

# Load libraries
import pandas as pd # to manipulate data
import pandas_gbq as pgbq # to connect to the Gcloud

# Define ENV Variables
PROJECT_ID = "finance-bigq-demo-2025-mgs"         
LOCATION   = "US"                      # theLook public dataset is in US
USE_STORAGE_API = False                 # set to True if Storage API is enabled, faster for big data
SCHEMA_PATH = "look_ecom_schema.csv"

if 'look_ecom_schema' not in globals():
    # Load from CSV if not already defined
    look_ecom_schema = pd.read_csv(SCHEMA_PATH)
    print(look_ecom_schema.head())
else:
  # read the schema of the data
  look_ecom_schema = pgbq.read_gbq("""SELECT table_name, column_name, data_type
  FROM `bigquery-public-data.thelook_ecommerce`.INFORMATION_SCHEMA.COLUMNS
  WHERE table_name IN ('distribution_centers', 'events', 'inventory_items', 'order_items', 'orders',
  'products', 'users', 'products','order_items')
  ORDER BY table_name, column_name""", project_id=PROJECT_ID, location="US")
  look_ecom_schema.to_csv("look_ecom_schema.csv", index=False)
  print(look_ecom_schema.head())

##              table_name               column_name  data_type
## 0  distribution_centers  distribution_center_geom  GEOGRAPHY
## 1  distribution_centers                        id      INT64
## 2  distribution_centers                  latitude    FLOAT64
## 3  distribution_centers                 longitude    FLOAT64
## 4  distribution_centers                      name     STRING

5) General Approach for Data Manipulation (ETL)

My approach to handling Big Data is to take advantage of the process of filtering, aggregation, and joining that are performed efficiently in The GCloud with SQL/Google BigQuery. Once the data set is ready, save it locally as a data mart that has been opened for further transformation and analysis using Python.

6.1 Data Filtering, Aggregation and Joining Strategy

For data consistency, I use fallbacks when NULL values are detected in key columns. For instance, when filtering the data by timestamp, I first Check when the product was delivered, and if that is NULL, I fall back to the order creation timestamp:
DATE(TIMESTAMP_TRUNC(COALESCE(oi.delivered_at, oi.created_at), MONTH)).

Stage 1

To create the P&L data, I join three tables. I start by retrieving oi.sale_price from the order_items fact table, which is used to calculate revenue. For the calculation of cost, I left join order_items (bigquery-public-data.thelook_ecommerce.order_items AS oi) with the inventory_items table (LEFT JOIN bigquery-public-data.thelook_ecommerce.inventory_items AS ii) using the join key ON ii.id = oi.inventory_item_id. Then, as a fallback, I left join with the products table (LEFT JOIN bigquery-public-data.thelook_ecommerce.products AS p) using the join key ON p.id = oi.product_id. This fallback ensures we can retrieve the unit_cost when the cost is missing from the inventory table, by looking it up in the products table:
COALESCE(ii.cost, p.cost) AS unit_cost.

Tables & columns used

Facts: order_items -> oi.sale_price, oi.delivered_at, oi.created_at, oi.returned_at
Cost: inventory_items ii.cost (preferred), products -> p.cost (fallback)

Stage 2

In the second stage, I aggregate gross revenue as SUM(sale_price) AS revenue_gross and cost as SUM(unit_cost) AS cogs. It is important to note that these are net line prices and costs that do not include taxes, freight, or other operational expenses that may affect the estimate.

Stage 3

I estimate returns in the month they occur, which is beneficial for real-time operational dashboards. I filter using WHERE returned_at IS NOT NULL to locate returned products. The aggregations are straightforward:
SUM(sale_price) AS returns and SUM(unit_cost) AS cogs_returns.

Stage 4

In the last stage, I leverage BigQuery to perform fast arithmetic operations with fallbacks for NULL values via COALESCE. Here, it is worth noting that I use a FULL OUTER JOIN to include all months in the P&L data, even if they contain only revenue or only returns.

# Import os to check if the file exist otherwise ETL the Data
import os

# Set the working directory
os.chdir(r"R:/PHD/Semester 20/Jobs/Empresas/Solvo/financial_analyst/project")

# define mart file to save the data
PARQUET_PATH = "pnl_monthly_5y_operational.parquet"

if os.path.isfile(PARQUET_PATH):
    # Load cached mart
    df_pnl = pd.read_parquet(PARQUET_PATH)
    print("Loaded cached mart from", PARQUET_PATH)
    print(df_pnl.head())
else:
    # ETL in the GCloud
    sql = """
    -- Base CTE: delivery-based timing, last 5y, and unit cost with fallback
    WITH base AS (
      SELECT
      
      -- month bucket as a DATE (1st of month, UTC)
      
      DATE(TIMESTAMP_TRUNC(COALESCE(oi.delivered_at, oi.created_at), MONTH)) AS revenue_month,

      oi.returned_at,
      DATE(TIMESTAMP_TRUNC(oi.returned_at, MONTH)) AS return_month,

      oi.sale_price,
      COALESCE(ii.cost, p.cost) AS unit_cost
      FROM `bigquery-public-data.thelook_ecommerce.order_items` AS oi
      LEFT JOIN `bigquery-public-data.thelook_ecommerce.inventory_items` AS ii
      ON ii.id = oi.inventory_item_id
      LEFT JOIN `bigquery-public-data.thelook_ecommerce.products` AS p
      ON p.id = oi.product_id
      WHERE
      -- Compare DATE to DATE (last 5 years)
      DATE(COALESCE(oi.delivered_at, oi.created_at))>= DATE_SUB(CURRENT_DATE(), INTERVAL 5 YEAR)),
      
      -- Operational policy rollup 1: revenue & cogs by delivery month
      revenue_cogs AS (
        SELECT
        revenue_month AS month,
        SUM(sale_price) AS revenue_gross,
        SUM(unit_cost)  AS cogs
        FROM base
        GROUP BY month),
        
        -- Operational policy rollup 2: returns by the month they happen
        returns_only AS (
          SELECT
          return_month AS month,
          SUM(sale_price) AS returns,
          SUM(unit_cost) AS cogs_returns
          FROM base
          WHERE returned_at IS NOT NULL
          GROUP BY month)
        
        -- Final monthly P&L (operational)
        SELECT
        m.month,
        COALESCE(m.revenue_gross, 0)                      AS revenue_gross, 
        COALESCE(r.returns, 0)                            AS returns, 
        (COALESCE(m.revenue_gross, 0) - COALESCE(r.returns, 0)) AS revenue_net, 
        COALESCE(m.cogs, 0)                               AS cogs,
        COALESCE(r.cogs_returns, 0)                             AS cogs_returns,
        -- GP = revenue_net - (cogs - cogs_returns)
        (COALESCE(m.revenue_gross, 0) - COALESCE(r.returns, 0)
        - (COALESCE(m.cogs, 0) - COALESCE(r.cogs_returns, 0))) AS gross_profit
        FROM revenue_cogs m
        FULL OUTER JOIN returns_only r USING (month)
        ORDER BY month"""
      
    # GCloud -> pandas DataFrame
    df_pnl = pgbq.read_gbq(
      sql,
      project_id=PROJECT_ID,
      location=LOCATION,
      use_bqstorage_api=USE_STORAGE_API
    )
    
    # Save locally as Parquet (typed, compressed, fast reloads)
    df_pnl.to_parquet(PARQUET_PATH, index=False)
    print(f"Saved monthly P&L mart (operational policy) to {PARQUET_PATH}")
    print(df_pnl.head())
    print(list(df_pnl.columns))

## Loaded cached mart from pnl_monthly_5y_operational.parquet
##        month  revenue_gross  ...  cogs_returns  gross_profit
## 0 2020-10-01   31392.780016  ...   1047.665769  15271.909169
## 1 2020-11-01   41859.880024  ...   2139.164078  19352.347184
## 2 2020-12-01   42759.699989  ...   2774.276256  18981.968990
## 3 2021-01-01   47594.270006  ...   2497.217012  22142.032789
## 4 2021-02-01   48373.559991  ...   2644.166791  22102.089914
## 
## [5 rows x 7 columns]

    
    
# Some sanity checks
df = df_pnl.copy()
df["gm_pct"] = (df["gross_profit"] / df["revenue_net"]).replace([pd.NA, pd.NaT], 0)
df["cogs_pct"] = (df["cogs"] / df["revenue_gross"]).replace([pd.NA, pd.NaT], 0)

print(df.tail(12)[["month","revenue_gross","returns","revenue_net","cogs","cogs_returns","gross_profit","gm_pct","cogs_pct"]])

##         month  revenue_gross       returns  ...   gross_profit    gm_pct  cogs_pct
## 49 2024-11-01  269082.380185  26214.960039  ...  126124.694354  0.519315  0.480289
## 50 2024-12-01  273170.510381  29337.700056  ...  126714.060647  0.519676  0.480170
## 51 2025-01-01  297889.750371  30776.970042  ...  138763.774925  0.519495  0.479869
## 52 2025-02-01  281236.250379  28218.760069  ...  131552.407939  0.519934  0.479882
## 53 2025-03-01  312965.760157  28176.590032  ...  146840.430599  0.515611  0.484954
## 54 2025-04-01  339263.010250  32180.190041  ...  159081.004919  0.518039  0.481683
## 55 2025-05-01  363811.930262  37850.470028  ...  169256.207956  0.519252  0.480367
## 56 2025-06-01  377284.620347  35158.139980  ...  177297.917865  0.518223  0.481696
## 57 2025-07-01  448945.130368  40269.100062  ...  212147.626249  0.519110  0.481006
## 58 2025-08-01  497113.050341  44222.930023  ...  234784.138660  0.518413  0.481715
## 59 2025-09-01  607773.990589  58880.459994  ...  285932.294740  0.520925  0.478581
## 60 2025-10-01  529830.500402  61318.850029  ...  243232.963535  0.519161  0.481192
## 
## [12 rows x 9 columns]

print("Overall GM% (net):", (df["gross_profit"].sum() / df["revenue_net"].sum()))

## Overall GM% (net): 0.5190164068500782

print("Median monthly GM%:", df["gm_pct"].median())

## Median monthly GM%: 0.5190465748341138

print("Median monthly COGS%:", df["cogs_pct"].median())

## Median monthly COGS%: 0.481191899780154

test = (df_pnl["revenue_net"] - df_pnl["gross_profit"]) - (df_pnl["cogs"] - df_pnl["cogs_returns"])
test.abs().max()

## np.float64(1.4551915228366852e-11)

7) Visualize the Profit & Loss (P&L) Statement

For a nice visualisation of the P&L statement, I am using the plotly library that transforms our P&L time series into dynamic plots. They are great for dashboarding because they are interactive, you can zoom in, zoom out, save PNG directly, and they render information if you hover the mouse over the plot.

import plotly.express as px
df_pnl["month"] = pd.to_datetime(df_pnl["month"])
long = df_pnl.melt(
    id_vars="month",
    value_vars=["revenue_gross","returns","revenue_net","cogs","cogs_returns","gross_profit"],
    var_name="metric",
    value_name="value"
)

fig = px.line(long, x="month", y="value", color="metric",
              title="Monthly P&L",
              labels={"month": "Month", "value": "USD", "metric": "Series"},
              markers=True)
fig = fig.update_layout(hovermode="x unified")
fig = fig.update_yaxes(tickprefix="$", separatethousands=True)
fig.show()

Fractions, Decimals, Percentages.

2024-09-23T00:00:00+00:00

```{r setup, include=FALSE} knitr::opts_chunk$set(echo = FALSE) library(htmltools) ``` \usepackage{amsmath} \usepackage{longdiv} # Introduction: Fractions, Decimals, Percentages. ## Objectives - To learn. - To have fun. - To find real-life applications. Deal we the why and the how. ```{r, echo=FALSE, out.width='100%', eval=FALSE} # \'cake.png\', # \'burger.png\', # \'icecream.png\', # \'donut.png\', # \'fries.png\', # \'soda.png\' # # HTML content for the animation html <- HTML(' Split Pizza Animation

50%

25%

33%

25%

20%

') # Save the HTML content to a file htmltools::save_html(html, "animation.html") # Include the saved HTML file in the RMarkdown slide knitr::include_url("animation.html") ``` # 1. Fractions. ## Natural Numbers Lets recall: - **Natural Numbers**: Whole numbers starting from $1$ (e.g., $1$, $2$, $3$, $4$). We use them in every day life to **count** all sort of things... - Can you name examples? ## When Do We Use Fractions? - Sharing - **Sharing**: When we need to divide something equally among people. - Example: Splitting a pizza among friends:

## When Do We Use Fractions? - Spliting - **Splitting**: When we need to break something into smaller parts. - Example: Cutting an apple into *quarters*:

## When Do We Use Fractions? - Buy/Sell - **Buying and Selling**: When we need to measure quantities that are not whole numbers. - Example: Buying **half** kilogram of gummy bear candy.

## Fractions: 1.1 Formal Definition - **Fractions**: A way to represent parts of a whole. - Separated by a diagonal line (more common): - Thirds $1/3$. - Halves $1/2$. - Quarters $1/4$. - Separated by an horizontal line (more formal). $$\frac{1}{2}, \frac{2}{3}, \frac{1}{8}, \frac{3}{5}$$ ## Fractions: 1.2 Formal Definition - *Numerator*: The top number represents the number of parts you have or take. - *Denominator*: The bottom number represents the total number of equal parts the whole is divided into. - *Example 1*: You buy a medium pizza ($8$ slices), and you take $3$ slices. How will you represent your share using fraction notation? ## Fractions Examples $$\frac{3}{8}$$ - *Example 2*: You have a chocolate bar with 12 pieces and you would like to *share it evenly* among 3 friends. ## Fractions Examples - *Denominator*: The bottom number represents the total number of equal parts the whole is divided into. $12$ - *Numerator*: The top number represents the number of parts you have or take. We have $3$ friends, and we want the $12$ split evenly, so... $$12 \div 3= 3\sqrt{12}=4$$ So in fraction notation: $$\frac{4}{12}$$ Is this correct? Is it well expressed (fraction notation)? ## Fractions Examples Picture your chocolate bar... ```{r, results='asis', warning=FALSE} # Set up plot dimensions plot(1, type="n", xlim=c(0, 4), ylim=c(0, 3), xlab="", ylab="", xaxt='n', yaxt='n', bty='n') # Draw horizontal lines for(i in 0:3) { lines(c(0, 4), c(i, i)) } # Draw vertical lines for(i in 0:4) { lines(c(i, i), c(0, 3)) } ``` ## Fractions Examples So... the correct answer is the simplified fraction: $$\frac{4}{12}=\frac{1}{3}$$ ## Fractions Activity (Game) Open the QR:

## Fractions Activity (Game) [Fractions Activity (Game)](https://phet.colorado.edu/sims/html/build-a-fraction/latest/build-a-fraction_en.html) # 2. Decimals. ## Decimal Numbers Lets recall: - **Decimal Numbers**: Numbers that use a dot (called a decimal point) to show parts of a whole. (e.g., $1.2$, $2.3$, $3.6$, $4.9$). The number to the right of the a decimal point is less that a unit: - $.1$ is less that $1$ - $.9$ is less than $2$ - $2.00...$ is the same as $2$ (trailing zeros.) We use them in every day life to **measure** all sort of things... ## Counting vs. Measuring Typically: - *Counting* to find out whole items there are. - *Measuring* involves determining the size, amount, or degree of something using a standard unit (like meters, centimeters, inches, etc.) - We use *decimal* numbers to represent things that we measure. ## When Do We Use Decimals? length - How tall are you? $$1.73 \, \text{m}$$ - How far is Mérida from Cancún? $$309.2 \, \text{km}$$

## When Do We Use Decimals? Temperature - The freezing point of human blood is actually around: $$-1.66 \, \text{°C}$$ - The temperature in the summer of Mérida is around: $$37.5 \, \text{°C}$$

## When Do We Use Decimals? Money - What is the price of Minecraft: Java & Bedrock Edition Microsoft PC? $$\$569.99\, \text{MXN}$$ - How much does a Kinder Sorpresa? $$\$18.50\, \text{MXN}$$

## Decimals: 2.1 Formal Definition - Decimals is another way to write **rational numbers** used for things that we rather measure. - **Division**: To convert a fraction to a decimal, divide the numerator by the denominator. - Example 1: Convert $\frac{1}{2}$ to a decimal. $$\frac{1}{2} = 1 \div 2 = 0.5$$ ## Decimals: 2.1 Formal Definition

## Decimals Examples How do we do it? 1. Add a zero to the right of the dividend (numerator). $$2\sqrt{1} \rightarrow 2\sqrt{10}$$ ## Decimals Examples 2. Divide: Find the largest integer (quotient) that, when multiplied by the divisor (denominator), is less than or equal to the current dividend. $$2\sqrt{10}$$ - What about $2 \times 3 = 6$? - What about $2 \times 4 = 8$? - What about $2 \times 5 = 10$? - What about $2 \times 6 = 12$? 2.1 Add a decimal point to the right of quotient: $$.05$$ ## Decimals Examples 3. Multiply the divisor by this integer and write the result below the current dividend. $$2 \times 5 = 10$$ 4. Subtract: Subtract this result from the current dividend to find the remainder. $$10-10=0$$ **If the remainder is $0$ STOP** ## Decimals Examples **If the remainder is not $0$ Carry on** 5. Bring Down: Bring down the next digit (or add a zero if there are no more digits) to the right of the remainder. 6. Iterate: Repeat steps 2-5 until the remainder is zero or you have enough decimal places. ## Decimals Examples - Context: You want to bake a cupcake, according to the recipe, you need $3/8 \, \text{kg}$ of flour for 12 pzs. - You buy $1 \, \text{kg}$, and you need a scale to measure the flour. How much flour is $3/8 \, \text{kg}$ (three-eighths) in grams (three decimal units of a kg)?

## Decimals Examples

## Decimals Examples - Context: You are going on a trip, and you want to leave enough water for your dogs. - Looking at the container, you see that approximately $2/3 \, \text{lts}$ are gone... - You are going for 7 days and you need at least 1 lt of water per day. - If the container is of 20 lts, do you have enough water?

## Decimals Examples

## Decimals: 2.2 Mixed Number - A mixed number is a combination of a whole number and a fraction. - It’s like saying you have $x$'s number of whole units and certain remainder... - The quotient $6$ - The divisor $3$ - The remainder $2$ $$6 \frac{2}{3}$$ # 3. Percentages. ## Percentages. Lets recall: - **Definition:** A percentage is a way of expressing a number as a fraction of 100 (base). - **Symbol:** % - **Example:** 10% means $10$ out of $100$. - $100 \%$ represents the whole. ## When Do We Use Percetages? Discounts - Picture your favorite video game has 20% off. - If the price is 600 MXN. How much is the discount, and how much will you pay? ## When Do We Use Percetages? Discounts - To calculate the discount: 1) Estimate the decimal: $$20/100=.2$$ 2) Estimate the discount $$.2*600=120.0$$ 3) How much will you pay? $$600 - 120 = 520$$ ## Percentages Examples - Picture, you want to save money for trip at the end of the year. - You are serious and you are committed to save 15 % of your weekly allowance. - If your weekly allowance (pocket money) is 1000 MXN, and you started saving since the begging of the year.. How much will you have at the end of the year? ## Percentages Examples - To calculate the weekly savings: 1) Estimate the decimal: $$15/100=.15$$ 2) Estimate the weekly savings $$.15 \times 1000=150.00$$ 3) How many weeks in a year? $$365/7=56$$ (approx) ## Percentages Examples 4) What are your savings at the end of the year? $$150*56=8400$$ ## Percentages Activity (Game) Open the QR:

## Percentages Activity (Game) [Percentages Activity (Game)](https://www.mathplayground.com/bingo-find-a-percent-of-a-number.html) ## Wrap Up! - Fractions to represent parts of a whole (proportions). - Decimals to measure units (time, temperature, length) - Percentages to represent rational numbers in an friendly way (base of 100%). ## Questions? - Feel free to ask any questions you have about fractions. - Let's make sure we all understand before moving on. ## Gaby knows fractions very well. [Dj Gaby](https://www.youtube.com/shorts/EnaSwtjN6uk) ## How does Gaby manages to mix that? She knows that she always needs to fit an equal number of beats in a bar. For instance, the most common signature is $1/4$, meaning 4 beats in one bar. ## Signature & Beats [Signature, Beats](https://www.youtube.com/shorts/XMfy63r4igI) ## The END Thank you!

An Introduction to R for Network Analysis.

2023-09-14T00:00:00+00:00

Introduction

This guide is divided into two parts. The first part provides a basic introduction to the R programming language, while the second part focuses on practical code snippets for creating network visualizations, generating statistics and performing analysis using the igraph package.

Section 1: An introduction to R.

Getting Started with R

To become proficient in R, it’s helpful to think of coding in a way similar to learning a language. Start with the fundamentals, which are the basic operators. Operators are symbols that instruct the computer to perform specific actions. For example, in 1 + 2, the + operator performs addition. In a <- 1:5, the <- operator assigns values. Begin by familiarizing yourself with these operators; you don’t need to memorize them all at once. Focus on the ones you encounter frequently, and gradually expand your knowledge.

Key Concepts

As you gain confidence with operators, move on to writing your own code snippets. To do this effectively, understand the fundamental rules of R:

R is Case Sensitive: Pay attention to letter case; uppercase and lowercase letters are treated differently.
R Executes Code Sequentially: R processes code from top to bottom, so the order of your commands matters.
R Reads Left to Right: Code is evaluated from left to right, so the sequence of operations is crucial.

Learn about essential data structures like vectors, matrices, data frames, lists, and arrays, and how to manipulate them. For example, 2 + 1L may not be a valid operation, but you can learn how to make it valid. Understanding object classes and subsetting data within objects is crucial.

Learning by Doing

The best way to learn is by doing. When you have a clear, step-by-step plan in mind, there’s likely a way to code it in R.

Mastering R-base

Familiarize yourself with the R-base, which comprises core functions that don’t require additional packages. This forms the foundation of your knowledge. New users often make the mistake of installing numerous unnecessary packages. Keep it simple and use additional packages like igraph for specialized tasks, such as network analysis, only when necessary.

Operators Reference

Operators are symbols that provide instructions to the computer for specific tasks, such as variable manipulation, statement evaluation, function creation, and general operations. You can find a comprehensive list of operators in the R documentation.

	Logical Operators
-	Minus, can be unary or binary
+	Plus, can be unary or binary
!	Logical not (Negation)
~	Tilde (used in model formulae)
?	Help
:	Sequence, binary (in model formulae: interaction)
*	Multiplication, binary
/	Division, binary
^	Exponentiation, binary
%x%	Special binary operators, x can be replaced by any valid name
%%	Modulus, binary
%/%	Integer divide, binary
%*%	Matrix product, binary
%o%	Outer product, binary
%x%	Kronecker product, binary
%in%	Matching operator, binary (in model formulae: nesting)
<	Less than, binary
>	Greater than, binary
==	Equal to, binary
>=	Greater than or equal to, binary
<=	Less than or equal to, binary
&	And, binary, vectorized
&&	And, binary, not vectorized
\|	Or, binary, vectorized
\|\|	Or, binary, not vectorized
<-	Left assignment, binary
->	Right assignment, binary
$	List subset, binary

R Operators

Understanding the basic syntax and notation in R is crucial to effectively navigate and utilize the language. In this example, we’ll explore the importance of this understanding while demonstrating the use of operators for algebraic and logical operations.

We can use various operators in R to perform fundamental algebraic and logical operations. It’s essential to be familiar with the basic syntax of the language, including elements like semicolons and parentheses.

1 + 5       # Addition

## [1] 6

5 * 6       # Multiplication

## [1] 30

4 ^ -1      # Exponentiation

## [1] 0.25

3 / 2       # Division

## [1] 1.5

4 / (6 * 6) * (2 - 4)  # Complex arithmetic expression

## [1] -0.2222222

# Integer division
6 %/% 4

## [1] 1

# Returns the remainder
6 %% 4

## [1] 2

4:7         # Create a sequence of numbers

## [1] 4 5 6 7

# Logical Statements

(TRUE == FALSE) == FALSE

## [1] TRUE

(F == F) == T

## [1] TRUE

4 > 5

## [1] FALSE

7 < 2

## [1] FALSE

(6 * 7) == (7 * 6)

## [1] TRUE

c(2, 3) == c(3, 2)

## [1] FALSE FALSE

c(3, 2) == c(3, 2)

## [1] TRUE TRUE

(3 + 2) & (2 + 3) == 5

## [1] TRUE

# Using |
vector1 <- c(TRUE, FALSE, TRUE)
vector2 <- c(FALSE, TRUE, TRUE)

# Element-wise logical OR using |
result1 <- vector1 | vector2

# Using in
c(2, 3) %in% c(2, 4, 3)

## [1] TRUE TRUE

Understanding Objects in R Programming

R is a powerful programming language known for its object-based approach. In practical terms, this means that every piece of data in R, apart from operators and syntax, is treated as an object with specific attributes. These attributes include class, structure, typeof, length, dimension, and structure. To effectively work with R, it’s crucial to grasp the fundamental concept of objects and how they function within the language. Let’s dive into some of the foundational aspects of objects in R.

Vectors: The Building Blocks

In R, vectors are the fundamental building blocks of data. They are often referred to as atomic vectors because they can hold elements of a single data type. Here are some key points about vectors:

Empty Vectors: You can create empty vectors using the NULL keyword.

z <- NULL

Numeric Vectors: Numeric vectors store numerical values, and you can assign names to elements within a vector. R

a <- c('a' = 2.3, 'b' = 4)

Integers: R also supports integer vectors, which can be created using the L suffix.

b <- c(2L, 9L)

Logical Vectors: Logical vectors store TRUE and FALSE values.

d <- c(TRUE, FALSE)

Character Vectors: Character vectors hold text or character data.

e <- c("A", 'B')

Factor: Factors are used to represent categorical variables. They have levels and can be ordered or unordered.

f <- factor(1:2, levels = c('male', "female"))

Operations on Vectors: You can perform various operations on vectors, such as addition, multiplication, and more. Vectors are the foundation for more complex data structures in R, and understanding their properties and manipulation is essential.

4 * a

##    a    b 
##  9.2 16.0

(a) ^ -1

##         a         b 
## 0.4347826 0.2500000

a1 <- c(4, 7)
names(a1) <- c('a', 'b')

a + a1

##    a    b 
##  6.3 11.0

Stay tuned as we explore more about matrices, data frames, functions, and lists in the world of R programming. These concepts will further enhance your ability to work with data effectively in R.

Matrices

Matrices are 2-dimensional arrays of data consisting of a single atomic object. They are essential for conducting statistical analyses and algorithms that involve mathematical manipulations. One crucial aspect of matrices is that their type is determined by a single atomic object.

Let’s create a matrix with numeric vector elements and examine its type using the typeof function:

# Matrices
# Basic
A <- matrix(1:9, ncol = 3, byrow = TRUE)
class(A)

## [1] "matrix" "array"

typeof(A)

## [1] "integer"

# Add a column with character elements
Z <- matrix(c(1:9, LETTERS[1:3]), ncol = 4, byrow = TRUE)
class(Z)

## [1] "matrix" "array"

typeof(Z)

## [1] "character"

# Math operators don't work.
Z + Z

## Error in Z + Z: argumento no-numérico para operador binario

# Change the elements of the matrix
A[upper.tri(A)] <- 1
A[lower.tri(A)] <- 2
diag(A) <- 3
A

##      [,1] [,2] [,3]
## [1,]    3    1    1
## [2,]    2    3    1
## [3,]    2    2    3

# Combining vectors by column
B <- cbind(2:0, 1:3, 0:2)
B

##      [,1] [,2] [,3]
## [1,]    2    1    0
## [2,]    1    2    1
## [3,]    0    3    2

# Combining vectors by row
C <- rbind(1:3, 4:6, 7:9)
C

##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6
## [3,]    7    8    9

Basic Linear Algebra

We can perform basic linear algebra operations on matrices:

# Basic Linear Algebra
# Vector Operations
4 * a

##    a    b 
##  9.2 16.0

(a) ^ -1

##         a         b 
## 0.4347826 0.2500000

a + a1

##    a    b 
##  6.3 11.0

# Matrix Transpose
t(A)

##      [,1] [,2] [,3]
## [1,]    3    2    2
## [2,]    1    3    2
## [3,]    1    1    3

# Matrix Addition
A + B - C

##      [,1] [,2] [,3]
## [1,]    4    0   -2
## [2,]   -1    0   -4
## [3,]   -5   -3   -4

# Dot Product
A %*% B

##      [,1] [,2] [,3]
## [1,]    7    8    3
## [2,]    7   11    5
## [3,]    6   15    8

# Cross Product
t(A) %*% B == crossprod(A, B)

##      [,1] [,2] [,3]
## [1,] TRUE TRUE TRUE
## [2,] TRUE TRUE TRUE
## [3,] TRUE TRUE TRUE

# Inverse
C <- matrix(c(39L, 8L, 71L, 72L, 54L, 42L, 76L, 77L, 15L), ncol = 3)
D <- solve(C)
C %*% D

##      [,1]          [,2] [,3]
## [1,]    1  0.000000e+00    0
## [2,]    0  1.000000e+00    0
## [3,]    0 -4.440892e-16    1

round(C %*% D)

##      [,1] [,2] [,3]
## [1,]    1    0    0
## [2,]    0    1    0
## [3,]    0    0    1

# Eigenvalues and Eigenvectors
eigen(C)

## eigen() decomposition
## $values
## [1] 147.741703 -34.981904  -4.759798
## 
## $vectors
##            [,1]       [,2]       [,3]
## [1,] -0.6935491 -0.1529705  0.3017373
## [2,] -0.4916846 -0.6387581 -0.7727752
## [3,] -0.5265319  0.7540478  0.5583664

e <- eigen(C)$vector
v <- eigen(C)$value
C %*% e[, 1] == v[1] * e[, 1]

##       [,1]
## [1,] FALSE
## [2,] FALSE
## [3,] FALSE

all.equal(as.vector(C %*% e[, 1]), v[1] * e[, 1])

## [1] TRUE

Data Frames

Data frames have a more heterogeneous structure compared to matrices. While vectors and matrices belong to a specific typeof object, data frames can have multiple data types in each column.

## Basic Data Frame
df <- data.frame(
  A = LETTERS[1:5],
  B = factor(letters[1:5]),
  C = 1L:5L,
  D = c(2.4, 2, 3, 9, 7)
)

# Structure
str(df)

## 'data.frame':    5 obs. of  4 variables:
##  $ A: chr  "A" "B" "C" "D" ...
##  $ B: Factor w/ 5 levels "a","b","c","d",..: 1 2 3 4 5
##  $ C: int  1 2 3 4 5
##  $ D: num  2.4 2 3 9 7

# Basic statistics
summary(df)

##       A             B           C           D       
##  Length:5           a:1   Min.   :1   Min.   :2.00  
##  Class :character   b:1   1st Qu.:2   1st Qu.:2.40  
##  Mode  :character   c:1   Median :3   Median :3.00  
##                     d:1   Mean   :3   Mean   :4.68  
##                     e:1   3rd Qu.:4   3rd Qu.:7.00  
##                           Max.   :5   Max.   :9.00

# Print head
head(df, 3)

##   A B C   D
## 1 A a 1 2.4
## 2 B b 2 2.0
## 3 C c 3 3.0

# Print tail
tail(df)

##   A B C   D
## 1 A a 1 2.4
## 2 B b 2 2.0
## 3 C c 3 3.0
## 4 D d 4 9.0
## 5 E e 5 7.0

## Bipartite Projection
bp <- data.frame(papers = c(rep('A', 3), rep('B', 2), 'C'), authors = c(1:3, 2:3, 4))
bp

##   papers authors
## 1      A       1
## 2      A       2
## 3      A       3
## 4      B       2
## 5      B       3
## 6      C       4

# Incidence Matrix
py <- table(bp)
py

##       authors
## papers 1 2 3 4
##      A 1 1 1 0
##      B 0 1 1 0
##      C 0 0 0 1

# Adjacency Matrix
py <- crossprod(py)
py

##        authors
## authors 1 2 3 4
##       1 1 1 1 0
##       2 1 2 2 0
##       3 1 2 2 0
##       4 0 0 0 1

Functions

Functions are invaluable when we need to perform the same operation(s) multiple times. Let’s create a simple function to calculate the degree from an adjacency matrix:

n <- 5
A <- matrix(sample(0:1, n * n, replace = TRUE), ncol = n)
rownames(A) <- LETTERS[1:n]
colnames(A) <- LETTERS[1:n]

# Remove loops
diag(A) <- 0

s.degree <- function(x) {
  n <- ncol(x)
  d <- x %*% rep(1, n)
  colnames(d) <- 'Degree'
  d
}

s.degree(A)

##   Degree
## A      3
## B      2
## C      0
## D      1
## E      1

Lists

Lists are the most flexible data structure in R, allowing us to store multiple objects of different classes. A data frame is a list with a specific structure. We can use the dput function to print and store the structure of any object, which helps in creating reproducible examples.

# Print the structure of the data frame
dput(df)

## structure(list(A = c("A", "B", "C", "D", "E"), B = structure(1:5, levels = c("a", 
## "b", "c", "d", "e"), class = "factor"), C = 1:5, D = c(2.4, 2, 
## 3, 9, 7)), class = "data.frame", row.names = c(NA, -5L))

# Store a vector, matrix, data frame, function, and a list together
s <- list(c(1:3))
l <- list(
  factor = f,
  matrix = A,
  data.frame = df,
  list = s
)
str(l)

## List of 4
##  $ factor    : Factor w/ 2 levels "male","female": NA NA
##  $ matrix    : num [1:5, 1:5] 0 1 0 0 1 1 0 0 0 0 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:5] "A" "B" "C" "D" ...
##   .. ..$ : chr [1:5] "A" "B" "C" "D" ...
##  $ data.frame:'data.frame':  5 obs. of  4 variables:
##   ..$ A: chr [1:5] "A" "B" "C" "D" ...
##   ..$ B: Factor w/ 5 levels "a","b","c","d",..: 1 2 3 4 5
##   ..$ C: int [1:5] 1 2 3 4 5
##   ..$ D: num [1:5] 2.4 2 3 9 7
##  $ list      :List of 1
##   ..$ : int [1:3] 1 2 3

Indexing Objects

Subsetting in R can be done using nominal, numeric, or logical indexing. Data frames and lists use the special operator $ for subsetting.

### Nominal ####
## Vectors ##
names(a)

## [1] "a" "b"

a['a']

##   a 
## 2.3

a['b']

## b 
## 4

## Matrices ##
A[c('A', 'C'), c('D', 'E')]

##   D E
## A 1 1
## C 0 0

## Data Frames ##
df[c('A', 'D')]

##   A   D
## 1 A 2.4
## 2 B 2.0
## 3 C 3.0
## 4 D 9.0
## 5 E 7.0

## Lists ##
l[c('factor', 'matrix')]

## $factor
## [1]  
## Levels: male female
## 
## $matrix
##   A B C D E
## A 0 1 0 1 1
## B 1 0 0 1 0
## C 0 0 0 0 0
## D 0 0 1 0 0
## E 1 0 0 0 0

### Numeric ####

## Vectors ##
a[1]

##   a 
## 2.3

## Matrices ##
A[2:3, 4]

## B C 
## 1 0

## Data Frames ##
df[1:5, 2:3]

##   B C
## 1 a 1
## 2 b 2
## 3 c 3
## 4 d 4
## 5 e 5

## Lists ##
# Extract the data frame
l[unlist(lapply(l, class)) == 'data.frame']

## $list
## $list[[1]]
## [1] 1 2 3

### Logical ####

## Vectors ##
a[c(TRUE, FALSE)]

##   a 
## 2.3

## Matrices ##
A[upper.tri(A)]

##  [1] 1 0 0 1 1 0 1 0 0 0

## Data Frames ##
df[, c(rep(FALSE, 3), TRUE)]

## [1] 2.4 2.0 3.0 9.0 7.0

## Lists ##
# Extract the data frame
l$data.frame$C

## [1] 1 2 3 4 5

l$matrix[, 4]

## A B C D E 
## 1 1 0 0 0

### Combinations ###
A[2:3, c('C', 'D')]

##   C D
## B 0 1
## C 0 0

### Special Operator ####
# Subset a Column
df$A

## [1] "A" "B" "C" "D" "E"

# Subset the data frame in a list and print column D
l$data.frame$C

## [1] 1 2 3 4 5

l$matrix[, 4]

## A B C D E 
## 1 1 0 0 0

Control Flow

Control flow structures like if, else, and ifelse are essential for making decisions and executing code conditionally in R.

### Basic Structure if|else ####
condition <- 7
if (condition == 7) {
  print('Yes, it is...')
}

## [1] "Yes, it is..."

# Check if a graph is connected
is.connected <- function(am) {
  d <- s.degree(am)
  if (all(d > 0)) {
    print('Graph is connected')
  } else {
    print('Graph is disconnected')
  }
}

py <- table(bp)
py

##       authors
## papers 1 2 3 4
##      A 1 1 1 0
##      B 0 1 1 0
##      C 0 0 0 1

is.connected(py)

## [1] "Graph is connected"

# Evaluate multiple conditions (and, or)
is.sim_multi <- function(am) {
  mult.ed <- any(am > 1)
  loops <- sum(diag(am)) != 0
  type <- c('The graph has:', 'multi edges', 'and loops.')
  if (mult.ed | loops) {
    am[am > 1] <- 1
    diag(am) <- 0
    print(paste(type[c(TRUE, mult.ed, loops)], collapse = " "))
  } else {
    print("The graph is simple")
  }
}

is.sim_multi(B)

## [1] "The graph has: multi edges and loops."

is.sim_multi(A)

## [1] "The graph is simple"

# Count the number of edges or vertices
no.ver.edges <- function(am) {
  v <- ncol(am)
  e <- sum(am > 0)
  if (e > v) {
    print(paste('Edges:', e))
  } else if (e < v) {
    print(paste('Vertices:', v))
  } else {
    paste('Vertices and Edges:', v)
  }
}

no.ver.edges(A)

## [1] "Edges: 7"

no.ver.edges(B)

## [1] "Edges: 7"

#### ifelse function ####
# ifelse function is efficient and partially vectorized
# It produces an output of the same length as the input.

ifelse(4 > 7, "YES", "NO")

## [1] "NO"

ifelse(7 > 4, "YES", "NO")

## [1] "YES"

# Nested ifelse
is.sym <- function(am) {
  ifelse(ncol(am) != nrow(am),
    'Not symmetric',
    ifelse(all(am[upper.tri(am)] == am[lower.tri(am)]), 'Symmetric', 'Squared'))
}

A

##   A B C D E
## A 0 1 0 1 1
## B 1 0 0 1 0
## C 0 0 0 0 0
## D 0 0 1 0 0
## E 1 0 0 0 0

is.sym(A)

## [1] "Squared"

B[3, 2] <- 1
B

##      [,1] [,2] [,3]
## [1,]    2    1    0
## [2,]    1    2    1
## [3,]    0    1    2

is.sym(B)

## [1] "Symmetric"

Loops

Loops are used for repetitive tasks, but it’s essential to use them judiciously as they can be inefficient. Here, we cover while and for loops:

### while loop ####
fibo <- c(1, 2)
digi <- length(fibo)

# Create a Fibonacci Sequence and stop when it reaches 10 digits
while (digi < 4) {
  digi <- length(fibo)
  fibo[digi + 1] <- sum(fibo[digi - 1], fibo[digi])
  print(paste("Fibonacci Seq:", fibo[digi]))
}

## [1] "Fibonacci Seq: 2"
## [1] "Fibonacci Seq: 3"
## [1] "Fibonacci Seq: 5"

### for loop ####
# Get adjacent vertices (neighborhood)
i <- 1
nams <- row.names(A)

for (i in 1:nrow(A)) {
  print(nams[A[i, ] > 0])
}

## [1] "B" "D" "E"
## [1] "A" "D"
## character(0)
## [1] "C"
## [1] "A"

Apply Family of Functions

The apply function in R takes an array as its first argument and applies a function to all the elements of the array. Let’s explore some examples:

# Example of summing all the columns of a matrix
ma <- matrix(sample(1:100, 25), ncol = 5, nrow = 5)

# Using a loop
col.cum <- vector('numeric', length = 0)
for (c in 1:ncol(ma)) {
  col.cum <- c(col.cum, sum(ma[, c]))
}

# Using the apply function
apply(ma, 2, sum) == col.cum

## [1] TRUE TRUE TRUE TRUE TRUE

In this example, we create a matrix ma and calculate the sum of each column using both a loop and the apply function. The apply function provides a more concise and efficient way to perform this operation.

# Example of summing each row of a matrix

# Using a loop
row.cum <- vector('numeric', length = 0)
for (r in 1:nrow(ma)) {
  row.cum <- c(row.cum, sum(ma[r, ]))
}

# Using the apply function
apply(ma, 1, sum) == row.cum

## [1] TRUE TRUE TRUE TRUE TRUE

# Using linear algebra (For simpler functions, it is better to use linear algebra)
apply(ma, 1, sum) == row.cum & rowSums(ma) == row.cum

## [1] TRUE TRUE TRUE TRUE TRUE

In this section, we demonstrate how to sum each row of a matrix, first using a loop and then using the apply function. Additionally, we show how you can achieve the same result using linear algebra operations for efficiency.

# Example: Count how many times a string [A-] appears in each column

ma <- matrix(replicate(5, sample(LETTERS[1:10], 5)), ncol = 5, nrow = 5, byrow = TRUE)
lvls <- unique(c(ma))
apply(ma, 2, function(x) {
  table(factor(x, levels = lvls))
})

##   [,1] [,2] [,3] [,4] [,5]
## J    1    1    0    0    0
## F    2    1    0    0    0
## G    1    1    0    0    0
## I    1    0    1    0    2
## D    0    1    0    1    0
## A    0    1    1    0    0
## E    0    0    1    1    0
## C    0    0    2    0    2
## B    0    0    0    1    1
## H    0    0    0    2    0

In this example, we create a matrix of random characters and count how many times each character appears in each column using the apply function.

lapply Function

The lapply function in R takes a list as its first argument and applies a function to all the elements of the list. It offers advantages such as improved code readability and flexibility compared to the apply function.

### lapply Examples ###

# Heterogeneous list example
lapply(list(data.frame(1:10), 20:30), sum)

## [[1]]
## [1] 55
## 
## [[2]]
## [1] 275

# Homogeneous list example
lapply(list(A, B, C, D), s.degree)

## [[1]]
##   Degree
## A      3
## B      2
## C      0
## D      1
## E      1
## 
## [[2]]
##      Degree
## [1,]      3
## [2,]      4
## [3,]      3
## 
## [[3]]
##      Degree
## [1,]    187
## [2,]    139
## [3,]    128
## 
## [[4]]
##           Degree
## [1,]  0.04585366
## [2,] -0.07556911
## [3,]  0.06121951

# Since a data.frame is a list, we can apply functions directly
# Check the list of sample data.frames ?data
# Load data
data(attitude)

lapply(attitude, function(x) {
  c(
    mean = mean(x),
    var = var(x),
    min = min(x),
    max = max(x),
    median = median(x)
  )
})

## $rating
##      mean       var       min       max    median 
##  64.63333 148.17126  40.00000  85.00000  65.50000 
## 
## $complaints
##     mean      var      min      max   median 
##  66.6000 177.2828  37.0000  90.0000  65.0000 
## 
## $privileges
##      mean       var       min       max    median 
##  53.13333 149.70575  30.00000  83.00000  51.50000 
## 
## $learning
##      mean       var       min       max    median 
##  56.36667 137.75747  34.00000  75.00000  56.50000 
## 
## $raises
##      mean       var       min       max    median 
##  64.63333 108.10230  43.00000  88.00000  63.50000 
## 
## $critical
##     mean      var      min      max   median 
## 74.76667 97.90920 49.00000 92.00000 77.50000 
## 
## $advance
##      mean       var       min       max    median 
##  42.93333 105.85747  25.00000  72.00000  41.00000

Here, we showcase various uses of lapply. It can be applied to both heterogeneous and homogeneous lists. When working with data frames, you can directly apply functions to columns, which can lead to more readable code.

# Similar to summary(attitude)

# Try to arrange the structure for better readability (not always successful)
t <- sapply(attitude, function(x) {
  c(
    mean = mean(x),
    var = var(x),
    min = min(x),
    max = max(x),
    median = median(x)
  )
})
class(t)

## [1] "matrix" "array"

In this section, we demonstrate a similar approach to the summary(attitude) function using sapply to provide a structured summary of the data.

Graphics in R

R offers a robust environment for creating graphics, making it a powerful tool for both statistical analysis and data visualization. To explore its capabilities, you can start by running demo(graphics) in the R console. Additionally, you can refer to this cheatsheet for an overview of the main plotting functions.

# demo(graphics)

Let’s delve into various aspects of graphics in R.

Color Management

Managing color spaces in R is essential for creating visually appealing graphics. Colors can be defined in three different ways: by name, by hexadecimal values, or by RGB values. You can explore a wide range of colors and conversions between these systems on this website.

For this tutorial a palette of high contrasting colors that I am defining in the following vector:

colors37 = c("#466791","#60bf37","#953ada","#4fbe6c","#ce49d3","#a7b43d","#5a51dc","#d49f36","#552095","#507f2d","#db37aa","#84b67c","#a06fda","#df462a","#5b83db","#c76c2d","#4f49a3","#82702d","#dd6bbb","#334c22","#d83979","#55baad","#dc4555","#62aad3","#8c3025","#417d61","#862977","#bba672","#403367","#da8a6d","#a79cd4","#71482c","#c689d0","#6b2940","#d593a7","#895c8b","#bd5975")

And in this snipped of code where you can see clearly the contrast in the palette:

# Example of hexadecimal format
# print(head(colors37))
# Example of RGB 
# rgb(red=1, green=0.05, blue=0.02, alpha=.2)
# Examples colors by name
# head(colors())

# Plot using a color space by name
plot(
  # Values on the x-axis
  x = 2:10,
  # Values on the y-axis
  y = 9:1,
  # Size of the point
  pch = 19,
  # Shape of the point
  cex = 2,
  # Color by name
  col = "dark red",
  # Axis labels
  xlab = "",
  ylab = "",
  axes = FALSE,
  # Limits for x and y
  xlim = c(2, 10.05),
  ylim = c(0, 11)
)

# Try running these loops again but change the cex value to get different shapes
for (i in 2:9) {
  # Plot using the RGB color space (arguments are values between [0,1]) 
  points(x = i:10, y = 10:i, pch = 19, cex = 2, col = rgb(runif(1), runif(1), runif(1)))
  # Plot using a vector of hexadecimal values
  points(x = 2:(11 - i), y = (10 - i):1, pch = 19, cex = 2, col = sample(colors37, 1))
}

# Draw a box
box()

In this section, we explore different ways to define and use colors in your plots, including by name, RGB values, and hexadecimal values.

Histograms

Histograms are a fundamental tool for visualizing the distribution (spread) of data around central values. To plot a single variable we can use the hist(...), and we need to include the specific vector that contains the data, for instance, hist(attitude[, 1L]).

### Plotting a simple histogram ####
hist(attitude[, 1L])

However, typically we are interested on visualize a whole set of variables on a data frame. Hence, I believe it is more useful a snipped of code that can plot a group of variables in grid (a group of plots). The subsequent R code sets up a 3x3 plotting layout using the par() function, allowing for a 3x3 grid of plots. Then the following code then creates histograms for each variable in the attitude dataframe and arranges them within the previously defined layout.

Here’s a step-by-step breakdown:

par(mfrow = c(3, 3)): This sets up a plotting layout with 3 rows and 3 columns, creating a 3x3 grid for plotting.
var.names <- colnames(attitude): Retrieves the column names (variable names) of the attitude dataframe.
invisible(lapply(seq_along(var.names), function(x) {...}): Iterates over each variable in attitude using lapply() and generates histograms.
hist(...): Generates a histogram for each variable. The hist() function takes parameters such as the data to be plotted (attitude[, x] for each variable), the main title (var.names[x], which is the variable name), and the color of the bars (col = sample(colors37, 1)).
The invisible() function is used to suppress the output of lapply() which would otherwise display the individual histograms.

### Using lapply and histograms ###

# Set up a 3 x 3 layout

par(mfrow = c(3, 3))

# render the histograms 
var.names <- colnames(attitude)
invisible(lapply(seq_along(var.names), function(x) {
  hist(
    attitude[, x],
    main = var.names[x],
    col = sample(colors37, 1),
    xlab = ""
  )
}))

### Box-Plots:

Box plots, also known as box-and-whisker plots, are a valuable tool for visualizing the distribution and spread of data. They provide a concise summary of a dataset’s central tendency, variability, and potential outliers. Unlike some other types of plots, box plots focus on displaying the overall distribution of data rather than showing individual data points.

They are particularly useful because they show the following key statistics:

The median (the middle value of the dataset).
The interquartile range (the range between the first quartile or Q1 and the third quartile or Q3), which contains the central 50% of the data.
The minimum and maximum values within a defined range.
Outliers, represented as individual points outside the “whiskers” of the plot.

Similar to the previous chunk, I am using the lapply combined with the function of box-plots (boxplot(...)) to efficiently display a plot for each variable in the attitude dataset.

  par(mfrow = c(3, 3))
  var.names <- colnames(attitude)
  invisible(lapply(seq_along(var.names), function(x) {
  boxplot(attitude[, x],
  main = var.names[x],
  xlab = "")
  }))

Scatter Plots:

Scatter plots are a basic form of data visualization that help us see the relationship between two continuous variables. They differ from other plots because they allow us to examine how two variables interact, specifically whether there is a linear or nonlinear relationship between them.

The importance of scatter plots lies in their ability to reveal patterns, trends, clusters and outliers in the data. By arranging data points as individual points on a two-dimensional plane, we can visually identify relationships, associations, or the lack thereof. Scatter plots are particularly useful for the following reasons.

Finding linear and non-linear relationships: Scatter plots help us to find out if two variables are linearly positively or negatively related or the relationship is not linear.
Identifying outliers: Outliers, data points that deviate significantly from the overall pattern, are easily detected in scatter plots and can be identified
Cluster analysis: Clusters of data points can indicate distinct subpopulations or clusters in the data.
Visualizing multivariate data: Scatter matrices like the one created in your code snippet allow us to visualize relationships between multiple variables simultaneously, which is important in exploratory data analysis

Using Scatter Plots in R:

pairs(attitude, main = "attitude data", panel = panel.smooth): This line generates a scatter plot matrix (a grid of scatter plots) for all the variables in the “attitude” dataset. The panel.smooth argument adds smoothed regression lines to each scatter plot to help visualize trends.

  # Plot variables: Useful to detect linear or non-linear patterns.
  pairs(attitude, main = "attitude data", panel = panel.smooth)

plot(attitude$rating, attitude$complaints): This line creates a single scatter plot between the “rating” and “complaints” variables, providing a detailed view of the relationship between these two specific variables.

  # Single plot
  plot(attitude$rating, attitude$complaints)

abline(lm(rating ~ complaints, data = attitude), col = 'red'): Here, a regression line is added to the scatter plot created in the previous step. This line represents the best-fit linear relationship between “rating” and “complaints” using a linear regression model. The line is colored red for visibility.

  plot(attitude$rating, attitude$complaints)
  # Draw a regression line
  abline(lm(rating ~ complaints, data = attitude), col = 'red')

Multiple Regression model

Imagine you have a yummy meal, and you want to know what makes it taste so good. Is it the color, the smell, or maybe the shape? Multiple regression helps us figure out which of these things, or variables, are most important in making the meal delicious. It’s like sniffing out the best part of a treat recipe! So the idea of regression is that we have a series of variables that affect or predict the behavior of another outcome variable. These explanatory variables are called determinants of the dependent variable precisely for their power to predict outcome.

Univariate models have only one determinant, but they are mostly unused. It is difficult to expect that one thing has only one predictor.

  # Single Regression
m0 <- lm(rating ~ advance, data = attitude)
summary(m0)

## 
## Call:
## lm(formula = rating ~ advance, data = attitude)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -25.7465  -4.8749   0.5975   7.4232  18.1526 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  56.7558     9.7428   5.825 2.93e-06 ***
## advance       0.1835     0.2209   0.831    0.413    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.24 on 28 degrees of freedom
## Multiple R-squared:  0.02405,    Adjusted R-squared:  -0.0108 
## F-statistic:  0.69 on 1 and 28 DF,  p-value: 0.4132

Multiple Regression model has two or more explanatory variables and it is the most frequent use model.

m1 <- lm(rating ~ ., data = attitude)
summary(m1)

## 
## Call:
## lm(formula = rating ~ ., data = attitude)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.9418  -4.3555   0.3158   5.5425  11.5990 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.78708   11.58926   0.931 0.361634    
## complaints   0.61319    0.16098   3.809 0.000903 ***
## privileges  -0.07305    0.13572  -0.538 0.595594    
## learning     0.32033    0.16852   1.901 0.069925 .  
## raises       0.08173    0.22148   0.369 0.715480    
## critical     0.03838    0.14700   0.261 0.796334    
## advance     -0.21706    0.17821  -1.218 0.235577    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.068 on 23 degrees of freedom
## Multiple R-squared:  0.7326, Adjusted R-squared:  0.6628 
## F-statistic:  10.5 on 6 and 23 DF,  p-value: 1.24e-05

Section 2: An introduction to Network Analysis using Igraph

Install packages

Before you start this section I recommend that you install the following packages:

igraph: A package for network analysis and visualization (most important).
tnet: A package for analyzing weighted, two-mode, and longitudinal networks.
data.table: A package for data manipulation and analysis.
qgraph: A package for creating and analyzing graphical models (e.g., network models) that we use only to improve the visualization of networks.
knitr: A package for dynamic report generation in R.

# packages
pks <- c('knitr', 'igraph', 'tnet', 'data.table', 'qgraph')
 
#Load and install packages
to.install <- pks[!unlist(lapply(pks, require, character.only = T ))]
if(length(to.install)!=0){install.packages(to.install, dependencies = T)}

Generate Graphs

To create networks, we have the option of utilizing both R base functions and functions within the igraph package.

Using Adjanceny Matrices

One approach involves generating a network using an adjacency matrix, where the rows and columns of the matrix correspond to vertices, and the values in the matrix indicate connections between these vertices.

#### Graph from a Matrix ####
n <- 5
g <- matrix(0, ncol = n, nrow = n)
val <- round(runif(sum(upper.tri(g)), min = 0, max = 1))
g[upper.tri(g)] <- val
g <- t(g) + g

# Create a graph object
g1 <- graph_from_adjacency_matrix(g, mode = "undirected")
as_bipartite()

## igraph layout specification, see ?layout_:
## layout_as_bipartite(, input = "C:/Users/mglez/Documents/PHD/Semester 16/15092023_network_blog/igraph_tu_mgs_v01.Rmd", 
##  igraph layout specification, see ?layout_:
##     encoding = "UTF-8")

?as_bipartite

## starting httpd help server ... done

Using Edge lists

The second most common way to generate a network is from an edge list or a set of pairs that define the connection between two vertices.

#### Graph from Edgelist ####
g1 <- graph_from_edgelist(t(combn(1:n,2)))

Using formulas

The third way to create a graph is by using specific formulas with the graph_from_literal function. This function enables us to create networks based on formulaic representations. Essentially, we specify the desired network structure using a compact formula notation within this function.

This is the general notation of the function:

-+: Represents a directed edge between two vertices. For example, A -+ B indicates a directed edge from vertex A to vertex B.
--: Represents an undirected edge between two vertices. For example, A -- B indicates an undirected edge between vertex A and vertex B.
++: Represents a directed edge with an arrowhead at both ends, implying a bidirectional connection between two vertices. For example, A ++ B signifies bidirectional directed edges between vertex A and vertex B.
:: Represents a grouping of vertices. For example, A:B signifies that vertices A and B are in the same group or cluster within the network.

These notations allow you to define various types of connections and structures within your network using concise and expressive formulas.

#### Graph from Formula ####

#Undirected
par(mfrow = c(1, 4))
g1 <- graph_from_literal( A-B-C )

#Directed
g2 <- graph_from_literal( A -+ B ++ C )

#Undirected grouping
g3 <- graph_from_literal( A-B:C )

#Directed grouping
g4 <- graph_from_literal( A-+B:C )

#Plot the graphs
invisible(lapply(list(g1, g2, g3, g4), plot, vertex.size = 25, edge.arrow.size = .5))

Using igraph-functions

In the igraph package, there are several functions are available to generate networks using various algorithms and models. Here are some of the commonly used functions for network generation:

Erdős-Rényi Model (erdos.renyi.game): This function generates random networks following the Erdős-Rényi model. In this model, you specify the number of vertices (n) and the probability (p) of forming an edge between any pair of vertices. The default type is set to gnp for the probability model. This model is useful for creating networks where edges exist between pairs of vertices independently with a fixed probability.

Watts-Strogatz Model (watts.strogatz.game): The Watts-Strogatz model creates small-world networks with a combination of regularity and randomness. By default, it starts with a regular lattice where each vertex is connected to its (nei) nearest neighbors in a ring. Then, edges are rewired with probability p to introduce randomness. You can specify the number of vertices (n), the dimension of the lattice (dim), the number of neighbors (nei), and the rewiring probability (p) as parameters. This model helps generate networks that exhibit small-world properties.

Barabási-Albert Model (barabasi.game): This function generates networks using the Barabási-Albert model. By default, it creates an undirected network with n vertices and attaches each new vertex to m existing vertices with preferential attachment. The default is set to m = 1, meaning each new vertex connects to a single existing vertex. This model results in scale-free networks with a few highly connected nodes, which is a common property in many real-world networks.

Forest Fire Model (forest.fire.game): The forest fire model simulates the growth of networks using the forest fire algorithm. By default, it creates a network with n vertices. The m parameter specifies the number of edges added from the new vertex to the existing graph in each step. The p parameter controls the probability of spreading the fire to existing vertices. This model is suitable for generating networks with a specified number of vertices and a desired average degree while considering network growth dynamics.

# Set the number of vertices for all networks
n <- 25

# Generate networks using different models
g1 <- erdos.renyi.game(n, p = 0.2)
g2 <- watts.strogatz.game(n, dim = 1, size = 4, p = 0.2)
g3 <- barabasi.game(n, m = 1)
g4 <- forest.fire.game(n, fw.prob = 0.2, bw.factor = 2)

This is the full list of functions available in igraph to generate networks:

games <- grep("^.*game", ls("package:igraph"), value = TRUE)[-1]
games

##  [1] "aging.barabasi.game"         "aging.prefatt.game"         
##  [3] "asymmetric.preference.game"  "ba.game"                    
##  [5] "barabasi.game"               "bipartite.random.game"      
##  [7] "callaway.traits.game"        "cited.type.game"            
##  [9] "citing.cited.type.game"      "degree.sequence.game"       
## [11] "erdos.renyi.game"            "establishment.game"         
## [13] "forest.fire.game"            "grg.game"                   
## [15] "growing.random.game"         "hrg.game"                   
## [17] "interconnected.islands.game" "k.regular.game"             
## [19] "lastcit.game"                "preference.game"            
## [21] "random.graph.game"           "sbm.game"                   
## [23] "static.fitness.game"         "static.power.law.game"      
## [25] "watts.strogatz.game"

Visualizing Networks with igraph

To create insightful network visualizations using the igraph package in R, we’ll begin with plotting a simple network. Later on, we’ll explore customization options and demonstrate how to visualize multiple networks side by side.

Plotting a Simple Network

In this initial plot, we have our network displayed.

plot(g1)

However, to improve the visualization we can further customize the attributes of the plot function in the following way:

vertex.size: This attribute allows you to adjust the size of the nodes (vertices) in your network.
edge.arrow.size: If your network contains directed edges, you can modify the arrow size using this attribute.
vertex.color: Sets the color of nodes (here, light blue).
edge.color: Defines the color of edges (here, gray).
vertex.label: Removes node labels for a cleaner visualization.
layout: Specifies the layout algorithm; we used the Fruchterman-Reingold layout here.

Customizing network attributes:

# Visualizing multiple networks in a grid
par(mfrow = c(2, 2))
networks <- list(g1, g2, g3, g4)
invisible(lapply(networks, plot, 
                vertex.size = 25, 
                edge.arrow.size = 0.5,
                vertex.color = "lightblue", 
                edge.color = "gray",
                vertex.label = NA,
                layout = layout.fruchterman.reingold))

Layouts of igraph

The igraph has different algorithms called layouts, which help us visualize and highlight network patterns, degree distributions, and the spatial arrangement of vertices within a network.

# Generate a graph
n <- 15
g1 <- barabasi.game(n, directed = F)

# Explore the complete list of layouts.
# layouts <- grep("^layout_", ls("package:igraph"), value = TRUE)[-1]
# layouts <- layouts[!grepl("bipartite|sugiyama", layouts)]
# dput(layouts)

layouts <- c("layout_as_star", "layout_as_tree", "layout_components", "layout_in_circle", 
"layout_nicely", "layout_on_grid", "layout_on_sphere", "layout_randomly", 
"layout_with_dh", "layout_with_drl", "layout_with_fr", "layout_with_gem", 
"layout_with_graphopt", "layout_with_kk", "layout_with_lgl", 
"layout_with_mds")

par(mfrow = c(2, 2))
invisible(lapply(layouts, function(x){plot(
g1,
vertex.size = 30,
layout = eval(get(x)),
xlab = x
) }))

Setting shape of vertices

Vertex shapes in a network graph represent the graphical symbols used to depict individual vertices or nodes. They are a visual attribute that allows you to distinguish between nodes based on specific characteristics or groupings. Vertex shapes are useful in network visualization as they help convey additional information beyond just the connections between nodes.

To illustrate the usefulness of vertex shapes, we are creating an example where we visualize different vertex attributes. In this particular case, we’re generating a random variable called Age from random draws of a normal distribution, with a mean of 30 and a standard deviation of 5, and dividing the vertices into three distinct groups based on quantiles of this variable. Each group will be assigned a different vertex shape, making it clear which nodes belong to which category. This approach enhances the interpretability of the network by allowing you to visually identify nodes with similar attributes or characteristics.

## All shape forms
shapes <- c(
  'circle',
  'square',
  'csquare',
  'rectangle',
  'crectangle',
  'vrectangle',
  'pie',
  'raster',
  'sphere'
)
V(g1)$Age <- rnorm(n, 30, 5)
q <- quantile(V(g1)$Age, c(0, .33, .66, 1))
ind <- cut(V(g1)$Age, q, include.lowest = T, labels = F)
shape <- ifelse(ind==1, 'csquare', ifelse(ind==2, 'circle', 'sphere'))

plot(
  g1,
  vertex.size = 15,
  edge.arrow.size = .3,
  vertex.shape = shape,
  layout = layout_nicely
)

Setting colours of vertices and adding legends to plots

Generate a vector of attributes by sampling with replacement from a set Gender = {female, male}, and set a new attribute to the graph called gender. Also, add legends using the legend function and pass the desired arguments.

V(g1)$gender <- sample(c("male", "female"), n, replace = T)

plot(
  g1,
  vertex.size = 15,
  vertex.color = ifelse(V(g1)$gender == "male", "light blue", "pink"),
  edge.arrow.size = .3,
  vertex.shape = shape,
  layout = layout_nicely
)

legend(
  # Position
  x = -1.5,
  y = -1.1,
  # Legends
  c("male", "female"),
  # Mark type (circle)
  pch = 21,
  col = 1,
  pt.bg = c("light blue", "pink"),
  pt.cex = 2,
  cex = .8,
  bty = "n",
  ncol = 1
)

Setting colours to groups of vertices

We can emphasize groups of vertices in the graph. Find the vertex with the highest degree centrality and mark the adjacent vertices in a group.

class <- adjacent_vertices(g1, which.max(degree(g1)), mode = c("all"))
plot(
  g1,
  vertex.size = 15,
  vertex.color = ifelse(V(g1)$gender == "male", "light blue", "pink"),
  edge.arrow.size = .3,
  vertex.shape = shape,
  layout = layout_nicely,
  mark.groups = class,
  mark.col = rainbow(length(class))
)

Set colours and thickness to edges

g1 <- graph.data.frame(data.frame(
  from = c('A', 'A', 'A', 'A', 'B', 'B', 'C', 'D', 'E', 'F'),
  to = c('B', 'C', 'D', 'E', 'C', 'E', 'D', 'E', 'F', 'G')
), directed = F)

# and plot it:
cl <- cliques(g1, min = 3, max = 3)
c.ed <- lapply(cl, function(x)
  E(g1, path = c(x, x[1])))
plot(g1,
  layout = layout.star,
  edge.width = edge.betweenness(g1),
  edge.color = ifelse(E(g1) %in% unlist(c.ed), "red", "gray"))

Subgraphs

Find the vertices adjacent to the vertex A and then plot a subgraph of the neighborhood.

g1 <- graph.data.frame(data.frame(
  from = c('A', 'A', 'A', 'A', 'B', 'B', 'C', 'D', 'E', 'F'),
  to = c('B', 'C', 'D', 'E', 'C', 'E', 'D', 'E', 'F', 'G')
), directed = F)

# Find the vertices adjacent to A
v <- 'A'
neig.a <- adjacent_vertices(g1, v, mode = c("all"))

# Subgraph the neighborhood of A
g2 <- induced.subgraph(g1, c(v, names(neig.a[[1]])))

# Plot the Graphs
par(mfrow = c(1, 2))
plot(g1)
plot(g2)

Network Statistics

Local Clustering

Local clustering of a vertex $v$ is the ratio of the number of 3-cliques, or triangles, that fall in to $v$ and the number of connected triplets from which two edges are incident to $v$. For instance, for vertex $A$, the number of triangles, is defined as the cardinality of the set of vertices $\Delta_{A}={(A,C,B), (A,C,D), (A,D,E), (A,E,D)}$. Similarly the number of connected triplets is the set $T={(A,C,B), (A,C,D), (A,D,E), (A,E,D), (C,A,E), (D,A,B) }$. The local clustering coefficient $C_{A}=\frac{|\Delta_{A}|}{|T|}=\frac{2}{3}$. The local clustering is not defined for topologies, such as stars, trees, lattices.

 g <-
    graph.data.frame(data.frame(
    from = c('A', 'A', 'A', 'A', 'B', 'B', 'C', 'D'),
    to = c('B', 'C', 'D', 'E', 'C', 'E', 'D', 'E')
    ), directed = F)

    #Graph
    plot(g, layout = layout.star)

    #Transitivity of A
    transitivity(g, v = "A", "local")

## [1] 0.6666667

    transitivity(graph.lattice(5), "local")

## [1] NaN   0   0   0 NaN

    transitivity(graph.star(5, mode = "undirected"), "local")

## [1]   0 NaN NaN NaN NaN

    transitivity(graph.tree(5, mode = "undirected"), "local")

## [1]   0   0 NaN NaN NaN

Degree and Strength

Add two new edges to the graph, from to , and from to , and plot the graph. Compare the measurements of and for the vertex $A$, notice that the vertices adjacent to are ${B,C,D,E)}$ but both measurements have a value of $7$. Simplify the graph, remove multiples edges and loops, and compute the degree centrality. Finally, use the count\_multiple function to compute weights for each edge in the graph and calculate strength centrality one more time.

    g <- graph.data.frame(data.frame(
      from = c('A', 'A', 'A', 'A', 'B', 'B', 'C', 'D'),
      to = c('B', 'C', 'D', 'E', 'C', 'E', 'D', 'E')
      ), directed = F)

    # Add edge
    g <- add_edges(g, c('A', 'B', 'A', 'A'))

    # Plot the graph
    plot(g)

    # Degree vs Strength
    degree(g, V(g)$name == 'A', mode = 'all')

## A 
## 7

    strength(g, V(g)$name == 'A', mode = 'all')

## A 
## 7

    # Simplify Degree
    degree(simplify(g, remove.multiple = T, remove.loops = T),
    V(g)$name == 'A',
    mode = 'all')

## A 
## 4

    # Strength with Weights
    E(g)$weight <- count_multiple(g)
    g <- simplify(g)
    strength(g, V(g)$name == 'A', mode = 'all')

## A 
## 7

Eigenvector Centrality

The intuition of is to capture the importance of the neighborhood of each vertex. Vertices who are connected to adjacent vertices with higher degree centrality will perform better in this measurement. The interest lies in finding a vector that represents a ranking of relative importance for each vertex. This is similar to finding a solution for the eigenvalue problem.

\[Aw = \lambda w\]

Suppose that we can assign equal weights $w_1$, to each vertex, and then perform $Aw_1=w_2$, similar to computing a weighted degree. Then use $w_2$ to perform $n$ iterations till difference between $Aw_{n} - \lambda w_{n+1}$ gets closer to zero. Run a algorithm of eigenvector centrality and compare the results with the function.

# graph: 
    g <- graph.data.frame(data.frame(
      from = c('A', 'A', 'A', 'A', 'B', 'B', 'C', 'D'),
      to = c('B', 'C', 'D', 'E', 'C', 'E', 'D', 'E')
      ), directed = F)

# Eigen vector centrality algorithm    
eigenvector.centrality = function(g, t=7) {
  A = get.adjacency(g)
  n <- nrow(A)
  #Degree
  # w <- max(A%*%rep(1, n))
  w <- n
  #Create a vector of weights
  x1 <- rep(1/w, n)
  #Create a vector of zeros (for the initial interation)
  x0 <- rep(0, n)
  #Presicion of the computation
  pre <- 1/10^t
  #Index of interaction
  iter <- 0
  while ( sum(abs(x0 - x1)) > pre) {
    #Store the current weight for comparison in the next interaction
    x0 <- x1
    #Compute a weighted degree
    x1 <- as.vector(A %*% x1)
    #Get the biggest weight
    w <- x1[which.max(abs(x1))]
    #Get a new vector of weights
    x1 <- x1 / w
    #Save the interations
    iter <- iter + 1
  }
  return(list(vector = x1, value = w, iter = iter))
}

#Compute eigenvector centrality scores
eigen_centrality(g)$vector

##        A        B        C        D        E 
## 1.000000 0.809017 0.809017 0.809017 0.809017

eigenvector.centrality(g, 7)$vector

## [1] 1.000000 0.809017 0.809017 0.809017 0.809017

Weighted Measurements of centrality

The package has implementations of weighted versions of , and . Lets generate a graph of $n$ vertices and, compare the difference between the weighted and unweighted centrality measurements.

  n <- 20
    # Generate an undirected weighted graph
    w <- matrix(0L, nrow = n, ncol = n)
    
    # Squewed distributon
    val <- rnbinom(sum(upper.tri(w)), prob=1/5, size = 1)
    w[upper.tri(w)] <- val
    w <- w + t(w)
    g_w <- as.tnet(w, type = 'weighted one-mode tnet')
    g_w.1 <- graph_from_adjacency_matrix(w)
    E(g_w.1)$weight <- count.multiple(g_w.1)

    # Generate an undirected unweighted graph
    uw <- matrix(0L, nrow = n, ncol = n)
    uw[w > 0] <- 1
    g_uw <- graph_from_adjacency_matrix(uw, mode = 'undirected')
    # Strength
    st <- strength(g_w.1)
    d <- degree(g_w.1)
    dw <- degree_w(g_w)
      
    # Betweeness
    bu <- betweenness(g_uw)
    bw <- betweenness_w(g_w)
    
    # Closeness
    cu <- closeness(g_uw)
    cw <- closeness_w(g_w)
    out <-
    data.frame(
    vertex = 1:n,
    degree = d,
    w.degree = dw[,3],
    strength = st,
    betweenness = bu,
    w.betweenness = bw[, 2],
    closeness = cu,
    w.closeness = cw[, 3]
    )
    
   kable(out, format = "markdown")

vertex	degree	w.degree	strength	betweenness	w.betweenness	closeness	w.closeness
1	136	68	1360	2.104459	7	0.0400000	0.0027701
2	206	103	3390	2.850419	29	0.0400000	0.0033045
3	142	71	1310	3.630675	6	0.0434783	0.0024859
4	132	66	1664	1.697183	7	0.0384615	0.0029117
5	144	72	1240	2.774312	9	0.0434783	0.0027380
6	84	42	496	2.129892	0	0.0384615	0.0022731
7	104	52	476	4.233103	1	0.0454545	0.0020333
8	148	74	1496	1.338850	13	0.0370370	0.0027492
9	130	65	762	2.880625	1	0.0434783	0.0023273
10	180	90	3248	3.568456	18	0.0434783	0.0031294
11	194	97	2246	3.685820	21	0.0454545	0.0032645
12	116	58	1212	2.174228	3	0.0384615	0.0022673
13	86	43	590	1.078691	0	0.0370370	0.0024054
14	144	72	988	4.050985	8	0.0434783	0.0028601
15	88	44	404	1.909510	0	0.0384615	0.0020143
16	126	63	822	4.247017	1	0.0454545	0.0025134
17	162	81	1314	1.615571	18	0.0384615	0.0033866
18	178	89	2278	2.254387	16	0.0400000	0.0032046
19	104	52	644	1.613647	5	0.0384615	0.0025963
20	196	98	2500	4.162168	32	0.0454545	0.0032223

Structural Holes

Structural Holes are separations between groups observed from discontinuities in network structure. The absence of structural holes signals saturation in the capacity of individuals(vertices) to create novel connections outside their group. Saturation occurs when individuals reach a limit on the number of connections they can create and maintain. When individual’s resources are concentrated in a single group structural holes are absent or scare. The scarcity of holes represent constraints to collaborate outside a single research team that leads to redundancy of information and inability to capitalize novel ideas from different research teams.

g <- graph.data.frame(data.frame(
  from = c('A', 'A', 'A', 'A', 'B', 'B', 'C', 'D', 'E', 'F'),
  to = c('B', 'E', 'F', 'G', 'D', 'G', 'G', 'G', 'G', 'G')
), directed = F)


plot(g)

A <- as.matrix(get.adjacency(g))
kable(A, format = "markdown")

	A	B	C	D	E	F	G
A	0	1	0	0	1	1	1
B	1	0	0	1	0	0	1
C	0	0	0	0	0	0	1
D	0	1	0	0	0	0	1
E	1	0	0	0	0	0	1
F	1	0	0	0	0	0	1
G	1	1	1	1	1	1	0

To calculate the constraints to bridge structural holes, the first step is to calculate, $i$, individual proportion of resources allocated to $j$ connections.

\[p_{ij} = z_{ij} / z_{iq}\]

# degree for undirected (Sum of resources spend in each connection)
d <- (A * upper.tri(A)) %*% matrix(1, nrow = nrow(A), ncol = 1)

# Matrix of degree
D <- matrix(rep(d, ncol(A)), nrow = nrow(A), ncol = ncol(A))

#  Matrix of time and enery invested on others z_iq = d - z_ij
z_iq <- (D * upper.tri(D)) - (A * upper.tri(A))

# Matrix of proportion of i's time an energy allocated to j's connections.
P <- (A * upper.tri(A))/z_iq

Redundancy of Centrality in Complete graphs

There are some cases in which the measurements of centrality will not provide relevant information. For instance, if the structure of an undirected network approaches a complete graph, each pair of different vertices is connected by a unique edge $\forall {i \neq j}:E(v_i,v_j)=1$, then the centrality measures will not yield relevant information. Take into consideration the following example, where I generate a fully connected graph:

## Creating Adjacency Matrix
A <- matrix(rep(1,25), ncol = 5, nrow = 5) 
diag(A) <- 0
G <- graph_from_adjacency_matrix(A, mode = "undirected")

## Creating the 
cols <- data.frame(
    degree(G),
    closeness(G),
    constraint(G),
    transitivity(G, type = 'local'),
    eigen_centrality(G)$vector,
    betweenness(G)
  )


colnames(cols) <- gsub("\\.G\\..*", "", colnames(cols))

kable(cols,  format = "markdown")

degree	closeness	constraint	transitivity	eigen_centrality
4	0.25	0.765625	1	1
4	0.25	0.765625	1	1
4	0.25	0.765625	1	1
4	0.25	0.765625	1	1
4	0.25	0.765625	1	1

To see more clearly the issue of redundancy of network measurements, I have created this snipped of code with a simulation. The code calculates the different network statistics keeping constant the number of edges but increasing the number of connections until the network is fully connected. In a nutshell the snipped, calculates network statistics for networks with the same number of vertices but an increasing number of connections conn <- c(seq(from=points, to=triag.matrix, by=round(triag.matrix/points)), triag.matrix). Using a loop, we iterate this sequence sampling randomly the connections in the conn sequence as follows: sample(triag.matrix, conn[i]).

#### Compute the average for each network centrality ####
n <- 100
triag.matrix <- ((n^2)-n)/2
points <-100
conn <- c(seq(from=points, to=triag.matrix, by=round(triag.matrix/points)), triag.matrix)
i <- 20
out.list <- list()
for(i in seq_along(conn)){
g <- matrix(0, ncol = n, nrow = n)  
val <- sample(triag.matrix, conn[i])
g[upper.tri(g)][val]<- 1
g <- t(g)+g
g <- graph_from_adjacency_matrix(g, mode = 'undirected')
t <- transitivity(g, type = 'localundirected')
t[is.na(t)]<- 0
out.list[[i]] <-
  data.frame(
    degree = mean(degree(g)),
    closeness = mean(closeness(g)),
    betweennes = mean(betweenness(g)),
    transitivity = mean(t),
    eigen.cent = mean(eigen_centrality(g)$vector),
    struc.holes = mean(constraint(g))
  )
}

Now that we have calculated the network statistics the goal of this snippet is to plot the network statistics accordingly. Each plot shows how a specific network statistic changes as the number of connections in the network increases. The vertical axis represents the values of various network statistics.The horizontal axis represents the number of connections in the network as they increase gradually untill they reach the fully connected graph.

As it becomes clear, when the connectivity level of the graph increases, the network centrality measurements become more and more similar. The results of this simulation suggest that network centrality measurements become redundant as a graph approaches a fully connected network. This is because all nodes in a fully connected network have the same fundamental network structure.

out.data <- rbindlist(out.list)
rm(out.list)
par(mfrow = c(2, 3))

x <- 1:nrow(out.data)
for(j in 1:ncol(out.data)){
y <- out.data[[j]]
y[is.na(y)] <- 0
st <- sqrt(var(y))
plot (x, y, ylim=c(0,max(y)), main = colnames(out.data)[j] )
segments(x,y-st,x,y+st)
epsilon <- 0.02
segments(x-epsilon,y-st,x+epsilon,y-st)
segments(x-epsilon,y+st,x+epsilon,y+st)

}

Network analysis with Data

In this code snippet, you will learn how to perform a basic network analysis on a real-world dataset. We start by downloading a network dataset, specifically the “ca-netscience” dataset, from an online source. This dataset represents a co-authorship network in the field of network science.

The code proceeds to unzip and load the dataset into R. We then construct a graph from the dataset using the igraph package, representing the relationships between authors. And finally, we calculate various network metrics such as degree centrality, closeness centrality, betweenness centrality and other network statistics.

I encourage you to explore similar datasets on the Stanford Large Network Dataset Collection and the NetworkRepository, they have a wide range of sample empirical data for your analyses.

# Download network data
 # setwd('C:/r_tutorial') 
# Data from: https://arxiv.org/abs/physics/0605087    
 download.file('http://nrvis.com/download/data/ca/ca-netscience.zip',
                  destfile = 'ca-netscience.zip')
 unzip('ca-netscience.zip')
 g <- read.csv('ca-netscience.mtx', sep = " ", header = F, skip = 2)
 g <- graph_from_data_frame(g, directed = F)
 E(g)$weight <- count.multiple(g)

 # Perform a basic network analysis (degree, closeness, betweenness, 
 # transitivity, eigenvector.centrality)
 # Store the results in a data.frame for analysis, add a column of ids. 
 
 na <- data.frame(
    id = V(g)$name,
    closeness = closeness(g, mode = "all", normalized = F),
    degree = degree(g, mode = "all", normalized = F, loops = F),
    strength= strength(g),
    betweenness = betweenness(g, directed = F, normalized = F),
    struc_hole = constraint(g),
    transitivity = transitivity(g, "localundirected"),
    eigen_centrality= eigen_centrality(g, scale = F)$vector)
 
kable(na[1:10,],  format = "markdown")

	id	closeness	degree	strength	betweenness	struc_hole	transitivity	eigen_centrality
2	2	0.0004627	2	2	0.000	0.8650000	1.0000000	0.0148383
3	3	0.0004627	2	2	0.000	0.8650000	1.0000000	0.0148383
4	4	0.0005643	34	34	10834.473	0.0955073	0.1336898	0.4142993
5	5	0.0006075	27	27	17858.003	0.1071098	0.1823362	0.3562072
16	16	0.0005211	21	21	1131.347	0.1690397	0.2761905	0.3464503
44	44	0.0005882	4	4	12460.579	0.2941086	0.5000000	0.0885522
113	113	0.0005033	15	15	4601.692	0.1901735	0.2571429	0.0222461
131	131	0.0004760	12	12	3238.339	0.2118647	0.3333333	0.0213000
250	250	0.0005089	6	6	13.200	0.3374061	0.8666667	0.1470503
259	259	0.0004735	3	3	0.000	0.4537654	1.0000000	0.0176052

Community Detection

Lastly, I would like to show you how to perform a basic community detection analysis. Community detection in network analysis is the process of identifying groups or communities of nodes within a network that are more densely connected to each other than to nodes outside their community. These communities represent subsets of nodes which may possess similar characteristics, functions, or roles within the network.

Here’s a brief explanation of some community detection methods available in igraph:

edge.betweenness.community: This method identifies communities based on edge betweenness centrality. It removes edges with the highest betweenness values iteratively, eventually breaking the network into communities.

fastgreedy.community: It uses a greedy optimization approach to find hierarchical communities by optimizing modularity, a measure of community structure quality.

infomap.community: This method employs the Infomap algorithm, which treats the network as a flow of information and detects communities by minimizing the expected description length of the information flow.

label.propagation.community: Nodes are assigned labels, and communities form based on the propagation of these labels through the network. It’s a simple and fast method.

leading.eigenvector.community: This approach uses spectral graph theory and the leading eigenvector of the network’s adjacency matrix to detect communities.

multilevel.community: It’s a multilevel algorithm that optimizes modularity by moving nodes between communities, iteratively improving the community structure.

spinglass.community: This method is based on spin glass models from statistical physics, which maximize a Hamiltonian function to find communities.

walktrap.community: It uses random walks to find communities by detecting nodes that are more likely to be visited together during random walks on the network.

For this example I will use the same data as the previos snipped.

comms <- c("edge.betweenness.community", "fastgreedy.community", "infomap.community",
"label.propagation.community", "leading.eigenvector.community",
"multilevel.community", "spinglass.community",
"walktrap.community")

V(g)$frame.color <- "white"
V(g)$label <- ""
E(g)$arrow.mode <- 0

plot.comm <- function(comm){
V(g)$color <- colors37[get(comm)(g)$membership]
l <- qgraph.layout.fruchtermanreingold(get.edgelist(g, names = F), vcount=vcount(g),
      area=8*(vcount(g)^2),repulse.rad=(vcount(g)^3.1))
plot(g,layout=l,vertex.size=5, main= comm)
}

invisible(lapply(comms, plot.comm))

Backtesting a Crossover Moving Average Strategy Algorithm in the Forex Market.

2023-09-06T00:00:00+00:00

Introduction

This post aims to explore the effectiveness of a straightforward Forex market investment algorithm. Amidst numerous algorithmic possibilities, I chose to embrace simplicity as a stepping stone before delving into more complex strategies. Inspiration was drawn from Gurrib’s 2016 paper, published on the Global Review of Accounting and Finance, available at SSRN. Gurrib’s study, which benchmarked a crossover simple moving average strategy on daily S&P500 candles between 1993 and 2014, reported an impressive 24% return over 1593 investment days. Let’s embark on this journey to assess the potential of a similar approach in the Forex market.

SMA Crossover Strategy

The SMA crossover strategy works by assuming that the series contains short and long-run trends. The short-run trend follows the series closely and reacts more rapidly to variations in the series in comparison to the long-run trend. The core idea of the strategy is that we can find market signals of buying or selling by monitoring the intersections between the short and long-run trends in the series. If the short-run trend intersects the long-run trend and moves upward the value of the series is increasing (also called a golden cross), and therefore the algorithm sends a buy signal. Conversely, if the short-run trend intersects the long-run trend and moves downward the price decreases (known as a dead cross), and it is time to sell.

To understand better the behavior of the algorithm I have created a visualization that monitors the interaction between the USD.SEK series that I downloaded from Interactive Brokers in candles of 30 seconds (blue) and the short and long-run trends (red and green respectively) that I have estimated using SMA.

If you are interested, you can recreate the animation with the code bellow in Python:

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import os
from matplotlib.animation import FuncAnimation
from IPython.display import display, clear_output

# change working directory

# Specify the target directory
new_directory = r'C:\Users\mglez\Documents\PHD\Semester 16\01092023_SMA_blog\material\dynamc_graph'

# Change the working directory
os.chdir(new_directory)

# Read the CSV file and extract the first column of the USD.SEK as y1
df = pd.read_csv('USDSEK_dur_5D_candle_30sec_2023-08-20_2023-08-24_CLOSE.csv')
y1 = df.iloc[301:, 1].values
# Loading the simple moving averages
y2 = df.iloc[301:, 2].values
y3 = df.iloc[301:, 3].values

# Create x data
n = len(y1)
x = np.arange(n)

# Initialize the plot
fig, ax = plt.subplots()
line1, = ax.plot(x, y1, label='USD.SEK', color='blue')
line2, = ax.plot(x, y2, label='SMA 5', color='red')
line3, = ax.plot(x, y3, label='SMA 300', color='green')
ax.set_title('SMA Crossover Strategy: USD.SEK')
ax.legend()

# Set the y-axis limits to display values between 10.5 and 11.5
ax.set_ylim(10.9, 11)

# Initialize text annotation
text = ax.text(0.1, 0.90, '', transform=ax.transAxes, fontsize=12, color='black')

# Number of values to display on the horizontal axis
num_values_to_display = 100

# Function to update the plot for each frame
def update(frame):
    x_data = x[max(0, frame - num_values_to_display):frame]
    line1.set_data(x_data, y1[max(0, frame - num_values_to_display):frame])
    line2.set_data(x_data, y2[max(0, frame - num_values_to_display):frame])
    line3.set_data(x_data, y3[max(0, frame - num_values_to_display):frame])
    
    # Calculate the x-axis limits dynamically based on the frame and num_values_to_display
    min_x = max(0, frame - num_values_to_display)
    max_x = frame
    ax.set_xlim(min_x, max_x)

    # Check if there are enough data points to calculate min and max
    if len(x_data) >= num_values_to_display:
        min_y = min(min(y1[max(0, frame - num_values_to_display):frame]), min(y2[max(0, frame - num_values_to_display):frame]), min(y3[max(0, frame - num_values_to_display):frame]))
        max_y = max(max(y1[max(0, frame - num_values_to_display):frame]), max(y2[max(0, frame - num_values_to_display):frame]), max(y3[max(0, frame - num_values_to_display):frame]))
        ax.set_ylim(min_y - 0.009, max_y + 0.009)
    
    if y2[frame] > y3[frame]:
        text.set_text('Buy')
        text.set_color('red')
    else:
        text.set_text('Sell')
        text.set_color('green')
    return line1, line2, line3, text

# Create the animation
ani = FuncAnimation(fig, update, frames=n, interval=300)  # Update every 3 seconds

# Display the animation in Jupyter Notebook
display(fig)

try:
    # Continuously update the animation
    for i in range(n):
        update(i)
        clear_output(wait=True)
        display(fig)
except KeyboardInterrupt:
    pass

Requesting Data from Interactive Brokers

To request data from Interactive Brokers, you need a trading account and to connect the API of the Trader Workstation to Python or R. If you are using R, you need to install the IBrokers package before you attempt to download the data. To teach how to download data from the API, can be a tutorial in itself. But the key elements that you need are a connection to the API, a contract with a correct symbol for the stock, a duration and a candle size. In this example, I am creating a connection that I call tws, then I am creating a contract with the twsContract() with the correct symbol USD.SEK and finally I am setting a duration of 5 D (5 days) with candles of 30 sec (seconds).

library('IBrokers')


#### ACCOUNT ####
tws = twsConnect(port=7496)
twsConnectionTime(tws)
ac <- reqAccountUpdates(tws)

#### CONTRACT ####
contract <- twsContract()
contract$symbol <- "USD"
contract$sectype <- "CASH"
contract$currency <- "SEK"
contract$exch <- "IDEALPRO"
contract$includeExpired = TRUE
is.twsContract(contract)


#### REQUEST HIST DATA ####

duration <- "5 D"
barSizeSetting <- "30 sec"

data <- reqHistoricalData(conn=tws, Contract=contract, duration=duration, barSize=barSizeSetting, whatToShow="MIDPOINT")

The setting that I am using on my Trader Workstation are the following:

Optimizing the SMA Crossover bands (Short-run Backtesting)

A key requirement for the success of the algorithm is to identify which set of bands (long and short-run) are better to predict changes in the behavior of the series. We are interesting on benchmarking a series of pair of bands so we can identify which combination is more profitable. In other words, we are going to set the criteria of the highest balance at the end of the testing period to select the pair of bands.

Probably there is fastest vectorized way to identify the intersections, but to calculate the final profit I think it is only possible to do with a loop. Because, the margins of profit/loss change in every transaction and the accumulation of capital depends on this interactive process.

The range I selected for the SMA in the short run is from 5:100 and 10:300for the long run. In each iteration the algorithm will select one pair of bands, fit the corresponding models, calculate benchmarks and estimate the capital at the end of the period. Effectively algorithm tests 27550 combinations of short and long-run bands and saves measurements of performance for latter analysis. The optimization of the bands was conducted in a data set that runs over 5 days in candles of 30 seconds, a data set of 94624.

Similar to the study by Gurrib (2016), I assume that:

The frequency of data is set to candles of 30 seconds.
The effect of discounts, taxes and commissions are ignored.
All orders occur immediately at market prices.
Limit and stop order options are not allowed at this stage.

perf_df <- data.frame(matrix(ncol = 13, nrow = 0))
colnames(perf_df) <-
  c(
    "n",
    "m",
    "capital",
    "num_trades",
    "trades_per_min",
    "numWinningTrades",
    "numLosingTrades",
    "mae_short",
    "mae_long",
    "rmse_short",
    "rmse_long",
    "corr_short",
    "corr_long"
  )

# Set commission rate
# commission_rate <- 0.00075
commission_rate <- 0

# Loop over n and m
for (n in 5:100) {
  for (m in 10:300) {
    # Calculate moving averages
    data$sma_short <- SMA(data$USD.SEK.Close, n = n)
    data$sma_long <- SMA(data$USD.SEK.Close, n = m)
    # data$sma_short[is.na(data$sma_short)] <- 0
    # data$sma_long[is.na(data$sma_long)] <- 0
    
    # Mean Absolute Error (MAE)
    mae_short <- mean(abs(data$sma_short - data$USD.SEK.Close),na.rm = TRUE)
    mae_long <- mean(abs(data$sma_long - data$USD.SEK.Close), na.rm = TRUE)
    
    # Root Mean Squared Error (RMSE)
    rmse_short <-
      sqrt(mean(sum(data$sma_short - data$USD.SEK.Close, na.rm = T) ^ 2))
    rmse_long <- sqrt(mean(sum(data$sma_long - data$USD.SEK.Close, na.rm = TRUE) ^ 2))
    
    # Correlation Coefficient
    corr_short <- cor(data$sma_short, data$USD.SEK.Close, use = "complete.obs")
    corr_long <- cor(data$sma_long, data$USD.SEK.Close, use = "complete.obs")
    
    
    # Initialize variables
    init_capital = 2000
    capital = 2000
    pos = 0
    numTrades = 0
    numWinningTrades = 0
    numLosingTrades = 0
    
    
    
    
    # Backtest strategy
    for (i in 2:nrow(data)){      # Check for a cross
      # c <- c + 1L
      # #print cross
      # print(paste0("cross: ", c))
      if (!is.na(data$sma_short[i - 1]) &&
          !is.na(data$sma_long[i - 1]) &&
          data$sma_short[i - 1] <= data$sma_long[i - 1] &&
          data$sma_short[i] > data$sma_long[i] && capital > 0) {
        # Buy
        pos = (capital - capital * commission_rate) * as.numeric(data$USD.SEK.Close[i])
        print(paste("BUY:", i, pos))
        capital = 0
        numTrades = numTrades + 1
      }
      else if (!is.na(data$sma_short[i - 1]) &&
               !is.na(data$sma_long[i - 1]) &&
               data$sma_short[i - 1] >= data$sma_long[i - 1] &&
               data$sma_short[i] < data$sma_long[i] && pos > 0) {
        # Sell
        capital = as.numeric(pos / data$USD.SEK.Close[i] - pos / data$USD.SEK.Close[i] * commission_rate)
        print(paste("SELL:", i, capital))
        pos = 0
        numTrades = numTrades + 1
        
        if (capital > init_capital) {
          numWinningTrades = numWinningTrades + 1
          
        } else {
          numLosingTrades = numLosingTrades + 1
          
        }
      }
      
    }
  }
  # c <- 0L
  # Append performance to dataframe
  perf_df <-
    rbind(
      perf_df,
      data.frame(
        n,
        m,
        capital,
        numTrades,
        trades_per_min = numTrades / ( (nrow(data) * 30)/60 ),
        numWinningTrades,
        numLosingTrades,
        mae_short,
        mae_long,
        rmse_short,
        rmse_long,
        corr_short = corr_short,
        corr_long = corr_short
      )
    )
  print(tail(perf_df))
}

  
colnames(perf_df) <-
  c(
    "n",
    "m",
    "capital",
    "numTrades",
    "trades_per_min",
    "numWinningTrades",
    "numLosingTrades",
    "mae_short",
    "mae_long",
    "rmse_short",
    "rmse_long",
    "corr_short",
    "corr_long"
  )




#### TOP PERFORMANCE ####
# Print final max capital
print(perf_df[which.max(perf_df$capital),])

# Print final max numWinningTrades  
print(perf_df[which.max(perf_df$numWinningTrades ),])



#### BEST FIT ####

# Print final max rmse_short   
print(perf_df[which.max(perf_df$rmse_short),1])

# Print final max mae_short   
print(perf_df[which.max(perf_df$mae_short),1])

# Print final max mae_short   
print(perf_df[which.max(perf_df$corr_short),1])

# Print final max rmse_long   
print(perf_df[which.max(perf_df$rmse_long),2])

# Print final max mae_long   
print(perf_df[which.max(perf_df$mae_long),2])

# Print final max mae_long   
print(perf_df[which.max(perf_df$corr_long),2])

In terms of performance (capital return) the pair that won is the 8, 300 followed closely by the 5, 300 for the short and long-run respectively.

Long Run Backtesting

To have a better idea of the behavior of the algorithm, I decided to run the algorithm using 6 months of data in candles 30 seconds with a total of 2961360 data points. Testing the algorithm over six months will give us a better perspective of how well the SMA bands capture the long and short run trends in the data and a better approximation of the financial return.

Improvements

I decided to make some small changes to the previous algorithm. Firstly, I wanted to compute the grossprofit/loss of each transaction. Secondly, I estimate the return of investment (ROI) of each transaction to compute the average and standard deviation of the returns at the end of the exercise and approximate a Sharpe Ratio. Thirdly, I wanted to correct a misleading numWinningTrades/numLosingTrades indicator in the previous algorithm. In the previous algorithm, I consider a wining trade if the current capital was higher than the initial capital after each transaction. However to be more accurate it is better to consider a winning trade when the buyPrice > sellPrice. This is a bit counter intuitive but remember that the algorithm buys when the price is increasing, so the USD (dollar) invested will render more Krones (SEK). For instance, imagine that you invest 10 USD, and the price of the Krone is 12 (buying price), that is 120 SEK. Then, if the algorithm identifies a selling signal at a price of 10.5 SEK (selling price), your profit would have been 1.43 USD for this transaction, calculated as (120 / 10.5 = 11.43).

First run of the algorithm

In the first run of the algorithm I wanted to secure winning transactions only. I attempt to achieve this by adding a rule buyPrice > as.numeric(data$USD.SEK.Close[i]), so I will guarantee that the selling price was always bellow the buying price and make winning trades all the time. The buying rule was transformed as follows:

 else if (!is.na(data$sma_short[i - 1]) &&
           !is.na(data$sma_long[i - 1]) &&
           data$sma_short[i - 1] >= data$sma_long[i - 1] &&
           data$sma_short[i] < data$sma_long[i] && pos > 0 && buyPrice > as.numeric(data$USD.SEK.Close[i])) 

Unfortunately this change didn’t report a greater performance than the regular unconstrained moving average. The issue is that the series eventually reach local maximum or minimum values. For instance, if the algorithm buys at a local minimum point in the series the selling condition will never be fulfilled buyPrice > as.numeric(data$USD.SEK.Close[i]). The algorithm’s performance suffered because it purchased an asset at a local minimum price, and since then, the price has consistently risen. This situation makes it unlikely for future prices to be lower, leading to lower overall performance. In a nutshell, it seems that in order to take advance of the volatility of the series and make higher profit it is necessary to lose some trades as long as on the averages we are winning more often. This is the reported performance of the first run:

n	m	capital	net_profit	grossProfit	grossLoss
8	300	2086.763	86.763	86.763	0

SMA constrained Crossover performance

The total profit over the six months was only 86.763 USD, a return of investment of only 4.33 %. However as expected, the total number of trades is low and more importantly there are no trades on loss.

buynumTrades	sellnumTrades	trades_per_min	numWinningTrades	numLosingTrades
56	55	0.001	55	0

Second run of the algorithm

In my second attempt I ran the unconstrained version (original version) with the additional elements that I discussed previously, as follows:

n <- 8
m <- 300

# Calculate moving averages
data$sma_short <- SMA(data$USD.SEK.Close, n = n)
data$sma_long <- SMA(data$USD.SEK.Close, n = m)
# data$sma_short[is.na(data$sma_short)] <- 0
# data$sma_long[is.na(data$sma_long)] <- 0

# Mean Absolute Error (MAE)
mae_short <- mean(abs(data$sma_short - data$USD.SEK.Close),na.rm = TRUE)
mae_long <- mean(abs(data$sma_long - data$USD.SEK.Close), na.rm = TRUE)

# Root Mean Squared Error (RMSE)
rmse_short <-
  sqrt(mean(sum(data$sma_short - data$USD.SEK.Close, na.rm = T) ^ 2))
rmse_long <- sqrt(mean(sum(data$sma_long - data$USD.SEK.Close, na.rm = TRUE) ^ 2))

# Correlation Coefficient
corr_short <- cor(data$sma_short, data$USD.SEK.Close, use = "complete.obs")
corr_long <- cor(data$sma_long, data$USD.SEK.Close, use = "complete.obs")



# Initialize variables
init_capital = 2000
capital = 2000
buyCapital <- 0
pos = 0
grossPnL = 0
buynumTrades = 0
sellnumTrades = 0
numWinningTrades = 0
numLosingTrades = 0
grossProfit = 0
grossLoss = 0
commission_rate = 0
buyPrice = 0  # Initialize previousPrice to 0
sellPrice = 0
roi = vector("numeric", length = 0)

# Backtest strategy
for (i in 2:nrow(data)){      # Check for a cross
  # ...
  if (!is.na(data$sma_short[i - 1]) &&
      !is.na(data$sma_long[i - 1]) &&
      data$sma_short[i - 1] <= data$sma_long[i - 1] &&
      data$sma_short[i] > data$sma_long[i] && capital > 0) {
    # Buy
    buyPrice <- as.numeric(data$USD.SEK.Close[i])  # Store the buy price
    buyCapital <- capital
    pos = (capital - capital * commission_rate) * buyPrice
    print(paste("BUY:", i, pos))
    capital = 0
    buynumTrades = buynumTrades + 1
  }
  else if (!is.na(data$sma_short[i - 1]) &&
           !is.na(data$sma_long[i - 1]) &&
           data$sma_short[i - 1] >= data$sma_long[i - 1] &&
           data$sma_short[i] < data$sma_long[i] && pos > 0){ #&& buyPrice > as.numeric(data$USD.SEK.Close[i])
    # Sell
    sellPrice <- as.numeric(data$USD.SEK.Close[i])  # Calculate PnL based on current capital
    capital = as.numeric(pos / sellPrice - pos / sellPrice * commission_rate)
    grossPnL <- capital - buyCapital
    print(paste("SELL:", i, capital))
    pos = 0
    sellnumTrades = sellnumTrades + 1
    
    roi <- c(roi, (buyPrice-sellPrice/buyPrice)*100)
    if (buyPrice > sellPrice) {
      numWinningTrades = numWinningTrades + 1
      grossProfit = grossProfit + grossPnL
    } else {
      numLosingTrades = numLosingTrades + 1
      grossLoss = grossLoss + abs(grossPnL)
    }
  }
}

The reported performance over the same period of data (6 months) is presented on the table bellow. The net profit of the unconstrained simple moving average over six months was 317.278 with an initial investment of 2000 USD. A total of 15.86 % return of investment, not bad at all, considering that we only tested half a year.

n	m	capital	net_profit	grossProfit	grossLoss
8	300	2317.278	317.278	1700.261	1382.983

SMA unconstrained Crossover performance

Remarkably, the unconstrained variant of the algorithm, absent the condition buyPrice > as.numeric(data$USD.SEK.Close[i]), exhibited a loss in approximately 20% of its trades. This is quite high, and it is an area of opportunity for further implementations of the algorithm. I will start by testing a less restrictive condition of selling that allows to sell on loss but only around certain margin, perhaps the standard deviation of long-run SMA.

buynumTrades	sellnumTrades	trades_per_min	numWinningTrades	numLosingTrades
2025	2025	0.001	1612	413

Composition of the trades

The distribution of the Return of Investment (ROI) of each trade is the following:

Min.	1st Qu.	Median	Mean	3rd Qu.	Max.	sd
-2.688	0.006	0.025	0.007	0.053	1.31	0.15

ROI Summary Statistics

Final remarks and areas of improvement.

The SMA crossover algorithm proved to be successful, achieving a total return on investment of 15.86% over six months with 30-second candles. However, it’s important to note that this performance heavily depends on specific parameter values (bands), candle intervals, and the chosen stock. In our rigorous testing, we explored a staggering 27,550 combinations of short and long-run bands over five days to identify the winning pair.

While the algorithm showed promise, there are areas for improvement. First, we observed a relatively high gross loss (1382 USD) compared to the gross gain (1780 USD), resulting in approximately 20% of losses. Enhancing the algorithm with additional rules, such as introducing resistance bands, may help mitigate losses during market uncertainties. Secondly, more realistic estimations of transaction commissions need to be incorporated to provide a more accurate representation of algorithm performance. It’s worth noting that Interactive Brokers limits regular trading accounts to one-minute candles, which may impact trading strategies. Looking ahead, optimizing and testing the algorithm’s performance in the equity market, particularly with stocks displaying higher returns and upward trends, could yield even better results. Finally, the next phase involves implementing the algorithm using live market data through the reqMktData function and testing it in a paper trading account to assess its real-time performance.

Creating a Dashboard of CPU Benchmarks Using R and Python.

2023-02-15T00:00:00+00:00

Introduction

In this post I will teach you how to create and deploy a dashboard with a preview of the dataset alongside useful data visualization tools. Dashboards are useful for creating interactive and customizable data visualizations and web applications. They allow you to create a dynamic user interface that can interact with data and update in real-time, making it an excellent tool for data exploration, analysis, and sharing. Dashboards can be used for a wide range of purposes, from monitoring business metrics to visualizing scientific data.

To create the dashboard, I will combine R an Python to take advantage of strengths of each language for web scraping, data cleaning, dashboard creation and deployment. My source to gather CPU information and benchmarks is CPU list from cpubenchmark.net. The goal is to scrap this data to create a dataset of benchmarks for our CPU dashboard example.

Web Scrapping HTML tables with rvest

The library rvest from R has many interesting functions for web scrapping. We are interested a function that can transform a HTML tables (

and

) into readable dataframes.

library("rvest")

## Read data
webpage <- read_html("https://www.cpubenchmark.net/cpu_list.php")
tbls <- webpage %>%
  html_nodes("table") %>%
  html_table(fill = TRUE)
length(tbls)

## [1] 3

head(tbls[[2]])

## # A tibble: 6 × 5
##   `CPU Name`              `CPU Mark(higher is better)` Rank(lo…¹ CPU V…² Price…³
##                                                        
## 1 AArch64 rev 2 (aarch64) 2,246                             2187      NA    
## 2 AArch64 rev 4 (aarch64) 1,797                             2439      NA    
## 3 AC8257V/WAB             774                               3269      NA    
## 4 AMD 3015Ce              2,088                             2263      NA    
## 5 AMD 3015e               2,691                             1969      NA    
## 6 AMD 3020e               2,446                             2069      NA    
## # … with abbreviated variable names ¹​`Rank(lower is better)`,
## #   ²​`CPU Value(higher is better)`, ³​`Price(USD)`

Now the next step is to convert this table into a dataframe and perform some basic data cleaning. We are going to use regular expressions to transform strings of character into numeric and integer values.

cpus_bench <- tbls[[2]]
cpus_bench$`CPU Mark(higher is better)` <- as.numeric(gsub(",", "", cpus_bench$`CPU Mark(higher is better)`))
cpus_bench$`Rank(lower is better)` <- as.numeric(cpus_bench$`Rank(lower is better)`)
cpus_bench$`CPU Value(higher is better)` <- as.numeric(cpus_bench$`CPU Value(higher is better)`)
cpus_bench$`Price(USD)` <-  gsub("(^\\$)|(\\*$)", "", cpus_bench$`Price(USD)`)
cpus_bench$`Price(USD)` <-  gsub(",", "", cpus_bench$`Price(USD)`)
cpus_bench$`Price(USD)` <- as.numeric(cpus_bench$`Price(USD)`)
head(cpus_bench)

## # A tibble: 6 × 5
##   `CPU Name`              `CPU Mark(higher is better)` Rank(lo…¹ CPU V…² Price…³
##                                                        
## 1 AArch64 rev 2 (aarch64)                         2246      2187      NA      NA
## 2 AArch64 rev 4 (aarch64)                         1797      2439      NA      NA
## 3 AC8257V/WAB                                      774      3269      NA      NA
## 4 AMD 3015Ce                                      2088      2263      NA      NA
## 5 AMD 3015e                                       2691      1969      NA      NA
## 6 AMD 3020e                                       2446      2069      NA      NA
## # … with abbreviated variable names ¹​`Rank(lower is better)`,
## #   ²​`CPU Value(higher is better)`, ³​`Price(USD)`

Now that we have the data in good shape, it is time to retrieve more information on CPU benchmarks. The cpus_bench contains information on 4080 CPUs, however, to make the bashboard more efective, I want to concentrate on the top-1000 CPUs according to the CPU Mark.

## Sort according to CPU Mark
cpus_bench <- cpus_bench[order(cpus_bench$`CPU Mark(higher is better)`, decreasing = T), ]
cpus_bench_1000 <- cpus_bench[1:1000,]
head(cpus_bench_1000, 10L)

## # A tibble: 10 × 5
##    `CPU Name`                        CPU Mark(higher i…¹ Rank(…² CPU V…³ Price…⁴
##                                                        
##  1 AMD EPYC 9654                                  124119       1    10.5  11805 
##  2 AMD Ryzen Threadripper PRO 5995WX               96237       2    14.5   6645.
##  3 AMD EPYC 7773X                                  90731       3    21.4   4249 
##  4 AMD EPYC 7763                                   85944       4    23.4   3665 
##  5 AMD EPYC 7J13                                   85661       5    NA       NA 
##  6 AMD EPYC 7713                                   85521       6    23.1   3700.
##  7 AMD EPYC 7713P                                  83439       7    18.3   4550 
##  8 AMD Ryzen Threadripper PRO 3995WX               83097       8    13.3   6267.
##  9 AMD EPYC 7V13                                   82878       9    NA       NA 
## 10 AMD Ryzen Threadripper 3990X                    81109      10    11.5   7069 
## # … with abbreviated variable names ¹​`CPU Mark(higher is better)`,
## #   ²​`Rank(lower is better)`, ³​`CPU Value(higher is better)`, ⁴​`Price(USD)`

Very interesting, in the top-10 we find only AMD processors…

To add the rest of the CPU benchmark we are going to take advantage of this simple function that takes the name of a CPU and creates an HTML link that will be used by a web scrapping algorithm to retrieve the CPU benchmarks.

i <- 1L
paste0("https://www.cpubenchmark.net/cpu.php?cpu=", gsub(" ", "\\+", cpus_bench_1000$`CPU Name`[i]))

## [1] "https://www.cpubenchmark.net/cpu.php?cpu=AMD+EPYC+9654"

The idea is to write a simple loop that would iterate over all the top-1000 CPUs and gather information on benchmarks, such as “integer_math(MOps/Sec)”,“floating_point_math(MOps/Sec)”,“find_prime_numbers(Million Primes/Sec)”, ect…

# read in HTML data
i <- 1L
df_bind <- list()
for(i in 1L:1000L){
webpage <- read_html(paste0("https://www.cpubenchmark.net/cpu.php?cpu=", gsub(" ", "\\+", cpus_bench_1000$`CPU Name`[i])))
tbls <- webpage %>%
  html_nodes("table") %>%
  html_table(fill = TRUE)
df <- try(data.table(t(tbls[[2]][2])))
if(any(class(df)%in%"data.table")){
  if(ncol(df) == 9){
    setnames(df, t(tbls[[2]][1]))
    year <- as.integer(sub("^.*\\s", "",trimws(gsub('
.*$', '', gsub('^.*CPU First Seen on Charts:', "", webpage)))))
    gz <- as.integer(sub("\\s.*", "",trimws(gsub('
.*$', '', gsub('^.*Clockspeed:', "", webpage)))))
    cores <- as.integer(trimws(gsub('.*$', '', gsub('^.*Cores:', "", webpage))))
    threads <- as.integer(trimws(gsub('
.*$', '', gsub('^.*Threads:', "", webpage))))
    df_bind[[i]] <- cbind.data.frame(cpus_bench_1000[i,], df, gz, cores, threads, year)  
  }
}
}

cpus_bench_full <- rbindlist(df_bind, fill = TRUE)
head(cpus_bench_full)

The algorithm may seem a bit intimidating but it is actually quite simple. It is gathering pieces of information on specific parts of the HTML code. For instance, to gather the information on the CPU benchmarks we are always scrapping the second table from the website as tbls[[2]][2]. Then we are scrapping the begging and end of HTML tabs that contain useful information such as gz, cores and the threads from the HTML source code. The final data set looks like this:

## X cpu_name cpu_mark.higher_is_better. ## 1 1 AMD EPYC 9654 124119 ## 2 2 AMD Ryzen Threadripper PRO 5995WX 95829 ## 3 3 AMD EPYC 7773X 90731 ## 4 4 AMD EPYC 7763 85944 ## 5 5 AMD EPYC 7J13 85661 ## 6 6 AMD EPYC 7713 85521 ## cpu_value.higher_is_better. price_usd integer_math.MOps.Sec. ## 1 10.51 11805.00 978227 ## 2 14.98 6399.00 631867 ## 3 21.11 4299.00 533457 ## 4 23.26 3695.00 547840 ## 5 NA NA 555507 ## 6 23.11 3699.99 533785 ## floating_point_math.MOps.Sec. find_prime_numbers.Million.Primes.Sec. ## 1 522611 NA ## 2 343904 676 ## 3 301129 NA ## 4 299973 665 ## 5 300486 686 ## 6 272582 621 ## random_string_sorting.Thousand.Strings.Sec. data_encryption.MBytes.Sec. ## 1 NA 187949 ## 2 676 132563 ## 3 NA 135770 ## 4 665 124591 ## 5 686 123954 ## 6 621 107100 ## data_compression.MBytes.Sec. physics.Frames.Sec. ## 1 NA NA ## 2 NA NA ## 3 NA NA ## 4 NA NA ## 5 NA NA ## 6 NA NA ## extended_instructions.Million.Matrices.Sec. single_thread.MOps.Sec. ghz cores ## 1 200277 2893 2 96 ## 2 123388 3302 2 64 ## 3 91298 2513 2 64 ## 4 98801 2576 2 64 ## 5 99971 2449 2 64 ## 6 94897 2718 2 64 ## threads year ## 1 192 2022 ## 2 128 2022 ## 3 128 2022 ## 4 128 2021 ## 5 128 2021 ## 6 128 2021

Dasboard in Plotly Dash from Python

The library that I am going to use to create the dashboard is called Plotly Dash. Plotly Dash has several advantages for deploying a static web application compared to other libraries. Firstly, it has high level of interactivity, meaning that users are able to play around with the data, apply filters, and perform various operations. Secondly, in my view, it is also flexible as it allows users to create an customize different plots and layouts, and it is relatively easy to customize. Thirdly, it has a high level of integration specially with Pandas and Numpy that are the main libraries that are commonly use of data science in Python. Finally, the library is relatively easy to deploy at zero cost as a static website that can be easily embedded or used as a stand alone service.

Don’t let the code overwhelm you! The structure of the dashboard python code is simple, we start by loading the packages that we are going to use. The first line loads the entire Dash library to be used in the script and the the next three lines import specific modules from the Dash library, which are dcc, html, and dash_table. These modules are needed to create the visual components of the dashboard such as tables, dropdowns menus, graphs, and other HTML elements. Pandas is imported in order to read and manipulate the dataset, which is stored in a CSV file. Then, the Plotly graph objects (graph_objs) are imported to create a pie chart and the Plotly express (px) is imported to create a scatter plot. Afther we have loaded the libraries and modules, we load the dataset using read_csv method from Pandas, either from a local file or from a remote URL, in this case I am exporting the data from my Github repository.

The app.layout is where the components of the dashboard are defined such as tables, dropdowns, graphs, and other HTML elements. Here, you can be creative and write a layout that is both visually appealing and functional. I am going for a simple design with a heather, html.H1('CPU Benchmark Data'), and a slim line that works as separator between the sections of the dashboard html.Hr(),. I start the dashboard presenting a preview of the top-10 rows of the dataset using the function dash_table.DataTable(). The function has several arguments, but perhaps the most important one is the data source data=df.head(10).to_dict('records'),which displays only the first 10 rows of the data.

After the data preview, I define another line html.Br(), to mark the beginning of another section of the web application, followed by the function, html.H4('Histogram variable:'), that displays a tittle of the histogram. Next, I define a dropdown menu to select between each column of the dataset:

dcc.Dropdown( id='variable-selector', options=[{'label': i, 'value': i} for i in df.columns], value='cpu_value(higher_is_better)' )

This is followed by two bottoms that are used to sort the table in an ascending or descending manner according to the variable selected. This snipped of code is the following one:

dcc.RadioItems( id='sort-order', options=[{'label': i, 'value': i} for i in ['Ascending', 'Descending']], value='Ascending', labelStyle={'display': 'inline-block'} )

And finally, we display the histogram using the following function:

dcc.Graph( id='histogram', figure={} )

The rest of the layout follows the same mechanics. I define a dropdown menu, dcc.Dropdown(), to select a variable for the next plot then I render the plot using the same dcc.Graph() function.

After defining the app.layout, we have to write the app.callback decorator that is used to bind the input/output of the interactive components (i.e., the dropdown menus) to the graph. Furthermore, the update_histogram_and_table, update_pie_chart, and update_scatter_plot functions that are the callback functions that update the graph based on the user input.

import dash from dash import dcc from dash import html from dash import dash_table import pandas as pd import plotly.graph_objs as go # for the pie chart import plotly.express as px # for the scatter plot external_stylesheets = ['https://codepen.io/chriddyp/pen/bWLwgP.css'] app = dash.Dash(__name__, external_stylesheets=external_stylesheets) # Load the dataset #df = pd.read_csv("test_cpus.csv") df = pd.read_csv("https://raw.githubusercontent.com/Wario84/blog/main/assets/data/test_cpus.csv") app.layout = html.Div([ html.H1('CPU Benchmark Data'), html.Hr(), html.H3('Data Preview:'), dash_table.DataTable( id='table', columns=[{"name": i, "id": i} for i in df.columns], data=df.head(10).to_dict('records'), style_table={'overflowX': 'auto'}, style_cell={'textAlign': 'left'}, sort_action='native', page_action='none', style_data_conditional=[{ 'if': {'row_index': 'odd'}, 'backgroundColor': 'rgb(248, 248, 248)' }] ), html.Br(), html.H4('Histogram variable:'), dcc.Dropdown( id='variable-selector', options=[{'label': i, 'value': i} for i in df.columns], value='cpu_value(higher_is_better)' ), dcc.RadioItems( id='sort-order', options=[{'label': i, 'value': i} for i in ['Ascending', 'Descending']], value='Ascending', labelStyle={'display': 'inline-block'} ), dcc.Graph( id='histogram', figure={} ), html.Br(), html.H4('Pie-chart variable:'), dcc.Dropdown( id="variable-selector-2", options=[ {"label": "Ghz", "value": "ghz"}, {"label": "Cores", "value": "cores"}, {"label": "Threads", "value": "threads"}, {"label": "Year", "value": "year"}, #"ghz","cores","threads" ], #style={"width": "45%"} value="cores" ), dcc.Graph(id="pie-chart"), html.Br(), html.H4('Scatter-Plot variable:'), dcc.Dropdown( id='variable-selector-3', options=[{'label': i, 'value': i} for i in df.columns], value='cpu_name' ), dcc.Graph(id="scatter-plot"), ]) @app.callback( [dash.dependencies.Output('histogram', 'figure'), dash.dependencies.Output('table', 'data')], [dash.dependencies.Input('variable-selector', 'value'), dash.dependencies.Input('sort-order', 'value')] ) def update_histogram_and_table(variable, sort_order): df_sorted = df.sort_values(by='cpu_value(higher_is_better)', ascending=False) if sort_order == 'Ascending': df_sorted = df_sorted.iloc[::-1] data_table = df_sorted.head(10).to_dict('records') fig = { 'data': [{ 'x': df[variable], 'type': 'histogram' }], 'layout': { 'title': 'Histogram of ' + variable, 'xaxis': {'title': variable}, 'yaxis': {'title': 'Count'} } } return fig, data_table @app.callback( dash.dependencies.Output("pie-chart", "figure"), [dash.dependencies.Input("variable-selector-2", "value")] ) def update_pie_chart(selected_column): #filtered_df = df[df['year'] == selected_column] values = df[selected_column].value_counts().values labels = df[selected_column].value_counts().index fig = go.Figure(data=[go.Pie(labels=labels, values=values)]) #fig.update_layout(title=f"{selected_column} distribution in {selected_column}") return fig @app.callback( dash.dependencies.Output("scatter-plot", "figure"), [dash.dependencies.Input("variable-selector-3", "value")] ) def update_scatter_plot(variable): return px.scatter(df, x="price_usd", y=variable).update_layout( xaxis={"title": "Price (USD)"}, yaxis={"title": variable.capitalize()}, margin={"l": 40, "b": 40, "t": 10, "r": 10}, height=300, ) if __name__ == '__main__': app.run_server(debug=False)

Deploying as a static website

To finally deploy the dashboard as web application, I am going to rely on this video put forward by the people from Plotly: Deploy your Python Data App to the Web for Free - Dash. The procedure is step by step, and it very simple, first, we put the .py python script in a public Github repository. Then we open an account on render.com and follow a simple procedure.

The final CPU Dashboard

Finally, I present you the CPU benchmark Dashboard. But for a better experience and visualization, I invite you to check out the static website at cpu-benchmark-plotly-dash.onrender.com

Dynamic network of collaboration in Machine Learning using R and Python.

2022-10-07T00:00:00+00:00

Introduction

In this blog entry, I will use data from Web of Science to draw a network of collaboration in the field of Machine Learning. I am going to disentangle the core universities that have published the top highly cited 2878 articles about Machine Learning in Web of Science. These are the most important scientific contributions to the field downloaded in October 2022.

I use data of Web of Science which is the most widely used database of research publications and citations. Most universities have a license to use this database for research purposes. My query is simple, I use the multidisciplinary Web of Science Core collection searching on the publication’s Tittle, Abstract or Keywords the word “Machine Learning”. Later I filter subsetting only the highly cited publications in the field.

Libraries

library(data.table) library(ggplot2) library(svglite)

## Warning: package 'svglite' was built under R version 4.2.1

library(igraph)

## ## Attaching package: 'igraph' ## The following objects are masked from 'package:stats': ## ## decompose, spectrum ## The following object is masked from 'package:base': ## ## union

Data

csvs <- dir(pattern = "savedrecs.*csv$") csvs <- lapply(csvs, fread) csvs <- rbindlist(csvs) dim(csvs)

## [1] 2878 72

General Approach

I will use R and Regex(regular expressions) to clean the address field of the publication to extract the affiliations of the authors. Then I will use the d3graph library of Python to produce a dynamic network of university collaboration.

Use Regex to clean the data

Now, that we have put together the files, it’s time to extract the data from the Addresses field. Looking closely at this column:

csvs$Addresses[1:3]

## [1] "[Muehlematter, Urs J.] Univ Zurich, Univ Hosp Zurich, Inst Diagnost & Intervent Radiol, Zurich, Switzerland; [Daniore, Paola; Vokinger, Kerstin N.] Univ Zurich, Inst Law, CH-8001 Zurich, Switzerland" ## [2] "[Fu, Xiangzheng; Cai, Lijun; Zeng, Xiangxiang] Hunan Univ, Coll Comp Sci & Elect Engn, Changsha 410082, Hunan, Peoples R China; [Zou, Quan] Univ Elect Sci & Technol China, Inst Fundamental & Frontier Sci, Chengdu 610054, Peoples R China" ## [3] "[Raissi, Maziar; Karniadakis, George Em] Brown Univ, Div Appl Math, Providence, RI 02912 USA"

It is clear that this string has a pattern in which the authors are surrounded by square brackets, for instance [Muehlematter, Urs J.], and immediately after the record reports the university Univ Zurich. If the article is published by two or more different universities the field will be separated by a ; semicolon.

# First we aim for separating authors: samp <- unlist(strsplit(csvs$Addresses[2], "; \\[" )) samp

## [1] "[Fu, Xiangzheng; Cai, Lijun; Zeng, Xiangxiang] Hunan Univ, Coll Comp Sci & Elect Engn, Changsha 410082, Hunan, Peoples R China" ## [2] "Zou, Quan] Univ Elect Sci & Technol China, Inst Fundamental & Frontier Sci, Chengdu 610054, Peoples R China"

# Then we aim to extract the universities samp <- gsub(".*\\] ", "", samp) samp

## [1] "Hunan Univ, Coll Comp Sci & Elect Engn, Changsha 410082, Hunan, Peoples R China" ## [2] "Univ Elect Sci & Technol China, Inst Fundamental & Frontier Sci, Chengdu 610054, Peoples R China"

# We have to clean everything after the comma samp <- gsub(",.*", "", samp) samp

## [1] "Hunan Univ" "Univ Elect Sci & Technol China"

Build the dataframe publication-university

Perfect, now we apply this to the whole dataset. We have to give the universities a unique identifier if they collaborate in the same publication.

ml_data <- list() i <- 1L for (i in 1L:nrow(csvs)) { temp <- unlist(strsplit(csvs$Addresses[i], "; \\[" )) temp <- gsub(".*\\] ", "", temp) temp <- gsub(",.*", "", temp) if(length(temp)>0){ ml_data[[i]] <- data.frame(id=i, univ=temp) } } ml_data <- rbindlist(ml_data) # We have 13074 in total dim(ml_data)

## [1] 13074 2

# Subset only to the top 50 universities ml_data <- ml_data[univ %in% names(sort(table(ml_data$univ), decreasing = T)[1:50]), ]

The edgelist

The edgelist is the key input that we need to plot the network. However, we have to perform additional data manipulation before we have a list of pairs of universities. The data so far contains pairs of c(publication, university), however, what we need is a list that contains pairs of universities when they work together in a project c(university, university). The article by RPubs (2022), describes more about this type of conversion from the theoretical angle. From the data science perspective, I show here Gonzalez-Sauri (2022) several ways to perform this transformation.

ml_data

## id univ ## 1: 2 Univ Elect Sci & Technol China ## 2: 4 MIT ## 3: 4 Northwestern Univ ## 4: 7 Chinese Acad Sci ## 5: 8 Univ Elect Sci & Technol China ## --- ## 2661: 2873 Tianjin Univ ## 2662: 2873 Natl Univ Singapore ## 2663: 2875 Wuhan Univ ## 2664: 2875 Univ Cambridge ## 2665: 2877 Univ Calif Berkeley

edge_lst <- merge(ml_data, ml_data, by = "id", allow.cartesian = TRUE) edge_lst <- edge_lst[edge_lst$univ.x != edge_lst$univ.y, -1L] dim(edge_lst)

## [1] 6594 2

I want to differentiate the strength of the link or edge, so, I will calculate the betweenness centrality at the level of the edge. I will append this to the dataset and then export it to a csv-file.

g1 <- igraph::graph_from_data_frame(edge_lst, directed = F) edge_lst[, weight:= edge.betweenness(g1)] setnames(edge_lst, c("source", "target", "weight")) fwrite(edge_lst, "edge_lst.csv")

Top Universities in Machine Learning

Just for curiosity lets look at the top 20 universities working in the field of Machine Learning.

top_ml <- sort(table(ml_data$univ), decreasing = T)[1:20] top_ml <- data.frame(top_ml) colnames(top_ml) <- c("univ", "pubs") p <- ggplot(top_ml, aes(x = univ, y = pubs, fill = pubs)) + geom_bar(stat = "identity") + theme_minimal() + theme(axis.text.x = element_text( angle = 45, #vjust = 0.5, hjust = 1 )) # save the picture ggsave(file="top_ml_univ.svg", plot=p, width=16, height=10) p

Python Dynamic Network

For the network, I will use the d3graph library created by Taskesen (2022).

import pandas as pd from d3graph import d3graph, vec2adjmat # Import data df = pd.read_csv("./blog/edge_lst.csv") # Show the input data print(df) # Create an adjaceny matrix adjmat = vec2adjmat(source=df["source"].tolist(), target=df["target"].to_list()) # Initialize d3 = d3graph() # Build force-directed graph with default settings d3.graph(adjmat) # Show graph d3.show()

Static Networks

The results are quite nice. First I would like to show the results filtering universities that have 20 edges (co-authored publications) or more.

Then we have the network when we filter only universities with more than 45 connections.

Dynamic Network

Finally, we have the main dynamic network that we can use to display several thresholds of network connections.

References

Gonzalez-Sauri. 2022. “What its the most efficient method to create an edgelist/adjacency matrix from two sets of IDs?” Stack Overflow. https://stackoverflow.com/questions/42764954/what-its-the-most-efficient-method-to-create-an-edgelist-adjacency-matrix-from-t.

RPubs. 2022. “RPubs - Bipartite/Two-Mode Networks in igraph.” https://rpubs.com/pjmurphy/317838.

Taskesen, Erdogan. 2022. “d3graph.” GitHub. https://github.com/erdogant/d3graph.

Introduction to Data Transformation with Python and Pandas.

2022-10-04T00:00:00+00:00

Importing Modules

import numpy as np import pandas as pd

Loading Data

For this tutorial you need to download the

The avocados, homelessness.csv, sales_subset.csv and temperatures.csv dataset from Kaggle’s website.

The COVID19 Daily Updatesdata from Kaggle’s website.

Creating Dataframes

A Dataframe from a list

a = [['a', '1.2', '4.2'], ['b', '70', '0.03'], ['x', '5', '0']] df11 = pd.DataFrame(a, columns=['one', 'two', 'three']) print(df11)

one two three 0 a 1.2 4.2 1 b 70 0.03 2 x 5 0

A Dataframe from Array

Example 1

dates = pd.date_range("20130101", periods=6) # pandas indexes r_num = np.random.randn(6, 4) # array df = pd.DataFrame(r_num, index=dates, columns=list("ABCD")) print(df)

A B C D 2013-01-01 0.817994 -0.924007 -1.515711 -0.198598 2013-01-02 0.673364 -1.914110 -0.126208 -0.282033 2013-01-03 1.312579 0.340656 -0.300397 -0.838614 2013-01-04 -0.732977 -0.560867 -0.515910 -0.768784 2013-01-05 -2.045106 -0.929131 -0.029660 0.529883 2013-01-06 -1.343257 -0.250821 -0.046303 0.944569

Example 2

r_num = np.array(range(1,26,1)) # Convert 1D array to a 2D numpy array of 5 rows and 5 columns arr_2d = np.reshape(r_num, (5, 5)) df4 = pd.DataFrame(arr_2d) print(df4) print(type(df4))

0 1 2 3 4 0 1 2 3 4 5 1 6 7 8 9 10 2 11 12 13 14 15 3 16 17 18 19 20 4 21 22 23 24 25

A Dataframe from a dictionary

dict = { "A": 1.0, "B": pd.Timestamp("20130102"), "C": pd.Series(1, index=list(range(4)), dtype="float32"), "D": np.array([3] * 4, dtype="int32"), "E": pd.Categorical(["test", "train", "test", "train"]), "F": "foo", } df2 = pd.DataFrame(dict) print(df2)

A B C D E F 0 1.0 2013-01-02 1.0 3 test foo 1 1.0 2013-01-02 1.0 3 train foo 2 1.0 2013-01-02 1.0 3 test foo 3 1.0 2013-01-02 1.0 3 train foo

A Dataframe froma list of dictionaries

# Create a list of dictionaries with new data df9 = [ {"date": "2019-11-03", "small_sold": 10376832, "large_sold": 7835071}, {"date": "2019-11-10", "small_sold": 10717154, "large_sold": 8561348}, ] # Convert list into DataFrame df9 = pd.DataFrame(df9) # Print the new DataFrame print(df9)

date small_sold large_sold 0 2019-11-03 10376832 7835071 1 2019-11-10 10717154 8561348

A Dataframe from CSV

To print df3 I use the head() method to print only the top rows.

# Read the csv from working directory df3 = pd.read_csv("homelessness.csv") print(df3.head())

Unnamed: 0 region ... family_members state_pop 0 0 East South Central ... 864.0 4887681 1 1 Pacific ... 582.0 735139 2 2 Mountain ... 2606.0 7158024 3 3 West South Central ... 432.0 3009733 4 4 Pacific ... 20964.0 39461588 [5 rows x 6 columns]

Another example of importing from csv.

df6 = pd.read_csv("sales_subset.csv") print(df6.head())

Unnamed: 0 store type ... temperature_c fuel_price_usd_per_l unemployment 0 0 1 A ... 5.727778 0.679451 8.106 1 1 1 A ... 8.055556 0.693452 8.106 2 2 1 A ... 16.816667 0.718284 7.808 3 3 1 A ... 22.527778 0.748928 7.808 4 4 1 A ... 27.050000 0.714586 7.808 [5 rows x 10 columns]

df7 = pd.read_csv("temperatures.csv") print(df7.head())

Unnamed: 0 date city country avg_temp_c 0 0 2000-01-01 Abidjan Côte D'Ivoire 27.293 1 1 2000-02-01 Abidjan Côte D'Ivoire 27.685 2 2 2000-03-01 Abidjan Côte D'Ivoire 29.061 3 3 2000-04-01 Abidjan Côte D'Ivoire 28.162 4 4 2000-05-01 Abidjan Côte D'Ivoire 27.547

df8 = pd.read_csv("covid-19-all.csv") print(df8.head())

:1: DtypeWarning: Columns (0,1) have mixed types. Specify dtype option on import or set low_memory=False. Country/Region Province/State Latitude ... Recovered Deaths Date 0 NaN NaN NaN ... 41727.0 2191.0 2021-01-01 1 NaN NaN NaN ... 33634.0 1181.0 2021-01-01 2 NaN NaN NaN ... 67395.0 2762.0 2021-01-01 3 NaN NaN NaN ... 7463.0 84.0 2021-01-01 4 NaN NaN NaN ... 11146.0 405.0 2021-01-01 [5 rows x 8 columns]

Methods and Attributes of a DataFrame

In Python, there are specific methods, or operations that can be performed for each data structure. Similarly, there are specific characteristics of the data structures called attributes. We can assess all the methods and attributes associated with a data structure using the following lines:

att_meth = dir(df) print(att_meth)

['A', 'B', 'C', 'D', 'T', '_AXIS_LEN', '_AXIS_ORDERS', '_AXIS_TO_AXIS_NUMBER', '_HANDLED_TYPES', '__abs__', '__add__', '__and__', '__annotations__', '__array__', '__array_priority__', '__array_ufunc__', '__array_wrap__', '__bool__', '__class__', '__contains__', '__copy__', '__deepcopy__', '__delattr__', '__delitem__', '__dict__', '__dir__', '__divmod__', '__doc__', '__eq__', '__finalize__', '__floordiv__', '__format__', '__ge__', '__getattr__', '__getattribute__', '__getitem__', '__getstate__', '__gt__', '__hash__', '__iadd__', '__iand__', '__ifloordiv__', '__imod__', '__imul__', '__init__', '__init_subclass__', '__invert__', '__ior__', '__ipow__', '__isub__', '__iter__', '__itruediv__', '__ixor__', '__le__', '__len__', '__lt__', '__matmul__', '__mod__', '__module__', '__mul__', '__ne__', '__neg__', '__new__', '__nonzero__', '__or__', '__pos__', '__pow__', '__radd__', '__rand__', '__rdivmod__', '__reduce__', '__reduce_ex__', '__repr__', '__rfloordiv__', '__rmatmul__', '__rmod__', '__rmul__', '__ror__', '__round__', '__rpow__', '__rsub__', '__rtruediv__', '__rxor__', '__setattr__', '__setitem__', '__setstate__', '__sizeof__', '__str__', '__sub__', '__subclasshook__', '__truediv__', '__weakref__', '__xor__', '_accessors', '_accum_func', '_add_numeric_operations', '_agg_by_level', '_agg_examples_doc', '_agg_summary_and_see_also_doc', '_align_frame', '_align_series', '_append', '_arith_method', '_as_manager', '_attrs', '_box_col_values', '_can_fast_transpose', '_check_inplace_and_allows_duplicate_labels', '_check_inplace_setting', '_check_is_chained_assignment_possible', '_check_label_or_level_ambiguity', '_check_setitem_copy', '_clear_item_cache', '_clip_with_one_bound', '_clip_with_scalar', '_cmp_method', '_combine_frame', '_consolidate', '_consolidate_inplace', '_construct_axes_dict', '_construct_axes_from_arguments', '_construct_result', '_constructor', '_constructor_sliced', '_convert', '_count_level', '_data', '_dir_additions', '_dir_deletions', '_dispatch_frame_op', '_drop_axis', '_drop_labels_or_levels', '_ensure_valid_index', '_find_valid_index', '_flags', '_from_arrays', '_from_mgr', '_get_agg_axis', '_get_axis', '_get_axis_name', '_get_axis_number', '_get_axis_resolvers', '_get_block_manager_axis', '_get_bool_data', '_get_cleaned_column_resolvers', '_get_column_array', '_get_index_resolvers', '_get_item_cache', '_get_label_or_level_values', '_get_numeric_data', '_get_value', '_getitem_bool_array', '_getitem_multilevel', '_gotitem', '_hidden_attrs', '_indexed_same', '_info_axis', '_info_axis_name', '_info_axis_number', '_info_repr', '_init_mgr', '_inplace_method', '_internal_names', '_internal_names_set', '_is_copy', '_is_homogeneous_type', '_is_label_or_level_reference', '_is_label_reference', '_is_level_reference', '_is_mixed_type', '_is_view', '_iset_item', '_iset_item_mgr', '_iset_not_inplace', '_item_cache', '_iter_column_arrays', '_ixs', '_join_compat', '_logical_func', '_logical_method', '_maybe_cache_changed', '_maybe_update_cacher', '_metadata', '_mgr', '_min_count_stat_function', '_needs_reindex_multi', '_protect_consolidate', '_reduce', '_reduce_axis1', '_reindex_axes', '_reindex_columns', '_reindex_index', '_reindex_multi', '_reindex_with_indexers', '_rename', '_replace_columnwise', '_repr_data_resource_', '_repr_fits_horizontal_', '_repr_fits_vertical_', '_repr_html_', '_repr_latex_', '_reset_cache', '_reset_cacher', '_sanitize_column', '_series', '_set_axis', '_set_axis_name', '_set_axis_nocheck', '_set_is_copy', '_set_item', '_set_item_frame_value', '_set_item_mgr', '_set_value', '_setitem_array', '_setitem_frame', '_setitem_slice', '_slice', '_stat_axis', '_stat_axis_name', '_stat_axis_number', '_stat_function', '_stat_function_ddof', '_take_with_is_copy', '_to_dict_of_blocks', '_typ', '_update_inplace', '_validate_dtype', '_values', '_where', 'abs', 'add', 'add_prefix', 'add_suffix', 'agg', 'aggregate', 'align', 'all', 'any', 'append', 'apply', 'applymap', 'asfreq', 'asof', 'assign', 'astype', 'at', 'at_time', 'attrs', 'axes', 'backfill', 'between_time', 'bfill', 'bool', 'boxplot', 'clip', 'columns', 'combine', 'combine_first', 'compare', 'convert_dtypes', 'copy', 'corr', 'corrwith', 'count', 'cov', 'cummax', 'cummin', 'cumprod', 'cumsum', 'describe', 'diff', 'div', 'divide', 'dot', 'drop', 'drop_duplicates', 'droplevel', 'dropna', 'dtypes', 'duplicated', 'empty', 'eq', 'equals', 'eval', 'ewm', 'expanding', 'explode', 'ffill', 'fillna', 'filter', 'first', 'first_valid_index', 'flags', 'floordiv', 'from_dict', 'from_records', 'ge', 'get', 'groupby', 'gt', 'head', 'hist', 'iat', 'idxmax', 'idxmin', 'iloc', 'index', 'infer_objects', 'info', 'insert', 'interpolate', 'isin', 'isna', 'isnull', 'items', 'iteritems', 'iterrows', 'itertuples', 'join', 'keys', 'kurt', 'kurtosis', 'last', 'last_valid_index', 'le', 'loc', 'lookup', 'lt', 'mad', 'mask', 'max', 'mean', 'median', 'melt', 'memory_usage', 'merge', 'min', 'mod', 'mode', 'mul', 'multiply', 'ndim', 'ne', 'nlargest', 'notna', 'notnull', 'nsmallest', 'nunique', 'pad', 'pct_change', 'pipe', 'pivot', 'pivot_table', 'plot', 'pop', 'pow', 'prod', 'product', 'quantile', 'query', 'radd', 'rank', 'rdiv', 'reindex', 'reindex_like', 'rename', 'rename_axis', 'reorder_levels', 'replace', 'resample', 'reset_index', 'rfloordiv', 'rmod', 'rmul', 'rolling', 'round', 'rpow', 'rsub', 'rtruediv', 'sample', 'select_dtypes', 'sem', 'set_axis', 'set_flags', 'set_index', 'shape', 'shift', 'size', 'skew', 'slice_shift', 'sort_index', 'sort_values', 'squeeze', 'stack', 'std', 'style', 'sub', 'subtract', 'sum', 'swapaxes', 'swaplevel', 'tail', 'take', 'to_clipboard', 'to_csv', 'to_dict', 'to_excel', 'to_feather', 'to_gbq', 'to_hdf', 'to_html', 'to_json', 'to_latex', 'to_markdown', 'to_numpy', 'to_parquet', 'to_period', 'to_pickle', 'to_records', 'to_sql', 'to_stata', 'to_string', 'to_timestamp', 'to_xarray', 'to_xml', 'transform', 'transpose', 'truediv', 'truncate', 'tz_convert', 'tz_localize', 'unstack', 'update', 'value_counts', 'values', 'var', 'where', 'xs']

All methods have parenthesis and typically they take arguments. However attributes do no have parenthesis. Attributes and methods can be called using the . notation.

DataFrame Attributes

Dimension and data types

# Print the dimension of the dataframe print(df3.shape) # Print the dataframe column types print(df3.dtypes)

(51, 6) Unnamed: 0 int64 region object state object individuals float64 family_members float64 state_pop int64 dtype: object

Columns and Rows

# Print the column index of dataframe print(df3.columns) # Print the row index of dataframe print(df3.index)

Index(['Unnamed: 0', 'region', 'state', 'individuals', 'family_members', 'state_pop'], dtype='object') RangeIndex(start=0, stop=51, step=1)

DataFrame Methods

Describe the Data Frame

# First Rows print(df.head()) # Last Rows print(df.tail())

A B C D 2013-01-01 0.817994 -0.924007 -1.515711 -0.198598 2013-01-02 0.673364 -1.914110 -0.126208 -0.282033 2013-01-03 1.312579 0.340656 -0.300397 -0.838614 2013-01-04 -0.732977 -0.560867 -0.515910 -0.768784 2013-01-05 -2.045106 -0.929131 -0.029660 0.529883 A B C D 2013-01-02 0.673364 -1.914110 -0.126208 -0.282033 2013-01-03 1.312579 0.340656 -0.300397 -0.838614 2013-01-04 -0.732977 -0.560867 -0.515910 -0.768784 2013-01-05 -2.045106 -0.929131 -0.029660 0.529883 2013-01-06 -1.343257 -0.250821 -0.046303 0.944569

Information

print(df.info())

DatetimeIndex: 6 entries, 2013-01-01 to 2013-01-06 Freq: D Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 A 6 non-null float64 1 B 6 non-null float64 2 C 6 non-null float64 3 D 6 non-null float64 dtypes: float64(4) memory usage: 240.0 bytes None

Sort a DataFrame

# Sortdf3by individuals (ascending) df3_ind = df3.sort_values("individuals") # Print the top few rows print(df3_ind.head()) # Sortdf3by individuals (descending) df3_ind = df3.sort_values("individuals", ascending=False) # Print the top few rows print(df3_ind.head()) # Sortdf3by region ascending, then descending family members df3_reg_fam = df3.sort_values(['region', 'family_members'], ascending=[True, False]) # Print the top few rows print(df3_reg_fam.head())

Unnamed: 0 region ... family_members state_pop 50 50 Mountain ... 205.0 577601 34 34 West North Central ... 75.0 758080 7 7 South Atlantic ... 374.0 965479 39 39 New England ... 354.0 1058287 45 45 New England ... 511.0 624358 [5 rows x 6 columns] Unnamed: 0 region ... family_members state_pop 4 4 Pacific ... 20964.0 39461588 32 32 Mid-Atlantic ... 52070.0 19530351 9 9 South Atlantic ... 9587.0 21244317 43 43 West South Central ... 6111.0 28628666 47 47 Pacific ... 5880.0 7523869 [5 rows x 6 columns] Unnamed: 0 region ... family_members state_pop 13 13 East North Central ... 3891.0 12723071 35 35 East North Central ... 3320.0 11676341 22 22 East North Central ... 3142.0 9984072 49 49 East North Central ... 2167.0 5807406 14 14 East North Central ... 1482.0 6695497 [5 rows x 6 columns]

Indexing

By default when we create a dataframe from scratch, Pandas assigns two indexes for rows and columns using integers ranging from zero until the last value, for instance, the df4 that was created from a 2D array.

Column index

df4.columns

RangeIndex(start=0, stop=5, step=1)

Row index

df4.index

RangeIndex(start=0, stop=5, step=1)

However, in dataframes created from a csv we have column names by default and whole numbers as indexes of the rows. For instance, the dataframes df6 and df7.

Column index

# Examples of column indexes df6.columns df7.columns

Index(['Unnamed: 0', 'store', 'type', 'department', 'date', 'weekly_sales', 'is_holiday', 'temperature_c', 'fuel_price_usd_per_l', 'unemployment'], dtype='object') Index(['Unnamed: 0', 'date', 'city', 'country', 'avg_temp_c'], dtype='object')

Row index

# Examples of row indexes df6.index df6.index

RangeIndex(start=0, stop=10774, step=1) RangeIndex(start=0, stop=10774, step=1)

Changing column names is straightforward:

# Old print(df9) df9.columns = ['A', 'B', 'C'] # New print(df9)

date small_sold large_sold 0 2019-11-03 10376832 7835071 1 2019-11-10 10717154 8561348 A B C 0 2019-11-03 10376832 7835071 1 2019-11-10 10717154 8561348

Is possible to use one column as a row index, as follows:

df7_1 = df7.set_index("city") df7_1.head()

Unnamed: 0 date country avg_temp_c city Abidjan 0 2000-01-01 Côte D'Ivoire 27.293 Abidjan 1 2000-02-01 Côte D'Ivoire 27.685 Abidjan 2 2000-03-01 Côte D'Ivoire 29.061 Abidjan 3 2000-04-01 Côte D'Ivoire 28.162 Abidjan 4 2000-05-01 Côte D'Ivoire 27.547

Dropping the index works as follows:

df7_1.reset_index().head()

city Unnamed: 0 date country avg_temp_c 0 Abidjan 0 2000-01-01 Côte D'Ivoire 27.293 1 Abidjan 1 2000-02-01 Côte D'Ivoire 27.685 2 Abidjan 2 2000-03-01 Côte D'Ivoire 29.061 3 Abidjan 3 2000-04-01 Côte D'Ivoire 28.162 4 Abidjan 4 2000-05-01 Côte D'Ivoire 27.547

Once row-indexed, ideally, you would like to sort the dataframe according to this index.

print(df7_1.sort_index())

Unnamed: 0 date country avg_temp_c city Abidjan 0 2000-01-01 Côte D'Ivoire 27.293 Abidjan 106 2008-11-01 Côte D'Ivoire 27.302 Abidjan 107 2008-12-01 Côte D'Ivoire 27.472 Abidjan 108 2009-01-01 Côte D'Ivoire 26.912 Abidjan 109 2009-02-01 Côte D'Ivoire 28.224 ... ... ... ... ... Xian 16391 2004-09-01 China 17.889 Xian 16392 2004-10-01 China 11.229 Xian 16393 2004-11-01 China 5.720 Xian 16395 2005-01-01 China -2.209 Xian 16499 2013-09-01 China NaN [16500 rows x 4 columns]

Subsetting rows

To subset specific values of the dataframe we can filter columns using relational operators (Boolean conditions) to return True of False subsets of the dataframe. There are at least two common ways to achieve this same objective. Imagine, you want to subset the rows from df3 where the column individuals is greater than 10000.

For example, passing the column as an attribute

df3[df3.individuals>10000]

Unnamed: 0 region ... family_members state_pop 4 4 Pacific ... 20964.0 39461588 9 9 South Atlantic ... 9587.0 21244317 32 32 Mid-Atlantic ... 52070.0 19530351 37 37 Pacific ... 3337.0 4181886 43 43 West South Central ... 6111.0 28628666 47 47 Pacific ... 5880.0 7523869 [6 rows x 6 columns]

Alternatively, we can directly subset passing df3["individuals"]>10000, which is transformed into an object of

print(type(df3["individuals"]>10000)) df3[df3["individuals"]>10000]

Unnamed: 0 region ... family_members state_pop 4 4 Pacific ... 20964.0 39461588 9 9 South Atlantic ... 9587.0 21244317 32 32 Mid-Atlantic ... 52070.0 19530351 37 37 Pacific ... 3337.0 4181886 43 43 West South Central ... 6111.0 28628666 47 47 Pacific ... 5880.0 7523869 [6 rows x 6 columns]

Another way to subset different rows is by using the method loc, which takes advantage of the rows and columns indexes.

Subsetting rows using the loc

We use the [rows, columns] brackets after the dataframe to distinguish between rows and columns.

Indexes of rows and columns are typically strings nested in list objects.

To subset range of rows or columns is easy with the : slicing operator

Let’s look a the df7, first we set the column city as a row index, and then we pass the list cities to map all the rows of the cities "Moscow", "Saint Petersburg".

cities = ["Abidjan", "Xian"] df7_1.loc[cities].head()

Unnamed: 0 date country avg_temp_c city Abidjan 0 2000-01-01 Côte D'Ivoire 27.293 Abidjan 1 2000-02-01 Côte D'Ivoire 27.685 Abidjan 2 2000-03-01 Côte D'Ivoire 29.061 Abidjan 3 2000-04-01 Côte D'Ivoire 28.162 Abidjan 4 2000-05-01 Côte D'Ivoire 27.547

Going further we can specify multilevel indexes, that is, indexes nested inside other indexes. Lets give an example were we nest the column country inside the index of city from the dataset df7.

# Index df7 by country & city df7_1 = df7.set_index(["country", "city"]) # List of tuples: Brazil, Rio De Janeiro & Pakistan, Lahore rows_to_keep = [("Brazil", "Rio De Janeiro"), ("Pakistan", "Lahore")] # this is a list print(df7_1.loc[rows_to_keep].head())

Unnamed: 0 date avg_temp_c country city Brazil Rio De Janeiro 12540 2000-01-01 25.974 Rio De Janeiro 12541 2000-02-01 26.699 Rio De Janeiro 12542 2000-03-01 26.270 Rio De Janeiro 12543 2000-04-01 25.750 Rio De Janeiro 12544 2000-05-01 24.356

Now that we have two indexes is possible to sort the data accordingly

df7_1.sort_index(level=["country", "city"], ascending=[True, False]).head()

Unnamed: 0 date avg_temp_c country city Afghanistan Kabul 7260 2000-01-01 3.326 Kabul 7261 2000-02-01 3.454 Kabul 7262 2000-03-01 9.612 Kabul 7263 2000-04-01 17.925 Kabul 7264 2000-05-01 24.658

Using the loc method is easy to slice blocks of values using the indexes. For instance, in the df7_1, that was indexed and sorted, we can pass the range "Pakistan":"Russia", which will subset this set of values from the dataframe. Is important to sort the data.frame before passing the indexes of rows as follows:

df7_1 = df7_1.sort_index() df7_1.loc["Pakistan":"Russia"]

Unnamed: 0 date avg_temp_c country city Pakistan Faisalabad 4785 2000-01-01 12.792 Faisalabad 4786 2000-02-01 14.339 Faisalabad 4787 2000-03-01 20.309 Faisalabad 4788 2000-04-01 29.072 Faisalabad 4789 2000-05-01 34.845 ... ... ... ... Russia Saint Petersburg 13360 2013-05-01 12.355 Saint Petersburg 13361 2013-06-01 17.185 Saint Petersburg 13362 2013-07-01 17.234 Saint Petersburg 13363 2013-08-01 17.153 Saint Petersburg 13364 2013-09-01 NaN [1155 rows x 3 columns]

Finally, given that we have indexed the df7_1 according to two columns, we can subset pairs of index values and return a given set of rows.

df7_1.loc[("Pakistan","Lahore"):("Russia", "Moscow")].head()

Unnamed: 0 date avg_temp_c country city Pakistan Lahore 8415 2000-01-01 12.792 Lahore 8416 2000-02-01 14.339 Lahore 8417 2000-03-01 20.309 Lahore 8418 2000-04-01 29.072 Lahore 8419 2000-05-01 34.845

Subsetting rows using the iloc

Using iloc with a Dataframe is similar to the loc:

We use the [rows, columns] brackets after the dataframe to distinguish between rows and columns.

Indexes of rows and columns are typically integers starting from zero.

To subset range of rows or columns is easy with the : slicing operator

Subset the first five rows

df7.iloc[:5, :]

Unnamed: 0 date city country avg_temp_c 0 0 2000-01-01 Abidjan Côte D'Ivoire 27.293 1 1 2000-02-01 Abidjan Côte D'Ivoire 27.685 2 2 2000-03-01 Abidjan Côte D'Ivoire 29.061 3 3 2000-04-01 Abidjan Côte D'Ivoire 28.162 4 4 2000-05-01 Abidjan Côte D'Ivoire 27.547

Subsetting rows in two columns or more

A more complex example involves a subset involving filtering of two columns. We are going to subset in two levels grouping each condition inside () parenthesis as follows:

print(df3[(df3["state_pop"] > 5000000) & (df3["individuals"]> 5000)])

Unnamed: 0 region ... family_members state_pop 2 2 Mountain ... 2606.0 7158024 4 4 Pacific ... 20964.0 39461588 5 5 Mountain ... 3250.0 5691287 9 9 South Atlantic ... 9587.0 21244317 10 10 South Atlantic ... 2556.0 10511131 13 13 East North Central ... 3891.0 12723071 21 21 New England ... 13257.0 6882635 22 22 East North Central ... 3142.0 9984072 30 30 Mid-Atlantic ... 3350.0 8886025 32 32 Mid-Atlantic ... 52070.0 19530351 33 33 South Atlantic ... 2817.0 10381615 35 35 East North Central ... 3320.0 11676341 38 38 Mid-Atlantic ... 5349.0 12800922 42 42 East South Central ... 1744.0 6771631 43 43 West South Central ... 6111.0 28628666 47 47 Pacific ... 5880.0 7523869 [16 rows x 6 columns]

Another example, in df3 filter for rows where family_members is less than 1000 and region is Pacific.

df3[(df3.family_members < 1000) & (df3.region == "Pacific")]

Unnamed: 0 region state individuals family_members state_pop 1 1 Pacific Alaska 1434.0 582.0 735139

Finally, we can subset from a list of options using the method isin().

# The Mojave Desert states canu = ["California", "Arizona", "Nevada", "Utah"] # Filter for rows in the Mojave Desert states df3[df3.state.isin(canu)].head()

Unnamed: 0 region state individuals family_members state_pop 2 2 Mountain Arizona 7259.0 2606.0 7158024 4 4 Pacific California 109008.0 20964.0 39461588 28 28 Mountain Nevada 7058.0 486.0 3027341 44 44 Mountain Utah 1904.0 972.0 3153550

Subsetting columns

To subset, it is necessary to know the column names of the dataframe. For instance, the column names of the dataframe df4 can be retrieved using the attribute df4.columns.values, as follows:

print(df4.columns.values)

[0 1 2 3 4]

Knowing the column index it is easy then to subset the dataframe:

# First column df4[[0]] # Last column df4[[4]]

0 0 1 1 6 2 11 3 16 4 21 4 0 5 1 10 2 15 3 20 4 25

Another example using the dataframe df3 whose column indexes are

print(df3.columns.values)

['Unnamed: 0' 'region' 'state' 'individuals' 'family_members' 'state_pop']

We can pass a list of column names to subset the state and family_members columns as follows:

# select two columns df3[["state", "family_members"]].head()

state family_members 0 Alabama 864.0 1 Alaska 582.0 2 Arizona 2606.0 3 Arkansas 432.0 4 California 20964.0

To subset columns using the loc method we have to select the rows and columns that we are selecting. If we intend to select all the rows and we only are subsetting columns, we have to pass the : slice operator.

df3.loc[:, ["state", "family_members"]].head()

state family_members 0 Alabama 864.0 1 Alaska 582.0 2 Arizona 2606.0 3 Arkansas 432.0 4 California 20964.0

Another example using the df7

df7.loc[:, ["city", "country"]].head()

city country 0 Abidjan Côte D'Ivoire 1 Abidjan Côte D'Ivoire 2 Abidjan Côte D'Ivoire 3 Abidjan Côte D'Ivoire 4 Abidjan Côte D'Ivoire

Similarly, to the loc method, subsetting columns with the iloc method uses integers to map columns. For instance subsetting the columns ["city", "country"] from df7. The slice operators : using the iloc method states that we are calling all the rows or columns.

df7.columns df7.iloc[:, [1,3]].head()

Index(['Unnamed: 0', 'date', 'city', 'country', 'avg_temp_c'], dtype='object') date country 0 2000-01-01 Côte D'Ivoire 1 2000-02-01 Côte D'Ivoire 2 2000-03-01 Côte D'Ivoire 3 2000-04-01 Côte D'Ivoire 4 2000-05-01 Côte D'Ivoire

Subsetting rows and columns

First, locate the rows of the second column that comply with the rule. In this example, we locate elements that are divisible by 2.

# Subset elements of the second column that are divisible by two rows = df4[1] % 2 == 0 print(type(rows))

Notice that Python creates an object of type pandas.core.series.Series. Next, we use this object to subset the elements of the second column in the following way:

df5 = df4[[1]][rows] print(df5.shape) print(type(df5)) print(df5)

(3, 1) 1 0 2 2 12 4 22

Another method of subsetting rows and columns is by using indexes and the loc method. Recall that this method works only if the dataset contains indexes and is sorted accordingly. To refresh these steps, let’s perform the following operations:

df7_1 = df7.set_index(["country", "city"]) # set the indexes df7_1 = df7_1.sort_index() # sort the dataframe descending

Now we can subset all rows that contain the country Zimbabwe on the column date, as follows:

df7_1.loc["Zimbabwe", "date"].head() type(df7_1.index.values)

city Harare 2000-01-01 Harare 2000-02-01 Harare 2000-03-01 Harare 2000-04-01 Harare 2000-05-01 Name: date, dtype: object

Subsetting rows and columns using the iloc

This method is similar to the loc method, but instead of using strings as indexes, we use integers to call rows and columns. We follow these two rules:

We use the [rows, columns] brackets to distinguish between rows and columns.

Indexes of rows and columns are numbers that typically start in 0.

Recall df4

print(df4)

0 1 2 3 4 0 1 2 3 4 5 1 6 7 8 9 10 2 11 12 13 14 15 3 16 17 18 19 20 4 21 22 23 24 25

Subset the first and last element of df4

# Subset the first element of df4 print(df4.iloc[0,0]) # Subset the last element of df4 print(df4.iloc[4,4])

1 25

Extract the first column

df4.iloc[:,[0]]

0 0 1 1 6 2 11 3 16 4 21

Extract the first row

df4.iloc[[0],:]

0 1 2 3 4 0 1 2 3 4 5

Extract the last 10 elements of the second column

df7.iloc[(df7.shape[0]-11):(df7.shape[0]-1), [1]]

date 16489 2012-11-01 16490 2012-12-01 16491 2013-01-01 16492 2013-02-01 16493 2013-03-01 16494 2013-04-01 16495 2013-05-01 16496 2013-06-01 16497 2013-07-01 16498 2013-08-01

Get the first 5 rows of columns 3 and 4.

df7.iloc[:4, 2:4]

city country 0 Abidjan Côte D'Ivoire 1 Abidjan Côte D'Ivoire 2 Abidjan Côte D'Ivoire 3 Abidjan Côte D'Ivoire

Extract all elements in the first row that comply with a condition. For this example we are going to nest the condition of all the values in the first row greater than two, df4.iloc[0,:]>2. Then we are using the np.where method to locate the indexes of elements that contain a True boolean.

df4.iloc[[0], np.where(df4.iloc[0,:]>2)[0]]

2 3 4 0 3 4 5

Transforming

Describe (Summary Statistics)

# mean print(df3.state_pop.mean()) # median print(df3.state_pop.median()) # variance print(df3.state_pop.var()) # standard deviation print(df3.state_pop.std()) # min value # standard deviation print(df3.state_pop.min()) # max value print(df3.state_pop.min()) # all together print(df3.describe())

6405637.274509804 4461153.0 53688706994844.23 7327257.808678784 577601 577601 Unnamed: 0 individuals family_members state_pop count 51.000000 51.000000 51.000000 5.100000e+01 mean 25.000000 7225.784314 3504.882353 6.405637e+06 std 14.866069 15991.025083 7805.411811 7.327258e+06 min 0.000000 434.000000 75.000000 5.776010e+05 25% 12.500000 1446.500000 592.000000 1.777414e+06 50% 25.000000 3082.000000 1482.000000 4.461153e+06 75% 37.500000 6781.500000 3196.000000 7.340946e+06 max 50.000000 109008.000000 52070.000000 3.946159e+07

If the dataframe contains integers or real numbers is easy to perform the operations across the columns or rows.

# Mean of rows across columns df4.mean(axis=1) # Mean of columns across columns df4.mean(axis=0)

0 3.0 1 8.0 2 13.0 3 18.0 4 23.0 dtype: float64 0 11.0 1 12.0 2 13.0 3 14.0 4 15.0 dtype: float64

Column operations

The main column operations are:

Sum

print(df3.state_pop.sum())

326687501

Cumulative sum

print(df3.state_pop.cumsum())

0 4887681 1 5622820 2 12780844 3 15790577 4 55252165 5 60943452 6 64514972 7 65480451 8 66181998 9 87426315 10 97937446 11 99358039 12 101108575 13 113831646 14 120527143 15 123675761 16 126587120 17 131048273 18 135707963 19 137047020 20 143082822 21 149965457 22 159949529 23 165555778 24 168536798 25 174658421 26 175719086 27 177644700 28 180672041 29 182025506 30 190911531 31 193004272 32 212534623 33 222916238 34 223674318 35 235350659 36 239290894 37 243472780 38 256273702 39 257331989 40 262416145 41 263294843 42 270066474 43 298695140 44 301848690 45 302473048 46 310974334 47 318498203 48 320302494 49 326109900 50 326687501 Name: state_pop, dtype: int64

Cumulative product

print(df3.state_pop.cumprod())

0 4887681 1 3593124922659 2 7272930357681714200 3 7903566051232904824 4 -6717292260295177376 5 8873061169631614368 6 8148294484792166400 7 5819023669836478464 8 -1205241372944291840 9 -6292177132517226496 10 6622373097463437312 11 4962005595153618944 12 3470577629245915136 13 -8423188867997155328 14 -2225371326546493440 15 3946997816422858752 16 5096266778097451008 17 1136688303390556160 18 1494541717077688320 19 7730232222543446016 20 8113663438137196544 21 -1654939752842264576 22 6166687450565443584 23 1500558237656678400 24 6253697334853500928 25 -8790508571491041280 26 -7109132568401805312 27 -5744006173422518272 28 -2284299832833081344 29 -8673003558171312128 30 -3504656614122586112 31 -5377308965807849472 32 -4878392185126387712 33 1709610911523667968 34 8941421250245296128 35 9215048314536329216 36 -8334418405679431680 37 4198662944055099392 38 -8723413785341067264 39 581225513859678208 40 1910243011917250560 41 2130586611002376192 42 -2684937009104945152 43 -2708850407756529664 44 7509713552135946240 45 -6781182851288137728 46 -6382118816838582272 47 -4484145143506534400 48 2332118313460563968 49 -4719128645426216960 50 -8733419210157326336 Name: state_pop, dtype: int64

Adding new columns

To add new columns we can take a dataframe, for instance df4, write the new index (or column name) and then pass the values on the left of = as follows

# add columns 3 and 4 and save them on a new column df4[5] = df4[2] + df4[3] # estimate column 3 as a proportion of column 5 and add the result in a new column df4[6] = df4[2] / df4[5] print(df4)

0 1 2 3 4 5 6 0 1 2 3 4 5 7 0.428571 1 6 7 8 9 10 17 0.470588 2 11 12 13 14 15 27 0.481481 3 16 17 18 19 20 37 0.486486 4 21 22 23 24 25 47 0.489362

If the columns do not have a numeric index, but a string name, the procedure to add a new column is by passing a string with the name of the new column, as follows:

# Create indiv_per_10k col as homeless individuals per 10k state pop df3["indiv_per_10k"] = 10000 * df3.individuals / df3.state_pop

Converting column types

Transform the first column into a datetime64[ns]

# Print class of object print(df7["date"].dtypes) # column date is an object # transform df7["date"] = pd.to_datetime(df7["date"]) # Print the new class of object print(df7["date"].dtypes) # column date is an object

object datetime64[ns]

Now that date is of class datetime64[ns], we can create new columns of year, month or day

# Add a year column to temperatures df7["year"] = df7["date"].dt.year df7["month"] = df7["date"].dt.month df7["day"] = df7["date"].dt.day df7.head() print(df7.dtypes)

Unnamed: 0 date city country avg_temp_c year month day 0 0 2000-01-01 Abidjan Côte D'Ivoire 27.293 2000 1 1 1 1 2000-02-01 Abidjan Côte D'Ivoire 27.685 2000 2 1 2 2 2000-03-01 Abidjan Côte D'Ivoire 29.061 2000 3 1 3 3 2000-04-01 Abidjan Côte D'Ivoire 28.162 2000 4 1 4 4 2000-05-01 Abidjan Côte D'Ivoire 27.547 2000 5 1 Unnamed: 0 int64 date datetime64[ns] city object country object avg_temp_c float64 year int64 month int64 day int64 dtype: object

Here is another example using df11, columns two and three are strings that can be transformed into numeric elements.

print(df11.dtypes) df11[['two', 'three']] = df11[['two', 'three']].astype(float)

one object two object three object dtype: object

Aggregating

The aggregate method works by defining a function that latter is going to be called on each column of a dataframe using the method agg()

For instance, consider the following function that computes the 75% and the 25% quantiles

def iqr(column): return column.quantile(0.75) - column.quantile(0.25)

We can use this function on columns one and three of the df4 dataframe as follows:

print(df4[[0, 2]].agg(iqr))

0 10.0 2 10.0 dtype: float64

agg() can take more than one function, for instance

print(df4[[0, 2]].agg([iqr, np.median, np.mean]))

0 2 iqr 10.0 10.0 median 11.0 13.0 mean 11.0 13.0

Grouping

For grouping, we use the method .groupby() from the pandas library that splits a dataframe according to a certain categorical variable. The following snipped splits the dataframe using the variable type in two categories, and then computes the sum() of the column weekly_sales.

df6.groupby("type")["weekly_sales"].sum()

type A 2.337163e+08 B 2.317840e+07 Name: weekly_sales, dtype: float64

We can even take more than one category

df6.groupby(["type", "is_holiday"])["weekly_sales"].sum()

type is_holiday A False 2.336927e+08 True 2.360181e+04 B False 2.317678e+07 True 1.621410e+03 Name: weekly_sales, dtype: float64

Now we can use the .groupby together with the agg functions to calculate the min, max, mean and median of a variable.

df6.groupby("type")[["unemployment", "fuel_price_usd_per_l"]].agg([np.mean, np.median, np.max, np.min])

unemployment ... fuel_price_usd_per_l mean median amax ... median amax amin type ... A 7.972611 8.067 8.992 ... 0.735455 1.107410 0.664129 B 9.279323 9.199 9.765 ... 0.803348 1.107674 0.760023 [2 rows x 8 columns]

Dealing with NaN values

# Initial number of rows na_rows = df8.shape[0] # Report the sum of NaN values in each column df8.isna().sum() # Report the sum of NaN values in each column as a proportion df8.isna().sum()/df8.shape[0]

Country/Region 171061 Province/State 223208 Latitude 171062 Longitude 171062 Confirmed 19 Recovered 386 Deaths 432 Date 0 dtype: int64 Country/Region 0.137736 Province/State 0.179724 Latitude 0.137736 Longitude 0.137736 Confirmed 0.000015 Recovered 0.000311 Deaths 0.000348 Date 0.000000 dtype: float64

Imputing NaN values with zero

#Impute NaN with Zero df8_1 = df8.fillna(0) #Display no NaNs df8_1.isna().any()

Country/Region False Province/State False Latitude False Longitude False Confirmed False Recovered False Deaths False Date False dtype: bool

Removing rows with missing values

# Drop Na values df8 = df8.dropna() # How many rows were dropped? na_rows - df8.shape[0]

223572

Pivot Tables

The pivot table is another method to transform data using categorical variables to split the dataframe. By default, the .pivot_table method computes the mean of a variable.

# Pivot for mean weekly_sales for each store type df6.pivot_table(values="weekly_sales", index="type")

weekly_sales type A 23674.667242 B 25696.678370

An example selecting two variables

# Pivot for mean weekly_sales for each store type df6.pivot_table(values=["weekly_sales", "unemployment"], index="type")

unemployment weekly_sales type A 7.972611 23674.667242 B 9.279323 25696.678370

We can extend the pivot table capabilities by passing two functions instead of one, as follows

# Pivot for mean weekly_sales for each store type df6.pivot_table(values="weekly_sales", index="type", aggfunc=[np.mean, np.median])

mean median weekly_sales weekly_sales type A 23674.667242 11943.92 B 25696.678370 13336.08

We can split the dataframe passing one variable as index and another as columns

df6.pivot_table(values="weekly_sales", index="type", columns="is_holiday")

is_holiday False True type A 23768.583523 590.04525 B 25751.980533 810.70500

Create a pivot table adding the avg_temp_c column from df7, with country and city as rows, and year as columns.

df7["year"] = df7["date"].dt.year

Compute over two variables and replaces the NaN values, adding a sum of the columns and rows with the argument margins=True.

df6.pivot_table(values="weekly_sales", index="department", columns=["store", "type"], fill_value=0, margins=True)

store 1 2 ... 39 All type A A ... A department ... 1 23491.755000 32392.588333 ... 21423.068333 32052.467153 2 47421.124167 68156.664167 ... 60768.638333 71380.022778 3 12872.590000 17012.000000 ... 16847.852500 18278.390625 4 38382.255833 47650.447500 ... 41670.077500 44863.253681 5 23761.120000 30331.717500 ... 23466.439167 37189.000000 ... ... ... ... ... ... 96 27897.153333 33841.960833 ... 24947.875833 20337.607681 97 33771.761667 40757.997500 ... 23002.670000 26584.400833 98 10853.782500 14009.203333 ... 9089.097500 11820.590278 99 466.364545 455.516364 ... 317.189091 379.123659 All 20896.941787 26517.435162 ... 18414.938423 23843.950149 [81 rows x 13 columns]

Exporting

DataFrame to CSV

The most common format to export data is by saving the dataframe on a csv file

df9.to_csv("df9.csv")

DataFrame to Latex

print(df.style.to_latex())

\begin{tabular}{lrrrr} & A & B & C & D \\ 2013-01-01 00:00:00 & 0.817994 & -0.924007 & -1.515711 & -0.198598 \\ 2013-01-02 00:00:00 & 0.673364 & -1.914110 & -0.126208 & -0.282033 \\ 2013-01-03 00:00:00 & 1.312579 & 0.340656 & -0.300397 & -0.838614 \\ 2013-01-04 00:00:00 & -0.732977 & -0.560867 & -0.515910 & -0.768784 \\ 2013-01-05 00:00:00 & -2.045106 & -0.929131 & -0.029660 & 0.529883 \\ 2013-01-06 00:00:00 & -1.343257 & -0.250821 & -0.046303 & 0.944569 \\ \end{tabular}

References

Stack Overflow (2022)

Data Camp (2022)

Data Camp. 2022. “Data Manipulation with pandas - DataCamp Learn.” .

Stack Overflow. 2022. “Change column type in pandas.” .

Introduction to Algorithms

2022-03-27T00:00:00+00:00

Introduction

In lectures one to four, we have set the stage to introduce the heart of Applied Data Science: algorithmic thinking for problem-solving. In lecture one, we learn about the scope of Data Science and the rise of Big Data. Lecture two, is an introduction to the use of inductive reasoning applied to Data Science. Lectures three and four are an overview on basic estimation using statistics and econometrics applied with R Programming. In this lecture, I cover another pillar of Data Science: Algorithm programming using control flow structures and functions.

Functions and control flow structures are the building blocks of algorithm programming. So far, we have use r-packages and more specifically FUN(X) functions, that take arguments and perform certain action. However, in this lecture, we will learn the elementary building blocks from algorithmic programming. Learning the elementary building blocks of algorithmic programming has two main advantages in your formation in Data Science. Firstly, algorithmic programming allows you to understand in detail how the functions work. I’m sure that thus far, you know that if we pass a numeric vector x, inside the functionmean(x), R somehow computes the mean. However, after learning the building blocks of algorithmic programming, we are going to be able to understand how the functions work. What is the series of steps behind the computation of certain functions? And how do the arguments of the function being used, in which order? In a nutshell, algorithmic programming enables you to deeply understand functions and packages in R.

The second advantage of algorithmic programming is that enables you to go beyond the “out-of-the-shelf” functions from the R-base and other packages. Indeed, instead of being bound to only functions from the R-base and other packages, algorithmic programming, gives you the tools to create your own functions. In general, is recommended to search first if there are no functions available to perform the action that you want. But, knowing algorithmic programming removes the constraints of only using available tools and gives you the freedom of developing tools that fulfil your particular needs. Indeed, we would like to build our own functions and algorithms in two cases. Firstly, when we can’t find a similar function in The R Base or in the packages maintained by The Comprehensive R Archive Network (CRAN). Especially, if this is your first course in Data Science, you would like to verify first if there is a function on CRAN that fulfils your needs before investing time building your own function. Using a function from CRAN’s database is typically a better option, not only because we save time, but also because the code is audited by lead experts in their corresponding fields. Secondly, we may opt to build our own function when we often perform a sequential series of functions or repetitive steps. For instance, I typically use the function lapply combined with the function class to verify the class of each column from a df (data.frame) in the following manner lapply(df, class).

What is an algorithm?

An algorithm is simply a “well-defined computational procedure that takes some value, or set of values, as input and produces some value, or set of values, as output” [@cormen2022introduction]. Indeed, an algorithm serves a specific purpose and has a specific procedure designed to solve a problem. Before we start learning the necessary syntax to produce algorithms in R, we are going to understand the structure (procedure) of examples of algorithms employing pseudo-code or flow diagrams. Later, in a second step, we will revise the specific R-code that we need to produce the algorithm in R.

Example 1: Fibonacci Sequence

Input: A variable x that is the number of numbers to generate in the Fibonacci serie.

Output: A series z that is the Fibonacci serie of lenght x.

Fibonacci Sequence Generator:
graph TD A[Input x: lenght of sequence to generate] -->|Start| B(Sum the last 2 numbers) B--> C(Add the number to the series z) C--> D(Count the y of generated numbers) D--> Z{Is x = y?} Z--> |NO| B Z--> |Yes| R(Return the sequence z)

Flow controls: while and if

To implement the algorithm in Example one, we need to expand our knowledge of R operators. The operator if and while are always followed by a (...) that assesses a logical condition and some {...} brackets that perform a set of actions, ... if the condition is fulfilled. For instance, in Example 1, the input of the algorithm is a variable x that defines the number of elements to generate in the sequence. The output of the algorithm is a sequence z that has y number of elements. To generate the series we need to add one number to the series z that corresponds to the sum of the last two elements of the previous series until the total length is y. To start the algorithm we need the variable x that inputs the number of values to generate in the series z. followed by y which is the initial value of the length of z. Assuming that the user is going to request more than two numbers in the series then y>2 (Requesting less than two numbers makes the algorithm redundant).

Next, the algorithm needs to be programmed to continue doing a series of steps until the goal is reached. Remember that the goal is to produce a series z that has a total length of x. Notice that y is then the actual number of elements in the series z in each iteration of the process of adding one element to the series. To operationalize the algorithm we are going to use while(y, which assesses the condition where y. The operator will perform the set of actions ... only while the condition is satisfied y. That means that the algorithm using while stops when y>=x (when the corresponding series z has a total length of x or more).
x <- 20L y <- 0L z <- c(0L, 1L) while (y < x) { z[length(z) + 1L] <- sum(z[c(length(z) - 1L, length(z))]) y <- length(z) } z Example 2: Sorthing Algorithm Input: A variable $x={a_1, a_2, \dots, a_n}$ of n rational numbers. Output: A permutation (reordering) of x called y such that $y={a_1^*, a_2^*, \dots, a_n^*}$ Source: (Cormen, Leiserson, Rivest, and Stein, 2022). Sorting Algorithm graph TD A[Input x: a series of rational numbers length n] -->|Start| B(Take 'n>=j>1' from 'x' and store it in 'key') B --> C(Location of the previous number: 'i=j-1') C --> Z{Is x_i -previous- > key -next- ? } Z --> |Yes| Y1(Swap x_i -previous with key -next- ) Y1-->Y2(Swap key-next with x_i -previous- ) Z --> |NO| B for loop The sorting algorithm of Example 2, takes a series x of unsorted rational numbers and using an iterative procedure (for loop) compares each value on the list with all other values in the series. Using an index of previous value i and next value j and a storing vector key the algorithm swaps places when a previous number x[i] is greater than the current number being compared (key). This algorithm performs the same action as the function sort(x) with the argument decreasing = FALSE. Therefore, the algorithm itself has no more purpose than learning how the for loop is being used in R. The most fundamental aspect of the for loop is that takes n values in a series to perform a list of steps in each iteration of the loop. In this case, the algorithm evaluates each number in the series x to verify if a previous number is bigger than the next number in the series x[i]>x[j]. If that is the case, the algorithm replaces (swaps) the previous number x[i] with the current number being evaluated x[j] # Unsorted x <- sample(1L:99L, 15) x # sorted with the function sort(x, decreasing = FALSE) # iterative sorting algorithm for(j in 2L:length(x)){ key <- x[j] i <- j - 1 while(i>0&&x[i]>key){ #previous number in the series (x[i]) is greater than next number (key) x[i+1] <- x[i] #swap previous number (x[i]) with the next number (x[i+1]) i <- i - 1 x[i + 1] <- key # swap next number (key) with previous number (x[i+1]) } } # Sorted x Example 3: Odds and Even numbers This example is may have only a pedagogical application. The algorithm samples one random number sample(..., 1) between one and lim to generate an x numeric series of length y of even or odd numbers. Input: A variable lim that defines the range of numbers to sample $[1, lim]$ and the length of the series to generate. Also, we need a binary variable to switch the series from even to odd numbers. Output: A series x of even or odd numbers. if, else A good analogy to understand the dynamics of if and else operators is a choice or selection between a set of possible categories. Example 5: Choice Algorithm Suppose you have a bag of candies with the following flavours: c("orange", "lemon", "strawberry", "mango"). Your preference is lemon above all and strawberry over orange, you dislike mango. Suppose that the bag contains 100 candies, and you are interested in how many candies you would need to take (by chance) before getting 3 lemon candies in total? Input: A random sample c with repetition of size 100 of candies (the bag). Output: A series x of at least three lemon candies. candies <- c("orange", "lemon", "strawberry", "mango") candy_bag <- sample(candies, 100L, replace = T) lemon <- 0L picks <- "" c <- 1L while(sum(picks=="lemon")<3){ pick <- sample(candy_bag, 1L) if(pick=="lemon"){ picks[c] <- pick c <- c + 1L }else if(pick=="strawberry"){ picks[c] <- pick c <- c + 1L }else if(pick=="orange"){ picks[c] <- pick c <- c + 1L } } # Number of picks length(picks) # Distribution of picks library(ggplot2) ggplot(as.data.frame(table(picks)), aes(x=picks, y = Freq)) + geom_bar(stat="identity") Functions FUN(...), functions make explicit the kind of input that we the algorithms need in the form of arguments. Functions can take n=... arguments as our implementation may require. As we mention before the operator if(...) evaluates a logical condition and it is the gatekeeper of a set of operations grouped within {} brackets. Finally, the ifelse and the else operators evaluate a set of logical conditions always after the first condition stated in the if operator. Example 6: Even or Odd numbers In the implementation of Example 6, the algorithm employs if and else if to select between an odd or even number. Instead of using a for loop, that has a deterministic number of iterations, the examples use a while to prevent the algorithm to stop before generating a series length y of even or odd numbers. Input: A random sample lim that generates Output: A series x of at least three lemon candies. odd_even <- function(lim=100L, y=25L, even=TRUE){ x <- vector(mode = "numeric") i <- 1L while (length(x) Example 7: Randomized Hire-Assistant Finally, example 7 employs previous control flows. Starts by assuming that there is a fixed supply of assistants in the labor market of data science. Further, it assumes that by a process of selection and interview their ability is explicit. Each candidate arrives at the interview randomly, and the goal is then to select the top candidate on a fixed number of interviews. Input: A vector of candidates supply with a vector a of ability. Additionally, a vector of interviews that contains the max number of interviews in each experiment. Output: A matrix H with i-rows hired assistant ability per j-column round of interviews. From this matrix, we are interested in estimating the total number of hires for each round of interviews, c(150L, 250L, 1000L, 3000L, 5000L, 10000L, 15000L), The algorithm uses a for loop to perform an iterative selection of candidates using the vector interviews. Then selects a random sample of candidates selected for the interviews. Using a while operator each iteration continues to run until length(selected)==0. For each run, I sample one candidate for interview and remove it from the selected vector. Finally, using an if(best operator the algorithm hires a candidate if their ability is higher than the current best candidate. h <- 1L supply <- 20000L hires <- 0L a <- runif(supply) interviews <- c(150L, 250L, 1000L, 3000L, 5000L, 10000L, 15000L) H <- matrix(NA, nrow = supply, ncol = length(interviews)) j <- 2L for(j in seq_along(interviews)){ selected <- sample(a, interviews[j]) #interview candidate h <- 1L best <- 0 while(length(selected)!=0){ i <- sample(1L:length(selected), 1) interview <- selected[i] selected <- selected[-i] if(best References Cormen, T. H., C. E. Leiserson, R. L. Rivest, and C. Stein. 2022. Introduction to Algorithms, Fourth Edition. MIT Press. https://books.google.nl/books?id=RSMuEAAAQBAJ.
Inferential Statistics: Causation or Correlation? 2022-03-21T00:00:00+00:00 Introduction Introduction In lecture number three, we review the use of descriptive statistics to answer questions such as: “What is the current state of affairs?”; “How often, how many, when?” An also I introduce the us of the correlation coefficient to assess “what is the association between two variables?” However, in many cases, to show that two variables have an association sometime is not enough. Associations only measure how the set of variables change toguether, but, they do not say anything regarding the direction or magnitude of the relationship. To say something about the direction, means to discover if one varibles is the cause or determinant of another. Here, there is a clear order in the relationship between two varibles, for instance, $X \rightarrow Y$, represents that $X$ is the cause or determinant of $Y$. The magnitude of the relationship refers the measurement of the effect of $X$ on $Y$, for instance, if $X$ changes by one unit how much does $Y$ would vary? The distiction between a correlation and a causal relationship between two variables is not only important but necesarry in many applications. Imagine for intance the development of a vaccine or an important policy prescription. Obviosly, the research that backs-up these developments will impact the life of many people. Therefore, we would like to make a precise inference to be able to claim with robustness the magnitude and the direction that exist between variables. An asociation between two variables is not strong enough to draw conclusions about the population of our interest. In many cases we would like to move then from showing a correlation between variables to find which variable is the cause or determinant of the other. This kind of research is the central quest of econometrics and Data Science and has a special place in empirical economics. Causality and Correlation As it turns out, the kind of relationship between two variables is no so clear. As Data Scientist, we should proceed with scientific skepticism when we analyze the relationship between two variables. When we measure a correlation between two variables, we are merely assessing the association between two variables, but that is not the same as causation. To help you draw a line between an association, correlation and a causal relationship between two variables, I elaborate on some properties that causal relationships must have: A Causal Mechanism The development of Machine Learning and Big Data are pushing the boundaries between data-driven and theory-driven research (Maass, Parsons, Et Al., 2018). Indeed, there is nowadays a real debate on the power of Data Science to replace the scientific method: Very large databases are a major opportunity for science and data analytics is a remarkable new field of investigation in computer science. The effectiveness of these tools is used to support a “philosophy” against the scientific method as developed throughout history. According to this view, computer-discovered correlations should replace understanding and guide prediction and action. Consequently, there will be no need to give scientific meaning to phenomena, by proposing, say, causal relations, since regularities in very large databases are enough: “with enough data, the numbers speak for themselves”. The “end of science” is proclaimed… Claude & Longo, 2017. However, in Economics, we are skeptical about weather data driven methods can really be a substitute for theory driven research. With the advent of Information Communication Systems (ICT) and now Big data, are generating large volumes of information. The availability of all sorts of data, also pose a challenge of identifying meaningful relationships between variables. The issue is that more often than before, we can find out by chance pairs of variables that seem to be related, but in fact they are completely disconnected from each other. In Economics, there is long persistent concern about this kind of problem called spurious relationship between variables. In Layman’s terms, a spurious relationship occurs when a set of variables seem to have a relationship, however, they are in fact completely unrelated. So then, what is the solution to avoid the trap of the spurious relationship between two variables? The answer is a well-defined and coherent theoretical framework. In fact, the main stream of methodology in economics, has always been about finding methods to prove economic theory and not the other way around. Although, that trend is changing, and some Data Scientist would argue that research is becoming more data driven, the fact is that in economics there is no substitute for a well-defined and coherent theory. The seminal work of Blaug (1992), takes a closer look at the developments of methodology in economics and argues that: Methodology is study of the relationship between theoretical concepts and warranted conclusions about the real world; in particular, methodology is that branch of economics where we examine the ways in which economists justify their theories and the reasons they offer for preferring one theory over another. An Exogenous Model To argue that two or more variables hold a causal relationship, we must ensure that our models are exogenous. What does that mean? Well, to say that $X \rightarrow Y$, requires that we control in the estimation all other factors that affect $Z \rightarrow Y$ our dependent variable. If our theory suggest that $X$ causes $Y$, we must ensure that our estimation isolates well the causal mechanism. In other words, we must account jointly with $X$ all the other $Z$ determinants of $Y$. If we fail to include all the variables that are systematically affecting $Y$, we fall in the omitted variable bias (OVB) trap. OVB is common, because if there are variables that remain confounded or unobservable ($Z$), make it hard to distinguish if $X$ determines $Y$, or perhaps is $Z$? A graphical approach to understand OVB treats to a causal estimation is represented in the following diagram. Here we can see that our variable of interest $X$ is indeed causing $Y$, however, there is another variable (in the yellow region), $Z$ that is jointly affecting $Y$. Failing then to control for $Z$ induces a discrepancy between the population parameter(s) and our estimate(s) called bias. Omitted Variable Bias (OVB). graph TD X --> Y subgraph OVB; Z --> Y classDef red fill:#fdc class AN red end Biased model: $$Y=\beta_1 X+\epsilon$$ Unbiased model: $$Y=\beta_1 X+ \beta_2 Z+\epsilon$$ The classic example is the estimation of years of education ($X$) on income ($Y$). The problem is that we can measure really well the years of education, but other determinants like ability, motivation and number of hours of study are very hard to measure. Even if we have psychometric measurements of IQ, these metrics are only proxies of the latent ability at the individual level. A proxy means that is just an approximation of the real variable that remains or confounded or unobserved. The book of Stock and Watson, (2019), offers another example from the study of school grades ($Y$) and the student-teacher ratio. The intuition of the study is that if the student-teacher ratio is high, then the grades are low. The causal mechanism is that explains this negative relation is the lack of capacity of teachers to properly tutor many students. However, the estimation, suffers from OBV, because it does not account for the percentage of English learners in some schools. This is a problem because migrant children might require additional tutoring, given that they do not master the language. Another potential source of OVB is the lack of control of the time of the test. As it turns out, the time of the test can impact the scores, because in early morning and later in the evening the alertness may reduce. A special type of OVB is called self-selection. Self-selection appears in an estimation when there are inherent characteristics of the unit of observation that affect the outcome variable ($Y$) but remain confounded or latent. This perhaps sounds quite abstract, so let’s give some examples to clarify the concept. Imagine that you are interested in estimating the effect of education quality ($Y$) on the career success ($Y$), measured in monthly income. So you run a model and control for the different schools; among the sample you have graduates from Oxford, Harvard, Stanford and so on. Then in your estimation, it appears that indeed the higher the rank of the university (for the World University Ranking) the higher the measured salary of the graduates $Y$. But wait, aren’t more able students also more likely to enroll themselves into highly ranked universities? Indeed, variables such as ability and motivation are very difficult to observe, and hence it is hard to determine if schooling from highly ranked universities causes latter career success. Although the association between university prestige and career success intuitively makes sense, most of the time we are only able to describe a simple correlation between the variables (Gonzalez-Sauri and Rossello 2022).. Another example, from studies of science and technology, is whether research collaboration causes higher research productivity? Similarly to the previous example, intuitively, we may expect that increments in the collaboration render beneficial exchanges between researchers that increase the overall productivity of authors. Similar to the previous example, intuitively, makes sense collaboration brings gains of human capital, division of labor and pooling of resources. However, we disregard, that these positive externalities from collaboration depend on the individual self-selection into networks or teams (Ductor 2015). The self-selection takes place because researchers do not connect or make partnerships with everybody randomly. But most of the time, a researcher’s own preferences in terms of discipline, research interest and other individual characteristics such as their personality are the reason behind the membership into different networks. Thus, it is hard to tell weather is collaboration the determinant of productivity or is it some other individual characteristics that help some researchers to be part of prolific networks. A second source of endogeneity (opposite of exogeneity) is called reverse-causality. Reverse causality is a real problem in many datasets because there is some feedback mechanism $Y \rightarrow X$ in which the dependent variable also affects the explanatory variable. A classic example in economics of this issue is presents in the functions of supply and demand. The issue is that $S=P$ supply varies depending on the selling price, but simultaneously, the price is also changing according to the demand $D=P$. This is an issue of feedback, in which the dependent variable (supply-demand), affects the explanatory variable (price) under equilibrium. This system of equations has a problem of reverse causality that is not so straightforward to solve. In graphical form, the problem or reverse causality is represented in the following way: Reverse Causality. graph TD P[Price] --> S[Supply] subgraph REV-CAUSAL; D[Demand] --> P classDef red fill:#fdc class AN red end S ---|Equilibrium: =| D Biased model: $$S=\beta_1 P+\epsilon$$ Unbiased model: $$S=\beta_1 P +\epsilon$$ $$D=\beta_2 P +\epsilon$$ A similar problem that poses a thread to exogeneity, is called circularity, and it happens when past realizations of a dependent variable have an impact on contemporaneous values. There are many examples of this problem in finance and time series econometrics. For instance, in macroeconomic estimations of the $GDP_{t}$ it is crucial to include the previous state of affairs $GDP_{t-y}$. Where $t>y$ stands for a previous period, such that the current or contemporaneous $GDP$ depends on the state of affairs of the last year. The problem of circularity in graphical form is described as follows: Circularity. graph TD X --> GDP_t2 subgraph CIRCULARITY; GDP_t1 --> GDP_t2 classDef red fill:#fdc class AN red end Biased model: $$GDP_{t+1}=\beta_1 X+\epsilon$$ Unbiased model: $$GDP_{t+1}=\beta_1 X + \beta_2 GDP_{t+1} +\epsilon$$ Nature of the data: Observational vs Experimental If we think deeply about the exogeneity threads discussed in the last section (OVB, self-selection, reverse causality and circularity) we may see a common problem. Yes indeed, at the heart of the issue of endogeneity (opposite of exogeneity) there is a common problem of confounding factors. A confounding factor, in Layman’s terms, is simply a variable that we can’t get our hands on. Either because we do not have the data, we can’t measure it (OVB), due to a problem of self-selection or because our dependent variable has a form of feedback (reverse causality or circularity). All the listed problems, induce bias and yield an unreliable inference because at the backbone of the estimation there is a problem of confounding variables. Having confounding factors in an estimation is like cooking with an incomplete recipe, or analogous to having a jigsaw puzzle with missing pieces. One Missing Puzzle Piece - Black Tile by FlowstoneGraphics This general issue of confounding variables is not easy to solve with the vast majority of the data sets that we can get our hands on. Indeed, the aforementioned threads to exogeneity may persist even in the most tidy and organize data from relational datasets (SQL for instance), surveys or administrative records. And unfortunately, the issue of confounding factors is not solved by increasing the magnitude and quantity of the data at our disposal. Even if we could collect Big Data that has millions of records using Web Scrapping algorithms, a large company or government agency, the problems may persist. One way to solve the problem of confounding factors is to employ what has become the golden standard in Economics and Social Sciences is called Randomized Control Trials (RCTs). Data that comes from RCTs is called experimental data and is different from all data we can collect from other sources, generally called observational data. An RCT typically has a well-defined causal mechanism that directs the process of data collection to eradicate by design the problems of confounding variable. Indeed, the power of the RCTs derive from their power to isolate well the causal mechanism by virtue of a random assignment. In simple terms, the ideal RCT design starts by selecting at least two groups of similar units (individuals, firms, regions). These two groups must be similar in all characteristics such that any difference between them becomes insignificant on the averages. The comparison is then an “apples to apples” and not “apples to oranges”. Source:Initiating an Experiment, Ch.4 The heart of the RCT is changing randomly the circumstances that surround the causal-mechanism in one of the two groups, namely, the treatment-group. The randomized assignment has two main virtues, firstly, we eradicate the problem of self-selection by controlling which of the two identical groups receives the treatment. Keep in mind that the treatment embodies the causal-mechanism that we are aiming to showcase $X \rightarrow Y$. The second benefit is that by changing the circumstances randomly, the variable of interest is most likely disconnected or unrelated to any other $Z$ factor affecting the outcome variable $Y$ of our research. Examples Chocolate consumption and Noble laureates The study of Aloys LeoPrinz, (2020), studies the well known association between Nobel laureates and chocolate consumption. At first glimpse, when we look at the consumption of coffee and chocolate with the number of Nobel laureates winner, we can tell that there is a positive relationship. Using descriptive statistics, we can assess this easily with a scatter plot or correlation table. library(ggplot2) library(gridExtra) library(dplyr) choc_lauretes <- readRDS("choc_lauretes.rds") grid.arrange( choc_lauretes %>% ggplot(aes(cholate_per_cap, no_nobel_lau)) + geom_point() + geom_smooth(method = "lm", se = T), choc_lauretes %>% ggplot(aes(coffee_per_cap, no_nobel_lau)) + geom_point() + geom_smooth(method = "lm", se = T), nrow = 1 ) cor(choc_lauretes[, c(2L:4L)], use="complete.obs") Table: 1 cholate_per_cap coffee_per_cap no_nobel_lau cholate_per_cap 1.00 coffee_per_cap 0.45 1.00 no_nobel_lau 0.17 -0.12 1.00 The correlation matrix shows that both chocolate and coffee have a positive association to the number of Laureates winners. The correlation of chocolate is to the Nobel winners is stronger than of the coffee. While looking at the scatter plot, we see that the trend indicates almost no relation between the coffee consumption and Nobel winners, however, we can observe a clear positive trend between chocolate consumption and Nobel Prize winners. Does chocolate consumption cause people to become smarter? As you are probably suspecting, to show a causal link between chocolate and human cognition, a simple correlation and trend analysis are not enough. But what is missing? Causal Mechanism In fact the sduty of Aloys LeoPrinz, (2020) provides a compelling theory that claims that because of the effects of flavonoids and caffeine has a positive effect on cognition and the dopaminergic reward system of the human brain. However, his paper does not provide the nuances about how come the flavoids and caffeine interact with a particular area(s) of the brain to yield that effect. In fact, his empirical study does not provide any biological evidence supporting that claim. Data The study uses observational data, and does not solve the problem of self-selection and confounding variables. Moreover, the unit of observation (countries) is quite disconnected from the unit of analysis (Nobel laureates winners). That is, the study attempts to describe a causal mechanism that occurs at the micro level, namely, in the brain of Nobel Prize winners. In other words, he is using a macro data at the country level to draw conclusions about the brain of researchers. Endogeneity The study does not controls for important confounding variables such as natural ability and the level of education of individuals. Also, there is no account for motivation and the number of weekly hours that researchers invest in their work. The lack of these controls, induces doubts in the estimation because they are important determinants of research productivity. Furthermore, the data has a problem of self-selection, because individuals are choosing to consume chocolate or coffee. Henceforth, we can observe the outcome of individuals of similar characteristics that do not consume coffee or chocolate (control or counterfactual group). Social Norms and Energy Consumption. The study of Schultz, Nolan, Et. Al, (2017) conducts an experiment on 290 households in San Marcos, CA, USA. The experiment was design to analyze the effect of two different kinds of social norms. One group was treated with an intervention that induce a “descriptive norm”, namely, the group was given information on their energy consumption compare to the average consumption of the neighborhood. The average consumption, has implicitly, a social norm given that individuals tend to have conformity with the behavior of their peers. The second group, was treated with another norm called the “injunctive norm” that embodies the perceptions of what is commonly right or wrong in a given situation. The core of the analysis is then to measure the effect of the two norms before and after the treatment. Causal Mechanism The study uses the theoretical framework of “Focus theory” that predicts that if only one of the two types of norms is prominent in an individual’s consciousness, it will exert the stronger influence on behavior (Cialdini & Goldstein, 2004). The theory thus prescribe that the group treated with a “descriptive norm” will increase and decrease their energy consumption towards the mean. In contrast, the group that is treated with an “injunctive norm” should change the behavior only if they receive a negative signal (a sad face) when their consumption is above the mean, but not the other way around (boomerang effect). Data The study uses experimental data because the treatment (social norm) is allocated randomly. The experimental data has the advantage of removing the problem of self-selection and the variable of interest $X$, the social norm, is, by virtue of the random assignment, uncorrelated with other determinants $Z$, of the energy consumption $Y$. Endogeneity The study only derives conclusions based on a difference in means, and they do not assess the effects of other factors that might drive the change of behavior, for instance, unemployment during the period of observation or absence in the household due to holidays or work. Further, it is not clear that the two groups were completely isolated from one another. The causal-mechanism depends on the prominence of one of the norms in the mind of the individuals. However, neighborhoods are typically well-know to communicate and interact among themselves, henceforth, it is not unlikely that a norm affected more than one household. References Blaug, Mark. 1992. The Methodology of Economics: Or, How Economists Explain. 2nd ed. Cambridge Surveys of Economic Literature. Cambridge University Press. https://doi.org/10.1017/CBO9780511528224. Calude, Cristian S., and Giuseppe Longo. 2017. "The Deluge of Spurious Correlations in Big Data." Foundations of Science 22 (3): 595-612. https://doi.org/10.1007/s10699-016-9489-4. Cialdini, Robert B., and Noah J. Goldstein. 2004. "Social Influence: Compliance and Conformity." Annual Review of Psychology 55 (1): 591-621. https://doi.org/10.1146/annurev.psych.55.090902.142015. Ductor, Lorenzo. 2015. "Does Co-Authorship Lead to Higher Academic Productivity?" Oxford Bulletin of Economics and Statistics 77 (3): 385-407. https://doi.org/https://doi.org/10.1111/obes.12070. Gonzalez-Sauri, Mario, and Giulia Rossello. 2022. "The Role of Early-Career University Prestige Stratification on the Future Academic Performance of Scholars." Research in Higher Education, April. https://doi.org/10.1007/s11162-022-09679-7. Maass, Wolfgang, Jeffrey Parsons, Sandeep Purao, Veda C Storey, and Carson Woo. 2018. "Data-Driven Meets Theory-Driven Research in the Era of Big Data: Opportunities and Challenges for Information Systems Research." Journal of the Association for Information Systems 19 (12): 1. https://doi.org/10.17705/1jais.00526. Nabavi, Noushin. 2020. Chapter 4 Defining the problem. https://noushinn.github.io/experimentation_course/defining-the-problem.html. Prinz, Aloys Leo. 2020. "Chocolate Consumption and Noble Laureates." Social Sciences & Humanities Open. https://doi.org/10.1016/j.ssaho.2020.100082. Schultz, P. Wesley, Jessica M. Nolan, Robert B. Cialdini, Noah J. Goldstein, and Vladas Griskevicius. 2007. "The Constructive, Destructive, and Reconstructive Power of Social Norms." Psychological Science 18 (5): 429-34. https://doi.org/10.1111/j.1467-9280.2007.01917.x. Stock, James, and Mark W. Watson. 2003. Introduction to Econometrics. New York: Prentice Hall; Prentice Hall. Times Higher Education. 2022. "World University Rankings." https://www.timeshighereducation.com/world-university-rankings.

	cholate_per_cap	coffee_per_cap	no_nobel_lau
cholate_per_cap	1.00
coffee_per_cap	0.45	1.00
no_nobel_lau	0.17	-0.12	1.00

Mario GS: Data Science Blog

GEM Indicators: A Reproducible Framework for SDG 4 Monitoring in Latin America.

Introduction

Pipeline Workflow

Indicator Selection and Methodological Scope

Family 1: Household Core Indicators

NSO Microdata Coverage and Sample Composition

Summary Statistics

Harmonization

Methodological Foundation: Peer-Reviewed Harmonization Frameworks

Global Harmonization

Weighted Population Estimator

Indicator-Level Harmonization

Honduras — ED05 / CP407 to ISCED: Dual-Standard Reconciliation (EPHPM)

Paraguay: ED0504 National Cycle Codes with Attendance-Aware Upper Secondary Completion

Completion Logic by Level (Hierarchical Cascading)

Grade Handling and Population Restriction

Argentina — Attendance (CH10) and Completion (CH12/CH13/CH14/NIVEL_ED) (EPH)

Phase 1: Raw Variable Extraction

Phase 2: ISCED Mapping with Stricter Thresholds

Rationale for Stricter CH14≥3

Results

Benchmark Comparison Table

Performance Assessment

Structural Deviations

Conclusion

References

Leveraging Financial Analysis with Google BigQuery and Python: A Financial Big Data Application.

1) Installing Python in RStudio

2) Setting Up the Cloud Environment

2.1) Install Google Cloud CLI

2.2) Creating a Project in Gcloud Power-Shell

2.3) Authentication and Enabling Big Data Services

2.4) Sanity Checks

3) The thelook_ecommerce Dataset

4) Connecting Gcloud with Python

5) General Approach for Data Manipulation (ETL)

6.1 Data Filtering, Aggregation and Joining Strategy

Stage 1

Stage 2

Stage 3

Stage 4

7) Visualize the Profit & Loss (P&L) Statement

Fractions, Decimals, Percentages.

An Introduction to R for Network Analysis.

Introduction

Section 1: An introduction to R.

Getting Started with R

Key Concepts

Learning by Doing

Mastering R-base

Operators Reference

Understanding Objects in R Programming

Vectors: The Building Blocks

Matrices

Basic Linear Algebra

Data Frames

Functions

Lists

Indexing Objects

Control Flow

Loops

Apply Family of Functions

lapply Function

Graphics in R

Color Management

Histograms

Scatter Plots:

Multiple Regression model

Section 2: An introduction to Network Analysis using Igraph

Install packages

Generate Graphs

Using Adjanceny Matrices

Using Edge lists

Using formulas

Using igraph-functions

Visualizing Networks with igraph

Plotting a Simple Network

Customizing network attributes:

Layouts of igraph

Honduras — `ED05` / `CP407` to ISCED: Dual-Standard Reconciliation (EPHPM)

Argentina — Attendance (`CH10`) and Completion (`CH12`/`CH13`/`CH14`/`NIVEL_ED`) (EPH)

Rationale for Stricter `CH14≥3`

3) The `thelook_ecommerce` Dataset

Subsetting rows using the `loc`

Subsetting rows using the `iloc`

Subsetting rows and columns using the `iloc`