<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://mario1084.github.io/blog/feed.xml" rel="self" type="application/atom+xml" /><link href="https://mario1084.github.io/blog/" rel="alternate" type="text/html" /><updated>2026-03-18T08:02:04+00:00</updated><id>https://mario1084.github.io/blog/feed.xml</id><title type="html">Mario GS: Data Science Blog</title><subtitle>&quot;Mario GS: Data Science Blog.&quot;</subtitle><entry><title type="html">GEM Indicators: A Reproducible Framework for SDG 4 Monitoring in Latin America.</title><link href="https://mario1084.github.io/blog/2026/03/17/gem_indicators.html" rel="alternate" type="text/html" title="GEM Indicators: A Reproducible Framework for SDG 4 Monitoring in Latin America." /><published>2026-03-17T00:00:00+00:00</published><updated>2026-03-17T00:00:00+00:00</updated><id>https://mario1084.github.io/blog/2026/03/17/gem_indicators</id><content type="html" xml:base="https://mario1084.github.io/blog/2026/03/17/gem_indicators.html"><![CDATA[<h2 id="introduction">Introduction</h2>

<p>In this analysis, I present a reproducible framework for generating
educational attainment indicators aligned with the standards of the
Global Education Monitoring (GEM) Report. These indicators are designed
to facilitate the monitoring of SDG 4 targets, specifically those aimed
at reducing multidimensional inequality in education. Focusing on a
sample of Latin American contexts—Argentina, Honduras, and Paraguay—my
preliminary results demonstrate a high level of convergence with the
canonical benchmarks published in cross-country databases such as WIDE,
SCOPE, and VIEW.</p>

<p>The policy framework motivating this reconstruction is the GEM Report
(Global Education Monitoring Report 2026), which places renewed emphasis
on the quality of the evidence base used to track SDG 4 progress.
Specifically, it highlights the risks of over-relying on single data
sources for global monitoring. The reconstruction exercise undertaken
here speaks directly to this concern: by re-deriving indicators from raw
microdata and benchmarking them against official figures, I identify not
only where estimates converge—offering a microdata-based point of
comparison for existing benchmarks—but also where they diverge and why.
This “methodological interpretability” reveals how national survey
architectures interact with global measurement frameworks in ways that
are not visible from published figures alone, contributing directly to
the kind of evidence-quality assessment the GEM Report calls for.</p>

<p>During the reconstruction, I systematically harmonized microdata from
the different national household surveys, ensuring that published
indicator definitions were consistent across all three contexts. This
alignment is particularly challenging given the inherent tension in
global monitoring: while SDG 4 goals are universal, the microdata
required to measure them—originating from diverse National Statistical
Offices (NSOs)—is inherently heterogeneous. Hence, there is a need to
implement robust harmonization methods to better inform educational
inequalities and outcomes.</p>

<p>To resolve this, I developed a framework that integrates a robust,
two-tier harmonization process. First, a global structural harmonization
aligns disparate survey formats; second, an indicator-based remapping
ensures that national education cycles (such as “Educación Básica”) are
correctly translated into international ISCED standards. By benchmarking
these estimations against referenced GEM sources obtained via the
asdaUIS API (WIDE, VIEW), I find overall that my reconstruction
framework is methodologically sound within the published data officially
used for cross-country comparison.</p>

<h3 id="pipeline-workflow">Pipeline Workflow</h3>

<p>The analytical pipeline comprises four consecutive stages published in
my <a href="https://github.com/mario1084/gem_unesco_st_analysis_demo">Github
repository</a>.
First, <code class="language-plaintext highlighter-rouge">01_data_acquisition.R</code> fetches and stages raw microdata files
from the NSO public repositories, preserving source-year identifiers.
Second, <code class="language-plaintext highlighter-rouge">02_harmonize.R</code> performs the <a href="#harmonization">global harmonization
layer</a>, applying correspondence tables and admissibility
rules to transform heterogeneous source variables into the unified
analytical record $H_i$. Third, <code class="language-plaintext highlighter-rouge">03_combine_harmonized_data.R</code>
consolidates individual harmonized CSV.GZ files into a single
<code class="language-plaintext highlighter-rouge">persons_harmonized.parquet</code> file for efficient processing. Fourth,
<code class="language-plaintext highlighter-rouge">04_indicators.R</code> orchestrates the computation of all indicator families
by executing household core estimators
(<code class="language-plaintext highlighter-rouge">R/indicators/household/completion.R</code>, <code class="language-plaintext highlighter-rouge">attendance.R</code>,
<code class="language-plaintext highlighter-rouge">out_of_school.R</code>, <code class="language-plaintext highlighter-rouge">literacy.R</code>, <code class="language-plaintext highlighter-rouge">repetition.R</code>) alongside secondary
layers (learning, admin/reference, finance). Each household estimator
applies <a href="#indicator-level-harmonization">indicator-level harmonization</a>
to translate national education cycle codes into ISCED-comparable
classifications before computing the weighted population share
estimator. All outputs—across families—are consolidated into a single
unified CSV with <code class="language-plaintext highlighter-rouge">indicator_family</code> labels, enabling selective
extraction for benchmarking. Finally, <code class="language-plaintext highlighter-rouge">ind_benchmark.py</code> filters to
household core indicators and performs comparative validation against
WIDE and UIS published figures, producing the audit report and status
assessments shown in the <a href="#results">results section</a> below.</p>

<h2 id="indicator-selection-and-methodological-scope">Indicator Selection and Methodological Scope</h2>

<p>The indicators selected for this reconstruction are educational outcome
measurements focused specifically on educational attainment. This group
represents the definitive metrics for tracking how individuals
transition through and ultimately exit national education cycles. As
detailed in the results section, the microdata capturing these cycles
suffers from significant instrument-level heterogeneity across NSOs.
Consequently, extracting cross-country comparable metrics requires the
implementation of rigorous harmonization rules. Despite these structural
challenges, the resulting indicators are uniquely rich: they are not
mere aggregates, but person-level reconstructions derived directly from
household-level microdata (<code class="language-plaintext highlighter-rouge">household_core</code>). This granular
reconstruction constitutes the primary methodological contribution of
this study. These are the specific indicators benchmarked against WIDE
and World Bank repositories, computed for Argentina, Honduras, and
Paraguay (2021–2024) using the weighted population share estimator
defined in the methodology section.</p>

<p>While the broader analytical repository estimates and reports on other
indicator families, they are deliberately omitted from this specific
benchmarking discussion. The <code class="language-plaintext highlighter-rouge">learning_layer</code>, <code class="language-plaintext highlighter-rouge">admin_reference</code>, and
<code class="language-plaintext highlighter-rouge">finance_layer</code> are fundamentally different in their methodological
demands. Because they are not derived from the harmonization of
heterogeneous NSO microdata, they operate primarily as straightforward
data integrations rather than structural reconstructions.</p>

<p>Specifically, the learning layer does not re-estimate assessment
results; it merely integrates published, source-native scores from ERCE,
PISA, PISA-D, and the UIS learning API to provide thematic context
alongside the household indicators. Similarly, the administrative
reference layer ingests established UIS administrative series and World
Population Prospects (WPP) denominators to support VIEW-style
publication logic, while the finance layer integrates standard OECD
DAC/CRS disbursement data to enable SCOPE-style education aid
contextualization. Because these secondary layers rely on standardized
data pipelines and lack the structural friction of national survey
architectures, only the household core requires the rigorous
methodological validation detailed in this report.</p>

<h3 id="family-1-household-core-indicators">Family 1: Household Core Indicators</h3>

<p>The household core indicators are derived from person-level microdata
using a weighted population share estimator applied to the harmonized
<code class="language-plaintext highlighter-rouge">persons_harmonized.parquet</code> file. Each indicator is computed at
national level and disaggregated by sex (<code class="language-plaintext highlighter-rouge">sex_h</code>) and urban/rural
location (<code class="language-plaintext highlighter-rouge">location_h</code>) across all three countries and four survey
years.</p>

<ul>
  <li>
    <p><strong>Out-of-school rate</strong> (<code class="language-plaintext highlighter-rouge">OOS_LVL</code>): the weighted share of the official
school-age population for each level that is not currently attending
any level of formal education. Computed as the complement of
attendance within the age-defined eligible universe. <em>Harmonization:</em>
no remapping beyond the binary recode of <code class="language-plaintext highlighter-rouge">attending_currently_h</code>; the
denominator is age-only.</p>
  </li>
  <li>
    <p><strong>Completion rate</strong> (<code class="language-plaintext highlighter-rouge">COMP_LVL</code>): the weighted share of a
“near-on-time” reference-age cohort—official graduation age plus a 3–5
year buffer—that has completed that level. The most technically
complex indicator in the family and the one where all benchmark
deviations concentrate. <em>Harmonization:</em> substantial and
country-specific—see the <a href="#harmonization">Harmonization</a> section for
detailed mappings by country.</p>
  </li>
  <li>
    <p><strong>Literacy rate</strong> (<code class="language-plaintext highlighter-rouge">LIT_RATE</code>): the weighted share of the 15–24 age
group that can read and write, based on a direct self-reported
literacy item. <em>Harmonization:</em> <code class="language-plaintext highlighter-rouge">ED01</code> (HND) and <code class="language-plaintext highlighter-rouge">ED02</code> (PRY) map
directly to <code class="language-plaintext highlighter-rouge">literacy_h</code>; no validated item for Argentina.</p>
  </li>
</ul>

<h3 id="nso-microdata-coverage-and-sample-composition">NSO Microdata Coverage and Sample Composition</h3>

<p>For this reconstruction, I focused on a strategic selection of Latin
American countries—Argentina, Paraguay, and Honduras—representing a
diverse range of educational structures (e.g., varying cycles of
Educación Básica) to ensure the scalability and cross-country validity
of the framework.</p>

<p>The indicators are derived from microdata spanning the 2021–2024 window,
specifically:</p>

<ul>
  <li>
    <p><strong>Argentina – Encuesta Permanente de Hogares (EPH):</strong></p>
  </li>
  <li>
    <p><strong>Honduras – Encuesta Permanente de Hogares de Propósitos Múltiples
(EPHPM):</strong></p>
  </li>
  <li>
    <p><strong>Paraguay – Encuesta Permanente de Hogares Continua (EPHC):</strong></p>
  </li>
</ul>

<table>
  <thead>
    <tr>
      <th>Country</th>
      <th>Survey</th>
      <th>Year</th>
      <th>Sample Size</th>
      <th>Households</th>
      <th>Age Range</th>
      <th>Female (%)</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Argentina</td>
      <td>Encuesta Permanente de Hogares</td>
      <td>2021</td>
      <td>192,600</td>
      <td>40,555</td>
      <td>1–103</td>
      <td>52.1%</td>
    </tr>
    <tr>
      <td>Argentina</td>
      <td>Encuesta Permanente de Hogares</td>
      <td>2022</td>
      <td>198,097</td>
      <td>42,583</td>
      <td>1–105</td>
      <td>52.2%</td>
    </tr>
    <tr>
      <td>Argentina</td>
      <td>Encuesta Permanente de Hogares</td>
      <td>2023</td>
      <td>193,382</td>
      <td>41,724</td>
      <td>1–108</td>
      <td>52.1%</td>
    </tr>
    <tr>
      <td>Argentina</td>
      <td>Encuesta Permanente de Hogares</td>
      <td>2024</td>
      <td>187,625</td>
      <td>41,150</td>
      <td>1–104</td>
      <td>52.0%</td>
    </tr>
    <tr>
      <td>Honduras</td>
      <td>EPHPM</td>
      <td>2021</td>
      <td>20,906</td>
      <td>27</td>
      <td>0–99</td>
      <td>51.9%</td>
    </tr>
    <tr>
      <td>Honduras</td>
      <td>EPHPM</td>
      <td>2022</td>
      <td>20,303</td>
      <td>5,211</td>
      <td>0–105</td>
      <td>52.9%</td>
    </tr>
    <tr>
      <td>Honduras</td>
      <td>EPHPM</td>
      <td>2023</td>
      <td>20,308</td>
      <td>5,342</td>
      <td>0–106</td>
      <td>52.4%</td>
    </tr>
    <tr>
      <td>Honduras</td>
      <td>EPHPM</td>
      <td>2024</td>
      <td>24,534</td>
      <td>6,487</td>
      <td>0–106</td>
      <td>52.7%</td>
    </tr>
    <tr>
      <td>Paraguay</td>
      <td>EPHC</td>
      <td>2021</td>
      <td><em>16,569</em></td>
      <td>4,646</td>
      <td>0–101</td>
      <td>50.8%</td>
    </tr>
    <tr>
      <td>Paraguay</td>
      <td>EPHC</td>
      <td>2022</td>
      <td>61,912</td>
      <td>17,972</td>
      <td>0–105</td>
      <td>50.6%</td>
    </tr>
    <tr>
      <td>Paraguay</td>
      <td>EPHC</td>
      <td>2023</td>
      <td>58,005</td>
      <td>17,037</td>
      <td>0–106</td>
      <td>50.7%</td>
    </tr>
    <tr>
      <td>Paraguay</td>
      <td>EPHC</td>
      <td>2024</td>
      <td>57,744</td>
      <td>17,242</td>
      <td>0–106</td>
      <td>50.5%</td>
    </tr>
  </tbody>
</table>

<p><strong>Note:</strong> Sample sizes reflect the raw harmonized person-level records
from each NSO survey. Indicator estimates are derived using weighted
population shares to account for survey design. Unfortunately for
Paraguay 2021, I was only able to obtain the consolidated data from the
last trimester from <a href="https://anda.ine.gov.py/anda/index.php/catalog/?page=1&amp;sk=EPHC&amp;from=2021&amp;to=2021&amp;ps=100">Paraguay
INE</a>.</p>

<h3 id="summary-statistics">Summary Statistics</h3>

<ul>
  <li><strong>Total persons:</strong> 1,051,985</li>
  <li><strong>Total households:</strong> 167,178</li>
  <li><strong>Countries:</strong> 3 (Argentina, Honduras, Paraguay)</li>
  <li><strong>Survey years:</strong> 4 (2021–2024)</li>
  <li><strong>Survey programs:</strong> 3 (EPH, EPHPM, EPHC)</li>
</ul>

<h2 id="harmonization">Harmonization</h2>

<p>The construction of comparable indicators from heterogeneous microdata
requires resolving two distinct problems. The first is structural: each
NSO utilizes its own variable names, coding schemes, and questionnaire
architectures. The second is conceptual: even when the same construct is
nominally measured—such as whether a child has “completed” a level—the
operationalization of that concept varies across education systems in
ways that a purely mechanical recode cannot resolve. I address these
problems through a two-tier harmonization framework: a global layer that
standardizes the analytical structure across all three sources, and an
<a href="#indicator-level-harmonization">indicator-level layer</a> that translates
national education cycle codes into ISCED-compatible classifications.</p>

<h3 id="methodological-foundation-peer-reviewed-harmonization-frameworks">Methodological Foundation: Peer-Reviewed Harmonization Frameworks</h3>

<p>The harmonization strategy employed here is grounded in three
peer-reviewed methodological frameworks that establish how heterogeneous
survey data can be transformed into comparable indicators:</p>

<ol>
  <li>
    <p><strong>IPUMS Harmonization of Census Data</strong> (Ruggles et al. 2019)
demonstrates that standardized metadata, correspondence tables, and
composite coding logic can map diverse source variables into
harmonized targets while preserving source detail separately. This
approach treats harmonization not as a free-standing guess but as a
reproducible transform governed by explicit documentation.</p>
  </li>
  <li>
    <p><strong>IPUMS MICS Data Harmonization Code</strong> (IPUMS International 2023)
provides a production implementation showing how standardized
variables, cross-survey coding rules, and source-specific set-up
logic are applied to heterogeneous UNICEF MICS samples. This
real-world example validates that metadata-driven transforms scale
across multiple surveys with incompatible original variable names.</p>
  </li>
  <li>
    <p><strong>Harmonizing Measurements through Shared Items</strong> (Desjardins et
al. 2024) establishes the principle that non-identical source
instruments can be mapped into a common metric through explicitly
declared anchors and transformation rules. Rather than assuming raw
comparability, this approach defines the transformation rules first,
then validates that the derived metric is methodologically
defensible.</p>
  </li>
</ol>

<p>These three frameworks collectively establish the theoretical and
practical foundation for the global harmonization layer. Instead of
treating national survey codes as intrinsically comparable, I use
correspondence tables, explicit admissibility rules, and source-specific
logic to derive harmonized variables that can support indicator
construction without hidden country-specific assumptions in downstream
code.</p>

<h3 id="global-harmonization">Global Harmonization</h3>

<p>The global layer functions as a transformation function that maps each
raw source-year file into a common person-level analytical record.
Rather than a simple renaming exercise, this transform identifies the
intersection of raw source variables, official source documentation, and
correspondence tables. Each variable is then passed through an
admissibility rule set that classifies it as directly harmonizable,
partially harmonizable, or non-comparable.</p>

<p>The output of this process is a standardized “Harmonized Analytical
Record” governed by four logical blocks:</p>

<ol>
  <li>
    <p>The Provenance Spine: Fields that preserve the source-year identity
and household-person keys, ensuring every estimate is wave-stable
and traceable back to the raw NSO file.</p>
  </li>
  <li>
    <p>The Design and Disaggregation Core: The minimum set of demographic
variables (age, sex, location) and sampling weights required for
representative estimation.</p>
  </li>
  <li>
    <p>The Education Block: Harmonized status fields (attendance, level
currently attending, highest level completed) that serve as the
direct inputs for indicator construction.</p>
  </li>
  <li>
    <p>The Exception Field: A record-level mechanism that logs
comparability caveats, ensuring that structural limitations in the
survey are made auditable rather than being absorbed silently into
the estimation code.</p>
  </li>
</ol>

<h4 id="weighted-population-estimator">Weighted Population Estimator</h4>

<p>To translate the harmonized microdata into cross-nationally comparable
indicators, I employ a weighted population share estimator grounded in
UIS household-survey methodology (UNESCO Institute for Statistics 2024).
The estimator is simple in principle but precise in practice: it
computes each indicator as a weighted ratio of individuals meeting both
the eligible-universe condition (usually defined by age) and the
indicator-specific status condition (e.g., currently attending, or
having completed a level). Specifically, for each indicator, I calculate
the sum of survey weights for individuals satisfying both conditions,
divided by the total sum of weights for the eligible universe. This
ensures that estimates reflect the national population structure
captured by the survey design, not merely the sample composition.</p>

<p>Critically, the eligible universe is defined <strong>strictly by age</strong>,
regardless of whether education variables are present or missing. For
example, a primary completion rate denominator includes all respondents
aged 14–16, even if some have missing data for
<code class="language-plaintext highlighter-rouge">highest_level_completed_h</code>. This approach prevents missing education
data from artificially inflating non-completion rates and maintains the
demographic integrity of the reference population—a key principle in
WIDE and VIEW methodology (Global Education Monitoring Report 2026). The
weighted share estimator thus ensures that reported rates are not only
methodologically defensible but also represent actual population
proportions, not sample artifacts.</p>

<p>The variable-level mapping for the education block is:</p>

<table>
  <thead>
    <tr>
      <th>Harmonized variable</th>
      <th>ARG (EPH)</th>
      <th>HND (EPHPM)</th>
      <th>PRY (EPHC)</th>
      <th>Rule type</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">attending_currently_h</code></td>
      <td><code class="language-plaintext highlighter-rouge">CH10</code></td>
      <td><code class="language-plaintext highlighter-rouge">ED03</code></td>
      <td><code class="language-plaintext highlighter-rouge">ED08</code></td>
      <td>direct / direct / direct</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">current_level_h</code></td>
      <td><code class="language-plaintext highlighter-rouge">NIVEL_ED</code> + state logic</td>
      <td><code class="language-plaintext highlighter-rouge">ED10</code></td>
      <td>structural missing</td>
      <td>direct / conditional</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">highest_level_completed_h</code></td>
      <td><code class="language-plaintext highlighter-rouge">NIVEL_ED</code> + <code class="language-plaintext highlighter-rouge">ESTADO</code></td>
      <td><code class="language-plaintext highlighter-rouge">ED05</code></td>
      <td><code class="language-plaintext highlighter-rouge">ED0504</code> (split)</td>
      <td>conditional / split-coded</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">highest_grade_completed_h</code></td>
      <td>structural missing</td>
      <td><code class="language-plaintext highlighter-rouge">ED08</code></td>
      <td><code class="language-plaintext highlighter-rouge">ED0504</code> (split)</td>
      <td>direct / split-coded</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">literacy_h</code></td>
      <td>structural missing</td>
      <td><code class="language-plaintext highlighter-rouge">ED01</code></td>
      <td><code class="language-plaintext highlighter-rouge">ED02</code></td>
      <td>direct</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">repetition_h</code></td>
      <td>structural missing</td>
      <td><code class="language-plaintext highlighter-rouge">ED11</code></td>
      <td>structural missing</td>
      <td>direct</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">weight_h</code></td>
      <td><code class="language-plaintext highlighter-rouge">PONDERA</code></td>
      <td><code class="language-plaintext highlighter-rouge">FACTOR</code></td>
      <td><code class="language-plaintext highlighter-rouge">FEX</code> / <code class="language-plaintext highlighter-rouge">FEX.2022</code></td>
      <td>direct</td>
    </tr>
  </tbody>
</table>

<p>Three fields carry a <code class="language-plaintext highlighter-rouge">structural missing</code> designation for one or more
countries. For Argentina, the EPH does not include a separate
grade-completed variable; <code class="language-plaintext highlighter-rouge">NIVEL_ED</code> conflates current enrolment level
with historical attainment and requires disambiguation through
attendance and labor-force state variables. For Paraguay, no validated
current-study level variable was identified in the <code class="language-plaintext highlighter-rouge">REG02</code> person file.
These absences propagate into specific methodological decisions at the
indicator layer.</p>

<h3 id="indicator-level-harmonization">Indicator-Level Harmonization</h3>

<p>The global harmonization layer standardizes variable names and
structures. But a second, deeper problem remains: <strong>national education
codes do not naturally align with ISCED</strong>. Honduras encodes nine years
under one code. Paraguay bundles level and grade into a single composite
number. Argentina’s <code class="language-plaintext highlighter-rouge">NIVEL_ED</code> field conflates current enrollment with
historical completion. To build trustworthy cross-country indicators, I
conducted a structural audit of each NSO’s questionnaire logic and
derived “hard mappings”—deterministic, data-driven rules that translate
each country’s native codes into ISCED classifications. These mappings
are grounded in source documentation and empirically validated against
WIDE benchmarks. Below, I walk through each country’s approach, showing
both the challenge and the specific solution.</p>

<h4 id="honduras--ed05--cp407-to-isced-dual-standard-reconciliation-ephpm">Honduras — <code class="language-plaintext highlighter-rouge">ED05</code> / <code class="language-plaintext highlighter-rouge">CP407</code> to ISCED: Dual-Standard Reconciliation (EPHPM)</h4>

<p><strong>The Problem:</strong> Honduras’ <em>Educación Básica</em> system spans nine years of
schooling, but the EPHPM collapses this entire span into a single level
code (<code class="language-plaintext highlighter-rouge">ED05=4</code> for 2022+; <code class="language-plaintext highlighter-rouge">CP407=4</code> for 2021). To distinguish primary
completion (6 years) from lower secondary completion (9 years), we must
parse the companion variable <code class="language-plaintext highlighter-rouge">ED08</code> (cumulative years within básica,
values 1–9). Complicating this, the 2021 survey used <code class="language-plaintext highlighter-rouge">CP407</code> with
different category labels than the 2022+ <code class="language-plaintext highlighter-rouge">ED05</code> variable—a product
redesign that broke consistency across years. Only the level 4 mapping
is stable across both waves.</p>

<p><strong>The Solution:</strong> I constructed separate mappings for each variable,
using grade thresholds to split the nine-year básica cycle into
ISCED-compatible boundaries. The table below shows how each code-grade
combination maps to ISCED levels for both survey versions.</p>

<p><strong>ISCED Mapping</strong></p>

<table>
  <thead>
    <tr>
      <th>Code</th>
      <th>Grade</th>
      <th>2021 (CP407)</th>
      <th>2022+ (ED05)</th>
      <th>ISCED</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>4</td>
      <td>1–5</td>
      <td>Básica (incompleto)</td>
      <td>Básica (incompleto)</td>
      <td>1</td>
    </tr>
    <tr>
      <td>4</td>
      <td>6+</td>
      <td>Básica (primaria)</td>
      <td>Básica (primaria)</td>
      <td>1</td>
    </tr>
    <tr>
      <td>4</td>
      <td>3 or 9</td>
      <td>Ciclo Común / Básica final</td>
      <td>Básica final</td>
      <td>2</td>
    </tr>
    <tr>
      <td>5</td>
      <td>—</td>
      <td>Ciclo Común (pre-reform)</td>
      <td>Media (upper secondary)</td>
      <td>2 / 3</td>
    </tr>
    <tr>
      <td>6</td>
      <td>—</td>
      <td>Media (upper secondary)</td>
      <td>—</td>
      <td>3</td>
    </tr>
    <tr>
      <td>6+</td>
      <td>—</td>
      <td>—</td>
      <td>Superior (higher education)</td>
      <td>4</td>
    </tr>
  </tbody>
</table>

<p><strong>2023 Case: Two-Track Reporting Approach</strong></p>

<p>For Honduras 2023, the pipeline estimates completion rates two ways
using the <strong>identical ISCED mapping</strong> but different methodological
choices about the reference population. This two-track approach reveals
whether observed deviations from WIDE benchmarks are caused by the
mapping itself or by denominator and cohort definitions:</p>

<ol>
  <li><strong>Standard Series (Conservative):</strong> Age 20–29, all respondents.
Treats missing level data (~12.5% of cases) as non-completion. This
is the internal methodology used by the pipeline for consistency
across all countries.
    <ul>
      <li>Primary: 76.44% (gap −8.36 pp vs. WIDE 84.80%)</li>
      <li>Lower Secondary: 48.34% (gap −6.46 pp vs. WIDE 54.80%)</li>
      <li>Upper Secondary: 35.11% (gap −6.59 pp vs. WIDE 41.70%)</li>
    </ul>
  </li>
  <li><strong>Harmonized Series (WIDE-aligned Method):</strong> Age 25–29, valid levels
only (denominator restricted to respondents with recorded level
data, excluding ~12.5% missing). This approximates the WIDE
methodology, excluding in-school 20–24 population and treating
missing data as non-response rather than non-completion.
    <ul>
      <li>Primary: 88.83% (gap +4.03 pp vs. WIDE 84.80%)</li>
      <li>Lower Secondary: 56.11% (gap +1.31 pp vs. WIDE 54.80%)</li>
      <li>Upper Secondary: 43.04% (gap +1.34 pp vs. WIDE 41.70%)</li>
    </ul>
  </li>
</ol>

<p><strong>Interpretation:</strong> Both series apply the same ISCED mapping to Honduras
2023 EPHPM data. The harmonized series demonstrates that Honduras <em>can</em>
achieve WIDE-level alignment through methodological choices in cohort
definition (age 25–29 vs. 20–29) and denominator treatment (valid-only
vs. all individuals). This pattern suggests the indicator drift in the
standard series is structural—driven by demographic composition and
missing data handling—rather than a mapping or formula error. The
two-track approach reveals that “completion rate” is inherently
dependent on how you define the reference cohort and treat missing
values; neither approach is intrinsically “right,” but they measure
different aspects of educational attainment.</p>

<p><strong>ISCED Mapping</strong></p>

<p><strong>Primary completion — all waves (standard and harmonized):</strong></p>

<table>
  <thead>
    <tr>
      <th>Level</th>
      <th>Grade</th>
      <th>ISCED</th>
      <th>Logic</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>4</td>
      <td>≥ 6</td>
      <td>1</td>
      <td>Educación Básica with grade-within-basic ≥ 6</td>
    </tr>
    <tr>
      <td>≥ 5</td>
      <td>—</td>
      <td>≥ 3</td>
      <td>Above Básica (Bachillerato or tertiary)</td>
    </tr>
  </tbody>
</table>

<p><strong>Lower secondary completion — both series (revised mapping with Grade
3):</strong></p>

<table>
  <thead>
    <tr>
      <th>Level</th>
      <th>Grade</th>
      <th>ISCED</th>
      <th>Logic</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>4</td>
      <td>3 or 9</td>
      <td>2</td>
      <td>Ciclo Común (Grade 3, CP407) OR Básica final (Grade 9, ED05)</td>
    </tr>
    <tr>
      <td>5</td>
      <td>—</td>
      <td>2 or 3</td>
      <td>Code 5: Ciclo Común in 2021 (→ ISCED 2); Media in 2022+ (→ ISCED 3)</td>
    </tr>
    <tr>
      <td>≥ 6</td>
      <td>—</td>
      <td>≥ 3</td>
      <td>Bachillerato or above (2022+ ED05; 2021 CP407 ≥ 7)</td>
    </tr>
  </tbody>
</table>

<p><strong>Upper Secondary and Tertiary (ISCED 3+) — Survey Redesign Challenge</strong></p>

<p>Above the lower secondary level, the 2021 survey redesign creates a
critical mapping problem: code 6 in CP407 means something different than
code 6 in ED05. In 2021, code 6 represents secondary education (Media).
In 2022+, code 6 represents tertiary education. This code shift means we
must use year-conditional logic to correctly identify who has reached
tertiary education (ISCED 4+):</p>

<ul>
  <li>
    <p><strong>2021 (CP407):</strong> <code class="language-plaintext highlighter-rouge">lvl ≥ 7</code> → ISCED 4+ (CP407: 6=Media/secondary,
7+=Tertiary)</p>
  </li>
  <li>
    <p><strong>2022+ (ED05):</strong> <code class="language-plaintext highlighter-rouge">lvl ≥ 6</code> → ISCED 4+ (ED05: 5=Media/secondary,
6+=Tertiary)</p>
  </li>
</ul>

<p>This year-conditional boundary ensures that the same individual’s
education level maps consistently to ISCED across both survey versions,
despite the code reassignments in the redesign.</p>

<p><strong>Two Structural Constraints: Attending Students and Denominator
Restrictions</strong></p>

<p>The EPHPM survey design creates two additional challenges beyond the
code shift. Both affect how we compute completion rates:</p>

<p><strong>(1) Attending-student gap:</strong> The <code class="language-plaintext highlighter-rouge">ED05</code> variable is only populated for
non-attending respondents; currently-attending students have
<code class="language-plaintext highlighter-rouge">highest_level_completed_h</code> missing. To estimate primary completion for
attending students, we apply a two-tier inference strategy: (Tier 1) any
attending student with <code class="language-plaintext highlighter-rouge">current_level_h &gt; 4</code> (studying above básica) has
completed primary; (Tier 2) any attending student aged ≥15 still in
level 4 is also credited with primary completion, following the UIS
convention that age 15 represents the minimum post-primary age without
overage. This inference captures students still progressing through the
system.</p>

<p><strong>(2) Lower secondary denominator restriction:</strong> Because <code class="language-plaintext highlighter-rouge">ED05</code> is
structurally absent for attending students, official WIDE methodology
conditions the lower secondary completion rate denominator on
non-attending respondents only. This structural constraint explains why
our standard series shows 9–12 pp lower rates than the WIDE
benchmark—we’re measuring completion differently, not incorrectly. By
restricting to non-attending respondents (those who have exited the
system), we replicate WIDE’s methodology exactly, which accounts for the
observed benchmark gap.</p>

<p><strong>Rationale for Dual Series:</strong></p>

<p>The two-track approach documents that Honduras 2023 indicator drift
reflects definitional choices, not harmonization failure. By
demonstrating that the same mapping produces WIDE-aligned results under
different (but justifiable) assumptions about cohort and denominator, I
establish that the observed gap is methodological tension—a feature of
cross-national comparison, not a bug in the EPHPM-to-ISCED translation.
This approach is particularly important given the structural constraints
of the EPHPM: the absence of current-grade data for attending students
and the code shift between survey redesigns.</p>

<h4 id="paraguay-ed0504-national-cycle-codes-with-attendance-aware-upper-secondary-completion">Paraguay: ED0504 National Cycle Codes with Attendance-Aware Upper Secondary Completion</h4>

<p><strong>The Problem:</strong> Paraguay’s household survey embeds <em>both</em> the education
level <strong>and</strong> the grade within that level into a single variable:
<code class="language-plaintext highlighter-rouge">ED0504</code>. To extract both pieces of information, we must use integer
division: <code class="language-plaintext highlighter-rouge">ED0504 %/% 10</code> (quotient) gives the level code;
<code class="language-plaintext highlighter-rouge">ED0504 %% 10</code> (remainder) gives the grade. The level codes are
Paraguay-specific (21=EEB 1st cycle, 30=2nd cycle, 40=3rd cycle,
90=Bachillerato, etc.) with no direct correspondence to ISCED.</p>

<p><strong>The Solution:</strong> The table below maps each Paraguay level code to its
ISCED equivalent. A critical addition: for Bachillerato (level 90), we
verify both final grade completion <strong>and</strong> non-attendance status,
applying a principle from WIDE methodology that completion means
graduation, not just enrollment in the final year.</p>

<p><strong>ISCED Mapping</strong></p>

<table>
  <thead>
    <tr>
      <th>Level Code</th>
      <th>Cycle</th>
      <th>ISCED Mapping</th>
      <th>Indicator Logic</th>
      <th>Notes</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>0, 10</td>
      <td>Pre-school / None</td>
      <td>0</td>
      <td>→ ISCED 0</td>
      <td>Below primary</td>
    </tr>
    <tr>
      <td>21</td>
      <td>EEB 1st cycle (grades 1–3)</td>
      <td>1</td>
      <td>→ ISCED 1</td>
      <td>Incomplete primary</td>
    </tr>
    <tr>
      <td>30</td>
      <td>EEB 2nd cycle (grades 4–6)</td>
      <td>1</td>
      <td>→ ISCED 1</td>
      <td>Incomplete primary</td>
    </tr>
    <tr>
      <td>40</td>
      <td>EEB 3rd cycle (grades 7–9)</td>
      <td>1</td>
      <td>→ ISCED 1</td>
      <td>Primary complete threshold at level 40 (entry to 3rd cycle = 6-year primary done)</td>
    </tr>
    <tr>
      <td>90</td>
      <td>Bachillerato / Media</td>
      <td>3</td>
      <td><strong>grd==3 &amp; attend≠2</strong> → ISCED 3</td>
      <td>Upper secondary: <strong>(2021-2023)</strong> Requires final grade (3) AND not currently enrolled in secondary (attend code 2 = “estudiando”). People with attend=2 are still in school; WIDE methodology counts only actual graduates.</td>
    </tr>
    <tr>
      <td>100–199</td>
      <td>University / Tertiary</td>
      <td>4+</td>
      <td>→ ISCED 4+</td>
      <td>Regular tertiary; all count as upper secondary complete</td>
    </tr>
    <tr>
      <td>240–999</td>
      <td>Técnico Superior / Advanced Tertiary</td>
      <td>4+</td>
      <td>→ ISCED 4+</td>
      <td>Short-cycle &amp; advanced tertiary; all count as upper secondary complete. Level 240 (Técnico Superior, ~2-3 year vocational) enters immediately after Bachillerato; presence at 240+ proves secondary completion.</td>
    </tr>
  </tbody>
</table>

<h5 id="completion-logic-by-level-hierarchical-cascading">Completion Logic by Level (Hierarchical Cascading)</h5>

<p><strong>Primary (ISCED 1):</strong> Level 40+ (anyone entering EEB 3rd cycle or
higher has completed 6-year primary).</p>

<p><strong>Lower Secondary (ISCED 2):</strong> Level 40 with grade=9 (EEB completion),
or level 90+ (anyone at Bachillerato/tertiary has passed lower
secondary). - <em>Denominator restriction:</em> Non-attending respondents only
(<code class="language-plaintext highlighter-rouge">attending_currently_h == 19</code>), matching WIDE methodology.</p>

<p><strong>Upper Secondary (ISCED 3):</strong> - <strong>Level 90 (Bachillerato):</strong> Final
grade completed (grd==3) <strong>AND</strong> not currently in school (attend≠2). -
<em>Rationale:</em> WIDE methodology is strict on enrollment status. Survey
timing can capture students in their final month before graduation;
without the attending filter, these count as completers even though
diplomas aren’t issued until the following calendar year. - <em>Attending
code mapping:</em> Code 2 = “estudiando” (currently attending secondary).
Code 19 and NA = non-attending (graduated or dropped out). - <strong>Level
100–999 (Tertiary):</strong> All tertiary attendance proves secondary
completion (hierarchical cascade).</p>

<h5 id="grade-handling-and-population-restriction">Grade Handling and Population Restriction</h5>

<p><em>Grade handling:</em> For all levels except the upper_secondary patch,
within-cycle grade is typically discarded (set to <code class="language-plaintext highlighter-rouge">NA</code>). For level 90
specifically, grade==3 is verified to ensure Bachillerato final year
(3-year cycle). The estimator then applies the attendance filter to
remove in-progress students.</p>

<p><em>Population restriction:</em> Lower secondary completion uses non-attending
respondents only (<code class="language-plaintext highlighter-rouge">attending_currently_h == 19</code>), matching WIDE
methodology and explaining why lower secondary COMP_LVL is much lower
(~84%) than primary (~99%). Primary and upper secondary use the full
reference-age population (all respondents in the cohort, regardless of
attendance status).</p>

<h4 id="argentina--attendance-ch10-and-completion-ch12ch13ch14nivel_ed-eph">Argentina — Attendance (<code class="language-plaintext highlighter-rouge">CH10</code>) and Completion (<code class="language-plaintext highlighter-rouge">CH12</code>/<code class="language-plaintext highlighter-rouge">CH13</code>/<code class="language-plaintext highlighter-rouge">CH14</code>/<code class="language-plaintext highlighter-rouge">NIVEL_ED</code>) (EPH)</h4>

<p><strong>Attendance — Direct Question:</strong> Argentina’s EPH includes a direct,
unambiguous attendance question (<code class="language-plaintext highlighter-rouge">CH10</code>): 1 = currently attending
school; anything else = not attending. Compared to Honduras (where we
must infer attendance from incomplete level codes) or Paraguay (where
grade must be parsed from a composite), Argentina’s attendance mapping
is straightforward. This simplicity yields ~99% primary attendance
rates, perfectly aligned with WIDE benchmarks.</p>

<p><strong>Completion — A Conflation Problem:</strong> The completion mapping is more
complex. Argentina’s <code class="language-plaintext highlighter-rouge">NIVEL_ED</code> field conflates two incompatible
aspects: it records both current enrollment level <em>and</em> highest
attainment level simultaneously. To resolve this, the pipeline uses a
surgical two-phase approach: first, extract raw variables (<code class="language-plaintext highlighter-rouge">CH12</code>,
<code class="language-plaintext highlighter-rouge">CH13</code>, <code class="language-plaintext highlighter-rouge">CH14</code>) that disambiguate what <code class="language-plaintext highlighter-rouge">NIVEL_ED</code> actually means;
second, apply stricter grade thresholds to account for provincial
variation in secondary structure.</p>

<p>The EPH’s <code class="language-plaintext highlighter-rouge">NIVEL_ED</code> conflates two incompatible education systems
(pre-1993 traditional 7+5 and post-2006 EGB 9+3), causing systematic
misclassification of lower secondary completion. The solution: use
supplementary variables to disambiguate what <code class="language-plaintext highlighter-rouge">NIVEL_ED=3</code> actually
represents, then apply appropriate ISCED mappings.</p>

<p>The table below shows the base <code class="language-plaintext highlighter-rouge">NIVEL_ED</code> codes and their ISCED
translations. Where <code class="language-plaintext highlighter-rouge">NIVEL_ED=3</code> appears (the ambiguous case), the
rightmost column indicates how the surgical fix disambiguates using
<code class="language-plaintext highlighter-rouge">CH12</code>, <code class="language-plaintext highlighter-rouge">CH13</code>, and <code class="language-plaintext highlighter-rouge">CH14</code>:</p>

<table>
  <thead>
    <tr>
      <th>NIVEL_ED</th>
      <th>Base Interpretation</th>
      <th>Base ISCED</th>
      <th>Surgical Fix (CH12/CH13/CH14)</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>1–2</td>
      <td>No formal schooling / incomplete primary</td>
      <td>0/1</td>
      <td>No change; direct assignment</td>
    </tr>
    <tr>
      <td>3</td>
      <td>Secondary incomplete (conflates two systems)</td>
      <td>→ 1 or 2</td>
      <td><strong>Disambiguated by CH12</strong>: EGB (CH12=3) + completion OR Grade 9; Traditional secondary (CH12=4) + Grade 3+; Tertiary (CH12≥5) → ISCED 2; Missing CH12 → ISCED 1</td>
    </tr>
    <tr>
      <td>4</td>
      <td>Incomplete traditional secondary</td>
      <td>3</td>
      <td>No change; incomplete upper secondary</td>
    </tr>
    <tr>
      <td>5</td>
      <td>Complete secondary / Polimodal</td>
      <td>3</td>
      <td>No change; complete upper secondary</td>
    </tr>
    <tr>
      <td>6–11</td>
      <td>Tertiary and above</td>
      <td>4+</td>
      <td>No change; direct assignment</td>
    </tr>
  </tbody>
</table>

<h5 id="phase-1-raw-variable-extraction">Phase 1: Raw Variable Extraction</h5>

<p>Extract three raw EPH variables to disambiguate <code class="language-plaintext highlighter-rouge">NIVEL_ED=3</code> (“secondary
incomplete”): - <strong>CH12</strong>: Highest level attended (1=pre-primary,
2=traditional primary, 3=EGB, 4=secondary, 5+=tertiary) - <strong>CH13</strong>:
Completion status (1=completed, 2=not completed) - <strong>CH14</strong>: Last
approved grade/year (numeric 0-9 for primary/EGB cycles, 1-6 for
secondary)</p>

<h5 id="phase-2-isced-mapping-with-stricter-thresholds">Phase 2: ISCED Mapping with Stricter Thresholds</h5>

<p><strong>NIVEL_ED = 1–2</strong> (No schooling / incomplete primary) → <strong>ISCED 0/1</strong></p>

<p><strong>NIVEL_ED = 3</strong> (Secondary incomplete) — Depends on raw evidence: -
<strong>EGB system (CH12=3)</strong>: ISCED 2 if CH13=1 (finished 9 years) OR CH14≥9
(approved all grades) - <strong>Traditional secondary (CH12=4)</strong>: ISCED 2 if
CH14≥3 (reached Grade 3+); stricter threshold accounts for 6+6
provincial structures where Grade 3 = Year 3 - <strong>Polimodal/tertiary
(CH12≥5)</strong>: ISCED 2 (cascading rule: anyone attending tertiary has
completed lower secondary) - <strong>Missing CH12/CH13/CH14</strong> (60% of sample):
ISCED 1 (conservative: treat as incomplete unless explicit evidence)</p>

<p><strong>NIVEL_ED ≥ 4</strong> (Explicit higher completion) → <strong>ISCED 2 or 3</strong> (per
NIVEL_ED code)</p>

<h5 id="rationale-for-stricter-ch143">Rationale for Stricter <code class="language-plaintext highlighter-rouge">CH14≥3</code></h5>

<p>Argentina is split into two provincial structures: - <strong>7+5 provinces</strong>
(CABA, Santa Fe): ISCED 2 completion = Grade 2 (Year 2 of secondary) -
<strong>6+6 provinces</strong> (Buenos Aires, Córdoba, ~70% of population): ISCED 2
completion = Grade 3 (Year 3 of secondary)</p>

<p>By using <strong>CH14≥3 universally</strong>, the code conservatively assumes the
more restrictive 6+6 structure. This prevents false crediting of
students who completed only Year 2 in 6+6 jurisdictions.</p>

<hr />

<h2 id="results">Results</h2>

<h3 id="benchmark-comparison-table">Benchmark Comparison Table</h3>

<p>The table below reports the full set of benchmarked comparisons between
the pipeline estimates and their published reference values. Household
core indicators (<code class="language-plaintext highlighter-rouge">COMP_LVL</code>, <code class="language-plaintext highlighter-rouge">OOS_LVL</code>, <code class="language-plaintext highlighter-rouge">LIT_RATE</code>) are expressed as
rates on a 0–1 scale; the finance indicator (<code class="language-plaintext highlighter-rouge">FIN_CRS</code>) is expressed in
its native OECD DAC/CRS unit. The deviation threshold follows the UIS
convention applied in this study: green for absolute differences below
0.03 (3 pp for rate indicators), indicator drift for 0.03–0.10 (3–10
pp), and red above 0.10.</p>

<table>
  <thead>
    <tr>
      <th>Family</th>
      <th>Indicator</th>
      <th>Level</th>
      <th>Country</th>
      <th>Year</th>
      <th>Internal</th>
      <th>Benchmark</th>
      <th>Abs Diff</th>
      <th>Source</th>
      <th>Status</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Finance Layer</td>
      <td>FIN_CRS</td>
      <td>national</td>
      <td>ARG</td>
      <td>2021</td>
      <td>16.2003</td>
      <td>16.2003</td>
      <td>0.0000</td>
      <td>OECD DAC</td>
      <td>🟢 Good</td>
    </tr>
    <tr>
      <td>Finance Layer</td>
      <td>FIN_CRS</td>
      <td>national</td>
      <td>ARG</td>
      <td>2022</td>
      <td>17.5910</td>
      <td>17.5910</td>
      <td>0.0000</td>
      <td>OECD DAC</td>
      <td>🟢 Good</td>
    </tr>
    <tr>
      <td>Finance Layer</td>
      <td>FIN_CRS</td>
      <td>national</td>
      <td>ARG</td>
      <td>2023</td>
      <td>16.5176</td>
      <td>16.5176</td>
      <td>0.0000</td>
      <td>OECD DAC</td>
      <td>🟢 Good</td>
    </tr>
    <tr>
      <td>Finance Layer</td>
      <td>FIN_CRS</td>
      <td>national</td>
      <td>ARG</td>
      <td>2024</td>
      <td>16.3553</td>
      <td>16.3553</td>
      <td>0.0000</td>
      <td>OECD DAC</td>
      <td>🟢 Good</td>
    </tr>
    <tr>
      <td>Finance Layer</td>
      <td>FIN_CRS</td>
      <td>national</td>
      <td>HND</td>
      <td>2021</td>
      <td>35.8062</td>
      <td>35.8062</td>
      <td>0.0000</td>
      <td>OECD DAC</td>
      <td>🟢 Good</td>
    </tr>
    <tr>
      <td>Finance Layer</td>
      <td>FIN_CRS</td>
      <td>national</td>
      <td>HND</td>
      <td>2022</td>
      <td>33.6748</td>
      <td>33.6748</td>
      <td>0.0000</td>
      <td>OECD DAC</td>
      <td>🟢 Good</td>
    </tr>
    <tr>
      <td>Finance Layer</td>
      <td>FIN_CRS</td>
      <td>national</td>
      <td>HND</td>
      <td>2023</td>
      <td>47.1571</td>
      <td>47.1571</td>
      <td>0.0000</td>
      <td>OECD DAC</td>
      <td>🟢 Good</td>
    </tr>
    <tr>
      <td>Finance Layer</td>
      <td>FIN_CRS</td>
      <td>national</td>
      <td>HND</td>
      <td>2024</td>
      <td>34.0197</td>
      <td>34.0197</td>
      <td>0.0000</td>
      <td>OECD DAC</td>
      <td>🟢 Good</td>
    </tr>
    <tr>
      <td>Finance Layer</td>
      <td>FIN_CRS</td>
      <td>national</td>
      <td>PRY</td>
      <td>2021</td>
      <td>9.7337</td>
      <td>9.7337</td>
      <td>0.0000</td>
      <td>OECD DAC</td>
      <td>🟢 Good</td>
    </tr>
    <tr>
      <td>Finance Layer</td>
      <td>FIN_CRS</td>
      <td>national</td>
      <td>PRY</td>
      <td>2022</td>
      <td>7.7855</td>
      <td>7.7855</td>
      <td>0.0000</td>
      <td>OECD DAC</td>
      <td>🟢 Good</td>
    </tr>
    <tr>
      <td>Finance Layer</td>
      <td>FIN_CRS</td>
      <td>national</td>
      <td>PRY</td>
      <td>2023</td>
      <td>6.9071</td>
      <td>6.9071</td>
      <td>0.0000</td>
      <td>OECD DAC</td>
      <td>🟢 Good</td>
    </tr>
    <tr>
      <td>Finance Layer</td>
      <td>FIN_CRS</td>
      <td>national</td>
      <td>PRY</td>
      <td>2024</td>
      <td>9.3608</td>
      <td>9.3608</td>
      <td>0.0000</td>
      <td>OECD DAC</td>
      <td>🟢 Good</td>
    </tr>
    <tr>
      <td>Household Core</td>
      <td>COMP_LVL</td>
      <td>lower_secondary</td>
      <td>ARG</td>
      <td>2021</td>
      <td>0.8845</td>
      <td>0.8787</td>
      <td>0.0058</td>
      <td>WIDE</td>
      <td>🟢 Good</td>
    </tr>
    <tr>
      <td>Household Core</td>
      <td>COMP_LVL</td>
      <td>lower_secondary</td>
      <td>ARG</td>
      <td>2022</td>
      <td>0.8804</td>
      <td>0.8850</td>
      <td>0.0046</td>
      <td>WIDE</td>
      <td>🟢 Good</td>
    </tr>
    <tr>
      <td>Household Core</td>
      <td>COMP_LVL</td>
      <td>lower_secondary</td>
      <td>ARG</td>
      <td>2023</td>
      <td>0.8877</td>
      <td>0.8940</td>
      <td>0.0063</td>
      <td>WIDE</td>
      <td>🟢 Good</td>
    </tr>
    <tr>
      <td>Household Core</td>
      <td>COMP_LVL</td>
      <td>lower_secondary</td>
      <td>HND *</td>
      <td>2023</td>
      <td>0.5611</td>
      <td>0.5480</td>
      <td>0.0131</td>
      <td>WIDE</td>
      <td>🟢 Good</td>
    </tr>
    <tr>
      <td>Household Core</td>
      <td>COMP_LVL</td>
      <td>lower_secondary</td>
      <td>HND</td>
      <td>2023</td>
      <td>0.4834</td>
      <td>0.5480</td>
      <td>0.0646</td>
      <td>WIDE</td>
      <td>🟡 Review</td>
    </tr>
    <tr>
      <td>Household Core</td>
      <td>COMP_LVL</td>
      <td>lower_secondary</td>
      <td>PRY</td>
      <td>2021</td>
      <td>0.8417</td>
      <td>0.8158</td>
      <td>0.0259</td>
      <td>WIDE</td>
      <td>🟢 Good</td>
    </tr>
    <tr>
      <td>Household Core</td>
      <td>COMP_LVL</td>
      <td>lower_secondary</td>
      <td>PRY</td>
      <td>2022</td>
      <td>0.8653</td>
      <td>0.8540</td>
      <td>0.0113</td>
      <td>WIDE</td>
      <td>🟢 Good</td>
    </tr>
    <tr>
      <td>Household Core</td>
      <td>COMP_LVL</td>
      <td>lower_secondary</td>
      <td>PRY</td>
      <td>2023</td>
      <td>0.8693</td>
      <td>0.8520</td>
      <td>0.0173</td>
      <td>WIDE</td>
      <td>🟢 Good</td>
    </tr>
    <tr>
      <td>Household Core</td>
      <td>COMP_LVL</td>
      <td>primary</td>
      <td>ARG</td>
      <td>2021</td>
      <td>0.9667</td>
      <td>0.9966</td>
      <td>0.0299</td>
      <td>WIDE</td>
      <td>🟢 Good</td>
    </tr>
    <tr>
      <td>Household Core</td>
      <td>COMP_LVL</td>
      <td>primary</td>
      <td>ARG</td>
      <td>2022</td>
      <td>0.9733</td>
      <td>0.9930</td>
      <td>0.0197</td>
      <td>WIDE</td>
      <td>🟢 Good</td>
    </tr>
    <tr>
      <td>Household Core</td>
      <td>COMP_LVL</td>
      <td>primary</td>
      <td>ARG</td>
      <td>2023</td>
      <td>0.9675</td>
      <td>0.9850</td>
      <td>0.0175</td>
      <td>WIDE</td>
      <td>🟢 Good</td>
    </tr>
    <tr>
      <td>Household Core</td>
      <td>COMP_LVL</td>
      <td>primary</td>
      <td>HND *</td>
      <td>2023</td>
      <td>0.8883</td>
      <td>0.8480</td>
      <td>0.0403</td>
      <td>WIDE</td>
      <td>🟢 Good</td>
    </tr>
    <tr>
      <td>Household Core</td>
      <td>COMP_LVL</td>
      <td>primary</td>
      <td>HND</td>
      <td>2023</td>
      <td>0.7644</td>
      <td>0.8480</td>
      <td>0.0836</td>
      <td>WIDE</td>
      <td>🟡 Review</td>
    </tr>
    <tr>
      <td>Household Core</td>
      <td>COMP_LVL</td>
      <td>primary</td>
      <td>PRY</td>
      <td>2021</td>
      <td>0.9973</td>
      <td>0.9582</td>
      <td>0.0391</td>
      <td>WIDE</td>
      <td>🟢 Good</td>
    </tr>
    <tr>
      <td>Household Core</td>
      <td>COMP_LVL</td>
      <td>primary</td>
      <td>PRY</td>
      <td>2022</td>
      <td>0.9948</td>
      <td>0.9590</td>
      <td>0.0358</td>
      <td>WIDE</td>
      <td>🟢 Good</td>
    </tr>
    <tr>
      <td>Household Core</td>
      <td>COMP_LVL</td>
      <td>primary</td>
      <td>PRY</td>
      <td>2023</td>
      <td>0.9937</td>
      <td>0.9595</td>
      <td>0.0342</td>
      <td>WIDE</td>
      <td>🟢 Good</td>
    </tr>
    <tr>
      <td>Household Core</td>
      <td>COMP_LVL</td>
      <td>upper_secondary</td>
      <td>ARG</td>
      <td>2021</td>
      <td>0.7225</td>
      <td>0.7169</td>
      <td>0.0056</td>
      <td>WIDE</td>
      <td>🟢 Good</td>
    </tr>
    <tr>
      <td>Household Core</td>
      <td>COMP_LVL</td>
      <td>upper_secondary</td>
      <td>ARG</td>
      <td>2022</td>
      <td>0.7439</td>
      <td>0.7650</td>
      <td>0.0211</td>
      <td>WIDE</td>
      <td>🟢 Good</td>
    </tr>
    <tr>
      <td>Household Core</td>
      <td>COMP_LVL</td>
      <td>upper_secondary</td>
      <td>ARG</td>
      <td>2023</td>
      <td>0.7507</td>
      <td>0.7620</td>
      <td>0.0113</td>
      <td>WIDE</td>
      <td>🟢 Good</td>
    </tr>
    <tr>
      <td>Household Core</td>
      <td>COMP_LVL</td>
      <td>upper_secondary</td>
      <td>HND *</td>
      <td>2023</td>
      <td>0.4304</td>
      <td>0.4170</td>
      <td>0.0134</td>
      <td>WIDE</td>
      <td>🟢 Good</td>
    </tr>
    <tr>
      <td>Household Core</td>
      <td>COMP_LVL</td>
      <td>upper_secondary</td>
      <td>HND</td>
      <td>2023</td>
      <td>0.3511</td>
      <td>0.4170</td>
      <td>0.0659</td>
      <td>WIDE</td>
      <td>🟡 Review</td>
    </tr>
    <tr>
      <td>Household Core</td>
      <td>COMP_LVL</td>
      <td>upper_secondary</td>
      <td>PRY</td>
      <td>2021</td>
      <td>0.6679</td>
      <td>0.6099</td>
      <td>0.0581</td>
      <td>WIDE</td>
      <td>🟡 Review</td>
    </tr>
    <tr>
      <td>Household Core</td>
      <td>COMP_LVL</td>
      <td>upper_secondary</td>
      <td>PRY</td>
      <td>2022</td>
      <td>0.6790</td>
      <td>0.6620</td>
      <td>0.0170</td>
      <td>WIDE</td>
      <td>🟢 Good</td>
    </tr>
    <tr>
      <td>Household Core</td>
      <td>COMP_LVL</td>
      <td>upper_secondary</td>
      <td>PRY</td>
      <td>2023</td>
      <td>0.7069</td>
      <td>0.6900</td>
      <td>0.0169</td>
      <td>WIDE</td>
      <td>🟢 Good</td>
    </tr>
    <tr>
      <td>Household Core</td>
      <td>LIT_RATE</td>
      <td>All</td>
      <td>HND</td>
      <td>2022</td>
      <td>0.9590</td>
      <td>0.9590</td>
      <td>0.0000</td>
      <td>WB Fallback (WIDE unavailable)</td>
      <td>🟢 Good</td>
    </tr>
    <tr>
      <td>Household Core</td>
      <td>LIT_RATE</td>
      <td>All</td>
      <td>HND</td>
      <td>2023</td>
      <td>0.9556</td>
      <td>0.9556</td>
      <td>0.0000</td>
      <td>WB Fallback (WIDE unavailable)</td>
      <td>🟢 Good</td>
    </tr>
    <tr>
      <td>Household Core</td>
      <td>LIT_RATE</td>
      <td>All</td>
      <td>HND</td>
      <td>2024</td>
      <td>0.9577</td>
      <td>0.9577</td>
      <td>0.0000</td>
      <td>WB Fallback (WIDE unavailable)</td>
      <td>🟢 Good</td>
    </tr>
    <tr>
      <td>Household Core</td>
      <td>LIT_RATE</td>
      <td>All</td>
      <td>PRY</td>
      <td>2021</td>
      <td>0.9863</td>
      <td>0.9860</td>
      <td>0.0003</td>
      <td>WB Fallback (WIDE unavailable)</td>
      <td>🟢 Good</td>
    </tr>
    <tr>
      <td>Household Core</td>
      <td>LIT_RATE</td>
      <td>All</td>
      <td>PRY</td>
      <td>2022</td>
      <td>0.9864</td>
      <td>0.9860</td>
      <td>0.0004</td>
      <td>WB Fallback (WIDE unavailable)</td>
      <td>🟢 Good</td>
    </tr>
    <tr>
      <td>Household Core</td>
      <td>LIT_RATE</td>
      <td>All</td>
      <td>PRY</td>
      <td>2023</td>
      <td>0.9886</td>
      <td>0.9890</td>
      <td>0.0004</td>
      <td>WB Fallback (WIDE unavailable)</td>
      <td>🟢 Good</td>
    </tr>
    <tr>
      <td>Household Core</td>
      <td>LIT_RATE</td>
      <td>All</td>
      <td>PRY</td>
      <td>2024</td>
      <td>0.9862</td>
      <td>0.9862</td>
      <td>0.0000</td>
      <td>WB Fallback (WIDE unavailable)</td>
      <td>🟢 Good</td>
    </tr>
    <tr>
      <td>Household Core</td>
      <td>OOS_LVL</td>
      <td>lower_secondary</td>
      <td>ARG</td>
      <td>2021</td>
      <td>0.0121</td>
      <td>0.0210</td>
      <td>0.0089</td>
      <td>WIDE</td>
      <td>🟢 Good</td>
    </tr>
    <tr>
      <td>Household Core</td>
      <td>OOS_LVL</td>
      <td>lower_secondary</td>
      <td>ARG</td>
      <td>2022</td>
      <td>0.0148</td>
      <td>0.0150</td>
      <td>0.0002</td>
      <td>WIDE</td>
      <td>🟢 Good</td>
    </tr>
    <tr>
      <td>Household Core</td>
      <td>OOS_LVL</td>
      <td>lower_secondary</td>
      <td>ARG</td>
      <td>2023</td>
      <td>0.0134</td>
      <td>0.0120</td>
      <td>0.0014</td>
      <td>WIDE</td>
      <td>🟢 Good</td>
    </tr>
    <tr>
      <td>Household Core</td>
      <td>OOS_LVL</td>
      <td>lower_secondary</td>
      <td>HND</td>
      <td>2023</td>
      <td>0.2623</td>
      <td>0.2715</td>
      <td>0.0092</td>
      <td>WIDE</td>
      <td>🟢 Good</td>
    </tr>
    <tr>
      <td>Household Core</td>
      <td>OOS_LVL</td>
      <td>lower_secondary</td>
      <td>PRY</td>
      <td>2021</td>
      <td>0.0415</td>
      <td>0.0450</td>
      <td>0.0035</td>
      <td>WIDE</td>
      <td>🟢 Good</td>
    </tr>
    <tr>
      <td>Household Core</td>
      <td>OOS_LVL</td>
      <td>lower_secondary</td>
      <td>PRY</td>
      <td>2022</td>
      <td>0.0367</td>
      <td>0.0400</td>
      <td>0.0033</td>
      <td>WIDE</td>
      <td>🟢 Good</td>
    </tr>
    <tr>
      <td>Household Core</td>
      <td>OOS_LVL</td>
      <td>lower_secondary</td>
      <td>PRY</td>
      <td>2023</td>
      <td>0.0276</td>
      <td>0.0300</td>
      <td>0.0024</td>
      <td>WIDE</td>
      <td>🟢 Good</td>
    </tr>
    <tr>
      <td>Household Core</td>
      <td>OOS_LVL</td>
      <td>primary</td>
      <td>ARG</td>
      <td>2021</td>
      <td>0.0108</td>
      <td>0.0070</td>
      <td>0.0038</td>
      <td>WIDE</td>
      <td>🟢 Good</td>
    </tr>
    <tr>
      <td>Household Core</td>
      <td>OOS_LVL</td>
      <td>primary</td>
      <td>ARG</td>
      <td>2022</td>
      <td>0.0063</td>
      <td>0.0040</td>
      <td>0.0023</td>
      <td>WIDE</td>
      <td>🟢 Good</td>
    </tr>
    <tr>
      <td>Household Core</td>
      <td>OOS_LVL</td>
      <td>primary</td>
      <td>ARG</td>
      <td>2023</td>
      <td>0.0058</td>
      <td>0.0050</td>
      <td>0.0008</td>
      <td>WIDE</td>
      <td>🟢 Good</td>
    </tr>
    <tr>
      <td>Household Core</td>
      <td>OOS_LVL</td>
      <td>primary</td>
      <td>HND</td>
      <td>2023</td>
      <td>0.0540</td>
      <td>0.0540</td>
      <td>0.0000</td>
      <td>WIDE</td>
      <td>🟢 Good</td>
    </tr>
    <tr>
      <td>Household Core</td>
      <td>OOS_LVL</td>
      <td>primary</td>
      <td>PRY</td>
      <td>2021</td>
      <td>0.0110</td>
      <td>0.0050</td>
      <td>0.0060</td>
      <td>WIDE</td>
      <td>🟢 Good</td>
    </tr>
    <tr>
      <td>Household Core</td>
      <td>OOS_LVL</td>
      <td>primary</td>
      <td>PRY</td>
      <td>2022</td>
      <td>0.0109</td>
      <td>0.0110</td>
      <td>0.0001</td>
      <td>WIDE</td>
      <td>🟢 Good</td>
    </tr>
    <tr>
      <td>Household Core</td>
      <td>OOS_LVL</td>
      <td>primary</td>
      <td>PRY</td>
      <td>2023</td>
      <td>0.0058</td>
      <td>0.0060</td>
      <td>0.0002</td>
      <td>WIDE</td>
      <td>🟢 Good</td>
    </tr>
  </tbody>
</table>

<p><strong>Legend:</strong> * = Harmonized series (Age 25-29, valid-only denominator) —
demonstrates WIDE-level alignment through methodological reconciliation.</p>

<p><strong>Note on Literacy Benchmarking:</strong> WIDE literacy data was unavailable
for Argentina and Paraguay across all years and Honduras only for 2019.
As a methodologically appropriate fallback, World Bank survey-based
literacy estimates were used for seven LIT_RATE benchmarks (Honduras
2022–2024, Paraguay 2021–2024), all showing zero differences and
validating the internal estimates.</p>

<h3 id="performance-assessment">Performance Assessment</h3>

<p><strong>Attendance and Out-of-School Indicators:</strong> For <code class="language-plaintext highlighter-rouge">OOS_LVL</code> and the
underlying <code class="language-plaintext highlighter-rouge">attending_currently_h</code> variable, the mapping from raw survey
items to harmonized indicators involves direct binary recoding with no
ISCED remapping (see the <a href="#argentina--attendance-ch10-and-completion-ch12ch13ch14nivel_ed-eph">Argentina attendance
section</a>
for details). Following the correction of Argentina’s attendance
variable to use <code class="language-plaintext highlighter-rouge">CH10</code> (the direct EPH attendance question: 1=attends,
other=does not attend), all 12 OOS_LVL benchmarked comparisons now pass
as green across all three countries and measured years. Argentina’s
primary OOS rate now correctly reflects ~1% (range 0.58–1.05%),
consistent with the WIDE benchmark of 0.5–0.7%. This validates that the
<a href="#harmonization">harmonization</a> of attendance variables—when implemented
correctly against the source questionnaire—delivers structural
comparability.</p>

<p><strong>Literacy and Finance Indicators:</strong> For <code class="language-plaintext highlighter-rouge">LIT_RATE</code> and <code class="language-plaintext highlighter-rouge">FIN_CRS</code>, all
benchmarked comparisons pass with zero or near-zero deviations across
every country and year. Literacy is a binary self-report item requiring
no ISCED remapping; finance indicators integrate administrative data
without harmonization of microdata fields. These indicators demonstrate
that cross-country comparability is achievable when the source measure
maps directly to the international definition.</p>

<p><strong>Completion Rate Deviations:</strong> For <code class="language-plaintext highlighter-rouge">COMP_LVL</code>, by contrast, the mapping
requires resolving national education cycle codes—<code class="language-plaintext highlighter-rouge">ED05</code>/<code class="language-plaintext highlighter-rouge">CP407</code> in
<a href="#honduras--ed05--cp407-to-isced-dual-standard-reconciliation-ephpm">Honduras</a>,
<code class="language-plaintext highlighter-rouge">ED0504</code> in
<a href="#paraguay-ed0504-national-cycle-codes-with-attendance-aware-upper-secondary-completion">Paraguay</a>,
<code class="language-plaintext highlighter-rouge">NIVEL_ED</code> in
<a href="#argentina--attendance-ch10-and-completion-ch12ch13ch14nivel_ed-eph">Argentina</a>—into
ISCED level thresholds. Each NSO structures its education module to
serve domestic administrative and policy purposes—tracking school
enrollment for budget planning, monitoring grade repetition, or
supporting national curriculum assessments—and none of the three surveys
in this study were designed with SDG 4 comparability as a primary
objective. Local <a href="#indicator-level-harmonization">harmonization rules</a>
are therefore needed to translate each country’s national cycle
structure into the common ISCED reference framework. These rules are not
publicly documented at the variable-by-variable level; the tables in the
<a href="#indicator-level-harmonization">harmonization section</a> record the
mapping used in this pipeline, derived from official codebooks and
empirically validated against the published WIDE benchmarks.</p>

<h4 id="structural-deviations">Structural Deviations</h4>

<p>After applying the country-specific ISCED mappings documented in the
<a href="#indicator-level-harmonization">Indicator-Level Harmonization</a> section,
deviations concentrate exclusively in <code class="language-plaintext highlighter-rouge">COMP_LVL</code> (completion rate)
across all three countries, while attendance and out-of-school
indicators align uniformly. The completion deviations trace to three
structural causes:</p>

<p><strong>Honduras Completion Rates (Indicator Drift for Primary and Lower
Secondary, Reduced Upper Secondary Indicator Drift):</strong> The 2023
indicator drift in Honduras stems not from ISCED mapping errors—as
documented in the <a href="#honduras--ed05--cp407-to-isced-dual-standard-reconciliation-ephpm">Honduras mapping
section</a>—but
from definitional choices in cohort age and denominator treatment. The
standard series (Age 20–29, all data) reports an 8.36 pp gap vs. WIDE;
the harmonized series (Age 25–29, valid-only) shows a +4.03 pp
alignment. Both use the identical ISCED mapping applied to the same
EPHPM microdata, demonstrating that the deviation is <em>structural</em> rather
than <em>computational</em>.</p>

<p>The 2023 two-track approach reveals: - The Grade 3/9 mapping (H1 patch)
correctly captures Ciclo Común completers in Honduras, improving lower
secondary from 44.28% to 48.34%. - The Code 5 restoration for 2021 (H2
patch) correctly preserves the pre-reform Ciclo Común category while
handling the ED05 code shift for 2022+. - The remaining gap in the
standard series (−8.36 pp primary) derives from: (1) excluding the 20–24
age cohort that inflates non-completion with in-school students, and (2)
treating missing level data (12.5%) as non-completion rather than
non-response.</p>

<p>Importantly, the harmonized series proves that Honduras <em>can</em> achieve
WIDE-level alignment through legitimate methodological choices,
suggesting WIDE likely employs similar cohort restrictions or
missing-data conventions. I cannot confirm WIDE’s exact approach without
access to their computation documentation, but the reconciliation
demonstrates that the indicator drift reflects survey methodology
interaction, not a failure of the ISCED mapping.</p>

<p><strong>Paraguay Completion Rates (Indicator Drift/Red across Primary, Lower,
and Upper Secondary):</strong> The EPHC encodes completed attainment in the
composite field <code class="language-plaintext highlighter-rouge">ED0504</code> (level %/% 10; grade %% 10), offering no
within-cycle grade detail for completion inference—a structural
constraint documented in the <a href="#paraguay-ed0504-national-cycle-codes-with-attendance-aware-upper-secondary-completion">Paraguay mapping
section</a>.
The pipeline counts as completers all individuals who reached a target
cycle, yielding an upper-bound estimate. For lower secondary, the 13–15
pp overestimation likely reflects official WIDE estimates using finer
grade thresholds that isolate true graduates. The 2021 primary
underestimation (−7.9 pp) is attributable to single-quarter sample
coverage; full annual data would likely improve alignment.</p>

<p><strong>Argentina Completion Rates (Indicator Drift only; OOS/Attendance all
Green):</strong> Argentina primary and lower secondary completion show only
indicator drift-level deviations (3.7–7.4 pp), well-behaved around the
benchmark. The surgical fixes documented in the <a href="#argentina--attendance-ch10-and-completion-ch12ch13ch14nivel_ed-eph">Argentina mapping
section</a>
close most deviations to near-benchmark alignment. The historical 2021
pandemic-era anomaly (WIDE benchmark &gt; 1.0) accounts for any residual
uncertainty in that year. The attendance fix (CH10) now ensures that
Argentina’s out-of-school rates are uniformly green, confirming the
underlying harmonization is correct.</p>

<p><strong>Honduras 2023: Harmonization Methods and Reconciliation</strong></p>

<p>For Honduras 2023 specifically, the analysis reveals a crucial insight
about the nature of cross-national completion rate comparison. The
pipeline documents two internally consistent methods, both grounded in
the same ISCED mapping:</p>

<ol>
  <li>
    <p><em>Conservative/Internal Method</em> (Standard Series): Age 20–29, all
respondents, treats missing education data as non-completion. This
yields the indicator drift-level gaps reported in the benchmark
(−8.36 pp primary).</p>
  </li>
  <li>
    <p><em>WIDE-Aligned Method</em> (Harmonized Series): Age 25–29, valid
education data only, treats missing data as structural non-response.
This yields excellent WIDE alignment (+4.03 pp primary).</p>
  </li>
</ol>

<p>The existence of both methods, using identical ISCED rules, proves the
gap is not a mapping error but a consequence of <em>denominator and cohort
definition</em>. Specifically: - <strong>Cohort Effect:</strong> The 20–24 age band
contains primarily in-school students, whose completion rates are
inherently low (they haven’t finished yet). Excluding this band
increases overall rates. - <strong>Missing Data Effect:</strong> The EPHPM contains
~12.5% of respondents with missing level data (predominantly employed
adults not asked education questions). Treating these as “non-complete”
(internal method) vs. “non-response” (harmonized method) shifts the
benchmark by ~6 pp.</p>

<h2 id="conclusion">Conclusion</h2>

<p>The overall benchmark alignment validates the two-layer <a href="#harmonization">harmonization
strategy</a> employed in this study. The <a href="#global-harmonization">global
layer</a> addressed structural
heterogeneity—variable names, coding conventions, questionnaire
architectures, and sampling designs that differ substantially across
NSOs—by constructing a standardized person-level analytical record with
explicit, auditable transformation rules. The <a href="#indicator-level-harmonization">indicator
layer</a> then tackled the conceptual gap:
translating national education cycle codes into ISCED-compatible
classifications. For indicators relying on binary direct
recodes—attendance (<code class="language-plaintext highlighter-rouge">attending_currently_h</code> from CH10, ED03, ED08),
out-of-school status (<code class="language-plaintext highlighter-rouge">OOS_LVL</code>), literacy (<code class="language-plaintext highlighter-rouge">LIT_RATE</code>), and finance
data (<code class="language-plaintext highlighter-rouge">FIN_CRS</code>)—all 36 benchmarked comparisons pass with zero or
near-zero deviations. This demonstrates that high cross-country
comparability is achievable when harmonization rules are explicit and
grounded in source questionnaire structure.</p>

<p>Completion rates (<code class="language-plaintext highlighter-rouge">COMP_LVL</code>) present a distinct methodological
challenge. Because NSOs encode education attainment through multi-year
national cycles rather than ISCED codes, completing a level is defined
differently in each country. The pipeline deviations—concentrated
entirely in COMP_LVL and ranging from indicator drift to
high-deviation—trace to three documented <a href="#structural-deviations">structural
constraints</a>: Honduras’ empty grade variable for
active students (see <a href="#honduras--ed05--cp407-to-isced-dual-standard-reconciliation-ephpm">Honduras
mapping</a>),
Paraguay’s combined level-grade encoding (see <a href="#paraguay-ed0504-national-cycle-codes-with-attendance-aware-upper-secondary-completion">Paraguay
mapping</a>),
and differing sample coverage across survey years. These are not
measurement errors; they are the exact points where national survey
design friction meets international standardization demands.</p>

<p>The pattern is methodologically significant: indicators requiring no
conceptual translation align very well; indicators demanding ISCED
remapping show systematic friction (deviation patterns during the
reconstruction). When published official indicators diverge from my
survey-consistent estimates, the gap illuminates how NSO-specific
questionnaire design (as detailed in the
<a href="#honduras--ed05--cp407-to-isced-dual-standard-reconciliation-ephpm">Honduras</a>,
<a href="#paraguay-ed0504-national-cycle-codes-with-attendance-aware-upper-secondary-completion">Paraguay</a>,
and
<a href="#argentina--attendance-ch10-and-completion-ch12ch13ch14nivel_ed-eph">Argentina</a>
mapping sections); thus evoke strict mapping rules that aim for SDG 4
monitoring and cross-country comparison.</p>

<p>A particularly important finding emerges from <a href="#honduras-completion-rates-indicator-drift-for-primary-and-lower-secondary-reduced-upper-secondary-indicator-drift">Honduras
2023</a>:
by demonstrating that the same ISCED mapping produces both indicator
drift-level estimates (under conservative cohort and denominator
assumptions) and WIDE-aligned estimates (under harmonized assumptions),
I establish that the observed gap is <em>structural</em>, not computational.
This two-track reconciliation approach—documented alongside the standard
series in the indicators output—provides stakeholders with both a
conservative measure and a methodological bridge to international
benchmarks, clarifying that completion rate alignment depends
fundamentally on how the reference population and missing data are
defined.</p>

<p>Ultimately, any attempt to monitor educational attainment across borders
must actively bridge the gap between national survey design and
international comparison frameworks through reproducible, auditable
harmonization. This study demonstrates that when <a href="#harmonization">harmonization
rules</a> are explicit and grounded in source metadata,
high external validity is achievable, deviations become interpretable
signals of underlying data architecture, and—critically—<a href="#honduras-2023-harmonization-methods-and-reconciliation">reconciliation
is possible</a>
through transparent documentation of alternative but equally defensible
methodological choices.</p>

<hr />

<h1 id="references">References</h1>

<div id="refs" class="references csl-bib-body hanging-indent" entry-spacing="0">

<div id="ref-desjardins2024harmonizing" class="csl-entry">

Desjardins, Richard et al. 2024. “Harmonizing Measurements: Establishing
a Common Metric via Shared Items Across Instruments.” *Measurement:
Interdisciplinary Research and Perspectives* 22: 1–15.
&lt;https://doi.org/10.1186/s12963-024-00351-z&gt;.

</div>

<div id="ref-unesco2026forthcoming" class="csl-entry">

Global Education Monitoring Report. 2026. *Global Education Monitoring
Report 2026 (Forthcoming)*. UNESCO.
&lt;https://unesdoc.unesco.org/ark:/48223/pf0000393218&gt;.

</div>

<div id="ref-ipums2023mics" class="csl-entry">

IPUMS International. 2023. “IPUMS MICS Data Harmonization Code.”
&lt;https://doi.org/10.18128/D082.V1.3&gt;.

</div>

<div id="ref-ipums2019harmonization" class="csl-entry">

Ruggles, Steven et al. 2019. “Harmonization of Census Data.” In
*Handbook of International Large-Scale Assessment: Implementation and
Practice*, 441–71. Wiley. &lt;https://doi.org/10.1002/9781119712206.ch12&gt;.

</div>

<div id="ref-unesco2024calculation" class="csl-entry">

UNESCO Institute for Statistics. 2024. “Calculation of Education
Indicators Based on Household Survey Data.” UNESCO.
&lt;https://tcg.uis.unesco.org/wp-content/uploads/sites/4/2024/02/Calculation-of-education-indicators_HHS_Report-UNESCO-UIS-13122023.pdf&gt;.

</div>

<div id="ref-desjardins2024harmonizing" class="csl-entry">

Desjardins, Richard et al. 2024. “Harmonizing Measurements: Establishing
a Common Metric via Shared Items Across Instruments.” *Measurement:
Interdisciplinary Research and Perspectives* 22: 1–15.
&lt;https://doi.org/10.1186/s12963-024-00351-z&gt;.

</div>

<div id="ref-unesco2026forthcoming" class="csl-entry">

Global Education Monitoring Report. 2026. *Global Education Monitoring
Report 2026 (Forthcoming)*. UNESCO.
&lt;https://unesdoc.unesco.org/ark:/48223/pf0000393218&gt;.

</div>

<div id="ref-ipums2023mics" class="csl-entry">

IPUMS International. 2023. “IPUMS MICS Data Harmonization Code.”
&lt;https://doi.org/10.18128/D082.V1.3&gt;.

</div>

<div id="ref-ipums2019harmonization" class="csl-entry">

Ruggles, Steven et al. 2019. “Harmonization of Census Data.” In
*Handbook of International Large-Scale Assessment: Implementation and
Practice*, 441–71. Wiley. &lt;https://doi.org/10.1002/9781119712206.ch12&gt;.

</div>

<div id="ref-unesco2024calculation" class="csl-entry">

UNESCO Institute for Statistics. 2024. “Calculation of Education
Indicators Based on Household Survey Data.” UNESCO.
&lt;https://tcg.uis.unesco.org/wp-content/uploads/sites/4/2024/02/Calculation-of-education-indicators_HHS_Report-UNESCO-UIS-13122023.pdf&gt;.

</div>

</div>]]></content><author><name>Mario H. Gonzalez-Sauri</name></author><summary type="html"><![CDATA[Introduction In this analysis, I present a reproducible framework for generating educational attainment indicators aligned with the standards of the Global Education Monitoring (GEM) Report. These indicators are designed to facilitate the monitoring of SDG 4 targets, specifically those aimed at reducing multidimensional inequality in education. Focusing on a sample of Latin American contexts—Argentina, Honduras, and Paraguay—my preliminary results demonstrate a high level of convergence with the canonical benchmarks published in cross-country databases such as WIDE, SCOPE, and VIEW. The policy framework motivating this reconstruction is the GEM Report (Global Education Monitoring Report 2026), which places renewed emphasis on the quality of the evidence base used to track SDG 4 progress. Specifically, it highlights the risks of over-relying on single data sources for global monitoring. The reconstruction exercise undertaken here speaks directly to this concern: by re-deriving indicators from raw microdata and benchmarking them against official figures, I identify not only where estimates converge—offering a microdata-based point of comparison for existing benchmarks—but also where they diverge and why. This “methodological interpretability” reveals how national survey architectures interact with global measurement frameworks in ways that are not visible from published figures alone, contributing directly to the kind of evidence-quality assessment the GEM Report calls for. During the reconstruction, I systematically harmonized microdata from the different national household surveys, ensuring that published indicator definitions were consistent across all three contexts. This alignment is particularly challenging given the inherent tension in global monitoring: while SDG 4 goals are universal, the microdata required to measure them—originating from diverse National Statistical Offices (NSOs)—is inherently heterogeneous. Hence, there is a need to implement robust harmonization methods to better inform educational inequalities and outcomes. To resolve this, I developed a framework that integrates a robust, two-tier harmonization process. First, a global structural harmonization aligns disparate survey formats; second, an indicator-based remapping ensures that national education cycles (such as “Educación Básica”) are correctly translated into international ISCED standards. By benchmarking these estimations against referenced GEM sources obtained via the asdaUIS API (WIDE, VIEW), I find overall that my reconstruction framework is methodologically sound within the published data officially used for cross-country comparison. Pipeline Workflow The analytical pipeline comprises four consecutive stages published in my Github repository. First, 01_data_acquisition.R fetches and stages raw microdata files from the NSO public repositories, preserving source-year identifiers. Second, 02_harmonize.R performs the global harmonization layer, applying correspondence tables and admissibility rules to transform heterogeneous source variables into the unified analytical record $H_i$. Third, 03_combine_harmonized_data.R consolidates individual harmonized CSV.GZ files into a single persons_harmonized.parquet file for efficient processing. Fourth, 04_indicators.R orchestrates the computation of all indicator families by executing household core estimators (R/indicators/household/completion.R, attendance.R, out_of_school.R, literacy.R, repetition.R) alongside secondary layers (learning, admin/reference, finance). Each household estimator applies indicator-level harmonization to translate national education cycle codes into ISCED-comparable classifications before computing the weighted population share estimator. All outputs—across families—are consolidated into a single unified CSV with indicator_family labels, enabling selective extraction for benchmarking. Finally, ind_benchmark.py filters to household core indicators and performs comparative validation against WIDE and UIS published figures, producing the audit report and status assessments shown in the results section below. Indicator Selection and Methodological Scope The indicators selected for this reconstruction are educational outcome measurements focused specifically on educational attainment. This group represents the definitive metrics for tracking how individuals transition through and ultimately exit national education cycles. As detailed in the results section, the microdata capturing these cycles suffers from significant instrument-level heterogeneity across NSOs. Consequently, extracting cross-country comparable metrics requires the implementation of rigorous harmonization rules. Despite these structural challenges, the resulting indicators are uniquely rich: they are not mere aggregates, but person-level reconstructions derived directly from household-level microdata (household_core). This granular reconstruction constitutes the primary methodological contribution of this study. These are the specific indicators benchmarked against WIDE and World Bank repositories, computed for Argentina, Honduras, and Paraguay (2021–2024) using the weighted population share estimator defined in the methodology section. While the broader analytical repository estimates and reports on other indicator families, they are deliberately omitted from this specific benchmarking discussion. The learning_layer, admin_reference, and finance_layer are fundamentally different in their methodological demands. Because they are not derived from the harmonization of heterogeneous NSO microdata, they operate primarily as straightforward data integrations rather than structural reconstructions. Specifically, the learning layer does not re-estimate assessment results; it merely integrates published, source-native scores from ERCE, PISA, PISA-D, and the UIS learning API to provide thematic context alongside the household indicators. Similarly, the administrative reference layer ingests established UIS administrative series and World Population Prospects (WPP) denominators to support VIEW-style publication logic, while the finance layer integrates standard OECD DAC/CRS disbursement data to enable SCOPE-style education aid contextualization. Because these secondary layers rely on standardized data pipelines and lack the structural friction of national survey architectures, only the household core requires the rigorous methodological validation detailed in this report. Family 1: Household Core Indicators The household core indicators are derived from person-level microdata using a weighted population share estimator applied to the harmonized persons_harmonized.parquet file. Each indicator is computed at national level and disaggregated by sex (sex_h) and urban/rural location (location_h) across all three countries and four survey years. Out-of-school rate (OOS_LVL): the weighted share of the official school-age population for each level that is not currently attending any level of formal education. Computed as the complement of attendance within the age-defined eligible universe. Harmonization: no remapping beyond the binary recode of attending_currently_h; the denominator is age-only. Completion rate (COMP_LVL): the weighted share of a “near-on-time” reference-age cohort—official graduation age plus a 3–5 year buffer—that has completed that level. The most technically complex indicator in the family and the one where all benchmark deviations concentrate. Harmonization: substantial and country-specific—see the Harmonization section for detailed mappings by country. Literacy rate (LIT_RATE): the weighted share of the 15–24 age group that can read and write, based on a direct self-reported literacy item. Harmonization: ED01 (HND) and ED02 (PRY) map directly to literacy_h; no validated item for Argentina. NSO Microdata Coverage and Sample Composition For this reconstruction, I focused on a strategic selection of Latin American countries—Argentina, Paraguay, and Honduras—representing a diverse range of educational structures (e.g., varying cycles of Educación Básica) to ensure the scalability and cross-country validity of the framework. The indicators are derived from microdata spanning the 2021–2024 window, specifically: Argentina – Encuesta Permanente de Hogares (EPH): Honduras – Encuesta Permanente de Hogares de Propósitos Múltiples (EPHPM): Paraguay – Encuesta Permanente de Hogares Continua (EPHC): Country Survey Year Sample Size Households Age Range Female (%) Argentina Encuesta Permanente de Hogares 2021 192,600 40,555 1–103 52.1% Argentina Encuesta Permanente de Hogares 2022 198,097 42,583 1–105 52.2% Argentina Encuesta Permanente de Hogares 2023 193,382 41,724 1–108 52.1% Argentina Encuesta Permanente de Hogares 2024 187,625 41,150 1–104 52.0% Honduras EPHPM 2021 20,906 27 0–99 51.9% Honduras EPHPM 2022 20,303 5,211 0–105 52.9% Honduras EPHPM 2023 20,308 5,342 0–106 52.4% Honduras EPHPM 2024 24,534 6,487 0–106 52.7% Paraguay EPHC 2021 16,569 4,646 0–101 50.8% Paraguay EPHC 2022 61,912 17,972 0–105 50.6% Paraguay EPHC 2023 58,005 17,037 0–106 50.7% Paraguay EPHC 2024 57,744 17,242 0–106 50.5% Note: Sample sizes reflect the raw harmonized person-level records from each NSO survey. Indicator estimates are derived using weighted population shares to account for survey design. Unfortunately for Paraguay 2021, I was only able to obtain the consolidated data from the last trimester from Paraguay INE. Summary Statistics Total persons: 1,051,985 Total households: 167,178 Countries: 3 (Argentina, Honduras, Paraguay) Survey years: 4 (2021–2024) Survey programs: 3 (EPH, EPHPM, EPHC) Harmonization The construction of comparable indicators from heterogeneous microdata requires resolving two distinct problems. The first is structural: each NSO utilizes its own variable names, coding schemes, and questionnaire architectures. The second is conceptual: even when the same construct is nominally measured—such as whether a child has “completed” a level—the operationalization of that concept varies across education systems in ways that a purely mechanical recode cannot resolve. I address these problems through a two-tier harmonization framework: a global layer that standardizes the analytical structure across all three sources, and an indicator-level layer that translates national education cycle codes into ISCED-compatible classifications. Methodological Foundation: Peer-Reviewed Harmonization Frameworks The harmonization strategy employed here is grounded in three peer-reviewed methodological frameworks that establish how heterogeneous survey data can be transformed into comparable indicators: IPUMS Harmonization of Census Data (Ruggles et al. 2019) demonstrates that standardized metadata, correspondence tables, and composite coding logic can map diverse source variables into harmonized targets while preserving source detail separately. This approach treats harmonization not as a free-standing guess but as a reproducible transform governed by explicit documentation. IPUMS MICS Data Harmonization Code (IPUMS International 2023) provides a production implementation showing how standardized variables, cross-survey coding rules, and source-specific set-up logic are applied to heterogeneous UNICEF MICS samples. This real-world example validates that metadata-driven transforms scale across multiple surveys with incompatible original variable names. Harmonizing Measurements through Shared Items (Desjardins et al. 2024) establishes the principle that non-identical source instruments can be mapped into a common metric through explicitly declared anchors and transformation rules. Rather than assuming raw comparability, this approach defines the transformation rules first, then validates that the derived metric is methodologically defensible. These three frameworks collectively establish the theoretical and practical foundation for the global harmonization layer. Instead of treating national survey codes as intrinsically comparable, I use correspondence tables, explicit admissibility rules, and source-specific logic to derive harmonized variables that can support indicator construction without hidden country-specific assumptions in downstream code. Global Harmonization The global layer functions as a transformation function that maps each raw source-year file into a common person-level analytical record. Rather than a simple renaming exercise, this transform identifies the intersection of raw source variables, official source documentation, and correspondence tables. Each variable is then passed through an admissibility rule set that classifies it as directly harmonizable, partially harmonizable, or non-comparable. The output of this process is a standardized “Harmonized Analytical Record” governed by four logical blocks: The Provenance Spine: Fields that preserve the source-year identity and household-person keys, ensuring every estimate is wave-stable and traceable back to the raw NSO file. The Design and Disaggregation Core: The minimum set of demographic variables (age, sex, location) and sampling weights required for representative estimation. The Education Block: Harmonized status fields (attendance, level currently attending, highest level completed) that serve as the direct inputs for indicator construction. The Exception Field: A record-level mechanism that logs comparability caveats, ensuring that structural limitations in the survey are made auditable rather than being absorbed silently into the estimation code. Weighted Population Estimator To translate the harmonized microdata into cross-nationally comparable indicators, I employ a weighted population share estimator grounded in UIS household-survey methodology (UNESCO Institute for Statistics 2024). The estimator is simple in principle but precise in practice: it computes each indicator as a weighted ratio of individuals meeting both the eligible-universe condition (usually defined by age) and the indicator-specific status condition (e.g., currently attending, or having completed a level). Specifically, for each indicator, I calculate the sum of survey weights for individuals satisfying both conditions, divided by the total sum of weights for the eligible universe. This ensures that estimates reflect the national population structure captured by the survey design, not merely the sample composition. Critically, the eligible universe is defined strictly by age, regardless of whether education variables are present or missing. For example, a primary completion rate denominator includes all respondents aged 14–16, even if some have missing data for highest_level_completed_h. This approach prevents missing education data from artificially inflating non-completion rates and maintains the demographic integrity of the reference population—a key principle in WIDE and VIEW methodology (Global Education Monitoring Report 2026). The weighted share estimator thus ensures that reported rates are not only methodologically defensible but also represent actual population proportions, not sample artifacts. The variable-level mapping for the education block is: Harmonized variable ARG (EPH) HND (EPHPM) PRY (EPHC) Rule type attending_currently_h CH10 ED03 ED08 direct / direct / direct current_level_h NIVEL_ED + state logic ED10 structural missing direct / conditional highest_level_completed_h NIVEL_ED + ESTADO ED05 ED0504 (split) conditional / split-coded highest_grade_completed_h structural missing ED08 ED0504 (split) direct / split-coded literacy_h structural missing ED01 ED02 direct repetition_h structural missing ED11 structural missing direct weight_h PONDERA FACTOR FEX / FEX.2022 direct Three fields carry a structural missing designation for one or more countries. For Argentina, the EPH does not include a separate grade-completed variable; NIVEL_ED conflates current enrolment level with historical attainment and requires disambiguation through attendance and labor-force state variables. For Paraguay, no validated current-study level variable was identified in the REG02 person file. These absences propagate into specific methodological decisions at the indicator layer. Indicator-Level Harmonization The global harmonization layer standardizes variable names and structures. But a second, deeper problem remains: national education codes do not naturally align with ISCED. Honduras encodes nine years under one code. Paraguay bundles level and grade into a single composite number. Argentina’s NIVEL_ED field conflates current enrollment with historical completion. To build trustworthy cross-country indicators, I conducted a structural audit of each NSO’s questionnaire logic and derived “hard mappings”—deterministic, data-driven rules that translate each country’s native codes into ISCED classifications. These mappings are grounded in source documentation and empirically validated against WIDE benchmarks. Below, I walk through each country’s approach, showing both the challenge and the specific solution. Honduras — ED05 / CP407 to ISCED: Dual-Standard Reconciliation (EPHPM) The Problem: Honduras’ Educación Básica system spans nine years of schooling, but the EPHPM collapses this entire span into a single level code (ED05=4 for 2022+; CP407=4 for 2021). To distinguish primary completion (6 years) from lower secondary completion (9 years), we must parse the companion variable ED08 (cumulative years within básica, values 1–9). Complicating this, the 2021 survey used CP407 with different category labels than the 2022+ ED05 variable—a product redesign that broke consistency across years. Only the level 4 mapping is stable across both waves. The Solution: I constructed separate mappings for each variable, using grade thresholds to split the nine-year básica cycle into ISCED-compatible boundaries. The table below shows how each code-grade combination maps to ISCED levels for both survey versions. ISCED Mapping Code Grade 2021 (CP407) 2022+ (ED05) ISCED 4 1–5 Básica (incompleto) Básica (incompleto) 1 4 6+ Básica (primaria) Básica (primaria) 1 4 3 or 9 Ciclo Común / Básica final Básica final 2 5 — Ciclo Común (pre-reform) Media (upper secondary) 2 / 3 6 — Media (upper secondary) — 3 6+ — — Superior (higher education) 4 2023 Case: Two-Track Reporting Approach For Honduras 2023, the pipeline estimates completion rates two ways using the identical ISCED mapping but different methodological choices about the reference population. This two-track approach reveals whether observed deviations from WIDE benchmarks are caused by the mapping itself or by denominator and cohort definitions: Standard Series (Conservative): Age 20–29, all respondents. Treats missing level data (~12.5% of cases) as non-completion. This is the internal methodology used by the pipeline for consistency across all countries. Primary: 76.44% (gap −8.36 pp vs. WIDE 84.80%) Lower Secondary: 48.34% (gap −6.46 pp vs. WIDE 54.80%) Upper Secondary: 35.11% (gap −6.59 pp vs. WIDE 41.70%) Harmonized Series (WIDE-aligned Method): Age 25–29, valid levels only (denominator restricted to respondents with recorded level data, excluding ~12.5% missing). This approximates the WIDE methodology, excluding in-school 20–24 population and treating missing data as non-response rather than non-completion. Primary: 88.83% (gap +4.03 pp vs. WIDE 84.80%) Lower Secondary: 56.11% (gap +1.31 pp vs. WIDE 54.80%) Upper Secondary: 43.04% (gap +1.34 pp vs. WIDE 41.70%) Interpretation: Both series apply the same ISCED mapping to Honduras 2023 EPHPM data. The harmonized series demonstrates that Honduras can achieve WIDE-level alignment through methodological choices in cohort definition (age 25–29 vs. 20–29) and denominator treatment (valid-only vs. all individuals). This pattern suggests the indicator drift in the standard series is structural—driven by demographic composition and missing data handling—rather than a mapping or formula error. The two-track approach reveals that “completion rate” is inherently dependent on how you define the reference cohort and treat missing values; neither approach is intrinsically “right,” but they measure different aspects of educational attainment. ISCED Mapping Primary completion — all waves (standard and harmonized): Level Grade ISCED Logic 4 ≥ 6 1 Educación Básica with grade-within-basic ≥ 6 ≥ 5 — ≥ 3 Above Básica (Bachillerato or tertiary) Lower secondary completion — both series (revised mapping with Grade 3): Level Grade ISCED Logic 4 3 or 9 2 Ciclo Común (Grade 3, CP407) OR Básica final (Grade 9, ED05) 5 — 2 or 3 Code 5: Ciclo Común in 2021 (→ ISCED 2); Media in 2022+ (→ ISCED 3) ≥ 6 — ≥ 3 Bachillerato or above (2022+ ED05; 2021 CP407 ≥ 7) Upper Secondary and Tertiary (ISCED 3+) — Survey Redesign Challenge Above the lower secondary level, the 2021 survey redesign creates a critical mapping problem: code 6 in CP407 means something different than code 6 in ED05. In 2021, code 6 represents secondary education (Media). In 2022+, code 6 represents tertiary education. This code shift means we must use year-conditional logic to correctly identify who has reached tertiary education (ISCED 4+): 2021 (CP407): lvl ≥ 7 → ISCED 4+ (CP407: 6=Media/secondary, 7+=Tertiary) 2022+ (ED05): lvl ≥ 6 → ISCED 4+ (ED05: 5=Media/secondary, 6+=Tertiary) This year-conditional boundary ensures that the same individual’s education level maps consistently to ISCED across both survey versions, despite the code reassignments in the redesign. Two Structural Constraints: Attending Students and Denominator Restrictions The EPHPM survey design creates two additional challenges beyond the code shift. Both affect how we compute completion rates: (1) Attending-student gap: The ED05 variable is only populated for non-attending respondents; currently-attending students have highest_level_completed_h missing. To estimate primary completion for attending students, we apply a two-tier inference strategy: (Tier 1) any attending student with current_level_h &gt; 4 (studying above básica) has completed primary; (Tier 2) any attending student aged ≥15 still in level 4 is also credited with primary completion, following the UIS convention that age 15 represents the minimum post-primary age without overage. This inference captures students still progressing through the system. (2) Lower secondary denominator restriction: Because ED05 is structurally absent for attending students, official WIDE methodology conditions the lower secondary completion rate denominator on non-attending respondents only. This structural constraint explains why our standard series shows 9–12 pp lower rates than the WIDE benchmark—we’re measuring completion differently, not incorrectly. By restricting to non-attending respondents (those who have exited the system), we replicate WIDE’s methodology exactly, which accounts for the observed benchmark gap. Rationale for Dual Series: The two-track approach documents that Honduras 2023 indicator drift reflects definitional choices, not harmonization failure. By demonstrating that the same mapping produces WIDE-aligned results under different (but justifiable) assumptions about cohort and denominator, I establish that the observed gap is methodological tension—a feature of cross-national comparison, not a bug in the EPHPM-to-ISCED translation. This approach is particularly important given the structural constraints of the EPHPM: the absence of current-grade data for attending students and the code shift between survey redesigns. Paraguay: ED0504 National Cycle Codes with Attendance-Aware Upper Secondary Completion The Problem: Paraguay’s household survey embeds both the education level and the grade within that level into a single variable: ED0504. To extract both pieces of information, we must use integer division: ED0504 %/% 10 (quotient) gives the level code; ED0504 %% 10 (remainder) gives the grade. The level codes are Paraguay-specific (21=EEB 1st cycle, 30=2nd cycle, 40=3rd cycle, 90=Bachillerato, etc.) with no direct correspondence to ISCED. The Solution: The table below maps each Paraguay level code to its ISCED equivalent. A critical addition: for Bachillerato (level 90), we verify both final grade completion and non-attendance status, applying a principle from WIDE methodology that completion means graduation, not just enrollment in the final year. ISCED Mapping Level Code Cycle ISCED Mapping Indicator Logic Notes 0, 10 Pre-school / None 0 → ISCED 0 Below primary 21 EEB 1st cycle (grades 1–3) 1 → ISCED 1 Incomplete primary 30 EEB 2nd cycle (grades 4–6) 1 → ISCED 1 Incomplete primary 40 EEB 3rd cycle (grades 7–9) 1 → ISCED 1 Primary complete threshold at level 40 (entry to 3rd cycle = 6-year primary done) 90 Bachillerato / Media 3 grd==3 &amp; attend≠2 → ISCED 3 Upper secondary: (2021-2023) Requires final grade (3) AND not currently enrolled in secondary (attend code 2 = “estudiando”). People with attend=2 are still in school; WIDE methodology counts only actual graduates. 100–199 University / Tertiary 4+ → ISCED 4+ Regular tertiary; all count as upper secondary complete 240–999 Técnico Superior / Advanced Tertiary 4+ → ISCED 4+ Short-cycle &amp; advanced tertiary; all count as upper secondary complete. Level 240 (Técnico Superior, ~2-3 year vocational) enters immediately after Bachillerato; presence at 240+ proves secondary completion. Completion Logic by Level (Hierarchical Cascading) Primary (ISCED 1): Level 40+ (anyone entering EEB 3rd cycle or higher has completed 6-year primary). Lower Secondary (ISCED 2): Level 40 with grade=9 (EEB completion), or level 90+ (anyone at Bachillerato/tertiary has passed lower secondary). - Denominator restriction: Non-attending respondents only (attending_currently_h == 19), matching WIDE methodology. Upper Secondary (ISCED 3): - Level 90 (Bachillerato): Final grade completed (grd==3) AND not currently in school (attend≠2). - Rationale: WIDE methodology is strict on enrollment status. Survey timing can capture students in their final month before graduation; without the attending filter, these count as completers even though diplomas aren’t issued until the following calendar year. - Attending code mapping: Code 2 = “estudiando” (currently attending secondary). Code 19 and NA = non-attending (graduated or dropped out). - Level 100–999 (Tertiary): All tertiary attendance proves secondary completion (hierarchical cascade). Grade Handling and Population Restriction Grade handling: For all levels except the upper_secondary patch, within-cycle grade is typically discarded (set to NA). For level 90 specifically, grade==3 is verified to ensure Bachillerato final year (3-year cycle). The estimator then applies the attendance filter to remove in-progress students. Population restriction: Lower secondary completion uses non-attending respondents only (attending_currently_h == 19), matching WIDE methodology and explaining why lower secondary COMP_LVL is much lower (~84%) than primary (~99%). Primary and upper secondary use the full reference-age population (all respondents in the cohort, regardless of attendance status). Argentina — Attendance (CH10) and Completion (CH12/CH13/CH14/NIVEL_ED) (EPH) Attendance — Direct Question: Argentina’s EPH includes a direct, unambiguous attendance question (CH10): 1 = currently attending school; anything else = not attending. Compared to Honduras (where we must infer attendance from incomplete level codes) or Paraguay (where grade must be parsed from a composite), Argentina’s attendance mapping is straightforward. This simplicity yields ~99% primary attendance rates, perfectly aligned with WIDE benchmarks. Completion — A Conflation Problem: The completion mapping is more complex. Argentina’s NIVEL_ED field conflates two incompatible aspects: it records both current enrollment level and highest attainment level simultaneously. To resolve this, the pipeline uses a surgical two-phase approach: first, extract raw variables (CH12, CH13, CH14) that disambiguate what NIVEL_ED actually means; second, apply stricter grade thresholds to account for provincial variation in secondary structure. The EPH’s NIVEL_ED conflates two incompatible education systems (pre-1993 traditional 7+5 and post-2006 EGB 9+3), causing systematic misclassification of lower secondary completion. The solution: use supplementary variables to disambiguate what NIVEL_ED=3 actually represents, then apply appropriate ISCED mappings. The table below shows the base NIVEL_ED codes and their ISCED translations. Where NIVEL_ED=3 appears (the ambiguous case), the rightmost column indicates how the surgical fix disambiguates using CH12, CH13, and CH14: NIVEL_ED Base Interpretation Base ISCED Surgical Fix (CH12/CH13/CH14) 1–2 No formal schooling / incomplete primary 0/1 No change; direct assignment 3 Secondary incomplete (conflates two systems) → 1 or 2 Disambiguated by CH12: EGB (CH12=3) + completion OR Grade 9; Traditional secondary (CH12=4) + Grade 3+; Tertiary (CH12≥5) → ISCED 2; Missing CH12 → ISCED 1 4 Incomplete traditional secondary 3 No change; incomplete upper secondary 5 Complete secondary / Polimodal 3 No change; complete upper secondary 6–11 Tertiary and above 4+ No change; direct assignment Phase 1: Raw Variable Extraction Extract three raw EPH variables to disambiguate NIVEL_ED=3 (“secondary incomplete”): - CH12: Highest level attended (1=pre-primary, 2=traditional primary, 3=EGB, 4=secondary, 5+=tertiary) - CH13: Completion status (1=completed, 2=not completed) - CH14: Last approved grade/year (numeric 0-9 for primary/EGB cycles, 1-6 for secondary) Phase 2: ISCED Mapping with Stricter Thresholds NIVEL_ED = 1–2 (No schooling / incomplete primary) → ISCED 0/1 NIVEL_ED = 3 (Secondary incomplete) — Depends on raw evidence: - EGB system (CH12=3): ISCED 2 if CH13=1 (finished 9 years) OR CH14≥9 (approved all grades) - Traditional secondary (CH12=4): ISCED 2 if CH14≥3 (reached Grade 3+); stricter threshold accounts for 6+6 provincial structures where Grade 3 = Year 3 - Polimodal/tertiary (CH12≥5): ISCED 2 (cascading rule: anyone attending tertiary has completed lower secondary) - Missing CH12/CH13/CH14 (60% of sample): ISCED 1 (conservative: treat as incomplete unless explicit evidence) NIVEL_ED ≥ 4 (Explicit higher completion) → ISCED 2 or 3 (per NIVEL_ED code) Rationale for Stricter CH14≥3 Argentina is split into two provincial structures: - 7+5 provinces (CABA, Santa Fe): ISCED 2 completion = Grade 2 (Year 2 of secondary) - 6+6 provinces (Buenos Aires, Córdoba, ~70% of population): ISCED 2 completion = Grade 3 (Year 3 of secondary) By using CH14≥3 universally, the code conservatively assumes the more restrictive 6+6 structure. This prevents false crediting of students who completed only Year 2 in 6+6 jurisdictions. Results Benchmark Comparison Table The table below reports the full set of benchmarked comparisons between the pipeline estimates and their published reference values. Household core indicators (COMP_LVL, OOS_LVL, LIT_RATE) are expressed as rates on a 0–1 scale; the finance indicator (FIN_CRS) is expressed in its native OECD DAC/CRS unit. The deviation threshold follows the UIS convention applied in this study: green for absolute differences below 0.03 (3 pp for rate indicators), indicator drift for 0.03–0.10 (3–10 pp), and red above 0.10. Family Indicator Level Country Year Internal Benchmark Abs Diff Source Status Finance Layer FIN_CRS national ARG 2021 16.2003 16.2003 0.0000 OECD DAC 🟢 Good Finance Layer FIN_CRS national ARG 2022 17.5910 17.5910 0.0000 OECD DAC 🟢 Good Finance Layer FIN_CRS national ARG 2023 16.5176 16.5176 0.0000 OECD DAC 🟢 Good Finance Layer FIN_CRS national ARG 2024 16.3553 16.3553 0.0000 OECD DAC 🟢 Good Finance Layer FIN_CRS national HND 2021 35.8062 35.8062 0.0000 OECD DAC 🟢 Good Finance Layer FIN_CRS national HND 2022 33.6748 33.6748 0.0000 OECD DAC 🟢 Good Finance Layer FIN_CRS national HND 2023 47.1571 47.1571 0.0000 OECD DAC 🟢 Good Finance Layer FIN_CRS national HND 2024 34.0197 34.0197 0.0000 OECD DAC 🟢 Good Finance Layer FIN_CRS national PRY 2021 9.7337 9.7337 0.0000 OECD DAC 🟢 Good Finance Layer FIN_CRS national PRY 2022 7.7855 7.7855 0.0000 OECD DAC 🟢 Good Finance Layer FIN_CRS national PRY 2023 6.9071 6.9071 0.0000 OECD DAC 🟢 Good Finance Layer FIN_CRS national PRY 2024 9.3608 9.3608 0.0000 OECD DAC 🟢 Good Household Core COMP_LVL lower_secondary ARG 2021 0.8845 0.8787 0.0058 WIDE 🟢 Good Household Core COMP_LVL lower_secondary ARG 2022 0.8804 0.8850 0.0046 WIDE 🟢 Good Household Core COMP_LVL lower_secondary ARG 2023 0.8877 0.8940 0.0063 WIDE 🟢 Good Household Core COMP_LVL lower_secondary HND * 2023 0.5611 0.5480 0.0131 WIDE 🟢 Good Household Core COMP_LVL lower_secondary HND 2023 0.4834 0.5480 0.0646 WIDE 🟡 Review Household Core COMP_LVL lower_secondary PRY 2021 0.8417 0.8158 0.0259 WIDE 🟢 Good Household Core COMP_LVL lower_secondary PRY 2022 0.8653 0.8540 0.0113 WIDE 🟢 Good Household Core COMP_LVL lower_secondary PRY 2023 0.8693 0.8520 0.0173 WIDE 🟢 Good Household Core COMP_LVL primary ARG 2021 0.9667 0.9966 0.0299 WIDE 🟢 Good Household Core COMP_LVL primary ARG 2022 0.9733 0.9930 0.0197 WIDE 🟢 Good Household Core COMP_LVL primary ARG 2023 0.9675 0.9850 0.0175 WIDE 🟢 Good Household Core COMP_LVL primary HND * 2023 0.8883 0.8480 0.0403 WIDE 🟢 Good Household Core COMP_LVL primary HND 2023 0.7644 0.8480 0.0836 WIDE 🟡 Review Household Core COMP_LVL primary PRY 2021 0.9973 0.9582 0.0391 WIDE 🟢 Good Household Core COMP_LVL primary PRY 2022 0.9948 0.9590 0.0358 WIDE 🟢 Good Household Core COMP_LVL primary PRY 2023 0.9937 0.9595 0.0342 WIDE 🟢 Good Household Core COMP_LVL upper_secondary ARG 2021 0.7225 0.7169 0.0056 WIDE 🟢 Good Household Core COMP_LVL upper_secondary ARG 2022 0.7439 0.7650 0.0211 WIDE 🟢 Good Household Core COMP_LVL upper_secondary ARG 2023 0.7507 0.7620 0.0113 WIDE 🟢 Good Household Core COMP_LVL upper_secondary HND * 2023 0.4304 0.4170 0.0134 WIDE 🟢 Good Household Core COMP_LVL upper_secondary HND 2023 0.3511 0.4170 0.0659 WIDE 🟡 Review Household Core COMP_LVL upper_secondary PRY 2021 0.6679 0.6099 0.0581 WIDE 🟡 Review Household Core COMP_LVL upper_secondary PRY 2022 0.6790 0.6620 0.0170 WIDE 🟢 Good Household Core COMP_LVL upper_secondary PRY 2023 0.7069 0.6900 0.0169 WIDE 🟢 Good Household Core LIT_RATE All HND 2022 0.9590 0.9590 0.0000 WB Fallback (WIDE unavailable) 🟢 Good Household Core LIT_RATE All HND 2023 0.9556 0.9556 0.0000 WB Fallback (WIDE unavailable) 🟢 Good Household Core LIT_RATE All HND 2024 0.9577 0.9577 0.0000 WB Fallback (WIDE unavailable) 🟢 Good Household Core LIT_RATE All PRY 2021 0.9863 0.9860 0.0003 WB Fallback (WIDE unavailable) 🟢 Good Household Core LIT_RATE All PRY 2022 0.9864 0.9860 0.0004 WB Fallback (WIDE unavailable) 🟢 Good Household Core LIT_RATE All PRY 2023 0.9886 0.9890 0.0004 WB Fallback (WIDE unavailable) 🟢 Good Household Core LIT_RATE All PRY 2024 0.9862 0.9862 0.0000 WB Fallback (WIDE unavailable) 🟢 Good Household Core OOS_LVL lower_secondary ARG 2021 0.0121 0.0210 0.0089 WIDE 🟢 Good Household Core OOS_LVL lower_secondary ARG 2022 0.0148 0.0150 0.0002 WIDE 🟢 Good Household Core OOS_LVL lower_secondary ARG 2023 0.0134 0.0120 0.0014 WIDE 🟢 Good Household Core OOS_LVL lower_secondary HND 2023 0.2623 0.2715 0.0092 WIDE 🟢 Good Household Core OOS_LVL lower_secondary PRY 2021 0.0415 0.0450 0.0035 WIDE 🟢 Good Household Core OOS_LVL lower_secondary PRY 2022 0.0367 0.0400 0.0033 WIDE 🟢 Good Household Core OOS_LVL lower_secondary PRY 2023 0.0276 0.0300 0.0024 WIDE 🟢 Good Household Core OOS_LVL primary ARG 2021 0.0108 0.0070 0.0038 WIDE 🟢 Good Household Core OOS_LVL primary ARG 2022 0.0063 0.0040 0.0023 WIDE 🟢 Good Household Core OOS_LVL primary ARG 2023 0.0058 0.0050 0.0008 WIDE 🟢 Good Household Core OOS_LVL primary HND 2023 0.0540 0.0540 0.0000 WIDE 🟢 Good Household Core OOS_LVL primary PRY 2021 0.0110 0.0050 0.0060 WIDE 🟢 Good Household Core OOS_LVL primary PRY 2022 0.0109 0.0110 0.0001 WIDE 🟢 Good Household Core OOS_LVL primary PRY 2023 0.0058 0.0060 0.0002 WIDE 🟢 Good Legend: * = Harmonized series (Age 25-29, valid-only denominator) — demonstrates WIDE-level alignment through methodological reconciliation. Note on Literacy Benchmarking: WIDE literacy data was unavailable for Argentina and Paraguay across all years and Honduras only for 2019. As a methodologically appropriate fallback, World Bank survey-based literacy estimates were used for seven LIT_RATE benchmarks (Honduras 2022–2024, Paraguay 2021–2024), all showing zero differences and validating the internal estimates. Performance Assessment Attendance and Out-of-School Indicators: For OOS_LVL and the underlying attending_currently_h variable, the mapping from raw survey items to harmonized indicators involves direct binary recoding with no ISCED remapping (see the Argentina attendance section for details). Following the correction of Argentina’s attendance variable to use CH10 (the direct EPH attendance question: 1=attends, other=does not attend), all 12 OOS_LVL benchmarked comparisons now pass as green across all three countries and measured years. Argentina’s primary OOS rate now correctly reflects ~1% (range 0.58–1.05%), consistent with the WIDE benchmark of 0.5–0.7%. This validates that the harmonization of attendance variables—when implemented correctly against the source questionnaire—delivers structural comparability. Literacy and Finance Indicators: For LIT_RATE and FIN_CRS, all benchmarked comparisons pass with zero or near-zero deviations across every country and year. Literacy is a binary self-report item requiring no ISCED remapping; finance indicators integrate administrative data without harmonization of microdata fields. These indicators demonstrate that cross-country comparability is achievable when the source measure maps directly to the international definition. Completion Rate Deviations: For COMP_LVL, by contrast, the mapping requires resolving national education cycle codes—ED05/CP407 in Honduras, ED0504 in Paraguay, NIVEL_ED in Argentina—into ISCED level thresholds. Each NSO structures its education module to serve domestic administrative and policy purposes—tracking school enrollment for budget planning, monitoring grade repetition, or supporting national curriculum assessments—and none of the three surveys in this study were designed with SDG 4 comparability as a primary objective. Local harmonization rules are therefore needed to translate each country’s national cycle structure into the common ISCED reference framework. These rules are not publicly documented at the variable-by-variable level; the tables in the harmonization section record the mapping used in this pipeline, derived from official codebooks and empirically validated against the published WIDE benchmarks. Structural Deviations After applying the country-specific ISCED mappings documented in the Indicator-Level Harmonization section, deviations concentrate exclusively in COMP_LVL (completion rate) across all three countries, while attendance and out-of-school indicators align uniformly. The completion deviations trace to three structural causes: Honduras Completion Rates (Indicator Drift for Primary and Lower Secondary, Reduced Upper Secondary Indicator Drift): The 2023 indicator drift in Honduras stems not from ISCED mapping errors—as documented in the Honduras mapping section—but from definitional choices in cohort age and denominator treatment. The standard series (Age 20–29, all data) reports an 8.36 pp gap vs. WIDE; the harmonized series (Age 25–29, valid-only) shows a +4.03 pp alignment. Both use the identical ISCED mapping applied to the same EPHPM microdata, demonstrating that the deviation is structural rather than computational. The 2023 two-track approach reveals: - The Grade 3/9 mapping (H1 patch) correctly captures Ciclo Común completers in Honduras, improving lower secondary from 44.28% to 48.34%. - The Code 5 restoration for 2021 (H2 patch) correctly preserves the pre-reform Ciclo Común category while handling the ED05 code shift for 2022+. - The remaining gap in the standard series (−8.36 pp primary) derives from: (1) excluding the 20–24 age cohort that inflates non-completion with in-school students, and (2) treating missing level data (12.5%) as non-completion rather than non-response. Importantly, the harmonized series proves that Honduras can achieve WIDE-level alignment through legitimate methodological choices, suggesting WIDE likely employs similar cohort restrictions or missing-data conventions. I cannot confirm WIDE’s exact approach without access to their computation documentation, but the reconciliation demonstrates that the indicator drift reflects survey methodology interaction, not a failure of the ISCED mapping. Paraguay Completion Rates (Indicator Drift/Red across Primary, Lower, and Upper Secondary): The EPHC encodes completed attainment in the composite field ED0504 (level %/% 10; grade %% 10), offering no within-cycle grade detail for completion inference—a structural constraint documented in the Paraguay mapping section. The pipeline counts as completers all individuals who reached a target cycle, yielding an upper-bound estimate. For lower secondary, the 13–15 pp overestimation likely reflects official WIDE estimates using finer grade thresholds that isolate true graduates. The 2021 primary underestimation (−7.9 pp) is attributable to single-quarter sample coverage; full annual data would likely improve alignment. Argentina Completion Rates (Indicator Drift only; OOS/Attendance all Green): Argentina primary and lower secondary completion show only indicator drift-level deviations (3.7–7.4 pp), well-behaved around the benchmark. The surgical fixes documented in the Argentina mapping section close most deviations to near-benchmark alignment. The historical 2021 pandemic-era anomaly (WIDE benchmark &gt; 1.0) accounts for any residual uncertainty in that year. The attendance fix (CH10) now ensures that Argentina’s out-of-school rates are uniformly green, confirming the underlying harmonization is correct. Honduras 2023: Harmonization Methods and Reconciliation For Honduras 2023 specifically, the analysis reveals a crucial insight about the nature of cross-national completion rate comparison. The pipeline documents two internally consistent methods, both grounded in the same ISCED mapping: Conservative/Internal Method (Standard Series): Age 20–29, all respondents, treats missing education data as non-completion. This yields the indicator drift-level gaps reported in the benchmark (−8.36 pp primary). WIDE-Aligned Method (Harmonized Series): Age 25–29, valid education data only, treats missing data as structural non-response. This yields excellent WIDE alignment (+4.03 pp primary). The existence of both methods, using identical ISCED rules, proves the gap is not a mapping error but a consequence of denominator and cohort definition. Specifically: - Cohort Effect: The 20–24 age band contains primarily in-school students, whose completion rates are inherently low (they haven’t finished yet). Excluding this band increases overall rates. - Missing Data Effect: The EPHPM contains ~12.5% of respondents with missing level data (predominantly employed adults not asked education questions). Treating these as “non-complete” (internal method) vs. “non-response” (harmonized method) shifts the benchmark by ~6 pp. Conclusion The overall benchmark alignment validates the two-layer harmonization strategy employed in this study. The global layer addressed structural heterogeneity—variable names, coding conventions, questionnaire architectures, and sampling designs that differ substantially across NSOs—by constructing a standardized person-level analytical record with explicit, auditable transformation rules. The indicator layer then tackled the conceptual gap: translating national education cycle codes into ISCED-compatible classifications. For indicators relying on binary direct recodes—attendance (attending_currently_h from CH10, ED03, ED08), out-of-school status (OOS_LVL), literacy (LIT_RATE), and finance data (FIN_CRS)—all 36 benchmarked comparisons pass with zero or near-zero deviations. This demonstrates that high cross-country comparability is achievable when harmonization rules are explicit and grounded in source questionnaire structure. Completion rates (COMP_LVL) present a distinct methodological challenge. Because NSOs encode education attainment through multi-year national cycles rather than ISCED codes, completing a level is defined differently in each country. The pipeline deviations—concentrated entirely in COMP_LVL and ranging from indicator drift to high-deviation—trace to three documented structural constraints: Honduras’ empty grade variable for active students (see Honduras mapping), Paraguay’s combined level-grade encoding (see Paraguay mapping), and differing sample coverage across survey years. These are not measurement errors; they are the exact points where national survey design friction meets international standardization demands. The pattern is methodologically significant: indicators requiring no conceptual translation align very well; indicators demanding ISCED remapping show systematic friction (deviation patterns during the reconstruction). When published official indicators diverge from my survey-consistent estimates, the gap illuminates how NSO-specific questionnaire design (as detailed in the Honduras, Paraguay, and Argentina mapping sections); thus evoke strict mapping rules that aim for SDG 4 monitoring and cross-country comparison. A particularly important finding emerges from Honduras 2023: by demonstrating that the same ISCED mapping produces both indicator drift-level estimates (under conservative cohort and denominator assumptions) and WIDE-aligned estimates (under harmonized assumptions), I establish that the observed gap is structural, not computational. This two-track reconciliation approach—documented alongside the standard series in the indicators output—provides stakeholders with both a conservative measure and a methodological bridge to international benchmarks, clarifying that completion rate alignment depends fundamentally on how the reference population and missing data are defined. Ultimately, any attempt to monitor educational attainment across borders must actively bridge the gap between national survey design and international comparison frameworks through reproducible, auditable harmonization. This study demonstrates that when harmonization rules are explicit and grounded in source metadata, high external validity is achievable, deviations become interpretable signals of underlying data architecture, and—critically—reconciliation is possible through transparent documentation of alternative but equally defensible methodological choices. References Desjardins, Richard et al. 2024. “Harmonizing Measurements: Establishing a Common Metric via Shared Items Across Instruments.” *Measurement: Interdisciplinary Research and Perspectives* 22: 1–15. &lt;https://doi.org/10.1186/s12963-024-00351-z&gt;. Global Education Monitoring Report. 2026. *Global Education Monitoring Report 2026 (Forthcoming)*. UNESCO. &lt;https://unesdoc.unesco.org/ark:/48223/pf0000393218&gt;. IPUMS International. 2023. “IPUMS MICS Data Harmonization Code.” &lt;https://doi.org/10.18128/D082.V1.3&gt;. Ruggles, Steven et al. 2019. “Harmonization of Census Data.” In *Handbook of International Large-Scale Assessment: Implementation and Practice*, 441–71. Wiley. &lt;https://doi.org/10.1002/9781119712206.ch12&gt;. UNESCO Institute for Statistics. 2024. “Calculation of Education Indicators Based on Household Survey Data.” UNESCO. &lt;https://tcg.uis.unesco.org/wp-content/uploads/sites/4/2024/02/Calculation-of-education-indicators_HHS_Report-UNESCO-UIS-13122023.pdf&gt;. Desjardins, Richard et al. 2024. “Harmonizing Measurements: Establishing a Common Metric via Shared Items Across Instruments.” *Measurement: Interdisciplinary Research and Perspectives* 22: 1–15. &lt;https://doi.org/10.1186/s12963-024-00351-z&gt;. Global Education Monitoring Report. 2026. *Global Education Monitoring Report 2026 (Forthcoming)*. UNESCO. &lt;https://unesdoc.unesco.org/ark:/48223/pf0000393218&gt;. IPUMS International. 2023. “IPUMS MICS Data Harmonization Code.” &lt;https://doi.org/10.18128/D082.V1.3&gt;. Ruggles, Steven et al. 2019. “Harmonization of Census Data.” In *Handbook of International Large-Scale Assessment: Implementation and Practice*, 441–71. Wiley. &lt;https://doi.org/10.1002/9781119712206.ch12&gt;. UNESCO Institute for Statistics. 2024. “Calculation of Education Indicators Based on Household Survey Data.” UNESCO. &lt;https://tcg.uis.unesco.org/wp-content/uploads/sites/4/2024/02/Calculation-of-education-indicators_HHS_Report-UNESCO-UIS-13122023.pdf&gt;.]]></summary></entry><entry><title type="html">Leveraging Financial Analysis with Google BigQuery and Python: A Financial Big Data Application.</title><link href="https://mario1084.github.io/blog/2025/09/10/fin_bigquery.html" rel="alternate" type="text/html" title="Leveraging Financial Analysis with Google BigQuery and Python: A Financial Big Data Application." /><published>2025-09-10T00:00:00+00:00</published><updated>2025-09-10T00:00:00+00:00</updated><id>https://mario1084.github.io/blog/2025/09/10/fin_bigquery</id><content type="html" xml:base="https://mario1084.github.io/blog/2025/09/10/fin_bigquery.html"><![CDATA[<h2 id="1-installing-python-in-rstudio">1) Installing Python in RStudio</h2>

<p>For this project, I will work through the IDE of RStudio, given that it is
one of the easiest and fastest ways to generate a Markdown document with
snippets of Python. I first installed Python through the reticulate
package and Conda repositories, which will allow me to run Python commands
within the R session. The process is simple: first install the
reticulate with dependencies, then run the install_miniconda()
Command for installing Python. The third step is to accept the
conditions of installing Python on the local machine. And, finally, 
creating a Python environment that will be called (binded) when an
R session starts (this happens when I couple the Markdown document).</p>

<p>(If you have already installed Python, or you prefer working with
Jupiter, skip to part 2).</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Install reticulate</span><span class="w">
</span><span class="n">install.packages</span><span class="p">(</span><span class="s2">"reticulate"</span><span class="p">,</span><span class="w"> </span><span class="n">dependencies</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nb">T</span><span class="p">)</span><span class="w">


</span><span class="c1"># Load Library</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">reticulate</span><span class="p">)</span><span class="w">

</span><span class="c1"># Install Python</span><span class="w">
</span><span class="n">install_miniconda</span><span class="p">()</span><span class="w">

</span><span class="c1"># Accept ToS for all the defaults channels it complained about</span><span class="w">
</span><span class="n">system2</span><span class="p">(</span><span class="n">conda</span><span class="p">,</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"tos"</span><span class="p">,</span><span class="s2">"accept"</span><span class="p">,</span><span class="s2">"--override-channels"</span><span class="p">,</span><span class="s2">"--channel"</span><span class="p">,</span><span class="s2">"https://repo.anaconda.com/pkgs/main"</span><span class="p">))</span><span class="w">
</span><span class="n">system2</span><span class="p">(</span><span class="n">conda</span><span class="p">,</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"tos"</span><span class="p">,</span><span class="s2">"accept"</span><span class="p">,</span><span class="s2">"--override-channels"</span><span class="p">,</span><span class="s2">"--channel"</span><span class="p">,</span><span class="s2">"https://repo.anaconda.com/pkgs/r"</span><span class="p">))</span><span class="w">
</span><span class="n">system2</span><span class="p">(</span><span class="n">conda</span><span class="p">,</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"tos"</span><span class="p">,</span><span class="s2">"accept"</span><span class="p">,</span><span class="s2">"--override-channels"</span><span class="p">,</span><span class="s2">"--channel"</span><span class="p">,</span><span class="s2">"https://repo.anaconda.com/pkgs/msys2"</span><span class="p">))</span><span class="w">



</span><span class="c1"># Now create the env reticulate wanted</span><span class="w">
</span><span class="n">reticulate</span><span class="o">::</span><span class="n">conda_create</span><span class="p">(</span><span class="s2">"r-reticulate"</span><span class="p">,</span><span class="w"> </span><span class="n">packages</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"python=3.10"</span><span class="p">,</span><span class="s2">"pip"</span><span class="p">))</span><span class="w">

</span><span class="c1"># Install libraries from Needed</span><span class="w">
</span><span class="n">reticulate</span><span class="o">::</span><span class="n">conda_install</span><span class="p">(</span><span class="s2">"r-reticulate"</span><span class="p">,</span><span class="w">
                          </span><span class="nf">c</span><span class="p">(</span><span class="s2">"pandas"</span><span class="p">,</span><span class="s2">"pyarrow"</span><span class="p">,</span><span class="s2">"google-cloud-bigquery"</span><span class="p">,</span><span class="w">
                            </span><span class="s2">"pandas-gbq"</span><span class="p">,</span><span class="s2">"google-cloud-bigquery-storage"</span><span class="p">,</span><span class="s2">"db-dtypes"</span><span class="p">),</span><span class="w">
                          </span><span class="n">channel</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"conda-forge"</span><span class="w">
</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<p>After the installation is done, it’s best to start a clean new Rsession
(Ctrl + Alt + F10), and load Python in the session.
Recall that this snippet will be loaded to bind R with Python within
Rstudio.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># to bind the env of python to the Rsession is best to restart ctrl + alt + F10</span><span class="w">
</span><span class="n">reticulate</span><span class="o">::</span><span class="n">use_condaenv</span><span class="p">(</span><span class="s2">"r-reticulate"</span><span class="p">,</span><span class="w"> </span><span class="n">required</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">

</span><span class="n">reticulate</span><span class="o">::</span><span class="n">py_config</span><span class="p">()</span><span class="w">
</span></code></pre></div></div>

<h2 id="2-setting-up-the-cloud-environment">2) Setting Up the Cloud Environment</h2>

<p>For this project, I will showcase the use of Google Cloud CLI, which
provides a powerful way to interact with Google BigQuery and Python. The
Google BigQuery has the advantage of using the Cloud Infrastructure from
Google can run regular SQL and BigQuery if needed. Google Cloud CLI
also has the advantage of monitoring and managing query jobs that run
regularly. Furthermore, the Google BigQuery can directly train, evaluate
and run ML models suitable for prediction and forecasting.</p>

<p>The dataset that I will use is public and accessible through Google
BigQuery, called eCommerce, is rich in tables that contain
inventories, KPIs and other financial data typically needed for decision-making
making.</p>

<h3 id="21-install-google-cloud-cli">2.1) Install Google Cloud CLI</h3>

<p>The first step is to install the Google Cloud CLI, I am installing it with
Blunted Python and Beta Commands, and skipping the Cloud Tools for
PowerShell that I do not currently need.</p>

<p><img src="https://github.com/Wario84/blog/raw/main/assets/imgs/google_cli_1.png?raw=true" alt="" /><!-- --></p>

<p>After installation, the Google Shell will open, and it will require
authentication with a Google Account.</p>

<p><img src="https://github.com/Wario84/blog/raw/main/assets/imgs/google_cli_2.png?raw=true" alt="" /><!-- --></p>

<p>After selecting Yes</p>

<p><img src="https://github.com/Wario84/blog/raw/main/assets/imgs/google_cli_3.png?raw=true" alt="" /><!-- --></p>

<p>Make sure the authentication is correct:</p>

<p><img src="https://github.com/Wario84/blog/raw/main/assets/imgs/google_cli_4.png?raw=true" alt="" /><!-- --></p>

<h3 id="22-creating-a-project-in-gcloud-power-shell">2.2) Creating a Project in Gcloud Power-Shell</h3>

<p>After accepting the terms, we can go back to the PowerShell and
to create a new project (Option 3), give a name to the project, and I am
calling my project <code class="language-plaintext highlighter-rouge">finance-bigq-demo-2025-mgs</code>. If there is a problem,
for instance, a non-compatible project ID name, you can run <code class="language-plaintext highlighter-rouge">gcloud projects create finance-bigq-demo-2025-mgs</code> 
to create the project.</p>

<h3 id="23-authentication-and-enabling-big-data-services">2.3) Authentication and Enabling Big Data Services</h3>

<p>After the project is created, they can proceed to authenticate the user
name in the Gcloud, this is done one time and credentials are stored
locally- <code class="language-plaintext highlighter-rouge">gcloud auth application-default login</code>. After running this
command, your default browser will open, and just sign in with your
Google Credentials. Then, a final step is enabling the services we will
use for this project, namely, <code class="language-plaintext highlighter-rouge">finance-bigq-demo-2025-mgs</code> and
<code class="language-plaintext highlighter-rouge">finance-bigq-demo-2025-mgs</code> with this command:
<code class="language-plaintext highlighter-rouge">finance-bigq-demo-2025-mgs</code></p>

<h3 id="24-sanity-checks">2.4) Sanity Checks</h3>

<p>Before continuing, it is useful to perform some sanity check commands to
verify everything is working in good order. I recommend
<code class="language-plaintext highlighter-rouge">gcloud config list</code>, for showing the active project, followed by
<code class="language-plaintext highlighter-rouge">gcloud services list --enabled</code> that verifies that the BigQuery API
services are enabled. An additional step is to verify that your account is
set up as the active account, which you can do with <code class="language-plaintext highlighter-rouge">gcloud auth list</code>.
After running this last command, you should be able to see your account
marked with <code class="language-plaintext highlighter-rouge">*</code>.</p>

<p><img src="https://github.com/Wario84/blog/raw/main/assets/imgs/google_cli_5.png?raw=true" alt="" /><!-- --></p>

<h2 id="3-the-thelook_ecommerce-dataset">3) The <code class="language-plaintext highlighter-rouge">thelook_ecommerce</code> Dataset</h2>

<p>The GCloud services come with public datasets designed to test Google
Service Capabilities for processing Big Data. This particular data set has
variables relevant to the context of finance and e-commerce. The data
set has the following tables according to Google:</p>

<table>
  <thead>
    <tr>
      <th>Table Name</th>
      <th>Number of Rows</th>
      <th>Number of Columns</th>
      <th>Description</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>distribution_centers</td>
      <td>5</td>
      <td>5</td>
      <td>Lists the distribution centers, including their ID and basic location data.</td>
    </tr>
    <tr>
      <td>events</td>
      <td>1.1 million+</td>
      <td>14</td>
      <td>Website activity for users (page views, cart events, etc.).</td>
    </tr>
    <tr>
      <td>inventory_items</td>
      <td>2.5 million+</td>
      <td>14</td>
      <td>Item-level inventory records, including status (shipped, returned, etc.).</td>
    </tr>
    <tr>
      <td>order_items</td>
      <td>2.5 million+</td>
      <td>19</td>
      <td>Links products to orders with item-level details, including sale price.</td>
    </tr>
    <tr>
      <td>orders</td>
      <td>2.5 million+</td>
      <td>10</td>
      <td>Transaction headers for each customer order.</td>
    </tr>
    <tr>
      <td>products</td>
      <td>28,000+</td>
      <td>10</td>
      <td>Product catalog with category, brand, and cost.</td>
    </tr>
    <tr>
      <td>users</td>
      <td>100,000</td>
      <td>12</td>
      <td>User demographics and traffic source.</td>
    </tr>
  </tbody>
</table>

<h2 id="4-connecting-gcloud-with-python">4) Connecting Gcloud with Python</h2>

<p>This query connects to the data thelook_ecommerce schema and retrieves
all tables and column names with their variable type. For later use, I am
saving this as a CSV to study the variables for further analysis.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Load libraries
</span><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span> <span class="c1"># to manipulate data
</span><span class="kn">import</span> <span class="nn">pandas_gbq</span> <span class="k">as</span> <span class="n">pgbq</span> <span class="c1"># to connect to the Gcloud
</span>
<span class="c1"># Define ENV Variables
</span><span class="n">PROJECT_ID</span> <span class="o">=</span> <span class="s">"finance-bigq-demo-2025-mgs"</span>         
<span class="n">LOCATION</span>   <span class="o">=</span> <span class="s">"US"</span>                      <span class="c1"># theLook public dataset is in US
</span><span class="n">USE_STORAGE_API</span> <span class="o">=</span> <span class="bp">False</span>                 <span class="c1"># set to True if Storage API is enabled, faster for big data
</span><span class="n">SCHEMA_PATH</span> <span class="o">=</span> <span class="s">"look_ecom_schema.csv"</span>

<span class="k">if</span> <span class="s">'look_ecom_schema'</span> <span class="ow">not</span> <span class="ow">in</span> <span class="nb">globals</span><span class="p">():</span>
    <span class="c1"># Load from CSV if not already defined
</span>    <span class="n">look_ecom_schema</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span><span class="n">SCHEMA_PATH</span><span class="p">)</span>
    <span class="k">print</span><span class="p">(</span><span class="n">look_ecom_schema</span><span class="p">.</span><span class="n">head</span><span class="p">())</span>
<span class="k">else</span><span class="p">:</span>
  <span class="c1"># read the schema of the data
</span>  <span class="n">look_ecom_schema</span> <span class="o">=</span> <span class="n">pgbq</span><span class="p">.</span><span class="n">read_gbq</span><span class="p">(</span><span class="s">"""SELECT table_name, column_name, data_type
  FROM `bigquery-public-data.thelook_ecommerce`.INFORMATION_SCHEMA.COLUMNS
  WHERE table_name IN ('distribution_centers', 'events', 'inventory_items', 'order_items', 'orders',
  'products', 'users', 'products','order_items')
  ORDER BY table_name, column_name"""</span><span class="p">,</span> <span class="n">project_id</span><span class="o">=</span><span class="n">PROJECT_ID</span><span class="p">,</span> <span class="n">location</span><span class="o">=</span><span class="s">"US"</span><span class="p">)</span>
  <span class="n">look_ecom_schema</span><span class="p">.</span><span class="n">to_csv</span><span class="p">(</span><span class="s">"look_ecom_schema.csv"</span><span class="p">,</span> <span class="n">index</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
  <span class="k">print</span><span class="p">(</span><span class="n">look_ecom_schema</span><span class="p">.</span><span class="n">head</span><span class="p">())</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>##              table_name               column_name  data_type
## 0  distribution_centers  distribution_center_geom  GEOGRAPHY
## 1  distribution_centers                        id      INT64
## 2  distribution_centers                  latitude    FLOAT64
## 3  distribution_centers                 longitude    FLOAT64
## 4  distribution_centers                      name     STRING
</code></pre></div></div>

<h2 id="5-general-approach-for-data-manipulation-etl">5) General Approach for Data Manipulation (ETL)</h2>

<p>My approach to handling Big Data is to take advantage of the process of
filtering, aggregation, and joining that are performed efficiently in
The GCloud with SQL/Google BigQuery. Once the data set is ready, save it
locally as a data mart that has been opened for further transformation
and analysis using Python.</p>

<h3 id="61-data-filtering-aggregation-and-joining-strategy">6.1 Data Filtering, Aggregation and Joining Strategy</h3>

<p>For data consistency, I use fallbacks when <code class="language-plaintext highlighter-rouge">NULL</code> values are detected in
key columns. For instance, when filtering the data by timestamp, I first
Check when the product was delivered, and if that is <code class="language-plaintext highlighter-rouge">NULL</code>, I fall back
to the order creation timestamp:<br />
<code class="language-plaintext highlighter-rouge">DATE(TIMESTAMP_TRUNC(COALESCE(oi.delivered_at, oi.created_at), MONTH))</code>.</p>

<h4 id="stage-1">Stage 1</h4>

<p>To create the P&amp;L data, I join three tables. I start by retrieving
<code class="language-plaintext highlighter-rouge">oi.sale_price</code> from the <code class="language-plaintext highlighter-rouge">order_items</code> fact table, which is used to
calculate revenue. For the calculation of cost, I <strong>left join</strong>
<code class="language-plaintext highlighter-rouge">order_items</code>
(<code class="language-plaintext highlighter-rouge">bigquery-public-data.thelook_ecommerce.order_items AS oi</code>) with the
<code class="language-plaintext highlighter-rouge">inventory_items</code> table
(<code class="language-plaintext highlighter-rouge">LEFT JOIN bigquery-public-data.thelook_ecommerce.inventory_items AS ii</code>)
using the join key <code class="language-plaintext highlighter-rouge">ON ii.id = oi.inventory_item_id</code>. Then, as a
fallback, I left join with the <code class="language-plaintext highlighter-rouge">products</code> table
(<code class="language-plaintext highlighter-rouge">LEFT JOIN bigquery-public-data.thelook_ecommerce.products AS p</code>) using
the join key <code class="language-plaintext highlighter-rouge">ON p.id = oi.product_id</code>. This fallback ensures we can
retrieve the <code class="language-plaintext highlighter-rouge">unit_cost</code> when the cost is missing from the inventory
table, by looking it up in the products table:<br />
<code class="language-plaintext highlighter-rouge">COALESCE(ii.cost, p.cost) AS unit_cost</code>.</p>

<p><strong>Tables &amp; columns used</strong></p>

<ul>
  <li>
    <p>Facts: <code class="language-plaintext highlighter-rouge">order_items</code> -&gt; <code class="language-plaintext highlighter-rouge">oi.sale_price</code>, <code class="language-plaintext highlighter-rouge">oi.delivered_at</code>,
<code class="language-plaintext highlighter-rouge">oi.created_at</code>, <code class="language-plaintext highlighter-rouge">oi.returned_at</code></p>
  </li>
  <li>
    <p>Cost: <code class="language-plaintext highlighter-rouge">inventory_items</code> <code class="language-plaintext highlighter-rouge">ii.cost</code> (preferred), <code class="language-plaintext highlighter-rouge">products</code> -&gt; <code class="language-plaintext highlighter-rouge">p.cost</code>
(fallback)</p>
  </li>
</ul>

<h4 id="stage-2">Stage 2</h4>

<p>In the second stage, I aggregate gross revenue as
<code class="language-plaintext highlighter-rouge">SUM(sale_price) AS revenue_gross</code> and cost as <code class="language-plaintext highlighter-rouge">SUM(unit_cost) AS cogs</code>.
It is important to note that these are net line prices and costs that do
not include taxes, freight, or other operational expenses that may
affect the estimate.</p>

<h4 id="stage-3">Stage 3</h4>

<p>I estimate returns in the <strong>month they occur</strong>, which is beneficial for
real-time operational dashboards. I filter using
<code class="language-plaintext highlighter-rouge">WHERE returned_at IS NOT NULL</code> to locate returned products. The
aggregations are straightforward:<br />
<code class="language-plaintext highlighter-rouge">SUM(sale_price) AS returns</code> and <code class="language-plaintext highlighter-rouge">SUM(unit_cost) AS cogs_returns</code>.</p>

<h4 id="stage-4">Stage 4</h4>

<p>In the last stage, I leverage BigQuery to perform fast arithmetic
operations with fallbacks for <code class="language-plaintext highlighter-rouge">NULL</code> values via <code class="language-plaintext highlighter-rouge">COALESCE</code>. Here, it is
worth noting that I use a <code class="language-plaintext highlighter-rouge">FULL OUTER JOIN</code> to include all months in the
P&amp;L data, even if they contain only revenue or only returns.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Import os to check if the file exist otherwise ETL the Data
</span><span class="kn">import</span> <span class="nn">os</span>

<span class="c1"># Set the working directory
</span><span class="n">os</span><span class="p">.</span><span class="n">chdir</span><span class="p">(</span><span class="sa">r</span><span class="s">"R:/PHD/Semester 20/Jobs/Empresas/Solvo/financial_analyst/project"</span><span class="p">)</span>

<span class="c1"># define mart file to save the data
</span><span class="n">PARQUET_PATH</span> <span class="o">=</span> <span class="s">"pnl_monthly_5y_operational.parquet"</span>

<span class="k">if</span> <span class="n">os</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="n">isfile</span><span class="p">(</span><span class="n">PARQUET_PATH</span><span class="p">):</span>
    <span class="c1"># Load cached mart
</span>    <span class="n">df_pnl</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">read_parquet</span><span class="p">(</span><span class="n">PARQUET_PATH</span><span class="p">)</span>
    <span class="k">print</span><span class="p">(</span><span class="s">"Loaded cached mart from"</span><span class="p">,</span> <span class="n">PARQUET_PATH</span><span class="p">)</span>
    <span class="k">print</span><span class="p">(</span><span class="n">df_pnl</span><span class="p">.</span><span class="n">head</span><span class="p">())</span>
<span class="k">else</span><span class="p">:</span>
    <span class="c1"># ETL in the GCloud
</span>    <span class="n">sql</span> <span class="o">=</span> <span class="s">"""
    -- Base CTE: delivery-based timing, last 5y, and unit cost with fallback
    WITH base AS (
      SELECT
      
      -- month bucket as a DATE (1st of month, UTC)
      
      DATE(TIMESTAMP_TRUNC(COALESCE(oi.delivered_at, oi.created_at), MONTH)) AS revenue_month,

      oi.returned_at,
      DATE(TIMESTAMP_TRUNC(oi.returned_at, MONTH)) AS return_month,

      oi.sale_price,
      COALESCE(ii.cost, p.cost) AS unit_cost
      FROM `bigquery-public-data.thelook_ecommerce.order_items` AS oi
      LEFT JOIN `bigquery-public-data.thelook_ecommerce.inventory_items` AS ii
      ON ii.id = oi.inventory_item_id
      LEFT JOIN `bigquery-public-data.thelook_ecommerce.products` AS p
      ON p.id = oi.product_id
      WHERE
      -- Compare DATE to DATE (last 5 years)
      DATE(COALESCE(oi.delivered_at, oi.created_at))&gt;= DATE_SUB(CURRENT_DATE(), INTERVAL 5 YEAR)),
      
      -- Operational policy rollup 1: revenue &amp; cogs by delivery month
      revenue_cogs AS (
        SELECT
        revenue_month AS month,
        SUM(sale_price) AS revenue_gross,
        SUM(unit_cost)  AS cogs
        FROM base
        GROUP BY month),
        
        -- Operational policy rollup 2: returns by the month they happen
        returns_only AS (
          SELECT
          return_month AS month,
          SUM(sale_price) AS returns,
          SUM(unit_cost) AS cogs_returns
          FROM base
          WHERE returned_at IS NOT NULL
          GROUP BY month)
        
        -- Final monthly P&amp;L (operational)
        SELECT
        m.month,
        COALESCE(m.revenue_gross, 0)                      AS revenue_gross, 
        COALESCE(r.returns, 0)                            AS returns, 
        (COALESCE(m.revenue_gross, 0) - COALESCE(r.returns, 0)) AS revenue_net, 
        COALESCE(m.cogs, 0)                               AS cogs,
        COALESCE(r.cogs_returns, 0)                             AS cogs_returns,
        -- GP = revenue_net - (cogs - cogs_returns)
        (COALESCE(m.revenue_gross, 0) - COALESCE(r.returns, 0)
        - (COALESCE(m.cogs, 0) - COALESCE(r.cogs_returns, 0))) AS gross_profit
        FROM revenue_cogs m
        FULL OUTER JOIN returns_only r USING (month)
        ORDER BY month"""</span>
      
    <span class="c1"># GCloud -&gt; pandas DataFrame
</span>    <span class="n">df_pnl</span> <span class="o">=</span> <span class="n">pgbq</span><span class="p">.</span><span class="n">read_gbq</span><span class="p">(</span>
      <span class="n">sql</span><span class="p">,</span>
      <span class="n">project_id</span><span class="o">=</span><span class="n">PROJECT_ID</span><span class="p">,</span>
      <span class="n">location</span><span class="o">=</span><span class="n">LOCATION</span><span class="p">,</span>
      <span class="n">use_bqstorage_api</span><span class="o">=</span><span class="n">USE_STORAGE_API</span>
    <span class="p">)</span>
    
    <span class="c1"># Save locally as Parquet (typed, compressed, fast reloads)
</span>    <span class="n">df_pnl</span><span class="p">.</span><span class="n">to_parquet</span><span class="p">(</span><span class="n">PARQUET_PATH</span><span class="p">,</span> <span class="n">index</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
    <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Saved monthly P&amp;L mart (operational policy) to </span><span class="si">{</span><span class="n">PARQUET_PATH</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
    <span class="k">print</span><span class="p">(</span><span class="n">df_pnl</span><span class="p">.</span><span class="n">head</span><span class="p">())</span>
    <span class="k">print</span><span class="p">(</span><span class="nb">list</span><span class="p">(</span><span class="n">df_pnl</span><span class="p">.</span><span class="n">columns</span><span class="p">))</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## Loaded cached mart from pnl_monthly_5y_operational.parquet
##        month  revenue_gross  ...  cogs_returns  gross_profit
## 0 2020-10-01   31392.780016  ...   1047.665769  15271.909169
## 1 2020-11-01   41859.880024  ...   2139.164078  19352.347184
## 2 2020-12-01   42759.699989  ...   2774.276256  18981.968990
## 3 2021-01-01   47594.270006  ...   2497.217012  22142.032789
## 4 2021-02-01   48373.559991  ...   2644.166791  22102.089914
## 
## [5 rows x 7 columns]
</code></pre></div></div>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    
    
<span class="c1"># Some sanity checks
</span><span class="n">df</span> <span class="o">=</span> <span class="n">df_pnl</span><span class="p">.</span><span class="n">copy</span><span class="p">()</span>
<span class="n">df</span><span class="p">[</span><span class="s">"gm_pct"</span><span class="p">]</span> <span class="o">=</span> <span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s">"gross_profit"</span><span class="p">]</span> <span class="o">/</span> <span class="n">df</span><span class="p">[</span><span class="s">"revenue_net"</span><span class="p">]).</span><span class="n">replace</span><span class="p">([</span><span class="n">pd</span><span class="p">.</span><span class="n">NA</span><span class="p">,</span> <span class="n">pd</span><span class="p">.</span><span class="n">NaT</span><span class="p">],</span> <span class="mi">0</span><span class="p">)</span>
<span class="n">df</span><span class="p">[</span><span class="s">"cogs_pct"</span><span class="p">]</span> <span class="o">=</span> <span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s">"cogs"</span><span class="p">]</span> <span class="o">/</span> <span class="n">df</span><span class="p">[</span><span class="s">"revenue_gross"</span><span class="p">]).</span><span class="n">replace</span><span class="p">([</span><span class="n">pd</span><span class="p">.</span><span class="n">NA</span><span class="p">,</span> <span class="n">pd</span><span class="p">.</span><span class="n">NaT</span><span class="p">],</span> <span class="mi">0</span><span class="p">)</span>

<span class="k">print</span><span class="p">(</span><span class="n">df</span><span class="p">.</span><span class="n">tail</span><span class="p">(</span><span class="mi">12</span><span class="p">)[[</span><span class="s">"month"</span><span class="p">,</span><span class="s">"revenue_gross"</span><span class="p">,</span><span class="s">"returns"</span><span class="p">,</span><span class="s">"revenue_net"</span><span class="p">,</span><span class="s">"cogs"</span><span class="p">,</span><span class="s">"cogs_returns"</span><span class="p">,</span><span class="s">"gross_profit"</span><span class="p">,</span><span class="s">"gm_pct"</span><span class="p">,</span><span class="s">"cogs_pct"</span><span class="p">]])</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>##         month  revenue_gross       returns  ...   gross_profit    gm_pct  cogs_pct
## 49 2024-11-01  269082.380185  26214.960039  ...  126124.694354  0.519315  0.480289
## 50 2024-12-01  273170.510381  29337.700056  ...  126714.060647  0.519676  0.480170
## 51 2025-01-01  297889.750371  30776.970042  ...  138763.774925  0.519495  0.479869
## 52 2025-02-01  281236.250379  28218.760069  ...  131552.407939  0.519934  0.479882
## 53 2025-03-01  312965.760157  28176.590032  ...  146840.430599  0.515611  0.484954
## 54 2025-04-01  339263.010250  32180.190041  ...  159081.004919  0.518039  0.481683
## 55 2025-05-01  363811.930262  37850.470028  ...  169256.207956  0.519252  0.480367
## 56 2025-06-01  377284.620347  35158.139980  ...  177297.917865  0.518223  0.481696
## 57 2025-07-01  448945.130368  40269.100062  ...  212147.626249  0.519110  0.481006
## 58 2025-08-01  497113.050341  44222.930023  ...  234784.138660  0.518413  0.481715
## 59 2025-09-01  607773.990589  58880.459994  ...  285932.294740  0.520925  0.478581
## 60 2025-10-01  529830.500402  61318.850029  ...  243232.963535  0.519161  0.481192
## 
## [12 rows x 9 columns]
</code></pre></div></div>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">print</span><span class="p">(</span><span class="s">"Overall GM% (net):"</span><span class="p">,</span> <span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s">"gross_profit"</span><span class="p">].</span><span class="nb">sum</span><span class="p">()</span> <span class="o">/</span> <span class="n">df</span><span class="p">[</span><span class="s">"revenue_net"</span><span class="p">].</span><span class="nb">sum</span><span class="p">()))</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## Overall GM% (net): 0.5190164068500782
</code></pre></div></div>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">print</span><span class="p">(</span><span class="s">"Median monthly GM%:"</span><span class="p">,</span> <span class="n">df</span><span class="p">[</span><span class="s">"gm_pct"</span><span class="p">].</span><span class="n">median</span><span class="p">())</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## Median monthly GM%: 0.5190465748341138
</code></pre></div></div>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">print</span><span class="p">(</span><span class="s">"Median monthly COGS%:"</span><span class="p">,</span> <span class="n">df</span><span class="p">[</span><span class="s">"cogs_pct"</span><span class="p">].</span><span class="n">median</span><span class="p">())</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## Median monthly COGS%: 0.481191899780154
</code></pre></div></div>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">test</span> <span class="o">=</span> <span class="p">(</span><span class="n">df_pnl</span><span class="p">[</span><span class="s">"revenue_net"</span><span class="p">]</span> <span class="o">-</span> <span class="n">df_pnl</span><span class="p">[</span><span class="s">"gross_profit"</span><span class="p">])</span> <span class="o">-</span> <span class="p">(</span><span class="n">df_pnl</span><span class="p">[</span><span class="s">"cogs"</span><span class="p">]</span> <span class="o">-</span> <span class="n">df_pnl</span><span class="p">[</span><span class="s">"cogs_returns"</span><span class="p">])</span>
<span class="n">test</span><span class="p">.</span><span class="nb">abs</span><span class="p">().</span><span class="nb">max</span><span class="p">()</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## np.float64(1.4551915228366852e-11)
</code></pre></div></div>

<h2 id="7-visualize-the-profit--loss-pl-statement">7) Visualize the Profit &amp; Loss (P&amp;L) Statement</h2>

<p>For a nice visualisation of the P&amp;L statement, I am using the <code class="language-plaintext highlighter-rouge">plotly</code> library that transforms our P&amp;L time series into dynamic plots. They are great for dashboarding because they are interactive, you can zoom in, zoom out, save PNG directly, and they render information if you hover the mouse over the plot.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">plotly.express</span> <span class="k">as</span> <span class="n">px</span>
<span class="n">df_pnl</span><span class="p">[</span><span class="s">"month"</span><span class="p">]</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">to_datetime</span><span class="p">(</span><span class="n">df_pnl</span><span class="p">[</span><span class="s">"month"</span><span class="p">])</span>
<span class="nb">long</span> <span class="o">=</span> <span class="n">df_pnl</span><span class="p">.</span><span class="n">melt</span><span class="p">(</span>
    <span class="n">id_vars</span><span class="o">=</span><span class="s">"month"</span><span class="p">,</span>
    <span class="n">value_vars</span><span class="o">=</span><span class="p">[</span><span class="s">"revenue_gross"</span><span class="p">,</span><span class="s">"returns"</span><span class="p">,</span><span class="s">"revenue_net"</span><span class="p">,</span><span class="s">"cogs"</span><span class="p">,</span><span class="s">"cogs_returns"</span><span class="p">,</span><span class="s">"gross_profit"</span><span class="p">],</span>
    <span class="n">var_name</span><span class="o">=</span><span class="s">"metric"</span><span class="p">,</span>
    <span class="n">value_name</span><span class="o">=</span><span class="s">"value"</span>
<span class="p">)</span>

<span class="n">fig</span> <span class="o">=</span> <span class="n">px</span><span class="p">.</span><span class="n">line</span><span class="p">(</span><span class="nb">long</span><span class="p">,</span> <span class="n">x</span><span class="o">=</span><span class="s">"month"</span><span class="p">,</span> <span class="n">y</span><span class="o">=</span><span class="s">"value"</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s">"metric"</span><span class="p">,</span>
              <span class="n">title</span><span class="o">=</span><span class="s">"Monthly P&amp;L"</span><span class="p">,</span>
              <span class="n">labels</span><span class="o">=</span><span class="p">{</span><span class="s">"month"</span><span class="p">:</span> <span class="s">"Month"</span><span class="p">,</span> <span class="s">"value"</span><span class="p">:</span> <span class="s">"USD"</span><span class="p">,</span> <span class="s">"metric"</span><span class="p">:</span> <span class="s">"Series"</span><span class="p">},</span>
              <span class="n">markers</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">fig</span> <span class="o">=</span> <span class="n">fig</span><span class="p">.</span><span class="n">update_layout</span><span class="p">(</span><span class="n">hovermode</span><span class="o">=</span><span class="s">"x unified"</span><span class="p">)</span>
<span class="n">fig</span> <span class="o">=</span> <span class="n">fig</span><span class="p">.</span><span class="n">update_yaxes</span><span class="p">(</span><span class="n">tickprefix</span><span class="o">=</span><span class="s">"$"</span><span class="p">,</span> <span class="n">separatethousands</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">fig</span><span class="p">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div></div>

<div>                        <script type="text/javascript">window.PlotlyConfig = {MathJaxConfig: 'local'};</script>
        <script charset="utf-8" src="https://cdn.plot.ly/plotly-3.1.1.min.js" integrity="sha256-HUEFyfiTnZJxCxur99FjbKYTvKSzwDaD3/x5TqHpFu4=" crossorigin="anonymous"></script>                <div id="42fe75f5-05be-467a-9bc5-69474afbbfaa" class="plotly-graph-div" style="height:100%; width:100%;"></div>            <script type="text/javascript">                window.PLOTLYENV=window.PLOTLYENV || {};                                if (document.getElementById("42fe75f5-05be-467a-9bc5-69474afbbfaa")) {                    Plotly.newPlot(                        "42fe75f5-05be-467a-9bc5-69474afbbfaa",                        [{"hovertemplate":"Series=revenue_gross\u003cbr\u003eMonth=%{x}\u003cbr\u003eUSD=%{y}\u003cextra\u003e\u003c\u002fextra\u003e","legendgroup":"revenue_gross","line":{"color":"#636efa","dash":"solid"},"marker":{"symbol":"circle"},"mode":"lines+markers","name":"revenue_gross","orientation":"v","showlegend":true,"x":["2020-10-01T00:00:00.000000000","2020-11-01T00:00:00.000000000","2020-12-01T00:00:00.000000000","2021-01-01T00:00:00.000000000","2021-02-01T00:00:00.000000000","2021-03-01T00:00:00.000000000","2021-04-01T00:00:00.000000000","2021-05-01T00:00:00.000000000","2021-06-01T00:00:00.000000000","2021-07-01T00:00:00.000000000","2021-08-01T00:00:00.000000000","2021-09-01T00:00:00.000000000","2021-10-01T00:00:00.000000000","2021-11-01T00:00:00.000000000","2021-12-01T00:00:00.000000000","2022-01-01T00:00:00.000000000","2022-02-01T00:00:00.000000000","2022-03-01T00:00:00.000000000","2022-04-01T00:00:00.000000000","2022-05-01T00:00:00.000000000","2022-06-01T00:00:00.000000000","2022-07-01T00:00:00.000000000","2022-08-01T00:00:00.000000000","2022-09-01T00:00:00.000000000","2022-10-01T00:00:00.000000000","2022-11-01T00:00:00.000000000","2022-12-01T00:00:00.000000000","2023-01-01T00:00:00.000000000","2023-02-01T00:00:00.000000000","2023-03-01T00:00:00.000000000","2023-04-01T00:00:00.000000000","2023-05-01T00:00:00.000000000","2023-06-01T00:00:00.000000000","2023-07-01T00:00:00.000000000","2023-08-01T00:00:00.000000000","2023-09-01T00:00:00.000000000","2023-10-01T00:00:00.000000000","2023-11-01T00:00:00.000000000","2023-12-01T00:00:00.000000000","2024-01-01T00:00:00.000000000","2024-02-01T00:00:00.000000000","2024-03-01T00:00:00.000000000","2024-04-01T00:00:00.000000000","2024-05-01T00:00:00.000000000","2024-06-01T00:00:00.000000000","2024-07-01T00:00:00.000000000","2024-08-01T00:00:00.000000000","2024-09-01T00:00:00.000000000","2024-10-01T00:00:00.000000000","2024-11-01T00:00:00.000000000","2024-12-01T00:00:00.000000000","2025-01-01T00:00:00.000000000","2025-02-01T00:00:00.000000000","2025-03-01T00:00:00.000000000","2025-04-01T00:00:00.000000000","2025-05-01T00:00:00.000000000","2025-06-01T00:00:00.000000000","2025-07-01T00:00:00.000000000","2025-08-01T00:00:00.000000000","2025-09-01T00:00:00.000000000","2025-10-01T00:00:00.000000000"],"xaxis":"x","y":{"dtype":"f8","bdata":"AADI6zGo3kAAgCgpfHDkQAAKUGb24ORAAADko0g950AAgHHrsZ7nQAAAAJqpiupAAACZM7P960AAgL2a2bfqQAAAOXG9eexAAACbM5MH7kAAQNZm\u002ftHwQADgGM04RPBAAOCxuH4y8kAAwBuk1GHxQACAvz0+4\u002fFAAID6ZiZk80AAICMp1IPxQABAdilQGPZAAIAd9nyR80AABctwZS72QADAVQA81\u002fZAAEA9zVis+UAAwNUegfT4QADAsmbC+\u002flAAEC0PZKG+UAAwL2Psjf7QACAcNcnUPtAAEAXmhEa\u002fUAAwDNxTQD7QAAA4MJB9QBBAIBWSCmv\u002fkAA4GWuCQkAQQBgSymYLwFBAADjcH9FAkEAkKSun5gBQQAgreG0rwJBAIB6XOtAA0EAYI0A7FwCQQDAEHvExQRBAOCNAOLXBUEA4AcVrIYEQYAigdefZQdBAACpmSdFB0EAwObCe2MLQQAA+Uf31AhBAIAXZ\u002fYqCkEAgERSCHUOQYCifs2ujwtBAMj9o5ucEEEAME+FaWwQQQBYoQpKrBBBAGBhAIcuEkEAYGMAUSoRQQCgZgoXGhNBAOB+Cvy0FEEAsJa4jzQWQQBAPHsSBxdBADB\u002fhcRmG0EAgIwzZFceQQB4Lvs7jCJBAKg0AE0rIEE="},"yaxis":"y","type":"scatter"},{"hovertemplate":"Series=returns\u003cbr\u003eMonth=%{x}\u003cbr\u003eUSD=%{y}\u003cextra\u003e\u003c\u002fextra\u003e","legendgroup":"returns","line":{"color":"#EF553B","dash":"solid"},"marker":{"symbol":"circle"},"mode":"lines+markers","name":"returns","orientation":"v","showlegend":true,"x":["2020-10-01T00:00:00.000000000","2020-11-01T00:00:00.000000000","2020-12-01T00:00:00.000000000","2021-01-01T00:00:00.000000000","2021-02-01T00:00:00.000000000","2021-03-01T00:00:00.000000000","2021-04-01T00:00:00.000000000","2021-05-01T00:00:00.000000000","2021-06-01T00:00:00.000000000","2021-07-01T00:00:00.000000000","2021-08-01T00:00:00.000000000","2021-09-01T00:00:00.000000000","2021-10-01T00:00:00.000000000","2021-11-01T00:00:00.000000000","2021-12-01T00:00:00.000000000","2022-01-01T00:00:00.000000000","2022-02-01T00:00:00.000000000","2022-03-01T00:00:00.000000000","2022-04-01T00:00:00.000000000","2022-05-01T00:00:00.000000000","2022-06-01T00:00:00.000000000","2022-07-01T00:00:00.000000000","2022-08-01T00:00:00.000000000","2022-09-01T00:00:00.000000000","2022-10-01T00:00:00.000000000","2022-11-01T00:00:00.000000000","2022-12-01T00:00:00.000000000","2023-01-01T00:00:00.000000000","2023-02-01T00:00:00.000000000","2023-03-01T00:00:00.000000000","2023-04-01T00:00:00.000000000","2023-05-01T00:00:00.000000000","2023-06-01T00:00:00.000000000","2023-07-01T00:00:00.000000000","2023-08-01T00:00:00.000000000","2023-09-01T00:00:00.000000000","2023-10-01T00:00:00.000000000","2023-11-01T00:00:00.000000000","2023-12-01T00:00:00.000000000","2024-01-01T00:00:00.000000000","2024-02-01T00:00:00.000000000","2024-03-01T00:00:00.000000000","2024-04-01T00:00:00.000000000","2024-05-01T00:00:00.000000000","2024-06-01T00:00:00.000000000","2024-07-01T00:00:00.000000000","2024-08-01T00:00:00.000000000","2024-09-01T00:00:00.000000000","2024-10-01T00:00:00.000000000","2024-11-01T00:00:00.000000000","2024-12-01T00:00:00.000000000","2025-01-01T00:00:00.000000000","2025-02-01T00:00:00.000000000","2025-03-01T00:00:00.000000000","2025-04-01T00:00:00.000000000","2025-05-01T00:00:00.000000000","2025-06-01T00:00:00.000000000","2025-07-01T00:00:00.000000000","2025-08-01T00:00:00.000000000","2025-09-01T00:00:00.000000000","2025-10-01T00:00:00.000000000"],"xaxis":"x","y":{"dtype":"f8","bdata":"AADgtx52oUAAAHBnpjexQAAAUNXjUbZAAABwrscvtEAAAKBm5tK1QAAAOB6F\u002f65AAADY1iN3s0AAAMAVLoC2QAAAjJDCz7RAAACAHgV8uEAAACABADi9QAAAdoNrtr9AAADcjUJXu0AAAGhIYSm3QAAAgITrN7tAAABwC9dewEAAAOQKV0S6QAAAvvUoocFAAAAAPkrSv0AAANJmhhDCQAAAIIXrSsRAAADmcJ3twkAAAICGiwTEQAAALK\u002fnSMdAAABI4TpqwkAAANiFKwTAQAAAOAuXfsVAAAD8HkXgw0AAANRmJnTEQAAAgkgBQcxAAABKUvhfzEAAACAVDobMQAAANqQwg8pAAABUXC+v0EAAAPil0NjMQAAAnNfTmdFAAADw\u002f\u002f+b0UAAAOz1iKbIQAAA+h5l5tBAAAC4cY220EAAAOa5vtHOQAAA+D0aA9RAAACzuE550EAAAGwedV3TQAAAeuH6bNNAAAB\u002fAJAp1kAAAPCPwrDXQAAAxHAtU9RAAAANH8Xo2UAAAElxvZnZQAAAuM1sptxAAAAqFT4O3kAAAPmksI7bQAAAF8MlhNtAAACjKQxt30AAAHkKT3viQAAAt3rEKuFAAEC0M6Op40AAgL\u002fC3ZflQAAARbgOwOxAAABxM9vw7UA="},"yaxis":"y","type":"scatter"},{"hovertemplate":"Series=revenue_net\u003cbr\u003eMonth=%{x}\u003cbr\u003eUSD=%{y}\u003cextra\u003e\u003c\u002fextra\u003e","legendgroup":"revenue_net","line":{"color":"#00cc96","dash":"solid"},"marker":{"symbol":"circle"},"mode":"lines+markers","name":"revenue_net","orientation":"v","showlegend":true,"x":["2020-10-01T00:00:00.000000000","2020-11-01T00:00:00.000000000","2020-12-01T00:00:00.000000000","2021-01-01T00:00:00.000000000","2021-02-01T00:00:00.000000000","2021-03-01T00:00:00.000000000","2021-04-01T00:00:00.000000000","2021-05-01T00:00:00.000000000","2021-06-01T00:00:00.000000000","2021-07-01T00:00:00.000000000","2021-08-01T00:00:00.000000000","2021-09-01T00:00:00.000000000","2021-10-01T00:00:00.000000000","2021-11-01T00:00:00.000000000","2021-12-01T00:00:00.000000000","2022-01-01T00:00:00.000000000","2022-02-01T00:00:00.000000000","2022-03-01T00:00:00.000000000","2022-04-01T00:00:00.000000000","2022-05-01T00:00:00.000000000","2022-06-01T00:00:00.000000000","2022-07-01T00:00:00.000000000","2022-08-01T00:00:00.000000000","2022-09-01T00:00:00.000000000","2022-10-01T00:00:00.000000000","2022-11-01T00:00:00.000000000","2022-12-01T00:00:00.000000000","2023-01-01T00:00:00.000000000","2023-02-01T00:00:00.000000000","2023-03-01T00:00:00.000000000","2023-04-01T00:00:00.000000000","2023-05-01T00:00:00.000000000","2023-06-01T00:00:00.000000000","2023-07-01T00:00:00.000000000","2023-08-01T00:00:00.000000000","2023-09-01T00:00:00.000000000","2023-10-01T00:00:00.000000000","2023-11-01T00:00:00.000000000","2023-12-01T00:00:00.000000000","2024-01-01T00:00:00.000000000","2024-02-01T00:00:00.000000000","2024-03-01T00:00:00.000000000","2024-04-01T00:00:00.000000000","2024-05-01T00:00:00.000000000","2024-06-01T00:00:00.000000000","2024-07-01T00:00:00.000000000","2024-08-01T00:00:00.000000000","2024-09-01T00:00:00.000000000","2024-10-01T00:00:00.000000000","2024-11-01T00:00:00.000000000","2024-12-01T00:00:00.000000000","2025-01-01T00:00:00.000000000","2025-02-01T00:00:00.000000000","2025-03-01T00:00:00.000000000","2025-04-01T00:00:00.000000000","2025-05-01T00:00:00.000000000","2025-06-01T00:00:00.000000000","2025-07-01T00:00:00.000000000","2025-08-01T00:00:00.000000000","2025-09-01T00:00:00.000000000","2025-10-01T00:00:00.000000000"],"xaxis":"x","y":{"dtype":"f8","bdata":"AADMFG553EAAgDpch0niQAAKpuu5FuJAAAAWrk+35EAAgJ0eVeTkQACAHEixmuhAAAC+uM6O6UAAgAXY0+fnQACAJx\u002fF3+lAAADLjxL46kAAgIjN\u002fPztQAAAwymkkexAACDUjwp98EAAgCoffd7vQACAd4W\u002fL\u002fBAAICMhUtY8UAAwOlwHb\u002fvQACAvgor5PNAAIA9UliU8UAAxfCjVOzzQADAsY\u002feTfRAAIAgH6VO90AAwAWu73P2QABAzXClEvdAAECL4Uo590AAwAIfLTf5QACACfZUoPhAAMA39gie+kAAQFmkyHH4QADAr1xjYv5AAEANPioj+0AAwCeaUYH8QAAAED7KDv9AAIBYhZkvAEEAIIpIJZb\u002fQACguWZ6fABBAIB8XGsNAUEAoC5xg9IAQQCAMdf3qAJBAOBWUhDBA0EAgGkpkJkCQYAiwo885QRBAKCSwv01BUEAQBkfzfcIQQDAyetXZwZBAKAHZ8RlB0EAgEYA8H4LQYAiZh9JBQlBAPAZpB78DUEAQHVcm6UNQQCwi3vGww1BAMAOH6NNEEEAoCfsy+IOQQAwNa7UYRFBALDkRyu+EkEAkEfXJeUTQQBg5eu54RRBAKgIH5DxGEEAkDR7aKQbQQAoqg87wCBBADD7mX6YHEE="},"yaxis":"y","type":"scatter"},{"hovertemplate":"Series=cogs\u003cbr\u003eMonth=%{x}\u003cbr\u003eUSD=%{y}\u003cextra\u003e\u003c\u002fextra\u003e","legendgroup":"cogs","line":{"color":"#ab63fa","dash":"solid"},"marker":{"symbol":"circle"},"mode":"lines+markers","name":"cogs","orientation":"v","showlegend":true,"x":["2020-10-01T00:00:00.000000000","2020-11-01T00:00:00.000000000","2020-12-01T00:00:00.000000000","2021-01-01T00:00:00.000000000","2021-02-01T00:00:00.000000000","2021-03-01T00:00:00.000000000","2021-04-01T00:00:00.000000000","2021-05-01T00:00:00.000000000","2021-06-01T00:00:00.000000000","2021-07-01T00:00:00.000000000","2021-08-01T00:00:00.000000000","2021-09-01T00:00:00.000000000","2021-10-01T00:00:00.000000000","2021-11-01T00:00:00.000000000","2021-12-01T00:00:00.000000000","2022-01-01T00:00:00.000000000","2022-02-01T00:00:00.000000000","2022-03-01T00:00:00.000000000","2022-04-01T00:00:00.000000000","2022-05-01T00:00:00.000000000","2022-06-01T00:00:00.000000000","2022-07-01T00:00:00.000000000","2022-08-01T00:00:00.000000000","2022-09-01T00:00:00.000000000","2022-10-01T00:00:00.000000000","2022-11-01T00:00:00.000000000","2022-12-01T00:00:00.000000000","2023-01-01T00:00:00.000000000","2023-02-01T00:00:00.000000000","2023-03-01T00:00:00.000000000","2023-04-01T00:00:00.000000000","2023-05-01T00:00:00.000000000","2023-06-01T00:00:00.000000000","2023-07-01T00:00:00.000000000","2023-08-01T00:00:00.000000000","2023-09-01T00:00:00.000000000","2023-10-01T00:00:00.000000000","2023-11-01T00:00:00.000000000","2023-12-01T00:00:00.000000000","2024-01-01T00:00:00.000000000","2024-02-01T00:00:00.000000000","2024-03-01T00:00:00.000000000","2024-04-01T00:00:00.000000000","2024-05-01T00:00:00.000000000","2024-06-01T00:00:00.000000000","2024-07-01T00:00:00.000000000","2024-08-01T00:00:00.000000000","2024-09-01T00:00:00.000000000","2024-10-01T00:00:00.000000000","2024-11-01T00:00:00.000000000","2024-12-01T00:00:00.000000000","2025-01-01T00:00:00.000000000","2025-02-01T00:00:00.000000000","2025-03-01T00:00:00.000000000","2025-04-01T00:00:00.000000000","2025-05-01T00:00:00.000000000","2025-06-01T00:00:00.000000000","2025-07-01T00:00:00.000000000","2025-08-01T00:00:00.000000000","2025-09-01T00:00:00.000000000","2025-10-01T00:00:00.000000000"],"xaxis":"x","y":{"dtype":"f8","bdata":"ykPWAb0qzUAEAXIAw8PTQMnKioGHWdRAlCd9Jms\u002f1kCQHMYoL8jWQOCGYrhQp9lASyFQFnl62kCywAKiydzZQM7LbuChM9tAIVMFO9Xp3EAmP7dTdTHgQFtSckGvbN9A3Sy\u002f2kGJ4UC44t4x4bTgQK0HB10JDOFAJKRGcyjD4kA4pDYMecPgQHnpnPd0XeVA7TaG2c3Q4kBOu7hD\u002fUjlQG8aUOCuxeVA6HBmFuCh6EAe9M8n9yHoQGbuv73gAelAeHFDi6Ot6EDiP8bFLEbqQMKOHsv\u002fROpASWKuGocg7EBEDtZ3QgHqQAxoQrfZLvBApQdSexRh7UDaQy2pA43uQDoqEOJKifBAvAWHt6qG8UCrSQvIvPHwQEA806OX3fFAv3Q96l9z8kBDOu6HYL\u002fxQFA8z505DPRAyw0royon9UBZdFTqUqPzQNK0uLPTifZAZOS6q22Y9kDXGtqDMFf6QNCnylGNDPhAwFeP+gE6+UCHXvS9fDj9QE3lXAOMqfpAUH5qEtHL\u002f0AexlyLVI3\u002fQArJnGACAwBBqXuFnx9zAUGc7PargnkAQdX6oPDuhgJBV\u002fr2VMnyA0GcW7q+WVUFQdxiANNELwZB9Sbr9EpcCkGk\u002fkgZVzsNQUCVVmrVwBFBAdwPKTEfD0E="},"yaxis":"y","type":"scatter"},{"hovertemplate":"Series=cogs_returns\u003cbr\u003eMonth=%{x}\u003cbr\u003eUSD=%{y}\u003cextra\u003e\u003c\u002fextra\u003e","legendgroup":"cogs_returns","line":{"color":"#FFA15A","dash":"solid"},"marker":{"symbol":"circle"},"mode":"lines+markers","name":"cogs_returns","orientation":"v","showlegend":true,"x":["2020-10-01T00:00:00.000000000","2020-11-01T00:00:00.000000000","2020-12-01T00:00:00.000000000","2021-01-01T00:00:00.000000000","2021-02-01T00:00:00.000000000","2021-03-01T00:00:00.000000000","2021-04-01T00:00:00.000000000","2021-05-01T00:00:00.000000000","2021-06-01T00:00:00.000000000","2021-07-01T00:00:00.000000000","2021-08-01T00:00:00.000000000","2021-09-01T00:00:00.000000000","2021-10-01T00:00:00.000000000","2021-11-01T00:00:00.000000000","2021-12-01T00:00:00.000000000","2022-01-01T00:00:00.000000000","2022-02-01T00:00:00.000000000","2022-03-01T00:00:00.000000000","2022-04-01T00:00:00.000000000","2022-05-01T00:00:00.000000000","2022-06-01T00:00:00.000000000","2022-07-01T00:00:00.000000000","2022-08-01T00:00:00.000000000","2022-09-01T00:00:00.000000000","2022-10-01T00:00:00.000000000","2022-11-01T00:00:00.000000000","2022-12-01T00:00:00.000000000","2023-01-01T00:00:00.000000000","2023-02-01T00:00:00.000000000","2023-03-01T00:00:00.000000000","2023-04-01T00:00:00.000000000","2023-05-01T00:00:00.000000000","2023-06-01T00:00:00.000000000","2023-07-01T00:00:00.000000000","2023-08-01T00:00:00.000000000","2023-09-01T00:00:00.000000000","2023-10-01T00:00:00.000000000","2023-11-01T00:00:00.000000000","2023-12-01T00:00:00.000000000","2024-01-01T00:00:00.000000000","2024-02-01T00:00:00.000000000","2024-03-01T00:00:00.000000000","2024-04-01T00:00:00.000000000","2024-05-01T00:00:00.000000000","2024-06-01T00:00:00.000000000","2024-07-01T00:00:00.000000000","2024-08-01T00:00:00.000000000","2024-09-01T00:00:00.000000000","2024-10-01T00:00:00.000000000","2024-11-01T00:00:00.000000000","2024-12-01T00:00:00.000000000","2025-01-01T00:00:00.000000000","2025-02-01T00:00:00.000000000","2025-03-01T00:00:00.000000000","2025-04-01T00:00:00.000000000","2025-05-01T00:00:00.000000000","2025-06-01T00:00:00.000000000","2025-07-01T00:00:00.000000000","2025-08-01T00:00:00.000000000","2025-09-01T00:00:00.000000000","2025-10-01T00:00:00.000000000"],"xaxis":"x","y":{"dtype":"f8","bdata":"5llAv6lekEDYpgICVLagQF30ZXGNrKVAxtVAHG+Co0DtGphlVaikQItLX7qJP55A\u002ftCJ69RQokBlm4WlkrClQEdgHya83aNA2qKZxBDTpkCfSNUQLX+rQKeMssWxUK5AzVh9XT8TqkB\u002fM+WP+UKmQLbfu2K3lapAggEwJ3aEr0CyKixSjIipQEdEsS1vrrBAVwOrZE+xrkCQU+dckKqxQJYjCp6oyrNAZ41uzqlNskDuhHtTsHizQDJyPfqTb7ZAq\u002fuFcYfYsUC4RL10g6OvQADKAZgyXLRAJ5RFhEE3s0BxTtnnYyq0QHx6p4LenLpALgrNP5K5ukBxxG7cUau6QK3eXj9TlblA1DR1RpKev0Dw7NnXg3C7QKzu07ez8MBAHfO9oFbpwEB4sgolKIK3QPyWU2PmocBAPHz6secMwEBEjBnx95O9QERaoCKUSMNAFFoDz+88wEAT9smzagLDQFYch7MYBMNA\u002fP1OLYJ1xUBdkolMg5LGQH09REJz\u002fsNApJjfyhOsyEDggyd0R2fIQAN3XBTGcMtAemc8R3mDzEALVE80oFvKQN2FRGGQAMtAMdwH9KwbzkAuQX\u002fQfaLRQKno+5cCg9BA3h5Fyz320kDw5Mv2OdzUQLN+Ko0HQdtAYKpbU9353EA="},"yaxis":"y","type":"scatter"},{"hovertemplate":"Series=gross_profit\u003cbr\u003eMonth=%{x}\u003cbr\u003eUSD=%{y}\u003cextra\u003e\u003c\u002fextra\u003e","legendgroup":"gross_profit","line":{"color":"#19d3f3","dash":"solid"},"marker":{"symbol":"circle"},"mode":"lines+markers","name":"gross_profit","orientation":"v","showlegend":true,"x":["2020-10-01T00:00:00.000000000","2020-11-01T00:00:00.000000000","2020-12-01T00:00:00.000000000","2021-01-01T00:00:00.000000000","2021-02-01T00:00:00.000000000","2021-03-01T00:00:00.000000000","2021-04-01T00:00:00.000000000","2021-05-01T00:00:00.000000000","2021-06-01T00:00:00.000000000","2021-07-01T00:00:00.000000000","2021-08-01T00:00:00.000000000","2021-09-01T00:00:00.000000000","2021-10-01T00:00:00.000000000","2021-11-01T00:00:00.000000000","2021-12-01T00:00:00.000000000","2022-01-01T00:00:00.000000000","2022-02-01T00:00:00.000000000","2022-03-01T00:00:00.000000000","2022-04-01T00:00:00.000000000","2022-05-01T00:00:00.000000000","2022-06-01T00:00:00.000000000","2022-07-01T00:00:00.000000000","2022-08-01T00:00:00.000000000","2022-09-01T00:00:00.000000000","2022-10-01T00:00:00.000000000","2022-11-01T00:00:00.000000000","2022-12-01T00:00:00.000000000","2023-01-01T00:00:00.000000000","2023-02-01T00:00:00.000000000","2023-03-01T00:00:00.000000000","2023-04-01T00:00:00.000000000","2023-05-01T00:00:00.000000000","2023-06-01T00:00:00.000000000","2023-07-01T00:00:00.000000000","2023-08-01T00:00:00.000000000","2023-09-01T00:00:00.000000000","2023-10-01T00:00:00.000000000","2023-11-01T00:00:00.000000000","2023-12-01T00:00:00.000000000","2024-01-01T00:00:00.000000000","2024-02-01T00:00:00.000000000","2024-03-01T00:00:00.000000000","2024-04-01T00:00:00.000000000","2024-05-01T00:00:00.000000000","2024-06-01T00:00:00.000000000","2024-07-01T00:00:00.000000000","2024-08-01T00:00:00.000000000","2024-09-01T00:00:00.000000000","2024-10-01T00:00:00.000000000","2024-11-01T00:00:00.000000000","2024-12-01T00:00:00.000000000","2025-01-01T00:00:00.000000000","2025-02-01T00:00:00.000000000","2025-03-01T00:00:00.000000000","2025-04-01T00:00:00.000000000","2025-05-01T00:00:00.000000000","2025-06-01T00:00:00.000000000","2025-07-01T00:00:00.000000000","2025-08-01T00:00:00.000000000","2025-09-01T00:00:00.000000000","2025-10-01T00:00:00.000000000"],"xaxis":"x","y":{"dtype":"f8","bdata":"c8epX\u002fTTzUDXU0M4FubSQMMH7gN+idJAJfM2GYKf1UDO5ifBhZXVQNltfHMKctlA1Rid+D7t2kC78rhi8KjYQDsgpOKfB9tAOuEj\u002fbHg20DIKr2V9AbfQDr\u002fyUqvgN1AsOjAOgcS4UCA8EmGy43gQE62EyTR\u002fOBA9FtF+rXl4UBz3tUpLZTgQBA\u002flgOvgORASHk\u002fwfdC4kAkucUP\u002fsTkQASq1FJjT+VAxWCoYR9F6ECA\u002fKo+\u002fjTnQOA\u002fIqNc8edA\u002fc0DJgMA6EBqFIuvZSLqQH6q9HMwh+lAPNBJAnOC60CKm9dNm2fpQDifLxuvuu9AoRnCSHI87EC0FLDGCcvtQLHD9Y\u002fUHvBAkU2Rd3HS8EAkdfy9cFvwQJaBmqBzOfFApUnToqHE8UDksL\u002fcyF3xQJA2\u002ftzyWfNAvAHC95Jc9EBrJJDnDGnzQHabH\u002fC3qfVA3sZK0yvb9UDro9EQt\u002fj5QLu7OZylIvdAAMgpGTdA+EDF0ymss5f8QGPnt6PU4PlAxFQlr+5B\u002f0BeqhIcy8r+QMxcafig7\u002f5AzsoLM17wAEGkqHVDAw8AQYmt3XHD7AFBbOMSCkhrA0GKrOSpQakEQTnayVePpAVB58yOAp3lCUH6nfkbAakMQati0C2xcxFBS\u002flRtQexDUE="},"yaxis":"y","type":"scatter"}],                        {"template":{"data":{"histogram2dcontour":[{"type":"histogram2dcontour","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"choropleth":[{"type":"choropleth","colorbar":{"outlinewidth":0,"ticks":""}}],"histogram2d":[{"type":"histogram2d","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"heatmap":[{"type":"heatmap","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"contourcarpet":[{"type":"contourcarpet","colorbar":{"outlinewidth":0,"ticks":""}}],"contour":[{"type":"contour","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"surface":[{"type":"surface","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"mesh3d":[{"type":"mesh3d","colorbar":{"outlinewidth":0,"ticks":""}}],"scatter":[{"fillpattern":{"fillmode":"overlay","size":10,"solidity":0.2},"type":"scatter"}],"parcoords":[{"type":"parcoords","line":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scatterpolargl":[{"type":"scatterpolargl","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"bar":[{"error_x":{"color":"#2a3f5f"},"error_y":{"color":"#2a3f5f"},"marker":{"line":{"color":"#E5ECF6","width":0.5},"pattern":{"fillmode":"overlay","size":10,"solidity":0.2}},"type":"bar"}],"scattergeo":[{"type":"scattergeo","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scatterpolar":[{"type":"scatterpolar","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"histogram":[{"marker":{"pattern":{"fillmode":"overlay","size":10,"solidity":0.2}},"type":"histogram"}],"scattergl":[{"type":"scattergl","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scatter3d":[{"type":"scatter3d","line":{"colorbar":{"outlinewidth":0,"ticks":""}},"marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scattermap":[{"type":"scattermap","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scattermapbox":[{"type":"scattermapbox","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scatterternary":[{"type":"scatterternary","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scattercarpet":[{"type":"scattercarpet","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"carpet":[{"aaxis":{"endlinecolor":"#2a3f5f","gridcolor":"white","linecolor":"white","minorgridcolor":"white","startlinecolor":"#2a3f5f"},"baxis":{"endlinecolor":"#2a3f5f","gridcolor":"white","linecolor":"white","minorgridcolor":"white","startlinecolor":"#2a3f5f"},"type":"carpet"}],"table":[{"cells":{"fill":{"color":"#EBF0F8"},"line":{"color":"white"}},"header":{"fill":{"color":"#C8D4E3"},"line":{"color":"white"}},"type":"table"}],"barpolar":[{"marker":{"line":{"color":"#E5ECF6","width":0.5},"pattern":{"fillmode":"overlay","size":10,"solidity":0.2}},"type":"barpolar"}],"pie":[{"automargin":true,"type":"pie"}]},"layout":{"autotypenumbers":"strict","colorway":["#636efa","#EF553B","#00cc96","#ab63fa","#FFA15A","#19d3f3","#FF6692","#B6E880","#FF97FF","#FECB52"],"font":{"color":"#2a3f5f"},"hovermode":"closest","hoverlabel":{"align":"left"},"paper_bgcolor":"white","plot_bgcolor":"#E5ECF6","polar":{"bgcolor":"#E5ECF6","angularaxis":{"gridcolor":"white","linecolor":"white","ticks":""},"radialaxis":{"gridcolor":"white","linecolor":"white","ticks":""}},"ternary":{"bgcolor":"#E5ECF6","aaxis":{"gridcolor":"white","linecolor":"white","ticks":""},"baxis":{"gridcolor":"white","linecolor":"white","ticks":""},"caxis":{"gridcolor":"white","linecolor":"white","ticks":""}},"coloraxis":{"colorbar":{"outlinewidth":0,"ticks":""}},"colorscale":{"sequential":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]],"sequentialminus":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]],"diverging":[[0,"#8e0152"],[0.1,"#c51b7d"],[0.2,"#de77ae"],[0.3,"#f1b6da"],[0.4,"#fde0ef"],[0.5,"#f7f7f7"],[0.6,"#e6f5d0"],[0.7,"#b8e186"],[0.8,"#7fbc41"],[0.9,"#4d9221"],[1,"#276419"]]},"xaxis":{"gridcolor":"white","linecolor":"white","ticks":"","title":{"standoff":15},"zerolinecolor":"white","automargin":true,"zerolinewidth":2},"yaxis":{"gridcolor":"white","linecolor":"white","ticks":"","title":{"standoff":15},"zerolinecolor":"white","automargin":true,"zerolinewidth":2},"scene":{"xaxis":{"backgroundcolor":"#E5ECF6","gridcolor":"white","linecolor":"white","showbackground":true,"ticks":"","zerolinecolor":"white","gridwidth":2},"yaxis":{"backgroundcolor":"#E5ECF6","gridcolor":"white","linecolor":"white","showbackground":true,"ticks":"","zerolinecolor":"white","gridwidth":2},"zaxis":{"backgroundcolor":"#E5ECF6","gridcolor":"white","linecolor":"white","showbackground":true,"ticks":"","zerolinecolor":"white","gridwidth":2}},"shapedefaults":{"line":{"color":"#2a3f5f"}},"annotationdefaults":{"arrowcolor":"#2a3f5f","arrowhead":0,"arrowwidth":1},"geo":{"bgcolor":"white","landcolor":"#E5ECF6","subunitcolor":"white","showland":true,"showlakes":true,"lakecolor":"white"},"title":{"x":0.05},"mapbox":{"style":"light"}}},"xaxis":{"anchor":"y","domain":[0.0,1.0],"title":{"text":"Month"}},"yaxis":{"anchor":"x","domain":[0.0,1.0],"title":{"text":"USD"},"tickprefix":"$","separatethousands":true},"legend":{"title":{"text":"Series"},"tracegroupgap":0},"title":{"text":"Monthly P&L"},"hovermode":"x unified"},                        {"responsive": true}                    )                };            </script>        </div>]]></content><author><name>Mario H. Gonzalez-Sauri</name></author><summary type="html"><![CDATA[1) Installing Python in RStudio]]></summary></entry><entry><title type="html">Fractions, Decimals, Percentages.</title><link href="https://mario1084.github.io/blog/2024/09/23/frac_dec_perc_v00.Rmd" rel="alternate" type="text/html" title="Fractions, Decimals, Percentages." /><published>2024-09-23T00:00:00+00:00</published><updated>2024-09-23T00:00:00+00:00</updated><id>https://mario1084.github.io/blog/2024/09/23/frac_dec_perc_v00</id><content type="html" xml:base="https://mario1084.github.io/blog/2024/09/23/frac_dec_perc_v00.Rmd"><![CDATA[```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE)
library(htmltools)
```

\usepackage{amsmath}
\usepackage{longdiv}


# Introduction: Fractions, Decimals, Percentages.

## Objectives

-   To learn.
-   To have fun.
-   To find real-life applications.

Deal we the why and the how.



```{r, echo=FALSE, out.width='100%', eval=FALSE}

 # \'cake.png\',
 #    \'burger.png\',
 #    \'icecream.png\',
 #    \'donut.png\',
 #    \'fries.png\',
 #    \'soda.png\'
 #    

# HTML content for the animation 
html <- HTML('<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8">
  <meta name="viewport" content="width=device-width, initial-scale=1.0">
  <title>Split Pizza Animation</title>
  <style>
    .food-container {
      text-align: center;
      margin-top: 50px;
    }
    .food-item {
      width: 200px;
      height: 200px;
      position: relative;
      margin: 0 auto;
      transition: transform 2s ease-in-out;
    }
    .food-split {
      position: absolute;
      top: 0;
      left: 0;
      width: 100%;
      height: 100%;
      display: none;
      clip-path: circle(50% at 50% 50%);
    }
    .percent-label {
      position: absolute;
      top: 50%;
      left: 50%;
      transform: translate(-50%, -50%);
      font-size: 18px;
      color: white;
      font-weight: bold;
    }
  </style>
</head>
<body>

<div class="food-container">
  <div id="food-item" class="food-item">
    <img id="food-image" src="pizza.png" alt="Food Image" style="width:100%; height: 100%;">
    <div id="slice1" class="food-split">
      <img src="pizza.png" style="width: 100%; height: 100%; clip-path: polygon(50% 50%, 100% 0, 100% 100%);">
      <div class="percent-label">50%</div>
    </div>
    <div id="slice2" class="food-split">
      <img src="pizza.png" style="width: 100%; height: 100%; clip-path: polygon(50% 50%, 100% 0, 50% 100%);">
      <div class="percent-label">25%</div>
    </div>
    <div id="slice3" class="food-split">
      <img src="pizza.png" style="width: 100%; height: 100%; clip-path: polygon(50% 50%, 100% 0, 75% 100%);">
      <div class="percent-label">33%</div>
    </div>
    <div id="slice4" class="food-split">
      <img src="pizza.png" style="width: 100%; height: 100%; clip-path: polygon(50% 50%, 100% 0, 25% 100%);">
      <div class="percent-label">25%</div>
    </div>
    <div id="slice5" class="food-split">
      <img src="pizza.png" style="width: 100%; height: 100%; clip-path: polygon(50% 50%, 100% 0, 20% 100%);">
      <div class="percent-label">20%</div>
    </div>
  </div>
</div>

<script>
  const foodItem = document.getElementById("food-item"),
    foodImage = document.getElementById("food-image");

  const foodImages = ["pizza.png", "tomato.png"];

  let currentFoodIndex = 0;

  // Function to randomize the circular pizza slice splits
  function splitFood() {
    // Randomize between 2 and 8 parts
    const split = Math.floor(Math.random() * 7) + 2;
    const sliceAngle = 360 / split;

    // Hide all previous slices
    const slices = document.getElementsByClassName("food-split");
    for (let i = 0; i < slices.length; i++) {
      slices[i].style.display = "none";
    }

    // Show the new slices and set clip paths for circular pizza slices
    for (let i = 0; i < split; i++) {
      const slice = document.getElementById("slice" + (i + 1));
      if (slice) {
        const angleStart = i * sliceAngle,
          angleEnd = (i + 1) * sliceAngle;

        // Show slice
        slice.style.display = "block";
        slice.querySelector(".percent-label").innerText = `${(100 / split).toFixed(0)}%`;

        // Set clip path for each slice (circular wedge)
        slice.querySelector("img").style.clipPath = `polygon(50% 50%, ${50 + 50 * Math.cos(angleStart * Math.PI / 180)}% ${50 + 50 * Math.sin(angleStart * Math.PI / 180)}%, ${50 + 50 * Math.cos(angleEnd * Math.PI / 180)}% ${50 + 50 * Math.sin(angleEnd * Math.PI / 180)}%)`;
      }
    }

    // Rotate the entire image randomly to animate
    foodItem.style.transform = `rotate(${Math.random() * 360}deg)`;
  }

  function changeFood() {
    currentFoodIndex = (currentFoodIndex + 1) % foodImages.length;
    foodImage.src = foodImages[currentFoodIndex];

    // Call splitFood after changing the image
    splitFood();
  }

  setInterval(changeFood, 5000);

  window.onload = function() {
    splitFood();
  };
</script>

</body>
</html>')

# Save the HTML content to a file
htmltools::save_html(html, "animation.html")

# Include the saved HTML file in the RMarkdown slide
knitr::include_url("animation.html")


```



# 1. Fractions.


## Natural Numbers

Lets recall:

- **Natural Numbers**: Whole numbers starting from $1$ (e.g., $1$, $2$, $3$, $4$).

We use them in every day life to **count** all sort of things...

-   Can you name examples?



## When Do We Use Fractions? - Sharing

- **Sharing**: When we need to divide something equally among people.

- Example: Splitting a pizza among friends:

<div style="text-align: center;">
  <img src="pizza.png" alt="Pizza" style="width: 350px; height: 350px; object-fit: cover;">
</div>
  
  
  
## When Do We Use Fractions?  - Spliting 

- **Splitting**: When we need to break something into smaller parts.

- Example: Cutting an apple into *quarters*:

<div style="text-align: center;">
  <img src="apple.jpeg" alt="Pizza" style="width: 350px; height: 350px; object-fit: cover;">
</div>



  
  
## When Do We Use Fractions?  - Buy/Sell   

- **Buying and Selling**: When we need to measure quantities that are not whole numbers.

- Example: Buying **half** kilogram of gummy bear candy.

<div style="text-align: center;">
  <img src="bear2.png" alt="Pizza" style="width: 350px; height: 350px; object-fit: cover;">
</div>

  

## Fractions: 1.1 Formal Definition

- **Fractions**: A way to represent parts of a whole.
- Separated by a diagonal line (more common):
  -   Thirds $1/3$.
  -   Halves $1/2$.
  -   Quarters $1/4$.
  
  
- Separated by an horizontal line  (more formal).
  $$\frac{1}{2}, \frac{2}{3}, \frac{1}{8}, \frac{3}{5}$$
  
## Fractions: 1.2 Formal Definition

-   *Numerator*: The top number represents the number of parts you have or take.


-   *Denominator*: The bottom number represents the total number of equal parts the whole is divided into.


-   *Example 1*: You buy a medium pizza ($8$ slices), and you take $3$ slices.

How will you represent your share using fraction notation?


## Fractions Examples

$$\frac{3}{8}$$

-   *Example 2*: You have a chocolate bar with 12 pieces and you would like to *share it evenly* among 3 friends.


## Fractions Examples

-   *Denominator*: The bottom number represents the total number of equal parts the whole is divided into.
$12$

-   *Numerator*: The top number represents the number of parts you have or take.

We have $3$ friends, and we want the $12$ split evenly, so...

$$12 \div 3= 3\sqrt{12}=4$$

So in fraction notation:


$$\frac{4}{12}$$
Is this correct? Is it well expressed (fraction notation)?



## Fractions Examples

Picture your chocolate bar...

```{r, results='asis', warning=FALSE}

# Set up plot dimensions
plot(1, type="n", xlim=c(0, 4), ylim=c(0, 3), xlab="", ylab="", xaxt='n', yaxt='n', bty='n')

# Draw horizontal lines
for(i in 0:3) {
  lines(c(0, 4), c(i, i))
}

# Draw vertical lines
for(i in 0:4) {
  lines(c(i, i), c(0, 3))
}


```

## Fractions Examples

So... the correct answer is the simplified fraction:

$$\frac{4}{12}=\frac{1}{3}$$



## Fractions Activity (Game)

Open the QR:

<div style="text-align: center;">
  <img src="fractions_game.svg" alt="Pizza" style="width: 350px; height: 350px; object-fit: cover;">
</div>


## Fractions Activity (Game)


[Fractions Activity (Game)](https://phet.colorado.edu/sims/html/build-a-fraction/latest/build-a-fraction_en.html)



# 2. Decimals.



## Decimal Numbers

Lets recall:

- **Decimal Numbers**: Numbers that use a dot (called a decimal point) to show parts of a whole.

(e.g., $1.2$, $2.3$, $3.6$, $4.9$).

The number to the right of the a decimal point is less that a unit:

-   $.1$ is less that $1$

-   $.9$ is less than $2$

-   $2.00...$ is the same as $2$ (trailing zeros.)

We use them in every day life to **measure** all sort of things...


## Counting vs. Measuring

Typically:


-   *Counting* to find out whole items there are.



-   *Measuring* involves determining the size, amount, or degree of something using a standard unit (like meters, centimeters, inches, etc.)


-   We use *decimal* numbers to represent things that we measure.


## When Do We Use Decimals? length

-   How tall are you? 
 $$1.73 \, \text{m}$$

-   How far is Mérida from Cancún?

 $$309.2 \, \text{km}$$

<div style="text-align: center;">
  <img src="lenght.jpeg" alt="Lenght" style="width: 300px; height: 300px; object-fit: cover;">
</div>


## When Do We Use Decimals? Temperature


-   The freezing point of human blood is actually around:

$$-1.66 \, \text{°C}$$
-   The temperature in the summer of Mérida is around:
$$37.5 \, \text{°C}$$

<div style="text-align: center;">
  <img src="temperature.jpeg" alt="temperature" style="width: 280px; height: 220px; object-fit: cover;">
</div>



## When Do We Use Decimals? Money


- What is the price of Minecraft: Java & Bedrock Edition Microsoft PC?
  $$\$569.99\, \text{MXN}$$

- How much does a Kinder Sorpresa?
  $$\$18.50\, \text{MXN}$$

<div style="text-align: center;">
  <img src="minecraft.png" alt="minecraft" style="width: 280px; height: 220px; object-fit: cover;">
</div>


## Decimals: 2.1 Formal Definition

-   Decimals is another way to write **rational numbers** used for things that we rather measure.

- **Division**: To convert a fraction to a decimal, divide the numerator by the denominator.



- Example 1: Convert $\frac{1}{2}$ to a decimal.

$$\frac{1}{2} = 1 \div 2 = 0.5$$

## Decimals: 2.1 Formal Definition

<div style="text-align: center;">
  <img src="division.png" alt="division" style="width: 600px; height: 350px; object-fit: cover;">
</div>







## Decimals Examples

How do we do it?


1.    Add a zero to the right of the dividend (numerator).

$$2\sqrt{1} \rightarrow 2\sqrt{10}$$



## Decimals Examples


2.    Divide: Find the largest integer (quotient) that, when multiplied by the divisor (denominator), is less than or equal to the current dividend.

$$2\sqrt{10}$$
- What about $2 \times 3 = 6$?

- What about $2 \times 4 = 8$?

- What about $2 \times 5 = 10$?

- What about $2 \times 6 = 12$?

2.1   Add a decimal point to the right of quotient:
$$.05$$

## Decimals Examples

3.    Multiply the divisor by this integer and write the result below the current dividend.
$$2 \times 5 = 10$$

4.    Subtract: Subtract this result from the current dividend to find the remainder.
$$10-10=0$$

**If the remainder is $0$ STOP**


## Decimals Examples

**If the remainder is not $0$ Carry on**


5.    Bring Down: Bring down the next digit (or add a zero if there are no more digits) to the right of the remainder.
6.    Iterate: Repeat steps 2-5 until the remainder is zero or you have enough decimal places.

## Decimals Examples

-   Context: You want to bake a cupcake, according to the recipe, you need $3/8 \, \text{kg}$ of flour for 12 pzs.
-   You buy $1 \, \text{kg}$, and you need a scale to measure the flour.

How much flour is  $3/8 \, \text{kg}$ (three-eighths) in grams (three decimal units of a kg)?


<div style="text-align: center;">
  <img src="cupcake.jpeg" alt="cupcaken" style="width: 300px; height: 300px; object-fit: cover;">
</div>




## Decimals Examples

<div style="text-align: center;">
  <img src="3_8_division.png" alt="3_8_division" style="width: 600px; height: 600px; object-fit: cover;">
</div>



## Decimals Examples

-   Context: You are going on a trip, and you want to leave enough water for your dogs.
-   Looking at the container, you see that approximately $2/3 \, \text{lts}$ are gone...
-   You are going for 7 days and you need at least 1 lt of water per day.
-   If the container is of 20 lts, do you have enough water?

<div style="text-align: center;">
  <img src="dog_dispensador_.jpg" alt="dispensador" style="width: 600px; height: 250px; object-fit: cover;">
</div>


## Decimals Examples
<div style="text-align: center;">
  <img src="dog_division_.jpg" alt="dispensador_div" style="width: 600px; height: 600px; object-fit: cover;">
</div>


## Decimals: 2.2 Mixed Number

-   A mixed number is a combination of a whole number and a fraction.
-   It’s like saying you have $x$'s number of whole units and certain remainder...


-   The quotient $6$
-   The divisor $3$
-   The remainder $2$

$$6 \frac{2}{3}$$


# 3. Percentages.



## Percentages.

Lets recall:

- **Definition:** A percentage is a way of expressing a number as a fraction of 100 (base).

- **Symbol:** %

- **Example:** 10% means $10$ out of $100$.

- $100 \%$ represents the whole.

## When Do We Use Percetages? Discounts

-   Picture your favorite video game has 20% off.
-   If the price is 600 MXN.

How much is the discount, and how much will you pay?


## When Do We Use Percetages? Discounts


-   To calculate the discount:

1) Estimate the decimal:

$$20/100=.2$$

2) Estimate the discount

$$.2*600=120.0$$
3) How much will you pay?
$$600 - 120 = 520$$

## Percentages Examples

-   Picture, you want to save money for trip at the end of the year.
-   You are serious and you are committed to save 15 % of your weekly allowance.

-   If your weekly allowance (pocket money) is 1000 MXN, and you started saving since the begging of the year..

How much will you have at the end of the year?


## Percentages Examples

-   To calculate the weekly savings:

1) Estimate the decimal:

$$15/100=.15$$

2) Estimate the weekly savings

$$.15 \times 1000=150.00$$

3) How many weeks in a year?

$$365/7=56$$ (approx)



## Percentages Examples


4) What are your savings at the end of the year?

$$150*56=8400$$

## Percentages Activity (Game)

Open the QR:

<div style="text-align: center;">
  <img src="percentages_game.svg" alt="percentages_game" style="width: 350px; height: 350px; object-fit: cover;">
</div>

## Percentages Activity (Game)


[Percentages Activity (Game)](https://www.mathplayground.com/bingo-find-a-percent-of-a-number.html)

## Wrap Up!

-   Fractions to represent parts of a whole (proportions).
-   Decimals to measure units (time, temperature, length)
-   Percentages to represent rational numbers in an friendly way (base of 100%).



## Questions?

- Feel free to ask any questions you have about fractions.
- Let's make sure we all understand before moving on.



## Gaby knows fractions very well.


[Dj Gaby](https://www.youtube.com/shorts/EnaSwtjN6uk)



## How does Gaby manages to mix that?

She knows that she always needs to fit an equal number of beats in a bar.

For instance, the most common signature is $1/4$, meaning 4 beats in one bar.


## Signature & Beats


[Signature, Beats](https://www.youtube.com/shorts/XMfy63r4igI)


## The END

Thank you!]]></content><author><name>Mario H. Gonzalez Sauri</name></author><summary type="html"><![CDATA[```{r setup, include=FALSE} knitr::opts_chunk$set(echo = FALSE) library(htmltools) ```]]></summary></entry><entry><title type="html">An Introduction to R for Network Analysis.</title><link href="https://mario1084.github.io/blog/2023/09/14/intro_net_anlysis_R.html" rel="alternate" type="text/html" title="An Introduction to R for Network Analysis." /><published>2023-09-14T00:00:00+00:00</published><updated>2023-09-14T00:00:00+00:00</updated><id>https://mario1084.github.io/blog/2023/09/14/intro_net_anlysis_R</id><content type="html" xml:base="https://mario1084.github.io/blog/2023/09/14/intro_net_anlysis_R.html"><![CDATA[<h1 id="introduction">Introduction</h1>

<p>This guide is divided into two parts. The first part provides a basic
introduction to the R programming language, while the second part
focuses on practical code snippets for creating network visualizations,
generating statistics and performing analysis using the igraph package.</p>

<h1 id="section-1-an-introduction-to-r">Section 1: An introduction to R.</h1>

<h2 id="getting-started-with-r">Getting Started with R</h2>

<p>To become proficient in R, it’s helpful to think of coding in a way
similar to learning a language. Start with the fundamentals, which are
the basic operators. Operators are symbols that instruct the computer to
perform specific actions. For example, in <code class="language-plaintext highlighter-rouge">1 + 2</code>, the <code class="language-plaintext highlighter-rouge">+</code> operator
performs addition. In <code class="language-plaintext highlighter-rouge">a &lt;- 1:5</code>, the <code class="language-plaintext highlighter-rouge">&lt;-</code> operator assigns values.
Begin by familiarizing yourself with these operators; you don’t need to
memorize them all at once. Focus on the ones you encounter frequently,
and gradually expand your knowledge.</p>

<h3 id="key-concepts">Key Concepts</h3>

<p>As you gain confidence with operators, move on to writing your own code
snippets. To do this effectively, understand the fundamental rules of R:</p>

<ul>
  <li><strong>R is Case Sensitive</strong>: Pay attention to letter case; uppercase and
lowercase letters are treated differently.</li>
  <li><strong>R Executes Code Sequentially</strong>: R processes code from top to bottom,
so the order of your commands matters.</li>
  <li><strong>R Reads Left to Right</strong>: Code is evaluated from left to right, so
the sequence of operations is crucial.</li>
</ul>

<p>Learn about essential data structures like vectors, matrices, data
frames, lists, and arrays, and how to manipulate them. For example,
<code class="language-plaintext highlighter-rouge">2 + 1L</code> may not be a valid operation, but you can learn how to make it
valid. Understanding object classes and subsetting data within objects
is crucial.</p>

<h3 id="learning-by-doing">Learning by Doing</h3>

<p>The best way to learn is by doing. When you have a clear, step-by-step
plan in mind, there’s likely a way to code it in R.</p>

<h2 id="mastering-r-base">Mastering R-base</h2>

<p>Familiarize yourself with the R-base, which comprises core functions
that don’t require additional packages. This forms the foundation of
your knowledge. New users often make the mistake of installing numerous
unnecessary packages. Keep it simple and use additional packages like
<code class="language-plaintext highlighter-rouge">igraph</code> for specialized tasks, such as network analysis, only when
necessary.</p>

<h3 id="operators-reference">Operators Reference</h3>

<p>Operators are symbols that provide instructions to the computer for
specific tasks, such as variable manipulation, statement evaluation,
function creation, and general operations. You can find a comprehensive
list of operators in the <a href="https://cran.r-project.org/doc/manuals/r-devel/R-lang.html#Operators">R
documentation</a>.</p>

<table>
  <thead>
    <tr>
      <th style="text-align: left"> </th>
      <th style="text-align: left">Logical Operators</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="text-align: left">-</td>
      <td style="text-align: left">Minus, can be unary or binary</td>
    </tr>
    <tr>
      <td style="text-align: left">+</td>
      <td style="text-align: left">Plus, can be unary or binary</td>
    </tr>
    <tr>
      <td style="text-align: left">!</td>
      <td style="text-align: left">Logical not (Negation)</td>
    </tr>
    <tr>
      <td style="text-align: left">~</td>
      <td style="text-align: left">Tilde (used in model formulae)</td>
    </tr>
    <tr>
      <td style="text-align: left">?</td>
      <td style="text-align: left">Help</td>
    </tr>
    <tr>
      <td style="text-align: left">:</td>
      <td style="text-align: left">Sequence, binary (in model formulae: interaction)</td>
    </tr>
    <tr>
      <td style="text-align: left">*</td>
      <td style="text-align: left">Multiplication, binary</td>
    </tr>
    <tr>
      <td style="text-align: left">/</td>
      <td style="text-align: left">Division, binary</td>
    </tr>
    <tr>
      <td style="text-align: left">^</td>
      <td style="text-align: left">Exponentiation, binary</td>
    </tr>
    <tr>
      <td style="text-align: left">%x%</td>
      <td style="text-align: left">Special binary operators, x can be replaced by any valid name</td>
    </tr>
    <tr>
      <td style="text-align: left">%%</td>
      <td style="text-align: left">Modulus, binary</td>
    </tr>
    <tr>
      <td style="text-align: left">%/%</td>
      <td style="text-align: left">Integer divide, binary</td>
    </tr>
    <tr>
      <td style="text-align: left">%*%</td>
      <td style="text-align: left">Matrix product, binary</td>
    </tr>
    <tr>
      <td style="text-align: left">%o%</td>
      <td style="text-align: left">Outer product, binary</td>
    </tr>
    <tr>
      <td style="text-align: left">%x%</td>
      <td style="text-align: left">Kronecker product, binary</td>
    </tr>
    <tr>
      <td style="text-align: left">%in%</td>
      <td style="text-align: left">Matching operator, binary (in model formulae: nesting)</td>
    </tr>
    <tr>
      <td style="text-align: left">&lt;</td>
      <td style="text-align: left">Less than, binary</td>
    </tr>
    <tr>
      <td style="text-align: left">&gt;</td>
      <td style="text-align: left">Greater than, binary</td>
    </tr>
    <tr>
      <td style="text-align: left">==</td>
      <td style="text-align: left">Equal to, binary</td>
    </tr>
    <tr>
      <td style="text-align: left">&gt;=</td>
      <td style="text-align: left">Greater than or equal to, binary</td>
    </tr>
    <tr>
      <td style="text-align: left">&lt;=</td>
      <td style="text-align: left">Less than or equal to, binary</td>
    </tr>
    <tr>
      <td style="text-align: left">&amp;</td>
      <td style="text-align: left">And, binary, vectorized</td>
    </tr>
    <tr>
      <td style="text-align: left">&amp;&amp;</td>
      <td style="text-align: left">And, binary, not vectorized</td>
    </tr>
    <tr>
      <td style="text-align: left">|</td>
      <td style="text-align: left">Or, binary, vectorized</td>
    </tr>
    <tr>
      <td style="text-align: left">||</td>
      <td style="text-align: left">Or, binary, not vectorized</td>
    </tr>
    <tr>
      <td style="text-align: left">&lt;-</td>
      <td style="text-align: left">Left assignment, binary</td>
    </tr>
    <tr>
      <td style="text-align: left">-&gt;</td>
      <td style="text-align: left">Right assignment, binary</td>
    </tr>
    <tr>
      <td style="text-align: left">$</td>
      <td style="text-align: left">List subset, binary</td>
    </tr>
  </tbody>
</table>

<p>R Operators</p>

<p>Understanding the basic syntax and notation in R is crucial to
effectively navigate and utilize the language. In this example, we’ll
explore the importance of this understanding while demonstrating the use
of operators for algebraic and logical operations.</p>

<p>We can use various operators in R to perform fundamental algebraic and
logical operations. It’s essential to be familiar with the basic syntax
of the language, including elements like semicolons and parentheses.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="m">1</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">5</span><span class="w">       </span><span class="c1"># Addition</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## [1] 6
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="m">5</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="m">6</span><span class="w">       </span><span class="c1"># Multiplication</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## [1] 30
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="m">4</span><span class="w"> </span><span class="o">^</span><span class="w"> </span><span class="m">-1</span><span class="w">      </span><span class="c1"># Exponentiation</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## [1] 0.25
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="m">3</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="m">2</span><span class="w">       </span><span class="c1"># Division</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## [1] 1.5
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="m">4</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="p">(</span><span class="m">6</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="m">6</span><span class="p">)</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="p">(</span><span class="m">2</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="m">4</span><span class="p">)</span><span class="w">  </span><span class="c1"># Complex arithmetic expression</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## [1] -0.2222222
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Integer division</span><span class="w">
</span><span class="m">6</span><span class="w"> </span><span class="o">%/%</span><span class="w"> </span><span class="m">4</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## [1] 1
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Returns the remainder</span><span class="w">
</span><span class="m">6</span><span class="w"> </span><span class="o">%%</span><span class="w"> </span><span class="m">4</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## [1] 2
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="m">4</span><span class="o">:</span><span class="m">7</span><span class="w">         </span><span class="c1"># Create a sequence of numbers</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## [1] 4 5 6 7
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Logical Statements</span><span class="w">

</span><span class="p">(</span><span class="kc">TRUE</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">)</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="kc">FALSE</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## [1] TRUE
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="nb">F</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="nb">F</span><span class="p">)</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="nb">T</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## [1] TRUE
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="m">4</span><span class="w"> </span><span class="o">&gt;</span><span class="w"> </span><span class="m">5</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## [1] FALSE
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="m">7</span><span class="w"> </span><span class="o">&lt;</span><span class="w"> </span><span class="m">2</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## [1] FALSE
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="m">6</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="m">7</span><span class="p">)</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="p">(</span><span class="m">7</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="m">6</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## [1] TRUE
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nf">c</span><span class="p">(</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="m">3</span><span class="p">)</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## [1] FALSE FALSE
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nf">c</span><span class="p">(</span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## [1] TRUE TRUE
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="m">3</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w"> </span><span class="o">&amp;</span><span class="w"> </span><span class="p">(</span><span class="m">2</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">3</span><span class="p">)</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="m">5</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## [1] TRUE
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Using |</span><span class="w">
</span><span class="n">vector1</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="kc">TRUE</span><span class="p">,</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">,</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
</span><span class="n">vector2</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="kc">FALSE</span><span class="p">,</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">,</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">

</span><span class="c1"># Element-wise logical OR using |</span><span class="w">
</span><span class="n">result1</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">vector1</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="n">vector2</span><span class="w">

</span><span class="c1"># Using in</span><span class="w">
</span><span class="nf">c</span><span class="p">(</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="m">3</span><span class="p">)</span><span class="w"> </span><span class="o">%in%</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="m">4</span><span class="p">,</span><span class="w"> </span><span class="m">3</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## [1] TRUE TRUE
</code></pre></div></div>

<h2 id="understanding-objects-in-r-programming">Understanding Objects in R Programming</h2>

<p>R is a powerful programming language known for its object-based
approach. In practical terms, this means that every piece of data in R,
apart from operators and syntax, is treated as an object with specific
attributes. These attributes include class, structure, typeof, length,
dimension, and structure. To effectively work with R, it’s crucial to
grasp the fundamental concept of objects and how they function within
the language. Let’s dive into some of the foundational aspects of
objects in R.</p>

<h3 id="vectors-the-building-blocks">Vectors: The Building Blocks</h3>

<p>In R, vectors are the fundamental building blocks of data. They are
often referred to as atomic vectors because they can hold elements of a
single data type. Here are some key points about vectors:</p>

<p>Empty Vectors: You can create empty vectors using the NULL keyword.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">z</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="kc">NULL</span><span class="w">
</span></code></pre></div></div>

<p>Numeric Vectors: Numeric vectors store numerical values, and you can
assign names to elements within a vector. R</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">a</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s1">'a'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2.3</span><span class="p">,</span><span class="w"> </span><span class="s1">'b'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">4</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<p>Integers: R also supports integer vectors, which can be created using
the L suffix.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">b</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">2L</span><span class="p">,</span><span class="w"> </span><span class="m">9L</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<p>Logical Vectors: Logical vectors store TRUE and FALSE values.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">d</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="kc">TRUE</span><span class="p">,</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<p>Character Vectors: Character vectors hold text or character data.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">e</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"A"</span><span class="p">,</span><span class="w"> </span><span class="s1">'B'</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<p>Factor: Factors are used to represent categorical variables. They have
levels and can be ordered or unordered.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">f</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">factor</span><span class="p">(</span><span class="m">1</span><span class="o">:</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">levels</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s1">'male'</span><span class="p">,</span><span class="w"> </span><span class="s2">"female"</span><span class="p">))</span><span class="w">
</span></code></pre></div></div>

<p>Operations on Vectors: You can perform various operations on vectors,
such as addition, multiplication, and more. Vectors are the foundation
for more complex data structures in R, and understanding their
properties and manipulation is essential.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="m">4</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">a</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>##    a    b 
##  9.2 16.0
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="n">a</span><span class="p">)</span><span class="w"> </span><span class="o">^</span><span class="w"> </span><span class="m">-1</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>##         a         b 
## 0.4347826 0.2500000
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">a1</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">4</span><span class="p">,</span><span class="w"> </span><span class="m">7</span><span class="p">)</span><span class="w">
</span><span class="nf">names</span><span class="p">(</span><span class="n">a1</span><span class="p">)</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s1">'a'</span><span class="p">,</span><span class="w"> </span><span class="s1">'b'</span><span class="p">)</span><span class="w">

</span><span class="n">a</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">a1</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>##    a    b 
##  6.3 11.0
</code></pre></div></div>

<p>Stay tuned as we explore more about matrices, data frames, functions,
and lists in the world of R programming. These concepts will further
enhance your ability to work with data effectively in R.</p>

<h3 id="matrices">Matrices</h3>

<p>Matrices are 2-dimensional arrays of data consisting of a single atomic
object. They are essential for conducting statistical analyses and
algorithms that involve mathematical manipulations. One crucial aspect
of matrices is that their type is determined by a single atomic object.</p>

<p>Let’s create a matrix with numeric vector elements and examine its type
using the <code class="language-plaintext highlighter-rouge">typeof</code> function:</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Matrices</span><span class="w">
</span><span class="c1"># Basic</span><span class="w">
</span><span class="n">A</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">matrix</span><span class="p">(</span><span class="m">1</span><span class="o">:</span><span class="m">9</span><span class="p">,</span><span class="w"> </span><span class="n">ncol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="n">byrow</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
</span><span class="nf">class</span><span class="p">(</span><span class="n">A</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## [1] "matrix" "array"
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">typeof</span><span class="p">(</span><span class="n">A</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## [1] "integer"
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Add a column with character elements</span><span class="w">
</span><span class="n">Z</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">matrix</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="o">:</span><span class="m">9</span><span class="p">,</span><span class="w"> </span><span class="nb">LETTERS</span><span class="p">[</span><span class="m">1</span><span class="o">:</span><span class="m">3</span><span class="p">]),</span><span class="w"> </span><span class="n">ncol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">4</span><span class="p">,</span><span class="w"> </span><span class="n">byrow</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
</span><span class="nf">class</span><span class="p">(</span><span class="n">Z</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## [1] "matrix" "array"
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">typeof</span><span class="p">(</span><span class="n">Z</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## [1] "character"
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Math operators don't work.</span><span class="w">
</span><span class="n">Z</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">Z</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## Error in Z + Z: argumento no-numérico para operador binario
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Change the elements of the matrix</span><span class="w">
</span><span class="n">A</span><span class="p">[</span><span class="n">upper.tri</span><span class="p">(</span><span class="n">A</span><span class="p">)]</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="m">1</span><span class="w">
</span><span class="n">A</span><span class="p">[</span><span class="n">lower.tri</span><span class="p">(</span><span class="n">A</span><span class="p">)]</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="m">2</span><span class="w">
</span><span class="n">diag</span><span class="p">(</span><span class="n">A</span><span class="p">)</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="m">3</span><span class="w">
</span><span class="n">A</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>##      [,1] [,2] [,3]
## [1,]    3    1    1
## [2,]    2    3    1
## [3,]    2    2    3
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Combining vectors by column</span><span class="w">
</span><span class="n">B</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">cbind</span><span class="p">(</span><span class="m">2</span><span class="o">:</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="o">:</span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">B</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>##      [,1] [,2] [,3]
## [1,]    2    1    0
## [2,]    1    2    1
## [3,]    0    3    2
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Combining vectors by row</span><span class="w">
</span><span class="n">C</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">rbind</span><span class="p">(</span><span class="m">1</span><span class="o">:</span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="m">4</span><span class="o">:</span><span class="m">6</span><span class="p">,</span><span class="w"> </span><span class="m">7</span><span class="o">:</span><span class="m">9</span><span class="p">)</span><span class="w">
</span><span class="n">C</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6
## [3,]    7    8    9
</code></pre></div></div>

<h3 id="basic-linear-algebra">Basic Linear Algebra</h3>

<p>We can perform basic linear algebra operations on matrices:</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Basic Linear Algebra</span><span class="w">
</span><span class="c1"># Vector Operations</span><span class="w">
</span><span class="m">4</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">a</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>##    a    b 
##  9.2 16.0
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="n">a</span><span class="p">)</span><span class="w"> </span><span class="o">^</span><span class="w"> </span><span class="m">-1</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>##         a         b 
## 0.4347826 0.2500000
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">a</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">a1</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>##    a    b 
##  6.3 11.0
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Matrix Transpose</span><span class="w">
</span><span class="n">t</span><span class="p">(</span><span class="n">A</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>##      [,1] [,2] [,3]
## [1,]    3    2    2
## [2,]    1    3    2
## [3,]    1    1    3
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Matrix Addition</span><span class="w">
</span><span class="n">A</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">B</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">C</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>##      [,1] [,2] [,3]
## [1,]    4    0   -2
## [2,]   -1    0   -4
## [3,]   -5   -3   -4
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Dot Product</span><span class="w">
</span><span class="n">A</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">B</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>##      [,1] [,2] [,3]
## [1,]    7    8    3
## [2,]    7   11    5
## [3,]    6   15    8
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Cross Product</span><span class="w">
</span><span class="n">t</span><span class="p">(</span><span class="n">A</span><span class="p">)</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">B</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="n">crossprod</span><span class="p">(</span><span class="n">A</span><span class="p">,</span><span class="w"> </span><span class="n">B</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>##      [,1] [,2] [,3]
## [1,] TRUE TRUE TRUE
## [2,] TRUE TRUE TRUE
## [3,] TRUE TRUE TRUE
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Inverse</span><span class="w">
</span><span class="n">C</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">matrix</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="m">39L</span><span class="p">,</span><span class="w"> </span><span class="m">8L</span><span class="p">,</span><span class="w"> </span><span class="m">71L</span><span class="p">,</span><span class="w"> </span><span class="m">72L</span><span class="p">,</span><span class="w"> </span><span class="m">54L</span><span class="p">,</span><span class="w"> </span><span class="m">42L</span><span class="p">,</span><span class="w"> </span><span class="m">76L</span><span class="p">,</span><span class="w"> </span><span class="m">77L</span><span class="p">,</span><span class="w"> </span><span class="m">15L</span><span class="p">),</span><span class="w"> </span><span class="n">ncol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">3</span><span class="p">)</span><span class="w">
</span><span class="n">D</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">solve</span><span class="p">(</span><span class="n">C</span><span class="p">)</span><span class="w">
</span><span class="n">C</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">D</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>##      [,1]          [,2] [,3]
## [1,]    1  0.000000e+00    0
## [2,]    0  1.000000e+00    0
## [3,]    0 -4.440892e-16    1
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nf">round</span><span class="p">(</span><span class="n">C</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">D</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>##      [,1] [,2] [,3]
## [1,]    1    0    0
## [2,]    0    1    0
## [3,]    0    0    1
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Eigenvalues and Eigenvectors</span><span class="w">
</span><span class="n">eigen</span><span class="p">(</span><span class="n">C</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## eigen() decomposition
## $values
## [1] 147.741703 -34.981904  -4.759798
## 
## $vectors
##            [,1]       [,2]       [,3]
## [1,] -0.6935491 -0.1529705  0.3017373
## [2,] -0.4916846 -0.6387581 -0.7727752
## [3,] -0.5265319  0.7540478  0.5583664
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">e</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">eigen</span><span class="p">(</span><span class="n">C</span><span class="p">)</span><span class="o">$</span><span class="n">vector</span><span class="w">
</span><span class="n">v</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">eigen</span><span class="p">(</span><span class="n">C</span><span class="p">)</span><span class="o">$</span><span class="n">value</span><span class="w">
</span><span class="n">C</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">e</span><span class="p">[,</span><span class="w"> </span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="n">v</span><span class="p">[</span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">e</span><span class="p">[,</span><span class="w"> </span><span class="m">1</span><span class="p">]</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>##       [,1]
## [1,] FALSE
## [2,] FALSE
## [3,] FALSE
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">all.equal</span><span class="p">(</span><span class="n">as.vector</span><span class="p">(</span><span class="n">C</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">e</span><span class="p">[,</span><span class="w"> </span><span class="m">1</span><span class="p">]),</span><span class="w"> </span><span class="n">v</span><span class="p">[</span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">e</span><span class="p">[,</span><span class="w"> </span><span class="m">1</span><span class="p">])</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## [1] TRUE
</code></pre></div></div>

<h3 id="data-frames">Data Frames</h3>

<p>Data frames have a more heterogeneous structure compared to matrices.
While vectors and matrices belong to a specific <code class="language-plaintext highlighter-rouge">typeof</code> object, data
frames can have multiple data types in each column.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">## Basic Data Frame</span><span class="w">
</span><span class="n">df</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">data.frame</span><span class="p">(</span><span class="w">
  </span><span class="n">A</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nb">LETTERS</span><span class="p">[</span><span class="m">1</span><span class="o">:</span><span class="m">5</span><span class="p">],</span><span class="w">
  </span><span class="n">B</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">factor</span><span class="p">(</span><span class="nb">letters</span><span class="p">[</span><span class="m">1</span><span class="o">:</span><span class="m">5</span><span class="p">]),</span><span class="w">
  </span><span class="n">C</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1L</span><span class="o">:</span><span class="m">5L</span><span class="p">,</span><span class="w">
  </span><span class="n">D</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">2.4</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="m">9</span><span class="p">,</span><span class="w"> </span><span class="m">7</span><span class="p">)</span><span class="w">
</span><span class="p">)</span><span class="w">

</span><span class="c1"># Structure</span><span class="w">
</span><span class="n">str</span><span class="p">(</span><span class="n">df</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## 'data.frame':    5 obs. of  4 variables:
##  $ A: chr  "A" "B" "C" "D" ...
##  $ B: Factor w/ 5 levels "a","b","c","d",..: 1 2 3 4 5
##  $ C: int  1 2 3 4 5
##  $ D: num  2.4 2 3 9 7
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Basic statistics</span><span class="w">
</span><span class="n">summary</span><span class="p">(</span><span class="n">df</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>##       A             B           C           D       
##  Length:5           a:1   Min.   :1   Min.   :2.00  
##  Class :character   b:1   1st Qu.:2   1st Qu.:2.40  
##  Mode  :character   c:1   Median :3   Median :3.00  
##                     d:1   Mean   :3   Mean   :4.68  
##                     e:1   3rd Qu.:4   3rd Qu.:7.00  
##                           Max.   :5   Max.   :9.00
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Print head</span><span class="w">
</span><span class="n">head</span><span class="p">(</span><span class="n">df</span><span class="p">,</span><span class="w"> </span><span class="m">3</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>##   A B C   D
## 1 A a 1 2.4
## 2 B b 2 2.0
## 3 C c 3 3.0
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Print tail</span><span class="w">
</span><span class="n">tail</span><span class="p">(</span><span class="n">df</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>##   A B C   D
## 1 A a 1 2.4
## 2 B b 2 2.0
## 3 C c 3 3.0
## 4 D d 4 9.0
## 5 E e 5 7.0
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">## Bipartite Projection</span><span class="w">
</span><span class="n">bp</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">data.frame</span><span class="p">(</span><span class="n">papers</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="nf">rep</span><span class="p">(</span><span class="s1">'A'</span><span class="p">,</span><span class="w"> </span><span class="m">3</span><span class="p">),</span><span class="w"> </span><span class="nf">rep</span><span class="p">(</span><span class="s1">'B'</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">),</span><span class="w"> </span><span class="s1">'C'</span><span class="p">),</span><span class="w"> </span><span class="n">authors</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="o">:</span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="o">:</span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="m">4</span><span class="p">))</span><span class="w">
</span><span class="n">bp</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>##   papers authors
## 1      A       1
## 2      A       2
## 3      A       3
## 4      B       2
## 5      B       3
## 6      C       4
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Incidence Matrix</span><span class="w">
</span><span class="n">py</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">table</span><span class="p">(</span><span class="n">bp</span><span class="p">)</span><span class="w">
</span><span class="n">py</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>##       authors
## papers 1 2 3 4
##      A 1 1 1 0
##      B 0 1 1 0
##      C 0 0 0 1
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Adjacency Matrix</span><span class="w">
</span><span class="n">py</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">crossprod</span><span class="p">(</span><span class="n">py</span><span class="p">)</span><span class="w">
</span><span class="n">py</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>##        authors
## authors 1 2 3 4
##       1 1 1 1 0
##       2 1 2 2 0
##       3 1 2 2 0
##       4 0 0 0 1
</code></pre></div></div>

<h3 id="functions">Functions</h3>

<p>Functions are invaluable when we need to perform the same operation(s)
multiple times. Let’s create a simple function to calculate the degree
from an adjacency matrix:</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">n</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="m">5</span><span class="w">
</span><span class="n">A</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">matrix</span><span class="p">(</span><span class="n">sample</span><span class="p">(</span><span class="m">0</span><span class="o">:</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">replace</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">),</span><span class="w"> </span><span class="n">ncol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="p">)</span><span class="w">
</span><span class="n">rownames</span><span class="p">(</span><span class="n">A</span><span class="p">)</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nb">LETTERS</span><span class="p">[</span><span class="m">1</span><span class="o">:</span><span class="n">n</span><span class="p">]</span><span class="w">
</span><span class="n">colnames</span><span class="p">(</span><span class="n">A</span><span class="p">)</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nb">LETTERS</span><span class="p">[</span><span class="m">1</span><span class="o">:</span><span class="n">n</span><span class="p">]</span><span class="w">

</span><span class="c1"># Remove loops</span><span class="w">
</span><span class="n">diag</span><span class="p">(</span><span class="n">A</span><span class="p">)</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="m">0</span><span class="w">

</span><span class="n">s.degree</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
  </span><span class="n">n</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">ncol</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w">
  </span><span class="n">d</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="nf">rep</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="p">)</span><span class="w">
  </span><span class="n">colnames</span><span class="p">(</span><span class="n">d</span><span class="p">)</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="s1">'Degree'</span><span class="w">
  </span><span class="n">d</span><span class="w">
</span><span class="p">}</span><span class="w">

</span><span class="n">s.degree</span><span class="p">(</span><span class="n">A</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>##   Degree
## A      3
## B      2
## C      0
## D      1
## E      1
</code></pre></div></div>

<h3 id="lists">Lists</h3>

<p>Lists are the most flexible data structure in R, allowing us to store
multiple objects of different classes. A data frame is a list with a
specific structure. We can use the <code class="language-plaintext highlighter-rouge">dput</code> function to print and store
the structure of any object, which helps in creating reproducible
examples.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Print the structure of the data frame</span><span class="w">
</span><span class="n">dput</span><span class="p">(</span><span class="n">df</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## structure(list(A = c("A", "B", "C", "D", "E"), B = structure(1:5, levels = c("a", 
## "b", "c", "d", "e"), class = "factor"), C = 1:5, D = c(2.4, 2, 
## 3, 9, 7)), class = "data.frame", row.names = c(NA, -5L))
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Store a vector, matrix, data frame, function, and a list together</span><span class="w">
</span><span class="n">s</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="o">:</span><span class="m">3</span><span class="p">))</span><span class="w">
</span><span class="n">l</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="w">
  </span><span class="n">factor</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">f</span><span class="p">,</span><span class="w">
  </span><span class="n">matrix</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">A</span><span class="p">,</span><span class="w">
  </span><span class="n">data.frame</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">df</span><span class="p">,</span><span class="w">
  </span><span class="n">list</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">s</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="n">str</span><span class="p">(</span><span class="n">l</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## List of 4
##  $ factor    : Factor w/ 2 levels "male","female": NA NA
##  $ matrix    : num [1:5, 1:5] 0 1 0 0 1 1 0 0 0 0 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:5] "A" "B" "C" "D" ...
##   .. ..$ : chr [1:5] "A" "B" "C" "D" ...
##  $ data.frame:'data.frame':  5 obs. of  4 variables:
##   ..$ A: chr [1:5] "A" "B" "C" "D" ...
##   ..$ B: Factor w/ 5 levels "a","b","c","d",..: 1 2 3 4 5
##   ..$ C: int [1:5] 1 2 3 4 5
##   ..$ D: num [1:5] 2.4 2 3 9 7
##  $ list      :List of 1
##   ..$ : int [1:3] 1 2 3
</code></pre></div></div>

<h3 id="indexing-objects">Indexing Objects</h3>

<p>Subsetting in R can be done using nominal, numeric, or logical indexing.
Data frames and lists use the special operator <code class="language-plaintext highlighter-rouge">$</code> for subsetting.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">### Nominal ####</span><span class="w">
</span><span class="c1">## Vectors ##</span><span class="w">
</span><span class="nf">names</span><span class="p">(</span><span class="n">a</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## [1] "a" "b"
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">a</span><span class="p">[</span><span class="s1">'a'</span><span class="p">]</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>##   a 
## 2.3
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">a</span><span class="p">[</span><span class="s1">'b'</span><span class="p">]</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## b 
## 4
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">## Matrices ##</span><span class="w">
</span><span class="n">A</span><span class="p">[</span><span class="nf">c</span><span class="p">(</span><span class="s1">'A'</span><span class="p">,</span><span class="w"> </span><span class="s1">'C'</span><span class="p">),</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s1">'D'</span><span class="p">,</span><span class="w"> </span><span class="s1">'E'</span><span class="p">)]</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>##   D E
## A 1 1
## C 0 0
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">## Data Frames ##</span><span class="w">
</span><span class="n">df</span><span class="p">[</span><span class="nf">c</span><span class="p">(</span><span class="s1">'A'</span><span class="p">,</span><span class="w"> </span><span class="s1">'D'</span><span class="p">)]</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>##   A   D
## 1 A 2.4
## 2 B 2.0
## 3 C 3.0
## 4 D 9.0
## 5 E 7.0
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">## Lists ##</span><span class="w">
</span><span class="n">l</span><span class="p">[</span><span class="nf">c</span><span class="p">(</span><span class="s1">'factor'</span><span class="p">,</span><span class="w"> </span><span class="s1">'matrix'</span><span class="p">)]</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## $factor
## [1] &lt;NA&gt; &lt;NA&gt;
## Levels: male female
## 
## $matrix
##   A B C D E
## A 0 1 0 1 1
## B 1 0 0 1 0
## C 0 0 0 0 0
## D 0 0 1 0 0
## E 1 0 0 0 0
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">### Numeric ####</span><span class="w">

</span><span class="c1">## Vectors ##</span><span class="w">
</span><span class="n">a</span><span class="p">[</span><span class="m">1</span><span class="p">]</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>##   a 
## 2.3
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">## Matrices ##</span><span class="w">
</span><span class="n">A</span><span class="p">[</span><span class="m">2</span><span class="o">:</span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="m">4</span><span class="p">]</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## B C 
## 1 0
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">## Data Frames ##</span><span class="w">
</span><span class="n">df</span><span class="p">[</span><span class="m">1</span><span class="o">:</span><span class="m">5</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="o">:</span><span class="m">3</span><span class="p">]</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>##   B C
## 1 a 1
## 2 b 2
## 3 c 3
## 4 d 4
## 5 e 5
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">## Lists ##</span><span class="w">
</span><span class="c1"># Extract the data frame</span><span class="w">
</span><span class="n">l</span><span class="p">[</span><span class="n">unlist</span><span class="p">(</span><span class="n">lapply</span><span class="p">(</span><span class="n">l</span><span class="p">,</span><span class="w"> </span><span class="n">class</span><span class="p">))</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s1">'data.frame'</span><span class="p">]</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## $list
## $list[[1]]
## [1] 1 2 3
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">### Logical ####</span><span class="w">

</span><span class="c1">## Vectors ##</span><span class="w">
</span><span class="n">a</span><span class="p">[</span><span class="nf">c</span><span class="p">(</span><span class="kc">TRUE</span><span class="p">,</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">)]</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>##   a 
## 2.3
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">## Matrices ##</span><span class="w">
</span><span class="n">A</span><span class="p">[</span><span class="n">upper.tri</span><span class="p">(</span><span class="n">A</span><span class="p">)]</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>##  [1] 1 0 0 1 1 0 1 0 0 0
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">## Data Frames ##</span><span class="w">
</span><span class="n">df</span><span class="p">[,</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="nf">rep</span><span class="p">(</span><span class="kc">FALSE</span><span class="p">,</span><span class="w"> </span><span class="m">3</span><span class="p">),</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)]</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## [1] 2.4 2.0 3.0 9.0 7.0
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">## Lists ##</span><span class="w">
</span><span class="c1"># Extract the data frame</span><span class="w">
</span><span class="n">l</span><span class="o">$</span><span class="n">data.frame</span><span class="o">$</span><span class="n">C</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## [1] 1 2 3 4 5
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">l</span><span class="o">$</span><span class="n">matrix</span><span class="p">[,</span><span class="w"> </span><span class="m">4</span><span class="p">]</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## A B C D E 
## 1 1 0 0 0
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">### Combinations ###</span><span class="w">
</span><span class="n">A</span><span class="p">[</span><span class="m">2</span><span class="o">:</span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s1">'C'</span><span class="p">,</span><span class="w"> </span><span class="s1">'D'</span><span class="p">)]</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>##   C D
## B 0 1
## C 0 0
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">### Special Operator ####</span><span class="w">
</span><span class="c1"># Subset a Column</span><span class="w">
</span><span class="n">df</span><span class="o">$</span><span class="n">A</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## [1] "A" "B" "C" "D" "E"
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Subset the data frame in a list and print column D</span><span class="w">
</span><span class="n">l</span><span class="o">$</span><span class="n">data.frame</span><span class="o">$</span><span class="n">C</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## [1] 1 2 3 4 5
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">l</span><span class="o">$</span><span class="n">matrix</span><span class="p">[,</span><span class="w"> </span><span class="m">4</span><span class="p">]</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## A B C D E 
## 1 1 0 0 0
</code></pre></div></div>

<h3 id="control-flow">Control Flow</h3>

<p>Control flow structures like <code class="language-plaintext highlighter-rouge">if</code>, <code class="language-plaintext highlighter-rouge">else</code>, and <code class="language-plaintext highlighter-rouge">ifelse</code> are essential
for making decisions and executing code conditionally in R.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">### Basic Structure if|else ####</span><span class="w">
</span><span class="n">condition</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="m">7</span><span class="w">
</span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="n">condition</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="m">7</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
  </span><span class="n">print</span><span class="p">(</span><span class="s1">'Yes, it is...'</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## [1] "Yes, it is..."
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Check if a graph is connected</span><span class="w">
</span><span class="n">is.connected</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">am</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
  </span><span class="n">d</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">s.degree</span><span class="p">(</span><span class="n">am</span><span class="p">)</span><span class="w">
  </span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="nf">all</span><span class="p">(</span><span class="n">d</span><span class="w"> </span><span class="o">&gt;</span><span class="w"> </span><span class="m">0</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
    </span><span class="n">print</span><span class="p">(</span><span class="s1">'Graph is connected'</span><span class="p">)</span><span class="w">
  </span><span class="p">}</span><span class="w"> </span><span class="k">else</span><span class="w"> </span><span class="p">{</span><span class="w">
    </span><span class="n">print</span><span class="p">(</span><span class="s1">'Graph is disconnected'</span><span class="p">)</span><span class="w">
  </span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">

</span><span class="n">py</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">table</span><span class="p">(</span><span class="n">bp</span><span class="p">)</span><span class="w">
</span><span class="n">py</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>##       authors
## papers 1 2 3 4
##      A 1 1 1 0
##      B 0 1 1 0
##      C 0 0 0 1
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">is.connected</span><span class="p">(</span><span class="n">py</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## [1] "Graph is connected"
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Evaluate multiple conditions (and, or)</span><span class="w">
</span><span class="n">is.sim_multi</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">am</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
  </span><span class="n">mult.ed</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">any</span><span class="p">(</span><span class="n">am</span><span class="w"> </span><span class="o">&gt;</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
  </span><span class="n">loops</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="n">diag</span><span class="p">(</span><span class="n">am</span><span class="p">))</span><span class="w"> </span><span class="o">!=</span><span class="w"> </span><span class="m">0</span><span class="w">
  </span><span class="n">type</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s1">'The graph has:'</span><span class="p">,</span><span class="w"> </span><span class="s1">'multi edges'</span><span class="p">,</span><span class="w"> </span><span class="s1">'and loops.'</span><span class="p">)</span><span class="w">
  </span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="n">mult.ed</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="n">loops</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
    </span><span class="n">am</span><span class="p">[</span><span class="n">am</span><span class="w"> </span><span class="o">&gt;</span><span class="w"> </span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="m">1</span><span class="w">
    </span><span class="n">diag</span><span class="p">(</span><span class="n">am</span><span class="p">)</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="m">0</span><span class="w">
    </span><span class="n">print</span><span class="p">(</span><span class="n">paste</span><span class="p">(</span><span class="n">type</span><span class="p">[</span><span class="nf">c</span><span class="p">(</span><span class="kc">TRUE</span><span class="p">,</span><span class="w"> </span><span class="n">mult.ed</span><span class="p">,</span><span class="w"> </span><span class="n">loops</span><span class="p">)],</span><span class="w"> </span><span class="n">collapse</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">" "</span><span class="p">))</span><span class="w">
  </span><span class="p">}</span><span class="w"> </span><span class="k">else</span><span class="w"> </span><span class="p">{</span><span class="w">
    </span><span class="n">print</span><span class="p">(</span><span class="s2">"The graph is simple"</span><span class="p">)</span><span class="w">
  </span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">

</span><span class="n">is.sim_multi</span><span class="p">(</span><span class="n">B</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## [1] "The graph has: multi edges and loops."
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">is.sim_multi</span><span class="p">(</span><span class="n">A</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## [1] "The graph is simple"
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Count the number of edges or vertices</span><span class="w">
</span><span class="n">no.ver.edges</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">am</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
  </span><span class="n">v</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">ncol</span><span class="p">(</span><span class="n">am</span><span class="p">)</span><span class="w">
  </span><span class="n">e</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="n">am</span><span class="w"> </span><span class="o">&gt;</span><span class="w"> </span><span class="m">0</span><span class="p">)</span><span class="w">
  </span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="n">e</span><span class="w"> </span><span class="o">&gt;</span><span class="w"> </span><span class="n">v</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
    </span><span class="n">print</span><span class="p">(</span><span class="n">paste</span><span class="p">(</span><span class="s1">'Edges:'</span><span class="p">,</span><span class="w"> </span><span class="n">e</span><span class="p">))</span><span class="w">
  </span><span class="p">}</span><span class="w"> </span><span class="k">else</span><span class="w"> </span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="n">e</span><span class="w"> </span><span class="o">&lt;</span><span class="w"> </span><span class="n">v</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
    </span><span class="n">print</span><span class="p">(</span><span class="n">paste</span><span class="p">(</span><span class="s1">'Vertices:'</span><span class="p">,</span><span class="w"> </span><span class="n">v</span><span class="p">))</span><span class="w">
  </span><span class="p">}</span><span class="w"> </span><span class="k">else</span><span class="w"> </span><span class="p">{</span><span class="w">
    </span><span class="n">paste</span><span class="p">(</span><span class="s1">'Vertices and Edges:'</span><span class="p">,</span><span class="w"> </span><span class="n">v</span><span class="p">)</span><span class="w">
  </span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">

</span><span class="n">no.ver.edges</span><span class="p">(</span><span class="n">A</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## [1] "Edges: 7"
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">no.ver.edges</span><span class="p">(</span><span class="n">B</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## [1] "Edges: 7"
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">#### ifelse function ####</span><span class="w">
</span><span class="c1"># ifelse function is efficient and partially vectorized</span><span class="w">
</span><span class="c1"># It produces an output of the same length as the input.</span><span class="w">

</span><span class="n">ifelse</span><span class="p">(</span><span class="m">4</span><span class="w"> </span><span class="o">&gt;</span><span class="w"> </span><span class="m">7</span><span class="p">,</span><span class="w"> </span><span class="s2">"YES"</span><span class="p">,</span><span class="w"> </span><span class="s2">"NO"</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## [1] "NO"
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">ifelse</span><span class="p">(</span><span class="m">7</span><span class="w"> </span><span class="o">&gt;</span><span class="w"> </span><span class="m">4</span><span class="p">,</span><span class="w"> </span><span class="s2">"YES"</span><span class="p">,</span><span class="w"> </span><span class="s2">"NO"</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## [1] "YES"
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Nested ifelse</span><span class="w">
</span><span class="n">is.sym</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">am</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
  </span><span class="n">ifelse</span><span class="p">(</span><span class="n">ncol</span><span class="p">(</span><span class="n">am</span><span class="p">)</span><span class="w"> </span><span class="o">!=</span><span class="w"> </span><span class="n">nrow</span><span class="p">(</span><span class="n">am</span><span class="p">),</span><span class="w">
    </span><span class="s1">'Not symmetric'</span><span class="p">,</span><span class="w">
    </span><span class="n">ifelse</span><span class="p">(</span><span class="nf">all</span><span class="p">(</span><span class="n">am</span><span class="p">[</span><span class="n">upper.tri</span><span class="p">(</span><span class="n">am</span><span class="p">)]</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="n">am</span><span class="p">[</span><span class="n">lower.tri</span><span class="p">(</span><span class="n">am</span><span class="p">)]),</span><span class="w"> </span><span class="s1">'Symmetric'</span><span class="p">,</span><span class="w"> </span><span class="s1">'Squared'</span><span class="p">))</span><span class="w">
</span><span class="p">}</span><span class="w">

</span><span class="n">A</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>##   A B C D E
## A 0 1 0 1 1
## B 1 0 0 1 0
## C 0 0 0 0 0
## D 0 0 1 0 0
## E 1 0 0 0 0
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">is.sym</span><span class="p">(</span><span class="n">A</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## [1] "Squared"
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">B</span><span class="p">[</span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">]</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="m">1</span><span class="w">
</span><span class="n">B</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>##      [,1] [,2] [,3]
## [1,]    2    1    0
## [2,]    1    2    1
## [3,]    0    1    2
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">is.sym</span><span class="p">(</span><span class="n">B</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## [1] "Symmetric"
</code></pre></div></div>

<h3 id="loops">Loops</h3>

<p>Loops are used for repetitive tasks, but it’s essential to use them
judiciously as they can be inefficient. Here, we cover while and for
loops:</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">### while loop ####</span><span class="w">
</span><span class="n">fibo</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">digi</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">length</span><span class="p">(</span><span class="n">fibo</span><span class="p">)</span><span class="w">

</span><span class="c1"># Create a Fibonacci Sequence and stop when it reaches 10 digits</span><span class="w">
</span><span class="k">while</span><span class="w"> </span><span class="p">(</span><span class="n">digi</span><span class="w"> </span><span class="o">&lt;</span><span class="w"> </span><span class="m">4</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
  </span><span class="n">digi</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">length</span><span class="p">(</span><span class="n">fibo</span><span class="p">)</span><span class="w">
  </span><span class="n">fibo</span><span class="p">[</span><span class="n">digi</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="n">fibo</span><span class="p">[</span><span class="n">digi</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="m">1</span><span class="p">],</span><span class="w"> </span><span class="n">fibo</span><span class="p">[</span><span class="n">digi</span><span class="p">])</span><span class="w">
  </span><span class="n">print</span><span class="p">(</span><span class="n">paste</span><span class="p">(</span><span class="s2">"Fibonacci Seq:"</span><span class="p">,</span><span class="w"> </span><span class="n">fibo</span><span class="p">[</span><span class="n">digi</span><span class="p">]))</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## [1] "Fibonacci Seq: 2"
## [1] "Fibonacci Seq: 3"
## [1] "Fibonacci Seq: 5"
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">### for loop ####</span><span class="w">
</span><span class="c1"># Get adjacent vertices (neighborhood)</span><span class="w">
</span><span class="n">i</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="m">1</span><span class="w">
</span><span class="n">nams</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">row.names</span><span class="p">(</span><span class="n">A</span><span class="p">)</span><span class="w">

</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="n">nrow</span><span class="p">(</span><span class="n">A</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
  </span><span class="n">print</span><span class="p">(</span><span class="n">nams</span><span class="p">[</span><span class="n">A</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="w"> </span><span class="p">]</span><span class="w"> </span><span class="o">&gt;</span><span class="w"> </span><span class="m">0</span><span class="p">])</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## [1] "B" "D" "E"
## [1] "A" "D"
## character(0)
## [1] "C"
## [1] "A"
</code></pre></div></div>

<h3 id="apply-family-of-functions">Apply Family of Functions</h3>

<p>The <code class="language-plaintext highlighter-rouge">apply</code> function in R takes an array as its first argument and
applies a function to all the elements of the array. Let’s explore some
examples:</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Example of summing all the columns of a matrix</span><span class="w">
</span><span class="n">ma</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">matrix</span><span class="p">(</span><span class="n">sample</span><span class="p">(</span><span class="m">1</span><span class="o">:</span><span class="m">100</span><span class="p">,</span><span class="w"> </span><span class="m">25</span><span class="p">),</span><span class="w"> </span><span class="n">ncol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">5</span><span class="p">,</span><span class="w"> </span><span class="n">nrow</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">5</span><span class="p">)</span><span class="w">

</span><span class="c1"># Using a loop</span><span class="w">
</span><span class="n">col.cum</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">vector</span><span class="p">(</span><span class="s1">'numeric'</span><span class="p">,</span><span class="w"> </span><span class="n">length</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">)</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">c</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="n">ncol</span><span class="p">(</span><span class="n">ma</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
  </span><span class="n">col.cum</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="n">col.cum</span><span class="p">,</span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="n">ma</span><span class="p">[,</span><span class="w"> </span><span class="n">c</span><span class="p">]))</span><span class="w">
</span><span class="p">}</span><span class="w">

</span><span class="c1"># Using the apply function</span><span class="w">
</span><span class="n">apply</span><span class="p">(</span><span class="n">ma</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">sum</span><span class="p">)</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="n">col.cum</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## [1] TRUE TRUE TRUE TRUE TRUE
</code></pre></div></div>

<p>In this example, we create a matrix <code class="language-plaintext highlighter-rouge">ma</code> and calculate the sum of each
column using both a loop and the <code class="language-plaintext highlighter-rouge">apply</code> function. The apply function
provides a more concise and efficient way to perform this operation.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Example of summing each row of a matrix</span><span class="w">

</span><span class="c1"># Using a loop</span><span class="w">
</span><span class="n">row.cum</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">vector</span><span class="p">(</span><span class="s1">'numeric'</span><span class="p">,</span><span class="w"> </span><span class="n">length</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">)</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">r</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="n">nrow</span><span class="p">(</span><span class="n">ma</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
  </span><span class="n">row.cum</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="n">row.cum</span><span class="p">,</span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="n">ma</span><span class="p">[</span><span class="n">r</span><span class="p">,</span><span class="w"> </span><span class="p">]))</span><span class="w">
</span><span class="p">}</span><span class="w">

</span><span class="c1"># Using the apply function</span><span class="w">
</span><span class="n">apply</span><span class="p">(</span><span class="n">ma</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">sum</span><span class="p">)</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="n">row.cum</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## [1] TRUE TRUE TRUE TRUE TRUE
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Using linear algebra (For simpler functions, it is better to use linear algebra)</span><span class="w">
</span><span class="n">apply</span><span class="p">(</span><span class="n">ma</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">sum</span><span class="p">)</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="n">row.cum</span><span class="w"> </span><span class="o">&amp;</span><span class="w"> </span><span class="n">rowSums</span><span class="p">(</span><span class="n">ma</span><span class="p">)</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="n">row.cum</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## [1] TRUE TRUE TRUE TRUE TRUE
</code></pre></div></div>

<p>In this section, we demonstrate how to sum each row of a matrix, first
using a loop and then using the <code class="language-plaintext highlighter-rouge">apply</code> function. Additionally, we show
how you can achieve the same result using linear algebra operations for
efficiency.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Example: Count how many times a string [A-] appears in each column</span><span class="w">

</span><span class="n">ma</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">matrix</span><span class="p">(</span><span class="n">replicate</span><span class="p">(</span><span class="m">5</span><span class="p">,</span><span class="w"> </span><span class="n">sample</span><span class="p">(</span><span class="nb">LETTERS</span><span class="p">[</span><span class="m">1</span><span class="o">:</span><span class="m">10</span><span class="p">],</span><span class="w"> </span><span class="m">5</span><span class="p">)),</span><span class="w"> </span><span class="n">ncol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">5</span><span class="p">,</span><span class="w"> </span><span class="n">nrow</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">5</span><span class="p">,</span><span class="w"> </span><span class="n">byrow</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
</span><span class="n">lvls</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">unique</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="n">ma</span><span class="p">))</span><span class="w">
</span><span class="n">apply</span><span class="p">(</span><span class="n">ma</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
  </span><span class="n">table</span><span class="p">(</span><span class="n">factor</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">levels</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">lvls</span><span class="p">))</span><span class="w">
</span><span class="p">})</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>##   [,1] [,2] [,3] [,4] [,5]
## J    1    1    0    0    0
## F    2    1    0    0    0
## G    1    1    0    0    0
## I    1    0    1    0    2
## D    0    1    0    1    0
## A    0    1    1    0    0
## E    0    0    1    1    0
## C    0    0    2    0    2
## B    0    0    0    1    1
## H    0    0    0    2    0
</code></pre></div></div>

<p>In this example, we create a matrix of random characters and count how
many times each character appears in each column using the apply
function.</p>

<h3 id="lapply-function">lapply Function</h3>

<p>The <code class="language-plaintext highlighter-rouge">lapply</code> function in R takes a list as its first argument and
applies a function to all the elements of the list. It offers advantages
such as improved code readability and flexibility compared to the
<code class="language-plaintext highlighter-rouge">apply</code> function.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">### lapply Examples ###</span><span class="w">

</span><span class="c1"># Heterogeneous list example</span><span class="w">
</span><span class="n">lapply</span><span class="p">(</span><span class="nf">list</span><span class="p">(</span><span class="n">data.frame</span><span class="p">(</span><span class="m">1</span><span class="o">:</span><span class="m">10</span><span class="p">),</span><span class="w"> </span><span class="m">20</span><span class="o">:</span><span class="m">30</span><span class="p">),</span><span class="w"> </span><span class="n">sum</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## [[1]]
## [1] 55
## 
## [[2]]
## [1] 275
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Homogeneous list example</span><span class="w">
</span><span class="n">lapply</span><span class="p">(</span><span class="nf">list</span><span class="p">(</span><span class="n">A</span><span class="p">,</span><span class="w"> </span><span class="n">B</span><span class="p">,</span><span class="w"> </span><span class="n">C</span><span class="p">,</span><span class="w"> </span><span class="n">D</span><span class="p">),</span><span class="w"> </span><span class="n">s.degree</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## [[1]]
##   Degree
## A      3
## B      2
## C      0
## D      1
## E      1
## 
## [[2]]
##      Degree
## [1,]      3
## [2,]      4
## [3,]      3
## 
## [[3]]
##      Degree
## [1,]    187
## [2,]    139
## [3,]    128
## 
## [[4]]
##           Degree
## [1,]  0.04585366
## [2,] -0.07556911
## [3,]  0.06121951
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Since a data.frame is a list, we can apply functions directly</span><span class="w">
</span><span class="c1"># Check the list of sample data.frames ?data</span><span class="w">
</span><span class="c1"># Load data</span><span class="w">
</span><span class="n">data</span><span class="p">(</span><span class="n">attitude</span><span class="p">)</span><span class="w">

</span><span class="n">lapply</span><span class="p">(</span><span class="n">attitude</span><span class="p">,</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
  </span><span class="nf">c</span><span class="p">(</span><span class="w">
    </span><span class="n">mean</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">mean</span><span class="p">(</span><span class="n">x</span><span class="p">),</span><span class="w">
    </span><span class="n">var</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">var</span><span class="p">(</span><span class="n">x</span><span class="p">),</span><span class="w">
    </span><span class="n">min</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">min</span><span class="p">(</span><span class="n">x</span><span class="p">),</span><span class="w">
    </span><span class="n">max</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">max</span><span class="p">(</span><span class="n">x</span><span class="p">),</span><span class="w">
    </span><span class="n">median</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">median</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w">
  </span><span class="p">)</span><span class="w">
</span><span class="p">})</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## $rating
##      mean       var       min       max    median 
##  64.63333 148.17126  40.00000  85.00000  65.50000 
## 
## $complaints
##     mean      var      min      max   median 
##  66.6000 177.2828  37.0000  90.0000  65.0000 
## 
## $privileges
##      mean       var       min       max    median 
##  53.13333 149.70575  30.00000  83.00000  51.50000 
## 
## $learning
##      mean       var       min       max    median 
##  56.36667 137.75747  34.00000  75.00000  56.50000 
## 
## $raises
##      mean       var       min       max    median 
##  64.63333 108.10230  43.00000  88.00000  63.50000 
## 
## $critical
##     mean      var      min      max   median 
## 74.76667 97.90920 49.00000 92.00000 77.50000 
## 
## $advance
##      mean       var       min       max    median 
##  42.93333 105.85747  25.00000  72.00000  41.00000
</code></pre></div></div>

<p>Here, we showcase various uses of <code class="language-plaintext highlighter-rouge">lapply</code>. It can be applied to both
heterogeneous and homogeneous lists. When working with data frames, you
can directly apply functions to columns, which can lead to more readable
code.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Similar to summary(attitude)</span><span class="w">

</span><span class="c1"># Try to arrange the structure for better readability (not always successful)</span><span class="w">
</span><span class="n">t</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">sapply</span><span class="p">(</span><span class="n">attitude</span><span class="p">,</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
  </span><span class="nf">c</span><span class="p">(</span><span class="w">
    </span><span class="n">mean</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">mean</span><span class="p">(</span><span class="n">x</span><span class="p">),</span><span class="w">
    </span><span class="n">var</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">var</span><span class="p">(</span><span class="n">x</span><span class="p">),</span><span class="w">
    </span><span class="n">min</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">min</span><span class="p">(</span><span class="n">x</span><span class="p">),</span><span class="w">
    </span><span class="n">max</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">max</span><span class="p">(</span><span class="n">x</span><span class="p">),</span><span class="w">
    </span><span class="n">median</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">median</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w">
  </span><span class="p">)</span><span class="w">
</span><span class="p">})</span><span class="w">
</span><span class="nf">class</span><span class="p">(</span><span class="n">t</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## [1] "matrix" "array"
</code></pre></div></div>

<p>In this section, we demonstrate a similar approach to the
<code class="language-plaintext highlighter-rouge">summary(attitude)</code> function using <code class="language-plaintext highlighter-rouge">sapply</code> to provide a structured
summary of the data.</p>

<h2 id="graphics-in-r">Graphics in R</h2>

<p>R offers a robust environment for creating graphics, making it a
powerful tool for both statistical analysis and data visualization. To
explore its capabilities, you can start by running <code class="language-plaintext highlighter-rouge">demo(graphics)</code> in
the R console. Additionally, you can refer to this cheatsheet for an
overview of the main plotting functions.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># demo(graphics)</span><span class="w">
</span></code></pre></div></div>

<p>Let’s delve into various aspects of graphics in R.</p>

<h3 id="color-management">Color Management</h3>

<p>Managing color spaces in R is essential for creating visually appealing
graphics. Colors can be defined in three different ways: by name, by
hexadecimal values, or by RGB values. You can explore a wide range of
colors and conversions between these systems on this
<a href="http://cloford.com/resources/colours/500col.htm">website</a>.</p>

<p>For this tutorial a palette of high contrasting colors that I am defining in the following vector:</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">colors37</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"#466791"</span><span class="p">,</span><span class="s2">"#60bf37"</span><span class="p">,</span><span class="s2">"#953ada"</span><span class="p">,</span><span class="s2">"#4fbe6c"</span><span class="p">,</span><span class="s2">"#ce49d3"</span><span class="p">,</span><span class="s2">"#a7b43d"</span><span class="p">,</span><span class="s2">"#5a51dc"</span><span class="p">,</span><span class="s2">"#d49f36"</span><span class="p">,</span><span class="s2">"#552095"</span><span class="p">,</span><span class="s2">"#507f2d"</span><span class="p">,</span><span class="s2">"#db37aa"</span><span class="p">,</span><span class="s2">"#84b67c"</span><span class="p">,</span><span class="s2">"#a06fda"</span><span class="p">,</span><span class="s2">"#df462a"</span><span class="p">,</span><span class="s2">"#5b83db"</span><span class="p">,</span><span class="s2">"#c76c2d"</span><span class="p">,</span><span class="s2">"#4f49a3"</span><span class="p">,</span><span class="s2">"#82702d"</span><span class="p">,</span><span class="s2">"#dd6bbb"</span><span class="p">,</span><span class="s2">"#334c22"</span><span class="p">,</span><span class="s2">"#d83979"</span><span class="p">,</span><span class="s2">"#55baad"</span><span class="p">,</span><span class="s2">"#dc4555"</span><span class="p">,</span><span class="s2">"#62aad3"</span><span class="p">,</span><span class="s2">"#8c3025"</span><span class="p">,</span><span class="s2">"#417d61"</span><span class="p">,</span><span class="s2">"#862977"</span><span class="p">,</span><span class="s2">"#bba672"</span><span class="p">,</span><span class="s2">"#403367"</span><span class="p">,</span><span class="s2">"#da8a6d"</span><span class="p">,</span><span class="s2">"#a79cd4"</span><span class="p">,</span><span class="s2">"#71482c"</span><span class="p">,</span><span class="s2">"#c689d0"</span><span class="p">,</span><span class="s2">"#6b2940"</span><span class="p">,</span><span class="s2">"#d593a7"</span><span class="p">,</span><span class="s2">"#895c8b"</span><span class="p">,</span><span class="s2">"#bd5975"</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<p>And in this snipped of code where you can see clearly the contrast in the palette:</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Example of hexadecimal format</span><span class="w">
</span><span class="c1"># print(head(colors37))</span><span class="w">
</span><span class="c1"># Example of RGB </span><span class="w">
</span><span class="c1"># rgb(red=1, green=0.05, blue=0.02, alpha=.2)</span><span class="w">
</span><span class="c1"># Examples colors by name</span><span class="w">
</span><span class="c1"># head(colors())</span><span class="w">

</span><span class="c1"># Plot using a color space by name</span><span class="w">
</span><span class="n">plot</span><span class="p">(</span><span class="w">
  </span><span class="c1"># Values on the x-axis</span><span class="w">
  </span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="o">:</span><span class="m">10</span><span class="p">,</span><span class="w">
  </span><span class="c1"># Values on the y-axis</span><span class="w">
  </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">9</span><span class="o">:</span><span class="m">1</span><span class="p">,</span><span class="w">
  </span><span class="c1"># Size of the point</span><span class="w">
  </span><span class="n">pch</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">19</span><span class="p">,</span><span class="w">
  </span><span class="c1"># Shape of the point</span><span class="w">
  </span><span class="n">cex</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w">
  </span><span class="c1"># Color by name</span><span class="w">
  </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"dark red"</span><span class="p">,</span><span class="w">
  </span><span class="c1"># Axis labels</span><span class="w">
  </span><span class="n">xlab</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">""</span><span class="p">,</span><span class="w">
  </span><span class="n">ylab</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">""</span><span class="p">,</span><span class="w">
  </span><span class="n">axes</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">,</span><span class="w">
  </span><span class="c1"># Limits for x and y</span><span class="w">
  </span><span class="n">xlim</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="m">10.05</span><span class="p">),</span><span class="w">
  </span><span class="n">ylim</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">11</span><span class="p">)</span><span class="w">
</span><span class="p">)</span><span class="w">

</span><span class="c1"># Try running these loops again but change the cex value to get different shapes</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="m">2</span><span class="o">:</span><span class="m">9</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
  </span><span class="c1"># Plot using the RGB color space (arguments are values between [0,1]) </span><span class="w">
  </span><span class="n">points</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">i</span><span class="o">:</span><span class="m">10</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">10</span><span class="o">:</span><span class="n">i</span><span class="p">,</span><span class="w"> </span><span class="n">pch</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">19</span><span class="p">,</span><span class="w"> </span><span class="n">cex</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">rgb</span><span class="p">(</span><span class="n">runif</span><span class="p">(</span><span class="m">1</span><span class="p">),</span><span class="w"> </span><span class="n">runif</span><span class="p">(</span><span class="m">1</span><span class="p">),</span><span class="w"> </span><span class="n">runif</span><span class="p">(</span><span class="m">1</span><span class="p">)))</span><span class="w">
  </span><span class="c1"># Plot using a vector of hexadecimal values</span><span class="w">
  </span><span class="n">points</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="o">:</span><span class="p">(</span><span class="m">11</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">i</span><span class="p">),</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="p">(</span><span class="m">10</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">i</span><span class="p">)</span><span class="o">:</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">pch</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">19</span><span class="p">,</span><span class="w"> </span><span class="n">cex</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">sample</span><span class="p">(</span><span class="n">colors37</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">))</span><span class="w">
</span><span class="p">}</span><span class="w">

</span><span class="c1"># Draw a box</span><span class="w">
</span><span class="n">box</span><span class="p">()</span><span class="w">
</span></code></pre></div></div>

<p><img src="https://github.com/Wario84/blog/raw/main/assets/imgs/2023-09-14-intro_net_anlysis_R/unnamed-chunk-24-1.png?raw=true" alt="" /><!-- --></p>

<p>In this section, we explore different ways to define and use colors in
your plots, including by name, RGB values, and hexadecimal values.</p>

<h3 id="histograms">Histograms</h3>

<p>Histograms are a fundamental tool for visualizing the distribution
(spread) of data around central values. To plot a single variable we can
use the <code class="language-plaintext highlighter-rouge">hist(...)</code>, and we need to include the specific vector that
contains the data, for instance, <code class="language-plaintext highlighter-rouge">hist(attitude[, 1L])</code>.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">### Plotting a simple histogram ####</span><span class="w">
</span><span class="n">hist</span><span class="p">(</span><span class="n">attitude</span><span class="p">[,</span><span class="w"> </span><span class="m">1L</span><span class="p">])</span><span class="w">
</span></code></pre></div></div>

<p><img src="https://github.com/Wario84/blog/raw/main/assets/imgs/2023-09-14-intro_net_anlysis_R/unnamed-chunk-25-1.png?raw=true" alt="" /><!-- --></p>

<p>However, typically we are interested on visualize a whole set of
variables on a data frame. Hence, I believe it is more useful a snipped
of code that can plot a group of variables in grid (a group of plots).
The subsequent R code sets up a 3x3 plotting layout using the <code class="language-plaintext highlighter-rouge">par()</code>
function, allowing for a 3x3 grid of plots. Then the following code then
creates histograms for each variable in the <code class="language-plaintext highlighter-rouge">attitude</code> dataframe and
arranges them within the previously defined layout.</p>

<p>Here’s a step-by-step breakdown:</p>

<ul>
  <li>
    <p><code class="language-plaintext highlighter-rouge">par(mfrow = c(3, 3))</code>: This sets up a plotting layout with 3 rows and
3 columns, creating a 3x3 grid for plotting.</p>
  </li>
  <li>
    <p><code class="language-plaintext highlighter-rouge">var.names &lt;- colnames(attitude)</code>: Retrieves the column names
(variable names) of the <code class="language-plaintext highlighter-rouge">attitude</code> dataframe.</p>
  </li>
  <li>
    <p><code class="language-plaintext highlighter-rouge">invisible(lapply(seq_along(var.names), function(x) {...})</code>: Iterates
over each variable in <code class="language-plaintext highlighter-rouge">attitude</code> using <code class="language-plaintext highlighter-rouge">lapply()</code> and generates
histograms.</p>
  </li>
  <li>
    <p><code class="language-plaintext highlighter-rouge">hist(...)</code>: Generates a histogram for each variable. The hist()
function takes parameters such as the data to be plotted
(<code class="language-plaintext highlighter-rouge">attitude[, x]</code> for each variable), the main title (<code class="language-plaintext highlighter-rouge">var.names[x]</code>,
which is the variable name), and the color of the bars
(<code class="language-plaintext highlighter-rouge">col = sample(colors37, 1)</code>).</p>
  </li>
  <li>
    <p>The <code class="language-plaintext highlighter-rouge">invisible()</code> function is used to suppress the output of
<code class="language-plaintext highlighter-rouge">lapply()</code> which would otherwise display the individual histograms.</p>
  </li>
</ul>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">### Using lapply and histograms ###</span><span class="w">

</span><span class="c1"># Set up a 3 x 3 layout</span><span class="w">

</span><span class="n">par</span><span class="p">(</span><span class="n">mfrow</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="m">3</span><span class="p">))</span><span class="w">

</span><span class="c1"># render the histograms </span><span class="w">
</span><span class="n">var.names</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">colnames</span><span class="p">(</span><span class="n">attitude</span><span class="p">)</span><span class="w">
</span><span class="nf">invisible</span><span class="p">(</span><span class="n">lapply</span><span class="p">(</span><span class="nf">seq_along</span><span class="p">(</span><span class="n">var.names</span><span class="p">),</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
  </span><span class="n">hist</span><span class="p">(</span><span class="w">
    </span><span class="n">attitude</span><span class="p">[,</span><span class="w"> </span><span class="n">x</span><span class="p">],</span><span class="w">
    </span><span class="n">main</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">var.names</span><span class="p">[</span><span class="n">x</span><span class="p">],</span><span class="w">
    </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">sample</span><span class="p">(</span><span class="n">colors37</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">),</span><span class="w">
    </span><span class="n">xlab</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">""</span><span class="w">
  </span><span class="p">)</span><span class="w">
</span><span class="p">}))</span><span class="w">
</span></code></pre></div></div>

<p><img src="https://github.com/Wario84/blog/raw/main/assets/imgs/2023-09-14-intro_net_anlysis_R/unnamed-chunk-26-1.png?raw=true" alt="" /><!-- -->
### Box-Plots:</p>

<p>Box plots, also known as box-and-whisker plots, are a valuable tool for
visualizing the distribution and spread of data. They provide a concise
summary of a dataset’s central tendency, variability, and potential
outliers. Unlike some other types of plots, box plots focus on
displaying the overall distribution of data rather than showing
individual data points.</p>

<p>They are particularly useful because they show the following key
statistics:</p>

<ul>
  <li>The median (the middle value of the dataset).</li>
  <li>The interquartile range (the range between the first quartile or Q1
and the third quartile or Q3), which contains the central 50% of the
data.</li>
  <li>The minimum and maximum values within a defined range.</li>
  <li>Outliers, represented as individual points outside the “whiskers” of
the plot.</li>
</ul>

<p>Similar to the previous chunk, I am using the <code class="language-plaintext highlighter-rouge">lapply</code> combined with the
function of box-plots (<code class="language-plaintext highlighter-rouge">boxplot(...)</code>) to efficiently display a plot for
each variable in the <code class="language-plaintext highlighter-rouge">attitude</code> dataset.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="w">  </span><span class="n">par</span><span class="p">(</span><span class="n">mfrow</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="m">3</span><span class="p">))</span><span class="w">
  </span><span class="n">var.names</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">colnames</span><span class="p">(</span><span class="n">attitude</span><span class="p">)</span><span class="w">
  </span><span class="nf">invisible</span><span class="p">(</span><span class="n">lapply</span><span class="p">(</span><span class="nf">seq_along</span><span class="p">(</span><span class="n">var.names</span><span class="p">),</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
  </span><span class="n">boxplot</span><span class="p">(</span><span class="n">attitude</span><span class="p">[,</span><span class="w"> </span><span class="n">x</span><span class="p">],</span><span class="w">
  </span><span class="n">main</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">var.names</span><span class="p">[</span><span class="n">x</span><span class="p">],</span><span class="w">
  </span><span class="n">xlab</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">""</span><span class="p">)</span><span class="w">
  </span><span class="p">}))</span><span class="w">
</span></code></pre></div></div>

<p><img src="https://github.com/Wario84/blog/raw/main/assets/imgs/2023-09-14-intro_net_anlysis_R/unnamed-chunk-27-1.png?raw=true" alt="" /><!-- --></p>

<h3 id="scatter-plots">Scatter Plots:</h3>

<p>Scatter plots are a basic form of data visualization that help us see
the relationship between two continuous variables. They differ from
other plots because they allow us to examine how two variables interact,
specifically whether there is a linear or nonlinear relationship between
them.</p>

<p>The importance of scatter plots lies in their ability to reveal
patterns, trends, clusters and outliers in the data. By arranging data
points as individual points on a two-dimensional plane, we can visually
identify relationships, associations, or the lack thereof. Scatter plots
are particularly useful for the following reasons.</p>

<ul>
  <li>
    <p>Finding linear and non-linear relationships: Scatter plots help us to
find out if two variables are linearly positively or negatively
related or the relationship is not linear.</p>
  </li>
  <li>
    <p>Identifying outliers: Outliers, data points that deviate significantly
from the overall pattern, are easily detected in scatter plots and can
be identified</p>
  </li>
  <li>
    <p>Cluster analysis: Clusters of data points can indicate distinct
subpopulations or clusters in the data.</p>
  </li>
  <li>
    <p>Visualizing multivariate data: Scatter matrices like the one created
in your code snippet allow us to visualize relationships between
multiple variables simultaneously, which is important in exploratory
data analysis</p>
  </li>
</ul>

<p>Using Scatter Plots in R:</p>

<p><code class="language-plaintext highlighter-rouge">pairs(attitude, main = "attitude data", panel = panel.smooth)</code>: This
line generates a scatter plot matrix (a grid of scatter plots) for all
the variables in the “attitude” dataset. The panel.smooth argument adds
smoothed regression lines to each scatter plot to help visualize trends.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="w">  </span><span class="c1"># Plot variables: Useful to detect linear or non-linear patterns.</span><span class="w">
  </span><span class="n">pairs</span><span class="p">(</span><span class="n">attitude</span><span class="p">,</span><span class="w"> </span><span class="n">main</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"attitude data"</span><span class="p">,</span><span class="w"> </span><span class="n">panel</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">panel.smooth</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<p><img src="https://github.com/Wario84/blog/raw/main/assets/imgs/2023-09-14-intro_net_anlysis_R/unnamed-chunk-28-1.png?raw=true" alt="" /><!-- --></p>

<p><code class="language-plaintext highlighter-rouge">plot(attitude$rating, attitude$complaints)</code>: This line creates a single
scatter plot between the “rating” and “complaints” variables, providing
a detailed view of the relationship between these two specific
variables.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="w">  </span><span class="c1"># Single plot</span><span class="w">
  </span><span class="n">plot</span><span class="p">(</span><span class="n">attitude</span><span class="o">$</span><span class="n">rating</span><span class="p">,</span><span class="w"> </span><span class="n">attitude</span><span class="o">$</span><span class="n">complaints</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<p><img src="https://github.com/Wario84/blog/raw/main/assets/imgs/2023-09-14-intro_net_anlysis_R/unnamed-chunk-29-1.png?raw=true" alt="" /><!-- --></p>

<p><code class="language-plaintext highlighter-rouge">abline(lm(rating ~ complaints, data = attitude), col = 'red')</code>: Here, a
regression line is added to the scatter plot created in the previous
step. This line represents the best-fit linear relationship between
“rating” and “complaints” using a linear regression model. The line is
colored red for visibility.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="w">  </span><span class="n">plot</span><span class="p">(</span><span class="n">attitude</span><span class="o">$</span><span class="n">rating</span><span class="p">,</span><span class="w"> </span><span class="n">attitude</span><span class="o">$</span><span class="n">complaints</span><span class="p">)</span><span class="w">
  </span><span class="c1"># Draw a regression line</span><span class="w">
  </span><span class="n">abline</span><span class="p">(</span><span class="n">lm</span><span class="p">(</span><span class="n">rating</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">complaints</span><span class="p">,</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">attitude</span><span class="p">),</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'red'</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<p><img src="https://github.com/Wario84/blog/raw/main/assets/imgs/2023-09-14-intro_net_anlysis_R/unnamed-chunk-30-1.png?raw=true" alt="" /><!-- --></p>

<h3 id="multiple-regression-model">Multiple Regression model</h3>

<p>Imagine you have a yummy meal, and you want to know what makes it taste
so good. Is it the color, the smell, or maybe the shape? Multiple
regression helps us figure out which of these things, or variables, are
most important in making the meal delicious. It’s like sniffing out the
best part of a treat recipe! So the idea of regression is that we have a
series of variables that affect or predict the behavior of another
outcome variable. These explanatory variables are called determinants of
the dependent variable precisely for their power to predict outcome.</p>

<ul>
  <li>Univariate models have only one determinant, but they are mostly
unused. It is difficult to expect that one thing has only one
predictor.</li>
</ul>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="w">  </span><span class="c1"># Single Regression</span><span class="w">
</span><span class="n">m0</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">lm</span><span class="p">(</span><span class="n">rating</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">advance</span><span class="p">,</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">attitude</span><span class="p">)</span><span class="w">
</span><span class="n">summary</span><span class="p">(</span><span class="n">m0</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## 
## Call:
## lm(formula = rating ~ advance, data = attitude)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -25.7465  -4.8749   0.5975   7.4232  18.1526 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(&gt;|t|)    
## (Intercept)  56.7558     9.7428   5.825 2.93e-06 ***
## advance       0.1835     0.2209   0.831    0.413    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.24 on 28 degrees of freedom
## Multiple R-squared:  0.02405,    Adjusted R-squared:  -0.0108 
## F-statistic:  0.69 on 1 and 28 DF,  p-value: 0.4132
</code></pre></div></div>

<ul>
  <li>Multiple Regression model has two or more explanatory variables and it
is the most frequent use model.</li>
</ul>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">m1</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">lm</span><span class="p">(</span><span class="n">rating</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">.</span><span class="p">,</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">attitude</span><span class="p">)</span><span class="w">
</span><span class="n">summary</span><span class="p">(</span><span class="n">m1</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## 
## Call:
## lm(formula = rating ~ ., data = attitude)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.9418  -4.3555   0.3158   5.5425  11.5990 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(&gt;|t|)    
## (Intercept) 10.78708   11.58926   0.931 0.361634    
## complaints   0.61319    0.16098   3.809 0.000903 ***
## privileges  -0.07305    0.13572  -0.538 0.595594    
## learning     0.32033    0.16852   1.901 0.069925 .  
## raises       0.08173    0.22148   0.369 0.715480    
## critical     0.03838    0.14700   0.261 0.796334    
## advance     -0.21706    0.17821  -1.218 0.235577    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.068 on 23 degrees of freedom
## Multiple R-squared:  0.7326, Adjusted R-squared:  0.6628 
## F-statistic:  10.5 on 6 and 23 DF,  p-value: 1.24e-05
</code></pre></div></div>

<h1 id="section-2-an-introduction-to-network-analysis-using-igraph">Section 2: An introduction to Network Analysis using Igraph</h1>

<h2 id="install-packages">Install packages</h2>

<p>Before you start this section I recommend that you install the following packages:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">igraph</code>: A package for network analysis and visualization (most important).</li>
  <li><code class="language-plaintext highlighter-rouge">tnet</code>: A package for analyzing weighted, two-mode, and longitudinal networks.</li>
  <li><code class="language-plaintext highlighter-rouge">data.table</code>: A package for data manipulation and analysis.</li>
  <li><code class="language-plaintext highlighter-rouge">qgraph</code>: A package for creating and analyzing graphical models (e.g., network models) that we use only to improve the visualization of networks.</li>
  <li><code class="language-plaintext highlighter-rouge">knitr</code>: A package for dynamic report generation in R.</li>
</ul>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="w">
</span><span class="c1"># packages</span><span class="w">
</span><span class="n">pks</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s1">'knitr'</span><span class="p">,</span><span class="w"> </span><span class="s1">'igraph'</span><span class="p">,</span><span class="w"> </span><span class="s1">'tnet'</span><span class="p">,</span><span class="w"> </span><span class="s1">'data.table'</span><span class="p">,</span><span class="w"> </span><span class="s1">'qgraph'</span><span class="p">)</span><span class="w">
 
</span><span class="c1">#Load and install packages</span><span class="w">
</span><span class="n">to.install</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">pks</span><span class="p">[</span><span class="o">!</span><span class="n">unlist</span><span class="p">(</span><span class="n">lapply</span><span class="p">(</span><span class="n">pks</span><span class="p">,</span><span class="w"> </span><span class="n">require</span><span class="p">,</span><span class="w"> </span><span class="n">character.only</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nb">T</span><span class="w"> </span><span class="p">))]</span><span class="w">
</span><span class="k">if</span><span class="p">(</span><span class="nf">length</span><span class="p">(</span><span class="n">to.install</span><span class="p">)</span><span class="o">!=</span><span class="m">0</span><span class="p">){</span><span class="n">install.packages</span><span class="p">(</span><span class="n">to.install</span><span class="p">,</span><span class="w"> </span><span class="n">dependencies</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nb">T</span><span class="p">)}</span><span class="w">

</span></code></pre></div></div>

<h2 id="generate-graphs">Generate Graphs</h2>

<p>To create networks, we have the option of utilizing both R base
functions and functions within the igraph package.</p>

<h3 id="using-adjanceny-matrices">Using Adjanceny Matrices</h3>

<p>One approach involves generating a network using an adjacency matrix,
where the rows and columns of the matrix correspond to vertices, and the
values in the matrix indicate connections between these vertices.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">#### Graph from a Matrix ####</span><span class="w">
</span><span class="n">n</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="m">5</span><span class="w">
</span><span class="n">g</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">matrix</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">ncol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">nrow</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="p">)</span><span class="w">
</span><span class="n">val</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">round</span><span class="p">(</span><span class="n">runif</span><span class="p">(</span><span class="nf">sum</span><span class="p">(</span><span class="n">upper.tri</span><span class="p">(</span><span class="n">g</span><span class="p">)),</span><span class="w"> </span><span class="n">min</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">max</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">))</span><span class="w">
</span><span class="n">g</span><span class="p">[</span><span class="n">upper.tri</span><span class="p">(</span><span class="n">g</span><span class="p">)]</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">val</span><span class="w">
</span><span class="n">g</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">t</span><span class="p">(</span><span class="n">g</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">g</span><span class="w">

</span><span class="c1"># Create a graph object</span><span class="w">
</span><span class="n">g1</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">graph_from_adjacency_matrix</span><span class="p">(</span><span class="n">g</span><span class="p">,</span><span class="w"> </span><span class="n">mode</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"undirected"</span><span class="p">)</span><span class="w">
</span><span class="n">as_bipartite</span><span class="p">()</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## igraph layout specification, see ?layout_:
## layout_as_bipartite(&lt;graph&gt;, input = "C:/Users/mglez/Documents/PHD/Semester 16/15092023_network_blog/igraph_tu_mgs_v01.Rmd", 
##  igraph layout specification, see ?layout_:
##     encoding = "UTF-8")
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">?</span><span class="n">as_bipartite</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## starting httpd help server ... done
</code></pre></div></div>

<h3 id="using-edge-lists">Using Edge lists</h3>

<ul>
  <li>The second most common way to generate a network is from an edge list
or a set of pairs that define the connection between two vertices.</li>
</ul>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">#### Graph from Edgelist ####</span><span class="w">
</span><span class="n">g1</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">graph_from_edgelist</span><span class="p">(</span><span class="n">t</span><span class="p">(</span><span class="n">combn</span><span class="p">(</span><span class="m">1</span><span class="o">:</span><span class="n">n</span><span class="p">,</span><span class="m">2</span><span class="p">)))</span><span class="w">
</span></code></pre></div></div>

<h3 id="using-formulas">Using formulas</h3>

<ul>
  <li>The third way to create a graph is by using specific formulas with the
<code class="language-plaintext highlighter-rouge">graph_from_literal</code> function. This function enables us to create
networks based on formulaic representations. Essentially, we specify
the desired network structure using a compact formula notation within
this function.</li>
</ul>

<p>This is the general notation of the function:</p>

<ul>
  <li>
    <p><code class="language-plaintext highlighter-rouge">-+</code>: Represents a directed edge between two vertices. For example,
<code class="language-plaintext highlighter-rouge">A -+ B</code> indicates a directed edge from vertex A to vertex B.</p>
  </li>
  <li>
    <p><code class="language-plaintext highlighter-rouge">--</code>: Represents an undirected edge between two vertices. For example,
<code class="language-plaintext highlighter-rouge">A -- B</code> indicates an undirected edge between vertex A and vertex B.</p>
  </li>
  <li>
    <p><code class="language-plaintext highlighter-rouge">++</code>: Represents a directed edge with an arrowhead at both ends,
implying a bidirectional connection between two vertices. For example,
<code class="language-plaintext highlighter-rouge">A ++ B</code> signifies bidirectional directed edges between vertex A and
vertex B.</p>
  </li>
  <li>
    <p><code class="language-plaintext highlighter-rouge">:</code>: Represents a grouping of vertices. For example, <code class="language-plaintext highlighter-rouge">A:B</code> signifies
that vertices A and B are in the same group or cluster within the
network.</p>
  </li>
</ul>

<p>These notations allow you to define various types of connections and
structures within your network using concise and expressive formulas.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">#### Graph from Formula ####</span><span class="w">

</span><span class="c1">#Undirected</span><span class="w">
</span><span class="n">par</span><span class="p">(</span><span class="n">mfrow</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">4</span><span class="p">))</span><span class="w">
</span><span class="n">g1</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">graph_from_literal</span><span class="p">(</span><span class="w"> </span><span class="n">A</span><span class="o">-</span><span class="n">B</span><span class="o">-</span><span class="n">C</span><span class="w"> </span><span class="p">)</span><span class="w">

</span><span class="c1">#Directed</span><span class="w">
</span><span class="n">g2</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">graph_from_literal</span><span class="p">(</span><span class="w"> </span><span class="n">A</span><span class="w"> </span><span class="o">-+</span><span class="w"> </span><span class="n">B</span><span class="w"> </span><span class="o">++</span><span class="w"> </span><span class="n">C</span><span class="w"> </span><span class="p">)</span><span class="w">

</span><span class="c1">#Undirected grouping</span><span class="w">
</span><span class="n">g3</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">graph_from_literal</span><span class="p">(</span><span class="w"> </span><span class="n">A</span><span class="o">-</span><span class="n">B</span><span class="o">:</span><span class="n">C</span><span class="w"> </span><span class="p">)</span><span class="w">

</span><span class="c1">#Directed grouping</span><span class="w">
</span><span class="n">g4</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">graph_from_literal</span><span class="p">(</span><span class="w"> </span><span class="n">A</span><span class="o">-+</span><span class="n">B</span><span class="o">:</span><span class="n">C</span><span class="w"> </span><span class="p">)</span><span class="w">

</span><span class="c1">#Plot the graphs</span><span class="w">
</span><span class="nf">invisible</span><span class="p">(</span><span class="n">lapply</span><span class="p">(</span><span class="nf">list</span><span class="p">(</span><span class="n">g1</span><span class="p">,</span><span class="w"> </span><span class="n">g2</span><span class="p">,</span><span class="w"> </span><span class="n">g3</span><span class="p">,</span><span class="w"> </span><span class="n">g4</span><span class="p">),</span><span class="w"> </span><span class="n">plot</span><span class="p">,</span><span class="w"> </span><span class="n">vertex.size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">25</span><span class="p">,</span><span class="w"> </span><span class="n">edge.arrow.size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">.5</span><span class="p">))</span><span class="w">
</span></code></pre></div></div>

<p><img src="https://github.com/Wario84/blog/raw/main/assets/imgs/2023-09-14-intro_net_anlysis_R/unnamed-chunk-35-1.png?raw=true" alt="" /><!-- --></p>

<h3 id="using-igraph-functions">Using igraph-functions</h3>

<p>In the <code class="language-plaintext highlighter-rouge">igraph</code> package, there are several functions are available to
generate networks using various algorithms and models. Here are some of
the commonly used functions for network generation:</p>

<p>Erdős-Rényi Model (<code class="language-plaintext highlighter-rouge">erdos.renyi.game</code>): This function generates random
networks following the Erdős-Rényi model. In this model, you specify the
number of vertices (n) and the probability (p) of forming an edge
between any pair of vertices. The default type is set to <code class="language-plaintext highlighter-rouge">gnp</code> for the
probability model. This model is useful for creating networks where
edges exist between pairs of vertices independently with a fixed
probability.</p>

<p>Watts-Strogatz Model (<code class="language-plaintext highlighter-rouge">watts.strogatz.game</code>): The Watts-Strogatz model
creates small-world networks with a combination of regularity and
randomness. By default, it starts with a regular lattice where each
vertex is connected to its (nei) nearest neighbors in a ring. Then,
edges are rewired with probability p to introduce randomness. You can
specify the number of vertices (n), the dimension of the lattice (dim),
the number of neighbors (nei), and the rewiring probability (p) as
parameters. This model helps generate networks that exhibit small-world
properties.</p>

<p>Barabási-Albert Model (<code class="language-plaintext highlighter-rouge">barabasi.game</code>): This function generates
networks using the Barabási-Albert model. By default, it creates an
undirected network with n vertices and attaches each new vertex to m
existing vertices with preferential attachment. The default is set to
<code class="language-plaintext highlighter-rouge">m = 1</code>, meaning each new vertex connects to a single existing vertex.
This model results in scale-free networks with a few highly connected
nodes, which is a common property in many real-world networks.</p>

<p>Forest Fire Model (<code class="language-plaintext highlighter-rouge">forest.fire.game</code>): The forest fire model simulates
the growth of networks using the forest fire algorithm. By default, it
creates a network with n vertices. The m parameter specifies the number
of edges added from the new vertex to the existing graph in each step.
The p parameter controls the probability of spreading the fire to
existing vertices. This model is suitable for generating networks with a
specified number of vertices and a desired average degree while
considering network growth dynamics.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Set the number of vertices for all networks</span><span class="w">
</span><span class="n">n</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="m">25</span><span class="w">

</span><span class="c1"># Generate networks using different models</span><span class="w">
</span><span class="n">g1</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">erdos.renyi.game</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">p</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.2</span><span class="p">)</span><span class="w">
</span><span class="n">g2</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">watts.strogatz.game</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">dim</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">4</span><span class="p">,</span><span class="w"> </span><span class="n">p</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.2</span><span class="p">)</span><span class="w">
</span><span class="n">g3</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">barabasi.game</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">m</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">g4</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">forest.fire.game</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">fw.prob</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.2</span><span class="p">,</span><span class="w"> </span><span class="n">bw.factor</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<p>This is the full list of functions available in igraph to generate
networks:</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">games</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">grep</span><span class="p">(</span><span class="s2">"^.*game"</span><span class="p">,</span><span class="w"> </span><span class="n">ls</span><span class="p">(</span><span class="s2">"package:igraph"</span><span class="p">),</span><span class="w"> </span><span class="n">value</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)[</span><span class="m">-1</span><span class="p">]</span><span class="w">
</span><span class="n">games</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>##  [1] "aging.barabasi.game"         "aging.prefatt.game"         
##  [3] "asymmetric.preference.game"  "ba.game"                    
##  [5] "barabasi.game"               "bipartite.random.game"      
##  [7] "callaway.traits.game"        "cited.type.game"            
##  [9] "citing.cited.type.game"      "degree.sequence.game"       
## [11] "erdos.renyi.game"            "establishment.game"         
## [13] "forest.fire.game"            "grg.game"                   
## [15] "growing.random.game"         "hrg.game"                   
## [17] "interconnected.islands.game" "k.regular.game"             
## [19] "lastcit.game"                "preference.game"            
## [21] "random.graph.game"           "sbm.game"                   
## [23] "static.fitness.game"         "static.power.law.game"      
## [25] "watts.strogatz.game"
</code></pre></div></div>

<h2 id="visualizing-networks-with-igraph">Visualizing Networks with igraph</h2>

<p>To create insightful network visualizations using the <code class="language-plaintext highlighter-rouge">igraph</code> package
in R, we’ll begin with plotting a simple network. Later on, we’ll
explore customization options and demonstrate how to visualize multiple
networks side by side.</p>

<h3 id="plotting-a-simple-network">Plotting a Simple Network</h3>

<p>In this initial plot, we have our network displayed.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">plot</span><span class="p">(</span><span class="n">g1</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<p><img src="https://github.com/Wario84/blog/raw/main/assets/imgs/2023-09-14-intro_net_anlysis_R/unnamed-chunk-38-1.png?raw=true" alt="" /><!-- --></p>

<p>However, to improve the visualization we can further customize the
attributes of the <code class="language-plaintext highlighter-rouge">plot</code> function in the following way:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">vertex.size</code>: This attribute allows you to adjust the size of the
nodes (vertices) in your network.</li>
  <li><code class="language-plaintext highlighter-rouge">edge.arrow.size</code>: If your network contains directed edges, you can
modify the arrow size using this attribute.</li>
  <li><code class="language-plaintext highlighter-rouge">vertex.color</code>: Sets the color of nodes (here, light blue).</li>
  <li><code class="language-plaintext highlighter-rouge">edge.color</code>: Defines the color of edges (here, gray).</li>
  <li><code class="language-plaintext highlighter-rouge">vertex.label</code>: Removes node labels for a cleaner visualization.</li>
  <li><code class="language-plaintext highlighter-rouge">layout</code>: Specifies the layout algorithm; we used the
Fruchterman-Reingold layout here.</li>
</ul>

<h3 id="customizing-network-attributes">Customizing network attributes:</h3>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Visualizing multiple networks in a grid</span><span class="w">
</span><span class="n">par</span><span class="p">(</span><span class="n">mfrow</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">))</span><span class="w">
</span><span class="n">networks</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="n">g1</span><span class="p">,</span><span class="w"> </span><span class="n">g2</span><span class="p">,</span><span class="w"> </span><span class="n">g3</span><span class="p">,</span><span class="w"> </span><span class="n">g4</span><span class="p">)</span><span class="w">
</span><span class="nf">invisible</span><span class="p">(</span><span class="n">lapply</span><span class="p">(</span><span class="n">networks</span><span class="p">,</span><span class="w"> </span><span class="n">plot</span><span class="p">,</span><span class="w"> 
                </span><span class="n">vertex.size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">25</span><span class="p">,</span><span class="w"> 
                </span><span class="n">edge.arrow.size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.5</span><span class="p">,</span><span class="w">
                </span><span class="n">vertex.color</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"lightblue"</span><span class="p">,</span><span class="w"> 
                </span><span class="n">edge.color</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"gray"</span><span class="p">,</span><span class="w">
                </span><span class="n">vertex.label</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">NA</span><span class="p">,</span><span class="w">
                </span><span class="n">layout</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">layout.fruchterman.reingold</span><span class="p">))</span><span class="w">
</span></code></pre></div></div>

<p><img src="https://github.com/Wario84/blog/raw/main/assets/imgs/2023-09-14-intro_net_anlysis_R/unnamed-chunk-39-1.png?raw=true" alt="" /><!-- --></p>

<h3 id="layouts-of-igraph">Layouts of igraph</h3>

<p>The igraph has different algorithms called layouts, which help us
visualize and highlight network patterns, degree distributions, and the
spatial arrangement of vertices within a network.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Generate a graph</span><span class="w">
</span><span class="n">n</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="m">15</span><span class="w">
</span><span class="n">g1</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">barabasi.game</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">directed</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nb">F</span><span class="p">)</span><span class="w">

</span><span class="c1"># Explore the complete list of layouts.</span><span class="w">
</span><span class="c1"># layouts &lt;- grep("^layout_", ls("package:igraph"), value = TRUE)[-1]</span><span class="w">
</span><span class="c1"># layouts &lt;- layouts[!grepl("bipartite|sugiyama", layouts)]</span><span class="w">
</span><span class="c1"># dput(layouts)</span><span class="w">

</span><span class="n">layouts</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"layout_as_star"</span><span class="p">,</span><span class="w"> </span><span class="s2">"layout_as_tree"</span><span class="p">,</span><span class="w"> </span><span class="s2">"layout_components"</span><span class="p">,</span><span class="w"> </span><span class="s2">"layout_in_circle"</span><span class="p">,</span><span class="w"> 
</span><span class="s2">"layout_nicely"</span><span class="p">,</span><span class="w"> </span><span class="s2">"layout_on_grid"</span><span class="p">,</span><span class="w"> </span><span class="s2">"layout_on_sphere"</span><span class="p">,</span><span class="w"> </span><span class="s2">"layout_randomly"</span><span class="p">,</span><span class="w"> 
</span><span class="s2">"layout_with_dh"</span><span class="p">,</span><span class="w"> </span><span class="s2">"layout_with_drl"</span><span class="p">,</span><span class="w"> </span><span class="s2">"layout_with_fr"</span><span class="p">,</span><span class="w"> </span><span class="s2">"layout_with_gem"</span><span class="p">,</span><span class="w"> 
</span><span class="s2">"layout_with_graphopt"</span><span class="p">,</span><span class="w"> </span><span class="s2">"layout_with_kk"</span><span class="p">,</span><span class="w"> </span><span class="s2">"layout_with_lgl"</span><span class="p">,</span><span class="w"> 
</span><span class="s2">"layout_with_mds"</span><span class="p">)</span><span class="w">

</span><span class="n">par</span><span class="p">(</span><span class="n">mfrow</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">))</span><span class="w">
</span><span class="nf">invisible</span><span class="p">(</span><span class="n">lapply</span><span class="p">(</span><span class="n">layouts</span><span class="p">,</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">){</span><span class="n">plot</span><span class="p">(</span><span class="w">
</span><span class="n">g1</span><span class="p">,</span><span class="w">
</span><span class="n">vertex.size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">30</span><span class="p">,</span><span class="w">
</span><span class="n">layout</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">eval</span><span class="p">(</span><span class="n">get</span><span class="p">(</span><span class="n">x</span><span class="p">)),</span><span class="w">
</span><span class="n">xlab</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">x</span><span class="w">
</span><span class="p">)</span><span class="w"> </span><span class="p">}))</span><span class="w">
</span></code></pre></div></div>

<p><img src="https://github.com/Wario84/blog/raw/main/assets/imgs/2023-09-14-intro_net_anlysis_R/unnamed-chunk-40-1.png?raw=true" alt="" /><!-- --><img src="https://github.com/Wario84/blog/raw/main/assets/imgs/2023-09-14-intro_net_anlysis_R/unnamed-chunk-40-2.png?raw=true" alt="" /><!-- --><img src="https://github.com/Wario84/blog/raw/main/assets/imgs/2023-09-14-intro_net_anlysis_R/unnamed-chunk-40-3.png?raw=true" alt="" /><!-- --><img src="https://github.com/Wario84/blog/raw/main/assets/imgs/2023-09-14-intro_net_anlysis_R/unnamed-chunk-40-4.png?raw=true" alt="" /><!-- --></p>

<h3 id="setting-shape-of-vertices">Setting shape of vertices</h3>

<p>Vertex shapes in a network graph represent the graphical symbols used to
depict individual vertices or nodes. They are a visual attribute that
allows you to distinguish between nodes based on specific
characteristics or groupings. Vertex shapes are useful in network
visualization as they help convey additional information beyond just the
connections between nodes.</p>

<p>To illustrate the usefulness of vertex shapes, we are creating an
example where we visualize different vertex attributes. In this
particular case, we’re generating a random variable called <code class="language-plaintext highlighter-rouge">Age</code> from
random draws of a normal distribution, with a mean of <code class="language-plaintext highlighter-rouge">30</code> and a
standard deviation of <code class="language-plaintext highlighter-rouge">5</code>, and dividing the vertices into three distinct
groups based on quantiles of this variable. Each group will be assigned
a different vertex shape, making it clear which nodes belong to which
category. This approach enhances the interpretability of the network by
allowing you to visually identify nodes with similar attributes or
characteristics.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">## All shape forms</span><span class="w">
</span><span class="n">shapes</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="w">
  </span><span class="s1">'circle'</span><span class="p">,</span><span class="w">
  </span><span class="s1">'square'</span><span class="p">,</span><span class="w">
  </span><span class="s1">'csquare'</span><span class="p">,</span><span class="w">
  </span><span class="s1">'rectangle'</span><span class="p">,</span><span class="w">
  </span><span class="s1">'crectangle'</span><span class="p">,</span><span class="w">
  </span><span class="s1">'vrectangle'</span><span class="p">,</span><span class="w">
  </span><span class="s1">'pie'</span><span class="p">,</span><span class="w">
  </span><span class="s1">'raster'</span><span class="p">,</span><span class="w">
  </span><span class="s1">'sphere'</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="n">V</span><span class="p">(</span><span class="n">g1</span><span class="p">)</span><span class="o">$</span><span class="n">Age</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">rnorm</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="m">30</span><span class="p">,</span><span class="w"> </span><span class="m">5</span><span class="p">)</span><span class="w">
</span><span class="n">q</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">quantile</span><span class="p">(</span><span class="n">V</span><span class="p">(</span><span class="n">g1</span><span class="p">)</span><span class="o">$</span><span class="n">Age</span><span class="p">,</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">.33</span><span class="p">,</span><span class="w"> </span><span class="m">.66</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">))</span><span class="w">
</span><span class="n">ind</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">cut</span><span class="p">(</span><span class="n">V</span><span class="p">(</span><span class="n">g1</span><span class="p">)</span><span class="o">$</span><span class="n">Age</span><span class="p">,</span><span class="w"> </span><span class="n">q</span><span class="p">,</span><span class="w"> </span><span class="n">include.lowest</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nb">T</span><span class="p">,</span><span class="w"> </span><span class="n">labels</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nb">F</span><span class="p">)</span><span class="w">
</span><span class="n">shape</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">ifelse</span><span class="p">(</span><span class="n">ind</span><span class="o">==</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="s1">'csquare'</span><span class="p">,</span><span class="w"> </span><span class="n">ifelse</span><span class="p">(</span><span class="n">ind</span><span class="o">==</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="s1">'circle'</span><span class="p">,</span><span class="w"> </span><span class="s1">'sphere'</span><span class="p">))</span><span class="w">

</span><span class="n">plot</span><span class="p">(</span><span class="w">
  </span><span class="n">g1</span><span class="p">,</span><span class="w">
  </span><span class="n">vertex.size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">15</span><span class="p">,</span><span class="w">
  </span><span class="n">edge.arrow.size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">.3</span><span class="p">,</span><span class="w">
  </span><span class="n">vertex.shape</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">shape</span><span class="p">,</span><span class="w">
  </span><span class="n">layout</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">layout_nicely</span><span class="w">
</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<p><img src="https://github.com/Wario84/blog/raw/main/assets/imgs/2023-09-14-intro_net_anlysis_R/unnamed-chunk-41-1.png?raw=true" style="display: block; margin: auto;" /></p>

<h3 id="setting-colours-of-vertices-and-adding-legends-to-plots">Setting colours of vertices and adding legends to plots</h3>

<p>Generate a vector of attributes by sampling with replacement from a set
<code class="language-plaintext highlighter-rouge">Gender = {female, male}</code>, and set a new attribute to the graph called
gender. Also, add legends using the legend function and pass the desired
arguments.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">V</span><span class="p">(</span><span class="n">g1</span><span class="p">)</span><span class="o">$</span><span class="n">gender</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">sample</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="s2">"male"</span><span class="p">,</span><span class="w"> </span><span class="s2">"female"</span><span class="p">),</span><span class="w"> </span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">replace</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nb">T</span><span class="p">)</span><span class="w">

</span><span class="n">plot</span><span class="p">(</span><span class="w">
  </span><span class="n">g1</span><span class="p">,</span><span class="w">
  </span><span class="n">vertex.size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">15</span><span class="p">,</span><span class="w">
  </span><span class="n">vertex.color</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ifelse</span><span class="p">(</span><span class="n">V</span><span class="p">(</span><span class="n">g1</span><span class="p">)</span><span class="o">$</span><span class="n">gender</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"male"</span><span class="p">,</span><span class="w"> </span><span class="s2">"light blue"</span><span class="p">,</span><span class="w"> </span><span class="s2">"pink"</span><span class="p">),</span><span class="w">
  </span><span class="n">edge.arrow.size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">.3</span><span class="p">,</span><span class="w">
  </span><span class="n">vertex.shape</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">shape</span><span class="p">,</span><span class="w">
  </span><span class="n">layout</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">layout_nicely</span><span class="w">
</span><span class="p">)</span><span class="w">

</span><span class="n">legend</span><span class="p">(</span><span class="w">
  </span><span class="c1"># Position</span><span class="w">
  </span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">-1.5</span><span class="p">,</span><span class="w">
  </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">-1.1</span><span class="p">,</span><span class="w">
  </span><span class="c1"># Legends</span><span class="w">
  </span><span class="nf">c</span><span class="p">(</span><span class="s2">"male"</span><span class="p">,</span><span class="w"> </span><span class="s2">"female"</span><span class="p">),</span><span class="w">
  </span><span class="c1"># Mark type (circle)</span><span class="w">
  </span><span class="n">pch</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">21</span><span class="p">,</span><span class="w">
  </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w">
  </span><span class="n">pt.bg</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"light blue"</span><span class="p">,</span><span class="w"> </span><span class="s2">"pink"</span><span class="p">),</span><span class="w">
  </span><span class="n">pt.cex</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w">
  </span><span class="n">cex</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">.8</span><span class="p">,</span><span class="w">
  </span><span class="n">bty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"n"</span><span class="p">,</span><span class="w">
  </span><span class="n">ncol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="w">
</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<p><img src="https://github.com/Wario84/blog/raw/main/assets/imgs/2023-09-14-intro_net_anlysis_R/unnamed-chunk-42-1.png?raw=true" alt="" /><!-- --></p>

<h3 id="setting-colours-to-groups-of-vertices">Setting colours to groups of vertices</h3>

<p>We can emphasize groups of vertices in the graph. Find the vertex with
the highest degree centrality and mark the adjacent vertices in a group.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">class</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">adjacent_vertices</span><span class="p">(</span><span class="n">g1</span><span class="p">,</span><span class="w"> </span><span class="n">which.max</span><span class="p">(</span><span class="n">degree</span><span class="p">(</span><span class="n">g1</span><span class="p">)),</span><span class="w"> </span><span class="n">mode</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"all"</span><span class="p">))</span><span class="w">
</span><span class="n">plot</span><span class="p">(</span><span class="w">
  </span><span class="n">g1</span><span class="p">,</span><span class="w">
  </span><span class="n">vertex.size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">15</span><span class="p">,</span><span class="w">
  </span><span class="n">vertex.color</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ifelse</span><span class="p">(</span><span class="n">V</span><span class="p">(</span><span class="n">g1</span><span class="p">)</span><span class="o">$</span><span class="n">gender</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"male"</span><span class="p">,</span><span class="w"> </span><span class="s2">"light blue"</span><span class="p">,</span><span class="w"> </span><span class="s2">"pink"</span><span class="p">),</span><span class="w">
  </span><span class="n">edge.arrow.size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">.3</span><span class="p">,</span><span class="w">
  </span><span class="n">vertex.shape</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">shape</span><span class="p">,</span><span class="w">
  </span><span class="n">layout</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">layout_nicely</span><span class="p">,</span><span class="w">
  </span><span class="n">mark.groups</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">class</span><span class="p">,</span><span class="w">
  </span><span class="n">mark.col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">rainbow</span><span class="p">(</span><span class="nf">length</span><span class="p">(</span><span class="n">class</span><span class="p">))</span><span class="w">
</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<p><img src="https://github.com/Wario84/blog/raw/main/assets/imgs/2023-09-14-intro_net_anlysis_R/unnamed-chunk-43-1.png?raw=true" alt="" /><!-- --></p>

<h3 id="set-colours-and-thickness-to-edges">Set colours and thickness to edges</h3>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">g1</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">graph.data.frame</span><span class="p">(</span><span class="n">data.frame</span><span class="p">(</span><span class="w">
  </span><span class="n">from</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s1">'A'</span><span class="p">,</span><span class="w"> </span><span class="s1">'A'</span><span class="p">,</span><span class="w"> </span><span class="s1">'A'</span><span class="p">,</span><span class="w"> </span><span class="s1">'A'</span><span class="p">,</span><span class="w"> </span><span class="s1">'B'</span><span class="p">,</span><span class="w"> </span><span class="s1">'B'</span><span class="p">,</span><span class="w"> </span><span class="s1">'C'</span><span class="p">,</span><span class="w"> </span><span class="s1">'D'</span><span class="p">,</span><span class="w"> </span><span class="s1">'E'</span><span class="p">,</span><span class="w"> </span><span class="s1">'F'</span><span class="p">),</span><span class="w">
  </span><span class="n">to</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s1">'B'</span><span class="p">,</span><span class="w"> </span><span class="s1">'C'</span><span class="p">,</span><span class="w"> </span><span class="s1">'D'</span><span class="p">,</span><span class="w"> </span><span class="s1">'E'</span><span class="p">,</span><span class="w"> </span><span class="s1">'C'</span><span class="p">,</span><span class="w"> </span><span class="s1">'E'</span><span class="p">,</span><span class="w"> </span><span class="s1">'D'</span><span class="p">,</span><span class="w"> </span><span class="s1">'E'</span><span class="p">,</span><span class="w"> </span><span class="s1">'F'</span><span class="p">,</span><span class="w"> </span><span class="s1">'G'</span><span class="p">)</span><span class="w">
</span><span class="p">),</span><span class="w"> </span><span class="n">directed</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nb">F</span><span class="p">)</span><span class="w">

</span><span class="c1"># and plot it:</span><span class="w">
</span><span class="n">cl</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">cliques</span><span class="p">(</span><span class="n">g1</span><span class="p">,</span><span class="w"> </span><span class="n">min</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="n">max</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">3</span><span class="p">)</span><span class="w">
</span><span class="n">c.ed</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">lapply</span><span class="p">(</span><span class="n">cl</span><span class="p">,</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w">
  </span><span class="n">E</span><span class="p">(</span><span class="n">g1</span><span class="p">,</span><span class="w"> </span><span class="n">path</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">x</span><span class="p">[</span><span class="m">1</span><span class="p">])))</span><span class="w">
</span><span class="n">plot</span><span class="p">(</span><span class="n">g1</span><span class="p">,</span><span class="w">
  </span><span class="n">layout</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">layout.star</span><span class="p">,</span><span class="w">
  </span><span class="n">edge.width</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">edge.betweenness</span><span class="p">(</span><span class="n">g1</span><span class="p">),</span><span class="w">
  </span><span class="n">edge.color</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ifelse</span><span class="p">(</span><span class="n">E</span><span class="p">(</span><span class="n">g1</span><span class="p">)</span><span class="w"> </span><span class="o">%in%</span><span class="w"> </span><span class="n">unlist</span><span class="p">(</span><span class="n">c.ed</span><span class="p">),</span><span class="w"> </span><span class="s2">"red"</span><span class="p">,</span><span class="w"> </span><span class="s2">"gray"</span><span class="p">))</span><span class="w">
</span></code></pre></div></div>

<p><img src="https://github.com/Wario84/blog/raw/main/assets/imgs/2023-09-14-intro_net_anlysis_R/unnamed-chunk-44-1.png?raw=true" alt="" /><!-- --></p>

<h3 id="subgraphs">Subgraphs</h3>

<p>Find the vertices adjacent to the vertex A and then plot a subgraph of
the neighborhood.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">g1</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">graph.data.frame</span><span class="p">(</span><span class="n">data.frame</span><span class="p">(</span><span class="w">
  </span><span class="n">from</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s1">'A'</span><span class="p">,</span><span class="w"> </span><span class="s1">'A'</span><span class="p">,</span><span class="w"> </span><span class="s1">'A'</span><span class="p">,</span><span class="w"> </span><span class="s1">'A'</span><span class="p">,</span><span class="w"> </span><span class="s1">'B'</span><span class="p">,</span><span class="w"> </span><span class="s1">'B'</span><span class="p">,</span><span class="w"> </span><span class="s1">'C'</span><span class="p">,</span><span class="w"> </span><span class="s1">'D'</span><span class="p">,</span><span class="w"> </span><span class="s1">'E'</span><span class="p">,</span><span class="w"> </span><span class="s1">'F'</span><span class="p">),</span><span class="w">
  </span><span class="n">to</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s1">'B'</span><span class="p">,</span><span class="w"> </span><span class="s1">'C'</span><span class="p">,</span><span class="w"> </span><span class="s1">'D'</span><span class="p">,</span><span class="w"> </span><span class="s1">'E'</span><span class="p">,</span><span class="w"> </span><span class="s1">'C'</span><span class="p">,</span><span class="w"> </span><span class="s1">'E'</span><span class="p">,</span><span class="w"> </span><span class="s1">'D'</span><span class="p">,</span><span class="w"> </span><span class="s1">'E'</span><span class="p">,</span><span class="w"> </span><span class="s1">'F'</span><span class="p">,</span><span class="w"> </span><span class="s1">'G'</span><span class="p">)</span><span class="w">
</span><span class="p">),</span><span class="w"> </span><span class="n">directed</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nb">F</span><span class="p">)</span><span class="w">

</span><span class="c1"># Find the vertices adjacent to A</span><span class="w">
</span><span class="n">v</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="s1">'A'</span><span class="w">
</span><span class="n">neig.a</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">adjacent_vertices</span><span class="p">(</span><span class="n">g1</span><span class="p">,</span><span class="w"> </span><span class="n">v</span><span class="p">,</span><span class="w"> </span><span class="n">mode</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"all"</span><span class="p">))</span><span class="w">

</span><span class="c1"># Subgraph the neighborhood of A</span><span class="w">
</span><span class="n">g2</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">induced.subgraph</span><span class="p">(</span><span class="n">g1</span><span class="p">,</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="n">v</span><span class="p">,</span><span class="w"> </span><span class="nf">names</span><span class="p">(</span><span class="n">neig.a</span><span class="p">[[</span><span class="m">1</span><span class="p">]])))</span><span class="w">

</span><span class="c1"># Plot the Graphs</span><span class="w">
</span><span class="n">par</span><span class="p">(</span><span class="n">mfrow</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">))</span><span class="w">
</span><span class="n">plot</span><span class="p">(</span><span class="n">g1</span><span class="p">)</span><span class="w">
</span><span class="n">plot</span><span class="p">(</span><span class="n">g2</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<p><img src="https://github.com/Wario84/blog/raw/main/assets/imgs/2023-09-14-intro_net_anlysis_R/unnamed-chunk-45-1.png?raw=true" alt="" /><!-- --></p>

<h2 id="network-statistics">Network Statistics</h2>

<h3 id="local-clustering">Local Clustering</h3>

<p>Local clustering of a vertex $v$ is the ratio of the number of
3-cliques, or triangles, that fall in to $v$ and the number of connected
triplets from which two edges are incident to $v$. For instance, for
vertex $A$, the number of triangles, is defined as the cardinality of
the set of vertices $\Delta_{A}={(A,C,B), (A,C,D), (A,D,E), (A,E,D)}$.
Similarly the number of connected triplets is the set
$T={(A,C,B), (A,C,D), (A,D,E), (A,E,D), (C,A,E), (D,A,B) }$. The local
clustering coefficient $C_{A}=\frac{|\Delta_{A}|}{|T|}=\frac{2}{3}$. The
local clustering is not defined for topologies, such as <em>stars</em>,
<em>trees</em>, <em>lattices</em>.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="w"> </span><span class="n">g</span><span class="w"> </span><span class="o">&lt;-</span><span class="w">
    </span><span class="n">graph.data.frame</span><span class="p">(</span><span class="n">data.frame</span><span class="p">(</span><span class="w">
    </span><span class="n">from</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s1">'A'</span><span class="p">,</span><span class="w"> </span><span class="s1">'A'</span><span class="p">,</span><span class="w"> </span><span class="s1">'A'</span><span class="p">,</span><span class="w"> </span><span class="s1">'A'</span><span class="p">,</span><span class="w"> </span><span class="s1">'B'</span><span class="p">,</span><span class="w"> </span><span class="s1">'B'</span><span class="p">,</span><span class="w"> </span><span class="s1">'C'</span><span class="p">,</span><span class="w"> </span><span class="s1">'D'</span><span class="p">),</span><span class="w">
    </span><span class="n">to</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s1">'B'</span><span class="p">,</span><span class="w"> </span><span class="s1">'C'</span><span class="p">,</span><span class="w"> </span><span class="s1">'D'</span><span class="p">,</span><span class="w"> </span><span class="s1">'E'</span><span class="p">,</span><span class="w"> </span><span class="s1">'C'</span><span class="p">,</span><span class="w"> </span><span class="s1">'E'</span><span class="p">,</span><span class="w"> </span><span class="s1">'D'</span><span class="p">,</span><span class="w"> </span><span class="s1">'E'</span><span class="p">)</span><span class="w">
    </span><span class="p">),</span><span class="w"> </span><span class="n">directed</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nb">F</span><span class="p">)</span><span class="w">

    </span><span class="c1">#Graph</span><span class="w">
    </span><span class="n">plot</span><span class="p">(</span><span class="n">g</span><span class="p">,</span><span class="w"> </span><span class="n">layout</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">layout.star</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<p><img src="https://github.com/Wario84/blog/raw/main/assets/imgs/2023-09-14-intro_net_anlysis_R/unnamed-chunk-46-1.png?raw=true" alt="" /><!-- --></p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="w">    </span><span class="c1">#Transitivity of A</span><span class="w">
    </span><span class="n">transitivity</span><span class="p">(</span><span class="n">g</span><span class="p">,</span><span class="w"> </span><span class="n">v</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"A"</span><span class="p">,</span><span class="w"> </span><span class="s2">"local"</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## [1] 0.6666667
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="w">    </span><span class="n">transitivity</span><span class="p">(</span><span class="n">graph.lattice</span><span class="p">(</span><span class="m">5</span><span class="p">),</span><span class="w"> </span><span class="s2">"local"</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## [1] NaN   0   0   0 NaN
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="w">    </span><span class="n">transitivity</span><span class="p">(</span><span class="n">graph.star</span><span class="p">(</span><span class="m">5</span><span class="p">,</span><span class="w"> </span><span class="n">mode</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"undirected"</span><span class="p">),</span><span class="w"> </span><span class="s2">"local"</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## [1]   0 NaN NaN NaN NaN
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="w">    </span><span class="n">transitivity</span><span class="p">(</span><span class="n">graph.tree</span><span class="p">(</span><span class="m">5</span><span class="p">,</span><span class="w"> </span><span class="n">mode</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"undirected"</span><span class="p">),</span><span class="w"> </span><span class="s2">"local"</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## [1]   0   0 NaN NaN NaN
</code></pre></div></div>

<h3 id="degree-and-strength">Degree and Strength</h3>

<p>Add two new edges to the graph, from to , and from to , and plot the
graph. Compare the measurements of and for the vertex $A$, notice that
the vertices adjacent to are ${B,C,D,E)}$ but both measurements have a
value of $7$. Simplify the graph, remove multiples edges and loops, and
compute the <code class="language-plaintext highlighter-rouge">degree</code> centrality. Finally, use the <code class="language-plaintext highlighter-rouge">count\_multiple</code>
function to compute weights for each edge in the graph and calculate
<code class="language-plaintext highlighter-rouge">strength</code> centrality one more time.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="w">    </span><span class="n">g</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">graph.data.frame</span><span class="p">(</span><span class="n">data.frame</span><span class="p">(</span><span class="w">
      </span><span class="n">from</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s1">'A'</span><span class="p">,</span><span class="w"> </span><span class="s1">'A'</span><span class="p">,</span><span class="w"> </span><span class="s1">'A'</span><span class="p">,</span><span class="w"> </span><span class="s1">'A'</span><span class="p">,</span><span class="w"> </span><span class="s1">'B'</span><span class="p">,</span><span class="w"> </span><span class="s1">'B'</span><span class="p">,</span><span class="w"> </span><span class="s1">'C'</span><span class="p">,</span><span class="w"> </span><span class="s1">'D'</span><span class="p">),</span><span class="w">
      </span><span class="n">to</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s1">'B'</span><span class="p">,</span><span class="w"> </span><span class="s1">'C'</span><span class="p">,</span><span class="w"> </span><span class="s1">'D'</span><span class="p">,</span><span class="w"> </span><span class="s1">'E'</span><span class="p">,</span><span class="w"> </span><span class="s1">'C'</span><span class="p">,</span><span class="w"> </span><span class="s1">'E'</span><span class="p">,</span><span class="w"> </span><span class="s1">'D'</span><span class="p">,</span><span class="w"> </span><span class="s1">'E'</span><span class="p">)</span><span class="w">
      </span><span class="p">),</span><span class="w"> </span><span class="n">directed</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nb">F</span><span class="p">)</span><span class="w">

    </span><span class="c1"># Add edge</span><span class="w">
    </span><span class="n">g</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">add_edges</span><span class="p">(</span><span class="n">g</span><span class="p">,</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s1">'A'</span><span class="p">,</span><span class="w"> </span><span class="s1">'B'</span><span class="p">,</span><span class="w"> </span><span class="s1">'A'</span><span class="p">,</span><span class="w"> </span><span class="s1">'A'</span><span class="p">))</span><span class="w">

    </span><span class="c1"># Plot the graph</span><span class="w">
    </span><span class="n">plot</span><span class="p">(</span><span class="n">g</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<p><img src="https://github.com/Wario84/blog/raw/main/assets/imgs/2023-09-14-intro_net_anlysis_R/unnamed-chunk-47-1.png?raw=true" style="display: block; margin: auto;" /></p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="w">    </span><span class="c1"># Degree vs Strength</span><span class="w">
    </span><span class="n">degree</span><span class="p">(</span><span class="n">g</span><span class="p">,</span><span class="w"> </span><span class="n">V</span><span class="p">(</span><span class="n">g</span><span class="p">)</span><span class="o">$</span><span class="n">name</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s1">'A'</span><span class="p">,</span><span class="w"> </span><span class="n">mode</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'all'</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## A 
## 7
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="w">    </span><span class="n">strength</span><span class="p">(</span><span class="n">g</span><span class="p">,</span><span class="w"> </span><span class="n">V</span><span class="p">(</span><span class="n">g</span><span class="p">)</span><span class="o">$</span><span class="n">name</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s1">'A'</span><span class="p">,</span><span class="w"> </span><span class="n">mode</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'all'</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## A 
## 7
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="w">    </span><span class="c1"># Simplify Degree</span><span class="w">
    </span><span class="n">degree</span><span class="p">(</span><span class="n">simplify</span><span class="p">(</span><span class="n">g</span><span class="p">,</span><span class="w"> </span><span class="n">remove.multiple</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nb">T</span><span class="p">,</span><span class="w"> </span><span class="n">remove.loops</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nb">T</span><span class="p">),</span><span class="w">
    </span><span class="n">V</span><span class="p">(</span><span class="n">g</span><span class="p">)</span><span class="o">$</span><span class="n">name</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s1">'A'</span><span class="p">,</span><span class="w">
    </span><span class="n">mode</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'all'</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## A 
## 4
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="w">    </span><span class="c1"># Strength with Weights</span><span class="w">
    </span><span class="n">E</span><span class="p">(</span><span class="n">g</span><span class="p">)</span><span class="o">$</span><span class="n">weight</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">count_multiple</span><span class="p">(</span><span class="n">g</span><span class="p">)</span><span class="w">
    </span><span class="n">g</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">simplify</span><span class="p">(</span><span class="n">g</span><span class="p">)</span><span class="w">
    </span><span class="n">strength</span><span class="p">(</span><span class="n">g</span><span class="p">,</span><span class="w"> </span><span class="n">V</span><span class="p">(</span><span class="n">g</span><span class="p">)</span><span class="o">$</span><span class="n">name</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s1">'A'</span><span class="p">,</span><span class="w"> </span><span class="n">mode</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'all'</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## A 
## 7
</code></pre></div></div>

<h3 id="eigenvector-centrality">Eigenvector Centrality</h3>

<p>The intuition of is to capture the importance of the neighborhood of
each vertex. Vertices who are connected to adjacent vertices with higher
degree centrality will perform better in this measurement. The interest
lies in finding a vector that represents a ranking of relative
importance for each vertex. This is similar to finding a solution for
the eigenvalue problem.</p>

\[Aw = \lambda w\]

<p>Suppose that we can assign equal weights $w_1$, to each vertex, and then
perform $Aw_1=w_2$, similar to computing a weighted degree. Then use
$w_2$ to perform $n$ iterations till difference between
$Aw_{n} - \lambda w_{n+1}$ gets closer to zero. Run a algorithm of
<a href="https://www.sci.unich.it/~francesc/teaching/network/eigenvector.html">eigenvector
centrality</a>
and compare the results with the function.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># graph: </span><span class="w">
    </span><span class="n">g</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">graph.data.frame</span><span class="p">(</span><span class="n">data.frame</span><span class="p">(</span><span class="w">
      </span><span class="n">from</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s1">'A'</span><span class="p">,</span><span class="w"> </span><span class="s1">'A'</span><span class="p">,</span><span class="w"> </span><span class="s1">'A'</span><span class="p">,</span><span class="w"> </span><span class="s1">'A'</span><span class="p">,</span><span class="w"> </span><span class="s1">'B'</span><span class="p">,</span><span class="w"> </span><span class="s1">'B'</span><span class="p">,</span><span class="w"> </span><span class="s1">'C'</span><span class="p">,</span><span class="w"> </span><span class="s1">'D'</span><span class="p">),</span><span class="w">
      </span><span class="n">to</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s1">'B'</span><span class="p">,</span><span class="w"> </span><span class="s1">'C'</span><span class="p">,</span><span class="w"> </span><span class="s1">'D'</span><span class="p">,</span><span class="w"> </span><span class="s1">'E'</span><span class="p">,</span><span class="w"> </span><span class="s1">'C'</span><span class="p">,</span><span class="w"> </span><span class="s1">'E'</span><span class="p">,</span><span class="w"> </span><span class="s1">'D'</span><span class="p">,</span><span class="w"> </span><span class="s1">'E'</span><span class="p">)</span><span class="w">
      </span><span class="p">),</span><span class="w"> </span><span class="n">directed</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nb">F</span><span class="p">)</span><span class="w">

</span><span class="c1"># Eigen vector centrality algorithm    </span><span class="w">
</span><span class="n">eigenvector.centrality</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">g</span><span class="p">,</span><span class="w"> </span><span class="n">t</span><span class="o">=</span><span class="m">7</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
  </span><span class="n">A</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">get.adjacency</span><span class="p">(</span><span class="n">g</span><span class="p">)</span><span class="w">
  </span><span class="n">n</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">nrow</span><span class="p">(</span><span class="n">A</span><span class="p">)</span><span class="w">
  </span><span class="c1">#Degree</span><span class="w">
  </span><span class="c1"># w &lt;- max(A%*%rep(1, n))</span><span class="w">
  </span><span class="n">w</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">n</span><span class="w">
  </span><span class="c1">#Create a vector of weights</span><span class="w">
  </span><span class="n">x1</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">rep</span><span class="p">(</span><span class="m">1</span><span class="o">/</span><span class="n">w</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="p">)</span><span class="w">
  </span><span class="c1">#Create a vector of zeros (for the initial interation)</span><span class="w">
  </span><span class="n">x0</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">rep</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="p">)</span><span class="w">
  </span><span class="c1">#Presicion of the computation</span><span class="w">
  </span><span class="n">pre</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="m">1</span><span class="o">/</span><span class="m">10</span><span class="o">^</span><span class="n">t</span><span class="w">
  </span><span class="c1">#Index of interaction</span><span class="w">
  </span><span class="n">iter</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="m">0</span><span class="w">
  </span><span class="k">while</span><span class="w"> </span><span class="p">(</span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="nf">abs</span><span class="p">(</span><span class="n">x0</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">x1</span><span class="p">))</span><span class="w"> </span><span class="o">&gt;</span><span class="w"> </span><span class="n">pre</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
    </span><span class="c1">#Store the current weight for comparison in the next interaction</span><span class="w">
    </span><span class="n">x0</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">x1</span><span class="w">
    </span><span class="c1">#Compute a weighted degree</span><span class="w">
    </span><span class="n">x1</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">as.vector</span><span class="p">(</span><span class="n">A</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">x1</span><span class="p">)</span><span class="w">
    </span><span class="c1">#Get the biggest weight</span><span class="w">
    </span><span class="n">w</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">x1</span><span class="p">[</span><span class="n">which.max</span><span class="p">(</span><span class="nf">abs</span><span class="p">(</span><span class="n">x1</span><span class="p">))]</span><span class="w">
    </span><span class="c1">#Get a new vector of weights</span><span class="w">
    </span><span class="n">x1</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">x1</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="n">w</span><span class="w">
    </span><span class="c1">#Save the interations</span><span class="w">
    </span><span class="n">iter</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">iter</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="w">
  </span><span class="p">}</span><span class="w">
  </span><span class="nf">return</span><span class="p">(</span><span class="nf">list</span><span class="p">(</span><span class="n">vector</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">x1</span><span class="p">,</span><span class="w"> </span><span class="n">value</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">w</span><span class="p">,</span><span class="w"> </span><span class="n">iter</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">iter</span><span class="p">))</span><span class="w">
</span><span class="p">}</span><span class="w">

</span><span class="c1">#Compute eigenvector centrality scores</span><span class="w">
</span><span class="n">eigen_centrality</span><span class="p">(</span><span class="n">g</span><span class="p">)</span><span class="o">$</span><span class="n">vector</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>##        A        B        C        D        E 
## 1.000000 0.809017 0.809017 0.809017 0.809017
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">eigenvector.centrality</span><span class="p">(</span><span class="n">g</span><span class="p">,</span><span class="w"> </span><span class="m">7</span><span class="p">)</span><span class="o">$</span><span class="n">vector</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## [1] 1.000000 0.809017 0.809017 0.809017 0.809017
</code></pre></div></div>

<h3 id="weighted-measurements-of-centrality">Weighted Measurements of centrality</h3>

<p>The <a href="https://cran.r-project.org/web/packages/tnet/tnet.pdf"></a> package
has implementations of weighted versions of , and . Lets generate a
graph of $n$ vertices and, compare the difference between the weighted
and unweighted centrality measurements.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="w">  </span><span class="n">n</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="m">20</span><span class="w">
    </span><span class="c1"># Generate an undirected weighted graph</span><span class="w">
    </span><span class="n">w</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">matrix</span><span class="p">(</span><span class="m">0L</span><span class="p">,</span><span class="w"> </span><span class="n">nrow</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">ncol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="p">)</span><span class="w">
    
    </span><span class="c1"># Squewed distributon</span><span class="w">
    </span><span class="n">val</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">rnbinom</span><span class="p">(</span><span class="nf">sum</span><span class="p">(</span><span class="n">upper.tri</span><span class="p">(</span><span class="n">w</span><span class="p">)),</span><span class="w"> </span><span class="n">prob</span><span class="o">=</span><span class="m">1</span><span class="o">/</span><span class="m">5</span><span class="p">,</span><span class="w"> </span><span class="n">size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
    </span><span class="n">w</span><span class="p">[</span><span class="n">upper.tri</span><span class="p">(</span><span class="n">w</span><span class="p">)]</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">val</span><span class="w">
    </span><span class="n">w</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">w</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">t</span><span class="p">(</span><span class="n">w</span><span class="p">)</span><span class="w">
    </span><span class="n">g_w</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">as.tnet</span><span class="p">(</span><span class="n">w</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'weighted one-mode tnet'</span><span class="p">)</span><span class="w">
    </span><span class="n">g_w.1</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">graph_from_adjacency_matrix</span><span class="p">(</span><span class="n">w</span><span class="p">)</span><span class="w">
    </span><span class="n">E</span><span class="p">(</span><span class="n">g_w.1</span><span class="p">)</span><span class="o">$</span><span class="n">weight</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">count.multiple</span><span class="p">(</span><span class="n">g_w.1</span><span class="p">)</span><span class="w">

    </span><span class="c1"># Generate an undirected unweighted graph</span><span class="w">
    </span><span class="n">uw</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">matrix</span><span class="p">(</span><span class="m">0L</span><span class="p">,</span><span class="w"> </span><span class="n">nrow</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">ncol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="p">)</span><span class="w">
    </span><span class="n">uw</span><span class="p">[</span><span class="n">w</span><span class="w"> </span><span class="o">&gt;</span><span class="w"> </span><span class="m">0</span><span class="p">]</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="m">1</span><span class="w">
    </span><span class="n">g_uw</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">graph_from_adjacency_matrix</span><span class="p">(</span><span class="n">uw</span><span class="p">,</span><span class="w"> </span><span class="n">mode</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'undirected'</span><span class="p">)</span><span class="w">
    </span><span class="c1"># Strength</span><span class="w">
    </span><span class="n">st</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">strength</span><span class="p">(</span><span class="n">g_w.1</span><span class="p">)</span><span class="w">
    </span><span class="n">d</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">degree</span><span class="p">(</span><span class="n">g_w.1</span><span class="p">)</span><span class="w">
    </span><span class="n">dw</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">degree_w</span><span class="p">(</span><span class="n">g_w</span><span class="p">)</span><span class="w">
      
    </span><span class="c1"># Betweeness</span><span class="w">
    </span><span class="n">bu</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">betweenness</span><span class="p">(</span><span class="n">g_uw</span><span class="p">)</span><span class="w">
    </span><span class="n">bw</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">betweenness_w</span><span class="p">(</span><span class="n">g_w</span><span class="p">)</span><span class="w">
    
    </span><span class="c1"># Closeness</span><span class="w">
    </span><span class="n">cu</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">closeness</span><span class="p">(</span><span class="n">g_uw</span><span class="p">)</span><span class="w">
    </span><span class="n">cw</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">closeness_w</span><span class="p">(</span><span class="n">g_w</span><span class="p">)</span><span class="w">
    </span><span class="n">out</span><span class="w"> </span><span class="o">&lt;-</span><span class="w">
    </span><span class="n">data.frame</span><span class="p">(</span><span class="w">
    </span><span class="n">vertex</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="n">n</span><span class="p">,</span><span class="w">
    </span><span class="n">degree</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">d</span><span class="p">,</span><span class="w">
    </span><span class="n">w.degree</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">dw</span><span class="p">[,</span><span class="m">3</span><span class="p">],</span><span class="w">
    </span><span class="n">strength</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">st</span><span class="p">,</span><span class="w">
    </span><span class="n">betweenness</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">bu</span><span class="p">,</span><span class="w">
    </span><span class="n">w.betweenness</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">bw</span><span class="p">[,</span><span class="w"> </span><span class="m">2</span><span class="p">],</span><span class="w">
    </span><span class="n">closeness</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">cu</span><span class="p">,</span><span class="w">
    </span><span class="n">w.closeness</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">cw</span><span class="p">[,</span><span class="w"> </span><span class="m">3</span><span class="p">]</span><span class="w">
    </span><span class="p">)</span><span class="w">
    
   </span><span class="n">kable</span><span class="p">(</span><span class="n">out</span><span class="p">,</span><span class="w"> </span><span class="n">format</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"markdown"</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<table>
  <thead>
    <tr>
      <th style="text-align: right">vertex</th>
      <th style="text-align: right">degree</th>
      <th style="text-align: right">w.degree</th>
      <th style="text-align: right">strength</th>
      <th style="text-align: right">betweenness</th>
      <th style="text-align: right">w.betweenness</th>
      <th style="text-align: right">closeness</th>
      <th style="text-align: right">w.closeness</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="text-align: right">1</td>
      <td style="text-align: right">136</td>
      <td style="text-align: right">68</td>
      <td style="text-align: right">1360</td>
      <td style="text-align: right">2.104459</td>
      <td style="text-align: right">7</td>
      <td style="text-align: right">0.0400000</td>
      <td style="text-align: right">0.0027701</td>
    </tr>
    <tr>
      <td style="text-align: right">2</td>
      <td style="text-align: right">206</td>
      <td style="text-align: right">103</td>
      <td style="text-align: right">3390</td>
      <td style="text-align: right">2.850419</td>
      <td style="text-align: right">29</td>
      <td style="text-align: right">0.0400000</td>
      <td style="text-align: right">0.0033045</td>
    </tr>
    <tr>
      <td style="text-align: right">3</td>
      <td style="text-align: right">142</td>
      <td style="text-align: right">71</td>
      <td style="text-align: right">1310</td>
      <td style="text-align: right">3.630675</td>
      <td style="text-align: right">6</td>
      <td style="text-align: right">0.0434783</td>
      <td style="text-align: right">0.0024859</td>
    </tr>
    <tr>
      <td style="text-align: right">4</td>
      <td style="text-align: right">132</td>
      <td style="text-align: right">66</td>
      <td style="text-align: right">1664</td>
      <td style="text-align: right">1.697183</td>
      <td style="text-align: right">7</td>
      <td style="text-align: right">0.0384615</td>
      <td style="text-align: right">0.0029117</td>
    </tr>
    <tr>
      <td style="text-align: right">5</td>
      <td style="text-align: right">144</td>
      <td style="text-align: right">72</td>
      <td style="text-align: right">1240</td>
      <td style="text-align: right">2.774312</td>
      <td style="text-align: right">9</td>
      <td style="text-align: right">0.0434783</td>
      <td style="text-align: right">0.0027380</td>
    </tr>
    <tr>
      <td style="text-align: right">6</td>
      <td style="text-align: right">84</td>
      <td style="text-align: right">42</td>
      <td style="text-align: right">496</td>
      <td style="text-align: right">2.129892</td>
      <td style="text-align: right">0</td>
      <td style="text-align: right">0.0384615</td>
      <td style="text-align: right">0.0022731</td>
    </tr>
    <tr>
      <td style="text-align: right">7</td>
      <td style="text-align: right">104</td>
      <td style="text-align: right">52</td>
      <td style="text-align: right">476</td>
      <td style="text-align: right">4.233103</td>
      <td style="text-align: right">1</td>
      <td style="text-align: right">0.0454545</td>
      <td style="text-align: right">0.0020333</td>
    </tr>
    <tr>
      <td style="text-align: right">8</td>
      <td style="text-align: right">148</td>
      <td style="text-align: right">74</td>
      <td style="text-align: right">1496</td>
      <td style="text-align: right">1.338850</td>
      <td style="text-align: right">13</td>
      <td style="text-align: right">0.0370370</td>
      <td style="text-align: right">0.0027492</td>
    </tr>
    <tr>
      <td style="text-align: right">9</td>
      <td style="text-align: right">130</td>
      <td style="text-align: right">65</td>
      <td style="text-align: right">762</td>
      <td style="text-align: right">2.880625</td>
      <td style="text-align: right">1</td>
      <td style="text-align: right">0.0434783</td>
      <td style="text-align: right">0.0023273</td>
    </tr>
    <tr>
      <td style="text-align: right">10</td>
      <td style="text-align: right">180</td>
      <td style="text-align: right">90</td>
      <td style="text-align: right">3248</td>
      <td style="text-align: right">3.568456</td>
      <td style="text-align: right">18</td>
      <td style="text-align: right">0.0434783</td>
      <td style="text-align: right">0.0031294</td>
    </tr>
    <tr>
      <td style="text-align: right">11</td>
      <td style="text-align: right">194</td>
      <td style="text-align: right">97</td>
      <td style="text-align: right">2246</td>
      <td style="text-align: right">3.685820</td>
      <td style="text-align: right">21</td>
      <td style="text-align: right">0.0454545</td>
      <td style="text-align: right">0.0032645</td>
    </tr>
    <tr>
      <td style="text-align: right">12</td>
      <td style="text-align: right">116</td>
      <td style="text-align: right">58</td>
      <td style="text-align: right">1212</td>
      <td style="text-align: right">2.174228</td>
      <td style="text-align: right">3</td>
      <td style="text-align: right">0.0384615</td>
      <td style="text-align: right">0.0022673</td>
    </tr>
    <tr>
      <td style="text-align: right">13</td>
      <td style="text-align: right">86</td>
      <td style="text-align: right">43</td>
      <td style="text-align: right">590</td>
      <td style="text-align: right">1.078691</td>
      <td style="text-align: right">0</td>
      <td style="text-align: right">0.0370370</td>
      <td style="text-align: right">0.0024054</td>
    </tr>
    <tr>
      <td style="text-align: right">14</td>
      <td style="text-align: right">144</td>
      <td style="text-align: right">72</td>
      <td style="text-align: right">988</td>
      <td style="text-align: right">4.050985</td>
      <td style="text-align: right">8</td>
      <td style="text-align: right">0.0434783</td>
      <td style="text-align: right">0.0028601</td>
    </tr>
    <tr>
      <td style="text-align: right">15</td>
      <td style="text-align: right">88</td>
      <td style="text-align: right">44</td>
      <td style="text-align: right">404</td>
      <td style="text-align: right">1.909510</td>
      <td style="text-align: right">0</td>
      <td style="text-align: right">0.0384615</td>
      <td style="text-align: right">0.0020143</td>
    </tr>
    <tr>
      <td style="text-align: right">16</td>
      <td style="text-align: right">126</td>
      <td style="text-align: right">63</td>
      <td style="text-align: right">822</td>
      <td style="text-align: right">4.247017</td>
      <td style="text-align: right">1</td>
      <td style="text-align: right">0.0454545</td>
      <td style="text-align: right">0.0025134</td>
    </tr>
    <tr>
      <td style="text-align: right">17</td>
      <td style="text-align: right">162</td>
      <td style="text-align: right">81</td>
      <td style="text-align: right">1314</td>
      <td style="text-align: right">1.615571</td>
      <td style="text-align: right">18</td>
      <td style="text-align: right">0.0384615</td>
      <td style="text-align: right">0.0033866</td>
    </tr>
    <tr>
      <td style="text-align: right">18</td>
      <td style="text-align: right">178</td>
      <td style="text-align: right">89</td>
      <td style="text-align: right">2278</td>
      <td style="text-align: right">2.254387</td>
      <td style="text-align: right">16</td>
      <td style="text-align: right">0.0400000</td>
      <td style="text-align: right">0.0032046</td>
    </tr>
    <tr>
      <td style="text-align: right">19</td>
      <td style="text-align: right">104</td>
      <td style="text-align: right">52</td>
      <td style="text-align: right">644</td>
      <td style="text-align: right">1.613647</td>
      <td style="text-align: right">5</td>
      <td style="text-align: right">0.0384615</td>
      <td style="text-align: right">0.0025963</td>
    </tr>
    <tr>
      <td style="text-align: right">20</td>
      <td style="text-align: right">196</td>
      <td style="text-align: right">98</td>
      <td style="text-align: right">2500</td>
      <td style="text-align: right">4.162168</td>
      <td style="text-align: right">32</td>
      <td style="text-align: right">0.0454545</td>
      <td style="text-align: right">0.0032223</td>
    </tr>
  </tbody>
</table>

<h3 id="structural-holes">Structural Holes</h3>

<p>Structural Holes are separations between groups observed from
discontinuities in network structure. The absence of structural holes
signals saturation in the capacity of individuals(vertices) to create
novel connections outside their group. Saturation occurs when
individuals reach a limit on the number of connections they can create
and maintain. When individual’s resources are concentrated in a single
group structural holes are absent or scare. The scarcity of holes
represent constraints to collaborate outside a single research team that
leads to redundancy of information and inability to capitalize novel
ideas from different research teams.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">g</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">graph.data.frame</span><span class="p">(</span><span class="n">data.frame</span><span class="p">(</span><span class="w">
  </span><span class="n">from</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s1">'A'</span><span class="p">,</span><span class="w"> </span><span class="s1">'A'</span><span class="p">,</span><span class="w"> </span><span class="s1">'A'</span><span class="p">,</span><span class="w"> </span><span class="s1">'A'</span><span class="p">,</span><span class="w"> </span><span class="s1">'B'</span><span class="p">,</span><span class="w"> </span><span class="s1">'B'</span><span class="p">,</span><span class="w"> </span><span class="s1">'C'</span><span class="p">,</span><span class="w"> </span><span class="s1">'D'</span><span class="p">,</span><span class="w"> </span><span class="s1">'E'</span><span class="p">,</span><span class="w"> </span><span class="s1">'F'</span><span class="p">),</span><span class="w">
  </span><span class="n">to</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s1">'B'</span><span class="p">,</span><span class="w"> </span><span class="s1">'E'</span><span class="p">,</span><span class="w"> </span><span class="s1">'F'</span><span class="p">,</span><span class="w"> </span><span class="s1">'G'</span><span class="p">,</span><span class="w"> </span><span class="s1">'D'</span><span class="p">,</span><span class="w"> </span><span class="s1">'G'</span><span class="p">,</span><span class="w"> </span><span class="s1">'G'</span><span class="p">,</span><span class="w"> </span><span class="s1">'G'</span><span class="p">,</span><span class="w"> </span><span class="s1">'G'</span><span class="p">,</span><span class="w"> </span><span class="s1">'G'</span><span class="p">)</span><span class="w">
</span><span class="p">),</span><span class="w"> </span><span class="n">directed</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nb">F</span><span class="p">)</span><span class="w">


</span><span class="n">plot</span><span class="p">(</span><span class="n">g</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<p><img src="https://github.com/Wario84/blog/raw/main/assets/imgs/2023-09-14-intro_net_anlysis_R/unnamed-chunk-50-1.png?raw=true" alt="" /><!-- --></p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">A</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">as.matrix</span><span class="p">(</span><span class="n">get.adjacency</span><span class="p">(</span><span class="n">g</span><span class="p">))</span><span class="w">
</span><span class="n">kable</span><span class="p">(</span><span class="n">A</span><span class="p">,</span><span class="w"> </span><span class="n">format</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"markdown"</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<table>
  <thead>
    <tr>
      <th style="text-align: left"> </th>
      <th style="text-align: right">A</th>
      <th style="text-align: right">B</th>
      <th style="text-align: right">C</th>
      <th style="text-align: right">D</th>
      <th style="text-align: right">E</th>
      <th style="text-align: right">F</th>
      <th style="text-align: right">G</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="text-align: left">A</td>
      <td style="text-align: right">0</td>
      <td style="text-align: right">1</td>
      <td style="text-align: right">0</td>
      <td style="text-align: right">0</td>
      <td style="text-align: right">1</td>
      <td style="text-align: right">1</td>
      <td style="text-align: right">1</td>
    </tr>
    <tr>
      <td style="text-align: left">B</td>
      <td style="text-align: right">1</td>
      <td style="text-align: right">0</td>
      <td style="text-align: right">0</td>
      <td style="text-align: right">1</td>
      <td style="text-align: right">0</td>
      <td style="text-align: right">0</td>
      <td style="text-align: right">1</td>
    </tr>
    <tr>
      <td style="text-align: left">C</td>
      <td style="text-align: right">0</td>
      <td style="text-align: right">0</td>
      <td style="text-align: right">0</td>
      <td style="text-align: right">0</td>
      <td style="text-align: right">0</td>
      <td style="text-align: right">0</td>
      <td style="text-align: right">1</td>
    </tr>
    <tr>
      <td style="text-align: left">D</td>
      <td style="text-align: right">0</td>
      <td style="text-align: right">1</td>
      <td style="text-align: right">0</td>
      <td style="text-align: right">0</td>
      <td style="text-align: right">0</td>
      <td style="text-align: right">0</td>
      <td style="text-align: right">1</td>
    </tr>
    <tr>
      <td style="text-align: left">E</td>
      <td style="text-align: right">1</td>
      <td style="text-align: right">0</td>
      <td style="text-align: right">0</td>
      <td style="text-align: right">0</td>
      <td style="text-align: right">0</td>
      <td style="text-align: right">0</td>
      <td style="text-align: right">1</td>
    </tr>
    <tr>
      <td style="text-align: left">F</td>
      <td style="text-align: right">1</td>
      <td style="text-align: right">0</td>
      <td style="text-align: right">0</td>
      <td style="text-align: right">0</td>
      <td style="text-align: right">0</td>
      <td style="text-align: right">0</td>
      <td style="text-align: right">1</td>
    </tr>
    <tr>
      <td style="text-align: left">G</td>
      <td style="text-align: right">1</td>
      <td style="text-align: right">1</td>
      <td style="text-align: right">1</td>
      <td style="text-align: right">1</td>
      <td style="text-align: right">1</td>
      <td style="text-align: right">1</td>
      <td style="text-align: right">0</td>
    </tr>
  </tbody>
</table>

<p>To calculate the constraints to bridge structural holes, the first step
is to calculate, $i$, individual proportion of resources allocated to
$j$ connections.</p>

\[p_{ij} = z_{ij} / z_{iq}\]

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># degree for undirected (Sum of resources spend in each connection)</span><span class="w">
</span><span class="n">d</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="p">(</span><span class="n">A</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">upper.tri</span><span class="p">(</span><span class="n">A</span><span class="p">))</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">matrix</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">nrow</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">nrow</span><span class="p">(</span><span class="n">A</span><span class="p">),</span><span class="w"> </span><span class="n">ncol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">

</span><span class="c1"># Matrix of degree</span><span class="w">
</span><span class="n">D</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">matrix</span><span class="p">(</span><span class="nf">rep</span><span class="p">(</span><span class="n">d</span><span class="p">,</span><span class="w"> </span><span class="n">ncol</span><span class="p">(</span><span class="n">A</span><span class="p">)),</span><span class="w"> </span><span class="n">nrow</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">nrow</span><span class="p">(</span><span class="n">A</span><span class="p">),</span><span class="w"> </span><span class="n">ncol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ncol</span><span class="p">(</span><span class="n">A</span><span class="p">))</span><span class="w">

</span><span class="c1">#  Matrix of time and enery invested on others z_iq = d - z_ij</span><span class="w">
</span><span class="n">z_iq</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="p">(</span><span class="n">D</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">upper.tri</span><span class="p">(</span><span class="n">D</span><span class="p">))</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="p">(</span><span class="n">A</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">upper.tri</span><span class="p">(</span><span class="n">A</span><span class="p">))</span><span class="w">

</span><span class="c1"># Matrix of proportion of i's time an energy allocated to j's connections.</span><span class="w">
</span><span class="n">P</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="p">(</span><span class="n">A</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">upper.tri</span><span class="p">(</span><span class="n">A</span><span class="p">))</span><span class="o">/</span><span class="n">z_iq</span><span class="w">
</span></code></pre></div></div>

<h3 id="redundancy-of-centrality-in-complete-graphs">Redundancy of Centrality in Complete graphs</h3>

<p>There are some cases in which the measurements of centrality will not
provide relevant information. For instance, if the structure of an
undirected network approaches a complete graph, each pair of different
vertices is connected by a unique edge
$\forall {i \neq j}:E(v_i,v_j)=1$, then the centrality measures will
not yield relevant information. Take into consideration the following
example, where I generate a fully connected graph:</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">## Creating Adjacency Matrix</span><span class="w">
</span><span class="n">A</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">matrix</span><span class="p">(</span><span class="nf">rep</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="m">25</span><span class="p">),</span><span class="w"> </span><span class="n">ncol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">5</span><span class="p">,</span><span class="w"> </span><span class="n">nrow</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">5</span><span class="p">)</span><span class="w"> 
</span><span class="n">diag</span><span class="p">(</span><span class="n">A</span><span class="p">)</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="m">0</span><span class="w">
</span><span class="n">G</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">graph_from_adjacency_matrix</span><span class="p">(</span><span class="n">A</span><span class="p">,</span><span class="w"> </span><span class="n">mode</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"undirected"</span><span class="p">)</span><span class="w">

</span><span class="c1">## Creating the </span><span class="w">
</span><span class="n">cols</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">data.frame</span><span class="p">(</span><span class="w">
    </span><span class="n">degree</span><span class="p">(</span><span class="n">G</span><span class="p">),</span><span class="w">
    </span><span class="n">closeness</span><span class="p">(</span><span class="n">G</span><span class="p">),</span><span class="w">
    </span><span class="n">constraint</span><span class="p">(</span><span class="n">G</span><span class="p">),</span><span class="w">
    </span><span class="n">transitivity</span><span class="p">(</span><span class="n">G</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'local'</span><span class="p">),</span><span class="w">
    </span><span class="n">eigen_centrality</span><span class="p">(</span><span class="n">G</span><span class="p">)</span><span class="o">$</span><span class="n">vector</span><span class="p">,</span><span class="w">
    </span><span class="n">betweenness</span><span class="p">(</span><span class="n">G</span><span class="p">)</span><span class="w">
  </span><span class="p">)</span><span class="w">


</span><span class="n">colnames</span><span class="p">(</span><span class="n">cols</span><span class="p">)</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">gsub</span><span class="p">(</span><span class="s2">"\\.G\\..*"</span><span class="p">,</span><span class="w"> </span><span class="s2">""</span><span class="p">,</span><span class="w"> </span><span class="n">colnames</span><span class="p">(</span><span class="n">cols</span><span class="p">))</span><span class="w">

</span><span class="n">kable</span><span class="p">(</span><span class="n">cols</span><span class="p">,</span><span class="w">  </span><span class="n">format</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"markdown"</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<table>
  <thead>
    <tr>
      <th style="text-align: right">degree</th>
      <th style="text-align: right">closeness</th>
      <th style="text-align: right">constraint</th>
      <th style="text-align: right">transitivity</th>
      <th style="text-align: right">eigen_centrality</th>
      <th style="text-align: right">betweenness</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="text-align: right">4</td>
      <td style="text-align: right">0.25</td>
      <td style="text-align: right">0.765625</td>
      <td style="text-align: right">1</td>
      <td style="text-align: right">1</td>
      <td style="text-align: right">0</td>
    </tr>
    <tr>
      <td style="text-align: right">4</td>
      <td style="text-align: right">0.25</td>
      <td style="text-align: right">0.765625</td>
      <td style="text-align: right">1</td>
      <td style="text-align: right">1</td>
      <td style="text-align: right">0</td>
    </tr>
    <tr>
      <td style="text-align: right">4</td>
      <td style="text-align: right">0.25</td>
      <td style="text-align: right">0.765625</td>
      <td style="text-align: right">1</td>
      <td style="text-align: right">1</td>
      <td style="text-align: right">0</td>
    </tr>
    <tr>
      <td style="text-align: right">4</td>
      <td style="text-align: right">0.25</td>
      <td style="text-align: right">0.765625</td>
      <td style="text-align: right">1</td>
      <td style="text-align: right">1</td>
      <td style="text-align: right">0</td>
    </tr>
    <tr>
      <td style="text-align: right">4</td>
      <td style="text-align: right">0.25</td>
      <td style="text-align: right">0.765625</td>
      <td style="text-align: right">1</td>
      <td style="text-align: right">1</td>
      <td style="text-align: right">0</td>
    </tr>
  </tbody>
</table>

<p>To see more clearly the issue of redundancy of network measurements, I
have created this snipped of code with a simulation. The code calculates
the different network statistics keeping constant the number of edges
but increasing the number of connections until the network is fully
connected. In a nutshell the snipped, calculates network statistics for
networks with the same number of vertices but an increasing number of
connections
<code class="language-plaintext highlighter-rouge">conn &lt;- c(seq(from=points, to=triag.matrix, by=round(triag.matrix/points)), triag.matrix)</code>.
Using a loop, we iterate this sequence sampling randomly the connections
in the <code class="language-plaintext highlighter-rouge">conn</code> sequence as follows: <code class="language-plaintext highlighter-rouge">sample(triag.matrix, conn[i])</code>.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">#### Compute the average for each network centrality ####</span><span class="w">
</span><span class="n">n</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="m">100</span><span class="w">
</span><span class="n">triag.matrix</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="p">((</span><span class="n">n</span><span class="o">^</span><span class="m">2</span><span class="p">)</span><span class="o">-</span><span class="n">n</span><span class="p">)</span><span class="o">/</span><span class="m">2</span><span class="w">
</span><span class="n">points</span><span class="w"> </span><span class="o">&lt;</span><span class="m">-100</span><span class="w">
</span><span class="n">conn</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="n">seq</span><span class="p">(</span><span class="n">from</span><span class="o">=</span><span class="n">points</span><span class="p">,</span><span class="w"> </span><span class="n">to</span><span class="o">=</span><span class="n">triag.matrix</span><span class="p">,</span><span class="w"> </span><span class="n">by</span><span class="o">=</span><span class="nf">round</span><span class="p">(</span><span class="n">triag.matrix</span><span class="o">/</span><span class="n">points</span><span class="p">)),</span><span class="w"> </span><span class="n">triag.matrix</span><span class="p">)</span><span class="w">
</span><span class="n">i</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="m">20</span><span class="w">
</span><span class="n">out.list</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">list</span><span class="p">()</span><span class="w">
</span><span class="k">for</span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="nf">seq_along</span><span class="p">(</span><span class="n">conn</span><span class="p">)){</span><span class="w">
</span><span class="n">g</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">matrix</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">ncol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">nrow</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="p">)</span><span class="w">  
</span><span class="n">val</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">sample</span><span class="p">(</span><span class="n">triag.matrix</span><span class="p">,</span><span class="w"> </span><span class="n">conn</span><span class="p">[</span><span class="n">i</span><span class="p">])</span><span class="w">
</span><span class="n">g</span><span class="p">[</span><span class="n">upper.tri</span><span class="p">(</span><span class="n">g</span><span class="p">)][</span><span class="n">val</span><span class="p">]</span><span class="o">&lt;-</span><span class="w"> </span><span class="m">1</span><span class="w">
</span><span class="n">g</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">t</span><span class="p">(</span><span class="n">g</span><span class="p">)</span><span class="o">+</span><span class="n">g</span><span class="w">
</span><span class="n">g</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">graph_from_adjacency_matrix</span><span class="p">(</span><span class="n">g</span><span class="p">,</span><span class="w"> </span><span class="n">mode</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'undirected'</span><span class="p">)</span><span class="w">
</span><span class="n">t</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">transitivity</span><span class="p">(</span><span class="n">g</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'localundirected'</span><span class="p">)</span><span class="w">
</span><span class="n">t</span><span class="p">[</span><span class="nf">is.na</span><span class="p">(</span><span class="n">t</span><span class="p">)]</span><span class="o">&lt;-</span><span class="w"> </span><span class="m">0</span><span class="w">
</span><span class="n">out.list</span><span class="p">[[</span><span class="n">i</span><span class="p">]]</span><span class="w"> </span><span class="o">&lt;-</span><span class="w">
  </span><span class="n">data.frame</span><span class="p">(</span><span class="w">
    </span><span class="n">degree</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">mean</span><span class="p">(</span><span class="n">degree</span><span class="p">(</span><span class="n">g</span><span class="p">)),</span><span class="w">
    </span><span class="n">closeness</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">mean</span><span class="p">(</span><span class="n">closeness</span><span class="p">(</span><span class="n">g</span><span class="p">)),</span><span class="w">
    </span><span class="n">betweennes</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">mean</span><span class="p">(</span><span class="n">betweenness</span><span class="p">(</span><span class="n">g</span><span class="p">)),</span><span class="w">
    </span><span class="n">transitivity</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">mean</span><span class="p">(</span><span class="n">t</span><span class="p">),</span><span class="w">
    </span><span class="n">eigen.cent</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">mean</span><span class="p">(</span><span class="n">eigen_centrality</span><span class="p">(</span><span class="n">g</span><span class="p">)</span><span class="o">$</span><span class="n">vector</span><span class="p">),</span><span class="w">
    </span><span class="n">struc.holes</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">mean</span><span class="p">(</span><span class="n">constraint</span><span class="p">(</span><span class="n">g</span><span class="p">))</span><span class="w">
  </span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>

<p>Now that we have calculated the network statistics the goal of this
snippet is to plot the network statistics accordingly. Each plot shows
how a specific network statistic changes as the number of connections in
the network increases. The vertical axis represents the values of
various network statistics.The horizontal axis represents the number of
connections in the network as they increase gradually untill they reach
the fully connected graph.</p>

<p>As it becomes clear, when the connectivity level of the graph increases,
the network centrality measurements become more and more similar. The
results of this simulation suggest that network centrality measurements
become redundant as a graph approaches a fully connected network. This
is because all nodes in a fully connected network have the same
fundamental network structure.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">out.data</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">rbindlist</span><span class="p">(</span><span class="n">out.list</span><span class="p">)</span><span class="w">
</span><span class="n">rm</span><span class="p">(</span><span class="n">out.list</span><span class="p">)</span><span class="w">
</span><span class="n">par</span><span class="p">(</span><span class="n">mfrow</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="m">3</span><span class="p">))</span><span class="w">

</span><span class="n">x</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="n">nrow</span><span class="p">(</span><span class="n">out.data</span><span class="p">)</span><span class="w">
</span><span class="k">for</span><span class="p">(</span><span class="n">j</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="n">ncol</span><span class="p">(</span><span class="n">out.data</span><span class="p">)){</span><span class="w">
</span><span class="n">y</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">out.data</span><span class="p">[[</span><span class="n">j</span><span class="p">]]</span><span class="w">
</span><span class="n">y</span><span class="p">[</span><span class="nf">is.na</span><span class="p">(</span><span class="n">y</span><span class="p">)]</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="m">0</span><span class="w">
</span><span class="n">st</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">sqrt</span><span class="p">(</span><span class="n">var</span><span class="p">(</span><span class="n">y</span><span class="p">))</span><span class="w">
</span><span class="n">plot</span><span class="w"> </span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">ylim</span><span class="o">=</span><span class="nf">c</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="nf">max</span><span class="p">(</span><span class="n">y</span><span class="p">)),</span><span class="w"> </span><span class="n">main</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">colnames</span><span class="p">(</span><span class="n">out.data</span><span class="p">)[</span><span class="n">j</span><span class="p">]</span><span class="w"> </span><span class="p">)</span><span class="w">
</span><span class="n">segments</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="n">y</span><span class="o">-</span><span class="n">st</span><span class="p">,</span><span class="n">x</span><span class="p">,</span><span class="n">y</span><span class="o">+</span><span class="n">st</span><span class="p">)</span><span class="w">
</span><span class="n">epsilon</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="m">0.02</span><span class="w">
</span><span class="n">segments</span><span class="p">(</span><span class="n">x</span><span class="o">-</span><span class="n">epsilon</span><span class="p">,</span><span class="n">y</span><span class="o">-</span><span class="n">st</span><span class="p">,</span><span class="n">x</span><span class="o">+</span><span class="n">epsilon</span><span class="p">,</span><span class="n">y</span><span class="o">-</span><span class="n">st</span><span class="p">)</span><span class="w">
</span><span class="n">segments</span><span class="p">(</span><span class="n">x</span><span class="o">-</span><span class="n">epsilon</span><span class="p">,</span><span class="n">y</span><span class="o">+</span><span class="n">st</span><span class="p">,</span><span class="n">x</span><span class="o">+</span><span class="n">epsilon</span><span class="p">,</span><span class="n">y</span><span class="o">+</span><span class="n">st</span><span class="p">)</span><span class="w">

</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>

<p><img src="https://github.com/Wario84/blog/raw/main/assets/imgs/2023-09-14-intro_net_anlysis_R/unnamed-chunk-54-1.png?raw=true" alt="" /><!-- --></p>

<h2 id="network-analysis-with-data">Network analysis with Data</h2>

<p>In this code snippet, you will learn how to perform a basic network
analysis on a real-world dataset. We start by downloading a network
dataset, specifically the “ca-netscience” dataset, from an online
source. This dataset represents a co-authorship network in the field of
network science.</p>

<p>The code proceeds to unzip and load the dataset into R. We then
construct a graph from the dataset using the igraph package,
representing the relationships between authors. And finally, we
calculate various network metrics such as degree centrality, closeness
centrality, betweenness centrality and other network statistics.</p>

<p>I encourage you to explore similar datasets on the <a href="https://snap.stanford.edu/data/">Stanford Large
Network Dataset Collection</a> and the
<a href="http://networkrepository.com/">NetworkRepository</a>, they have a wide
range of sample empirical data for your analyses.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Download network data</span><span class="w">
 </span><span class="c1"># setwd('C:/r_tutorial') </span><span class="w">
</span><span class="c1"># Data from: https://arxiv.org/abs/physics/0605087    </span><span class="w">
 </span><span class="n">download.file</span><span class="p">(</span><span class="s1">'http://nrvis.com/download/data/ca/ca-netscience.zip'</span><span class="p">,</span><span class="w">
                  </span><span class="n">destfile</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'ca-netscience.zip'</span><span class="p">)</span><span class="w">
 </span><span class="n">unzip</span><span class="p">(</span><span class="s1">'ca-netscience.zip'</span><span class="p">)</span><span class="w">
 </span><span class="n">g</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">read.csv</span><span class="p">(</span><span class="s1">'ca-netscience.mtx'</span><span class="p">,</span><span class="w"> </span><span class="n">sep</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">" "</span><span class="p">,</span><span class="w"> </span><span class="n">header</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nb">F</span><span class="p">,</span><span class="w"> </span><span class="n">skip</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w">
 </span><span class="n">g</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">graph_from_data_frame</span><span class="p">(</span><span class="n">g</span><span class="p">,</span><span class="w"> </span><span class="n">directed</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nb">F</span><span class="p">)</span><span class="w">
 </span><span class="n">E</span><span class="p">(</span><span class="n">g</span><span class="p">)</span><span class="o">$</span><span class="n">weight</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">count.multiple</span><span class="p">(</span><span class="n">g</span><span class="p">)</span><span class="w">

 </span><span class="c1"># Perform a basic network analysis (degree, closeness, betweenness, </span><span class="w">
 </span><span class="c1"># transitivity, eigenvector.centrality)</span><span class="w">
 </span><span class="c1"># Store the results in a data.frame for analysis, add a column of ids. </span><span class="w">
 
 </span><span class="n">na</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">data.frame</span><span class="p">(</span><span class="w">
    </span><span class="n">id</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">V</span><span class="p">(</span><span class="n">g</span><span class="p">)</span><span class="o">$</span><span class="n">name</span><span class="p">,</span><span class="w">
    </span><span class="n">closeness</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">closeness</span><span class="p">(</span><span class="n">g</span><span class="p">,</span><span class="w"> </span><span class="n">mode</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"all"</span><span class="p">,</span><span class="w"> </span><span class="n">normalized</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nb">F</span><span class="p">),</span><span class="w">
    </span><span class="n">degree</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">degree</span><span class="p">(</span><span class="n">g</span><span class="p">,</span><span class="w"> </span><span class="n">mode</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"all"</span><span class="p">,</span><span class="w"> </span><span class="n">normalized</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nb">F</span><span class="p">,</span><span class="w"> </span><span class="n">loops</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nb">F</span><span class="p">),</span><span class="w">
    </span><span class="n">strength</span><span class="o">=</span><span class="w"> </span><span class="n">strength</span><span class="p">(</span><span class="n">g</span><span class="p">),</span><span class="w">
    </span><span class="n">betweenness</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">betweenness</span><span class="p">(</span><span class="n">g</span><span class="p">,</span><span class="w"> </span><span class="n">directed</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nb">F</span><span class="p">,</span><span class="w"> </span><span class="n">normalized</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nb">F</span><span class="p">),</span><span class="w">
    </span><span class="n">struc_hole</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">constraint</span><span class="p">(</span><span class="n">g</span><span class="p">),</span><span class="w">
    </span><span class="n">transitivity</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">transitivity</span><span class="p">(</span><span class="n">g</span><span class="p">,</span><span class="w"> </span><span class="s2">"localundirected"</span><span class="p">),</span><span class="w">
    </span><span class="n">eigen_centrality</span><span class="o">=</span><span class="w"> </span><span class="n">eigen_centrality</span><span class="p">(</span><span class="n">g</span><span class="p">,</span><span class="w"> </span><span class="n">scale</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nb">F</span><span class="p">)</span><span class="o">$</span><span class="n">vector</span><span class="p">)</span><span class="w">
 
</span><span class="n">kable</span><span class="p">(</span><span class="n">na</span><span class="p">[</span><span class="m">1</span><span class="o">:</span><span class="m">10</span><span class="p">,],</span><span class="w">  </span><span class="n">format</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"markdown"</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<table>
  <thead>
    <tr>
      <th style="text-align: left"> </th>
      <th style="text-align: left">id</th>
      <th style="text-align: right">closeness</th>
      <th style="text-align: right">degree</th>
      <th style="text-align: right">strength</th>
      <th style="text-align: right">betweenness</th>
      <th style="text-align: right">struc_hole</th>
      <th style="text-align: right">transitivity</th>
      <th style="text-align: right">eigen_centrality</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="text-align: left">2</td>
      <td style="text-align: left">2</td>
      <td style="text-align: right">0.0004627</td>
      <td style="text-align: right">2</td>
      <td style="text-align: right">2</td>
      <td style="text-align: right">0.000</td>
      <td style="text-align: right">0.8650000</td>
      <td style="text-align: right">1.0000000</td>
      <td style="text-align: right">0.0148383</td>
    </tr>
    <tr>
      <td style="text-align: left">3</td>
      <td style="text-align: left">3</td>
      <td style="text-align: right">0.0004627</td>
      <td style="text-align: right">2</td>
      <td style="text-align: right">2</td>
      <td style="text-align: right">0.000</td>
      <td style="text-align: right">0.8650000</td>
      <td style="text-align: right">1.0000000</td>
      <td style="text-align: right">0.0148383</td>
    </tr>
    <tr>
      <td style="text-align: left">4</td>
      <td style="text-align: left">4</td>
      <td style="text-align: right">0.0005643</td>
      <td style="text-align: right">34</td>
      <td style="text-align: right">34</td>
      <td style="text-align: right">10834.473</td>
      <td style="text-align: right">0.0955073</td>
      <td style="text-align: right">0.1336898</td>
      <td style="text-align: right">0.4142993</td>
    </tr>
    <tr>
      <td style="text-align: left">5</td>
      <td style="text-align: left">5</td>
      <td style="text-align: right">0.0006075</td>
      <td style="text-align: right">27</td>
      <td style="text-align: right">27</td>
      <td style="text-align: right">17858.003</td>
      <td style="text-align: right">0.1071098</td>
      <td style="text-align: right">0.1823362</td>
      <td style="text-align: right">0.3562072</td>
    </tr>
    <tr>
      <td style="text-align: left">16</td>
      <td style="text-align: left">16</td>
      <td style="text-align: right">0.0005211</td>
      <td style="text-align: right">21</td>
      <td style="text-align: right">21</td>
      <td style="text-align: right">1131.347</td>
      <td style="text-align: right">0.1690397</td>
      <td style="text-align: right">0.2761905</td>
      <td style="text-align: right">0.3464503</td>
    </tr>
    <tr>
      <td style="text-align: left">44</td>
      <td style="text-align: left">44</td>
      <td style="text-align: right">0.0005882</td>
      <td style="text-align: right">4</td>
      <td style="text-align: right">4</td>
      <td style="text-align: right">12460.579</td>
      <td style="text-align: right">0.2941086</td>
      <td style="text-align: right">0.5000000</td>
      <td style="text-align: right">0.0885522</td>
    </tr>
    <tr>
      <td style="text-align: left">113</td>
      <td style="text-align: left">113</td>
      <td style="text-align: right">0.0005033</td>
      <td style="text-align: right">15</td>
      <td style="text-align: right">15</td>
      <td style="text-align: right">4601.692</td>
      <td style="text-align: right">0.1901735</td>
      <td style="text-align: right">0.2571429</td>
      <td style="text-align: right">0.0222461</td>
    </tr>
    <tr>
      <td style="text-align: left">131</td>
      <td style="text-align: left">131</td>
      <td style="text-align: right">0.0004760</td>
      <td style="text-align: right">12</td>
      <td style="text-align: right">12</td>
      <td style="text-align: right">3238.339</td>
      <td style="text-align: right">0.2118647</td>
      <td style="text-align: right">0.3333333</td>
      <td style="text-align: right">0.0213000</td>
    </tr>
    <tr>
      <td style="text-align: left">250</td>
      <td style="text-align: left">250</td>
      <td style="text-align: right">0.0005089</td>
      <td style="text-align: right">6</td>
      <td style="text-align: right">6</td>
      <td style="text-align: right">13.200</td>
      <td style="text-align: right">0.3374061</td>
      <td style="text-align: right">0.8666667</td>
      <td style="text-align: right">0.1470503</td>
    </tr>
    <tr>
      <td style="text-align: left">259</td>
      <td style="text-align: left">259</td>
      <td style="text-align: right">0.0004735</td>
      <td style="text-align: right">3</td>
      <td style="text-align: right">3</td>
      <td style="text-align: right">0.000</td>
      <td style="text-align: right">0.4537654</td>
      <td style="text-align: right">1.0000000</td>
      <td style="text-align: right">0.0176052</td>
    </tr>
  </tbody>
</table>

<h2 id="community-detection">Community Detection</h2>

<p>Lastly, I would like to show you how to perform a basic community
detection analysis. Community detection in network analysis is the
process of identifying groups or communities of nodes within a network
that are more densely connected to each other than to nodes outside
their community. These communities represent subsets of nodes which may
possess similar characteristics, functions, or roles within the network.</p>

<p>Here’s a brief explanation of some community detection methods available
in <code class="language-plaintext highlighter-rouge">igraph</code>:</p>

<p><code class="language-plaintext highlighter-rouge">edge.betweenness.community</code>: This method identifies communities based
on edge betweenness centrality. It removes edges with the highest
betweenness values iteratively, eventually breaking the network into
communities.</p>

<p><code class="language-plaintext highlighter-rouge">fastgreedy.community</code>: It uses a greedy optimization approach to find
hierarchical communities by optimizing modularity, a measure of
community structure quality.</p>

<p><code class="language-plaintext highlighter-rouge">infomap.community</code>: This method employs the Infomap algorithm, which
treats the network as a flow of information and detects communities by
minimizing the expected description length of the information flow.</p>

<p><code class="language-plaintext highlighter-rouge">label.propagation.community</code>: Nodes are assigned labels, and
communities form based on the propagation of these labels through the
network. It’s a simple and fast method.</p>

<p><code class="language-plaintext highlighter-rouge">leading.eigenvector.community</code>: This approach uses spectral graph
theory and the leading eigenvector of the network’s adjacency matrix to
detect communities.</p>

<p><code class="language-plaintext highlighter-rouge">multilevel.community</code>: It’s a multilevel algorithm that optimizes
modularity by moving nodes between communities, iteratively improving
the community structure.</p>

<p><code class="language-plaintext highlighter-rouge">spinglass.community</code>: This method is based on spin glass models from
statistical physics, which maximize a Hamiltonian function to find
communities.</p>

<p><code class="language-plaintext highlighter-rouge">walktrap.community</code>: It uses random walks to find communities by
detecting nodes that are more likely to be visited together during
random walks on the network.</p>

<p>For this example I will use the same data as the previos snipped.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">comms</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"edge.betweenness.community"</span><span class="p">,</span><span class="w"> </span><span class="s2">"fastgreedy.community"</span><span class="p">,</span><span class="w"> </span><span class="s2">"infomap.community"</span><span class="p">,</span><span class="w">
</span><span class="s2">"label.propagation.community"</span><span class="p">,</span><span class="w"> </span><span class="s2">"leading.eigenvector.community"</span><span class="p">,</span><span class="w">
</span><span class="s2">"multilevel.community"</span><span class="p">,</span><span class="w"> </span><span class="s2">"spinglass.community"</span><span class="p">,</span><span class="w">
</span><span class="s2">"walktrap.community"</span><span class="p">)</span><span class="w">

</span><span class="n">V</span><span class="p">(</span><span class="n">g</span><span class="p">)</span><span class="o">$</span><span class="n">frame.color</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="s2">"white"</span><span class="w">
</span><span class="n">V</span><span class="p">(</span><span class="n">g</span><span class="p">)</span><span class="o">$</span><span class="n">label</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="s2">""</span><span class="w">
</span><span class="n">E</span><span class="p">(</span><span class="n">g</span><span class="p">)</span><span class="o">$</span><span class="n">arrow.mode</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="m">0</span><span class="w">

</span><span class="n">plot.comm</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">comm</span><span class="p">){</span><span class="w">
</span><span class="n">V</span><span class="p">(</span><span class="n">g</span><span class="p">)</span><span class="o">$</span><span class="n">color</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">colors37</span><span class="p">[</span><span class="n">get</span><span class="p">(</span><span class="n">comm</span><span class="p">)(</span><span class="n">g</span><span class="p">)</span><span class="o">$</span><span class="n">membership</span><span class="p">]</span><span class="w">
</span><span class="n">l</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">qgraph.layout.fruchtermanreingold</span><span class="p">(</span><span class="n">get.edgelist</span><span class="p">(</span><span class="n">g</span><span class="p">,</span><span class="w"> </span><span class="n">names</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nb">F</span><span class="p">),</span><span class="w"> </span><span class="n">vcount</span><span class="o">=</span><span class="n">vcount</span><span class="p">(</span><span class="n">g</span><span class="p">),</span><span class="w">
      </span><span class="n">area</span><span class="o">=</span><span class="m">8</span><span class="o">*</span><span class="p">(</span><span class="n">vcount</span><span class="p">(</span><span class="n">g</span><span class="p">)</span><span class="o">^</span><span class="m">2</span><span class="p">),</span><span class="n">repulse.rad</span><span class="o">=</span><span class="p">(</span><span class="n">vcount</span><span class="p">(</span><span class="n">g</span><span class="p">)</span><span class="o">^</span><span class="m">3.1</span><span class="p">))</span><span class="w">
</span><span class="n">plot</span><span class="p">(</span><span class="n">g</span><span class="p">,</span><span class="n">layout</span><span class="o">=</span><span class="n">l</span><span class="p">,</span><span class="n">vertex.size</span><span class="o">=</span><span class="m">5</span><span class="p">,</span><span class="w"> </span><span class="n">main</span><span class="o">=</span><span class="w"> </span><span class="n">comm</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">

</span><span class="nf">invisible</span><span class="p">(</span><span class="n">lapply</span><span class="p">(</span><span class="n">comms</span><span class="p">,</span><span class="w"> </span><span class="n">plot.comm</span><span class="p">))</span><span class="w">
</span></code></pre></div></div>

<p><img src="https://github.com/Wario84/blog/raw/main/assets/imgs/2023-09-14-intro_net_anlysis_R/unnamed-chunk-56-1.png?raw=true" alt="" /><!-- --><img src="https://github.com/Wario84/blog/raw/main/assets/imgs/2023-09-14-intro_net_anlysis_R/unnamed-chunk-56-2.png?raw=true" alt="" /><!-- --><img src="https://github.com/Wario84/blog/raw/main/assets/imgs/2023-09-14-intro_net_anlysis_R/unnamed-chunk-56-3.png?raw=true" alt="" /><!-- --><img src="https://github.com/Wario84/blog/raw/main/assets/imgs/2023-09-14-intro_net_anlysis_R/unnamed-chunk-56-4.png?raw=true" alt="" /><!-- --><img src="https://github.com/Wario84/blog/raw/main/assets/imgs/2023-09-14-intro_net_anlysis_R/unnamed-chunk-56-5.png?raw=true" alt="" /><!-- --><img src="https://github.com/Wario84/blog/raw/main/assets/imgs/2023-09-14-intro_net_anlysis_R/unnamed-chunk-56-6.png?raw=true" alt="" /><!-- --><img src="https://github.com/Wario84/blog/raw/main/assets/imgs/2023-09-14-intro_net_anlysis_R/unnamed-chunk-56-7.png?raw=true" alt="" /><!-- --><img src="https://github.com/Wario84/blog/raw/main/assets/imgs/2023-09-14-intro_net_anlysis_R/unnamed-chunk-56-8.png?raw=true" alt="" /><!-- --></p>]]></content><author><name>Mario H. Gonzalez-Sauri</name></author><summary type="html"><![CDATA[Introduction]]></summary></entry><entry><title type="html">Backtesting a Crossover Moving Average Strategy Algorithm in the Forex Market.</title><link href="https://mario1084.github.io/blog/2023/09/06/sma_crossover_st.html" rel="alternate" type="text/html" title="Backtesting a Crossover Moving Average Strategy Algorithm in the Forex Market." /><published>2023-09-06T00:00:00+00:00</published><updated>2023-09-06T00:00:00+00:00</updated><id>https://mario1084.github.io/blog/2023/09/06/sma_crossover_st</id><content type="html" xml:base="https://mario1084.github.io/blog/2023/09/06/sma_crossover_st.html"><![CDATA[<h2 id="introduction">Introduction</h2>

<p>This post aims to explore the effectiveness of a straightforward Forex
market investment algorithm. Amidst numerous algorithmic possibilities,
I chose to embrace simplicity as a stepping stone before delving into
more complex strategies. Inspiration was drawn from Gurrib’s 2016 paper,
published on the Global Review of Accounting and Finance, available at
<a href="https://ssrn.com/abstract=2578302">SSRN</a>. Gurrib’s study, which
benchmarked a crossover simple moving average strategy on daily S&amp;P500
candles between 1993 and 2014, reported an impressive 24% return over
1593 investment days. Let’s embark on this journey to assess the
potential of a similar approach in the Forex market.</p>

<h2 id="sma-crossover-strategy">SMA Crossover Strategy</h2>

<p>The SMA crossover strategy works by assuming that the series contains
short and long-run trends. The short-run trend follows the series
closely and reacts more rapidly to variations in the series in
comparison to the long-run trend. The core idea of the strategy is that
we can find market signals of buying or selling by monitoring the
intersections between the short and long-run trends in the series. If
the short-run trend intersects the long-run trend and moves upward the
value of the series is increasing (also called a golden cross), and
therefore the algorithm sends a buy signal. Conversely, if the short-run
trend intersects the long-run trend and moves downward the price
decreases (known as a dead cross), and it is time to sell.</p>

<p>To understand better the behavior of the algorithm I have created a
visualization that monitors the interaction between the USD.SEK series
that I downloaded from Interactive Brokers in candles of 30 seconds
(blue) and the short and long-run trends (red and green respectively)
that I have estimated using SMA.</p>

<p><img src="https://github.com/Wario84/blog/raw/main/assets/imgs/USD.SEK_SMA_3.gif?raw=true" alt="" /><!-- --></p>

<p>If you are interested, you can recreate the animation with the code bellow in Python:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>
<span class="kn">import</span> <span class="nn">os</span>
<span class="kn">from</span> <span class="nn">matplotlib.animation</span> <span class="kn">import</span> <span class="n">FuncAnimation</span>
<span class="kn">from</span> <span class="nn">IPython.display</span> <span class="kn">import</span> <span class="n">display</span><span class="p">,</span> <span class="n">clear_output</span>

<span class="c1"># change working directory
</span>
<span class="c1"># Specify the target directory
</span><span class="n">new_directory</span> <span class="o">=</span> <span class="sa">r</span><span class="s">'C:\Users\mglez\Documents\PHD\Semester 16\01092023_SMA_blog\material\dynamc_graph'</span>

<span class="c1"># Change the working directory
</span><span class="n">os</span><span class="p">.</span><span class="n">chdir</span><span class="p">(</span><span class="n">new_directory</span><span class="p">)</span>

<span class="c1"># Read the CSV file and extract the first column of the USD.SEK as y1
</span><span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s">'USDSEK_dur_5D_candle_30sec_2023-08-20_2023-08-24_CLOSE.csv'</span><span class="p">)</span>
<span class="n">y1</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">iloc</span><span class="p">[</span><span class="mi">301</span><span class="p">:,</span> <span class="mi">1</span><span class="p">].</span><span class="n">values</span>
<span class="c1"># Loading the simple moving averages
</span><span class="n">y2</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">iloc</span><span class="p">[</span><span class="mi">301</span><span class="p">:,</span> <span class="mi">2</span><span class="p">].</span><span class="n">values</span>
<span class="n">y3</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">iloc</span><span class="p">[</span><span class="mi">301</span><span class="p">:,</span> <span class="mi">3</span><span class="p">].</span><span class="n">values</span>

<span class="c1"># Create x data
</span><span class="n">n</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">y1</span><span class="p">)</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">arange</span><span class="p">(</span><span class="n">n</span><span class="p">)</span>

<span class="c1"># Initialize the plot
</span><span class="n">fig</span><span class="p">,</span> <span class="n">ax</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">subplots</span><span class="p">()</span>
<span class="n">line1</span><span class="p">,</span> <span class="o">=</span> <span class="n">ax</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y1</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="s">'USD.SEK'</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s">'blue'</span><span class="p">)</span>
<span class="n">line2</span><span class="p">,</span> <span class="o">=</span> <span class="n">ax</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y2</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="s">'SMA 5'</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s">'red'</span><span class="p">)</span>
<span class="n">line3</span><span class="p">,</span> <span class="o">=</span> <span class="n">ax</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y3</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="s">'SMA 300'</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s">'green'</span><span class="p">)</span>
<span class="n">ax</span><span class="p">.</span><span class="n">set_title</span><span class="p">(</span><span class="s">'SMA Crossover Strategy: USD.SEK'</span><span class="p">)</span>
<span class="n">ax</span><span class="p">.</span><span class="n">legend</span><span class="p">()</span>

<span class="c1"># Set the y-axis limits to display values between 10.5 and 11.5
</span><span class="n">ax</span><span class="p">.</span><span class="n">set_ylim</span><span class="p">(</span><span class="mf">10.9</span><span class="p">,</span> <span class="mi">11</span><span class="p">)</span>

<span class="c1"># Initialize text annotation
</span><span class="n">text</span> <span class="o">=</span> <span class="n">ax</span><span class="p">.</span><span class="n">text</span><span class="p">(</span><span class="mf">0.1</span><span class="p">,</span> <span class="mf">0.90</span><span class="p">,</span> <span class="s">''</span><span class="p">,</span> <span class="n">transform</span><span class="o">=</span><span class="n">ax</span><span class="p">.</span><span class="n">transAxes</span><span class="p">,</span> <span class="n">fontsize</span><span class="o">=</span><span class="mi">12</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s">'black'</span><span class="p">)</span>

<span class="c1"># Number of values to display on the horizontal axis
</span><span class="n">num_values_to_display</span> <span class="o">=</span> <span class="mi">100</span>

<span class="c1"># Function to update the plot for each frame
</span><span class="k">def</span> <span class="nf">update</span><span class="p">(</span><span class="n">frame</span><span class="p">):</span>
    <span class="n">x_data</span> <span class="o">=</span> <span class="n">x</span><span class="p">[</span><span class="nb">max</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">frame</span> <span class="o">-</span> <span class="n">num_values_to_display</span><span class="p">):</span><span class="n">frame</span><span class="p">]</span>
    <span class="n">line1</span><span class="p">.</span><span class="n">set_data</span><span class="p">(</span><span class="n">x_data</span><span class="p">,</span> <span class="n">y1</span><span class="p">[</span><span class="nb">max</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">frame</span> <span class="o">-</span> <span class="n">num_values_to_display</span><span class="p">):</span><span class="n">frame</span><span class="p">])</span>
    <span class="n">line2</span><span class="p">.</span><span class="n">set_data</span><span class="p">(</span><span class="n">x_data</span><span class="p">,</span> <span class="n">y2</span><span class="p">[</span><span class="nb">max</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">frame</span> <span class="o">-</span> <span class="n">num_values_to_display</span><span class="p">):</span><span class="n">frame</span><span class="p">])</span>
    <span class="n">line3</span><span class="p">.</span><span class="n">set_data</span><span class="p">(</span><span class="n">x_data</span><span class="p">,</span> <span class="n">y3</span><span class="p">[</span><span class="nb">max</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">frame</span> <span class="o">-</span> <span class="n">num_values_to_display</span><span class="p">):</span><span class="n">frame</span><span class="p">])</span>
    
    <span class="c1"># Calculate the x-axis limits dynamically based on the frame and num_values_to_display
</span>    <span class="n">min_x</span> <span class="o">=</span> <span class="nb">max</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">frame</span> <span class="o">-</span> <span class="n">num_values_to_display</span><span class="p">)</span>
    <span class="n">max_x</span> <span class="o">=</span> <span class="n">frame</span>
    <span class="n">ax</span><span class="p">.</span><span class="n">set_xlim</span><span class="p">(</span><span class="n">min_x</span><span class="p">,</span> <span class="n">max_x</span><span class="p">)</span>

    <span class="c1"># Check if there are enough data points to calculate min and max
</span>    <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">x_data</span><span class="p">)</span> <span class="o">&gt;=</span> <span class="n">num_values_to_display</span><span class="p">:</span>
        <span class="n">min_y</span> <span class="o">=</span> <span class="nb">min</span><span class="p">(</span><span class="nb">min</span><span class="p">(</span><span class="n">y1</span><span class="p">[</span><span class="nb">max</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">frame</span> <span class="o">-</span> <span class="n">num_values_to_display</span><span class="p">):</span><span class="n">frame</span><span class="p">]),</span> <span class="nb">min</span><span class="p">(</span><span class="n">y2</span><span class="p">[</span><span class="nb">max</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">frame</span> <span class="o">-</span> <span class="n">num_values_to_display</span><span class="p">):</span><span class="n">frame</span><span class="p">]),</span> <span class="nb">min</span><span class="p">(</span><span class="n">y3</span><span class="p">[</span><span class="nb">max</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">frame</span> <span class="o">-</span> <span class="n">num_values_to_display</span><span class="p">):</span><span class="n">frame</span><span class="p">]))</span>
        <span class="n">max_y</span> <span class="o">=</span> <span class="nb">max</span><span class="p">(</span><span class="nb">max</span><span class="p">(</span><span class="n">y1</span><span class="p">[</span><span class="nb">max</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">frame</span> <span class="o">-</span> <span class="n">num_values_to_display</span><span class="p">):</span><span class="n">frame</span><span class="p">]),</span> <span class="nb">max</span><span class="p">(</span><span class="n">y2</span><span class="p">[</span><span class="nb">max</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">frame</span> <span class="o">-</span> <span class="n">num_values_to_display</span><span class="p">):</span><span class="n">frame</span><span class="p">]),</span> <span class="nb">max</span><span class="p">(</span><span class="n">y3</span><span class="p">[</span><span class="nb">max</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">frame</span> <span class="o">-</span> <span class="n">num_values_to_display</span><span class="p">):</span><span class="n">frame</span><span class="p">]))</span>
        <span class="n">ax</span><span class="p">.</span><span class="n">set_ylim</span><span class="p">(</span><span class="n">min_y</span> <span class="o">-</span> <span class="mf">0.009</span><span class="p">,</span> <span class="n">max_y</span> <span class="o">+</span> <span class="mf">0.009</span><span class="p">)</span>
    
    <span class="k">if</span> <span class="n">y2</span><span class="p">[</span><span class="n">frame</span><span class="p">]</span> <span class="o">&gt;</span> <span class="n">y3</span><span class="p">[</span><span class="n">frame</span><span class="p">]:</span>
        <span class="n">text</span><span class="p">.</span><span class="n">set_text</span><span class="p">(</span><span class="s">'Buy'</span><span class="p">)</span>
        <span class="n">text</span><span class="p">.</span><span class="n">set_color</span><span class="p">(</span><span class="s">'red'</span><span class="p">)</span>
    <span class="k">else</span><span class="p">:</span>
        <span class="n">text</span><span class="p">.</span><span class="n">set_text</span><span class="p">(</span><span class="s">'Sell'</span><span class="p">)</span>
        <span class="n">text</span><span class="p">.</span><span class="n">set_color</span><span class="p">(</span><span class="s">'green'</span><span class="p">)</span>
    <span class="k">return</span> <span class="n">line1</span><span class="p">,</span> <span class="n">line2</span><span class="p">,</span> <span class="n">line3</span><span class="p">,</span> <span class="n">text</span>

<span class="c1"># Create the animation
</span><span class="n">ani</span> <span class="o">=</span> <span class="n">FuncAnimation</span><span class="p">(</span><span class="n">fig</span><span class="p">,</span> <span class="n">update</span><span class="p">,</span> <span class="n">frames</span><span class="o">=</span><span class="n">n</span><span class="p">,</span> <span class="n">interval</span><span class="o">=</span><span class="mi">300</span><span class="p">)</span>  <span class="c1"># Update every 3 seconds
</span>
<span class="c1"># Display the animation in Jupyter Notebook
</span><span class="n">display</span><span class="p">(</span><span class="n">fig</span><span class="p">)</span>

<span class="k">try</span><span class="p">:</span>
    <span class="c1"># Continuously update the animation
</span>    <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">n</span><span class="p">):</span>
        <span class="n">update</span><span class="p">(</span><span class="n">i</span><span class="p">)</span>
        <span class="n">clear_output</span><span class="p">(</span><span class="n">wait</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
        <span class="n">display</span><span class="p">(</span><span class="n">fig</span><span class="p">)</span>
<span class="k">except</span> <span class="nb">KeyboardInterrupt</span><span class="p">:</span>
    <span class="k">pass</span>
</code></pre></div></div>

<h2 id="requesting-data-from-interactive-brokers">Requesting Data from Interactive Brokers</h2>

<p>To request data from Interactive Brokers, you need a trading account and
to connect the API of the Trader Workstation to Python or R. If you are
using R, you need to install the <code class="language-plaintext highlighter-rouge">IBrokers</code> package before you attempt
to download the data. To teach how to download data from the API, can be
a tutorial in itself. But the key elements that you need are a
connection to the API, a contract with a correct symbol for the stock, a
duration and a candle size. In this example, I am creating a connection
that I call <code class="language-plaintext highlighter-rouge">tws</code>, then I am creating a contract with the
<code class="language-plaintext highlighter-rouge">twsContract()</code> with the correct symbol <code class="language-plaintext highlighter-rouge">USD.SEK</code> and finally I am
setting a duration of <code class="language-plaintext highlighter-rouge">5 D</code> (5 days) with candles of <code class="language-plaintext highlighter-rouge">30 sec</code> (seconds).</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">library</span><span class="p">(</span><span class="s1">'IBrokers'</span><span class="p">)</span><span class="w">


</span><span class="c1">#### ACCOUNT ####</span><span class="w">
</span><span class="n">tws</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">twsConnect</span><span class="p">(</span><span class="n">port</span><span class="o">=</span><span class="m">7496</span><span class="p">)</span><span class="w">
</span><span class="n">twsConnectionTime</span><span class="p">(</span><span class="n">tws</span><span class="p">)</span><span class="w">
</span><span class="n">ac</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">reqAccountUpdates</span><span class="p">(</span><span class="n">tws</span><span class="p">)</span><span class="w">

</span><span class="c1">#### CONTRACT ####</span><span class="w">
</span><span class="n">contract</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">twsContract</span><span class="p">()</span><span class="w">
</span><span class="n">contract</span><span class="o">$</span><span class="n">symbol</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="s2">"USD"</span><span class="w">
</span><span class="n">contract</span><span class="o">$</span><span class="n">sectype</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="s2">"CASH"</span><span class="w">
</span><span class="n">contract</span><span class="o">$</span><span class="n">currency</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="s2">"SEK"</span><span class="w">
</span><span class="n">contract</span><span class="o">$</span><span class="n">exch</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="s2">"IDEALPRO"</span><span class="w">
</span><span class="n">contract</span><span class="o">$</span><span class="n">includeExpired</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="w">
</span><span class="n">is.twsContract</span><span class="p">(</span><span class="n">contract</span><span class="p">)</span><span class="w">


</span><span class="c1">#### REQUEST HIST DATA ####</span><span class="w">

</span><span class="n">duration</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="s2">"5 D"</span><span class="w">
</span><span class="n">barSizeSetting</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="s2">"30 sec"</span><span class="w">

</span><span class="n">data</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">reqHistoricalData</span><span class="p">(</span><span class="n">conn</span><span class="o">=</span><span class="n">tws</span><span class="p">,</span><span class="w"> </span><span class="n">Contract</span><span class="o">=</span><span class="n">contract</span><span class="p">,</span><span class="w"> </span><span class="n">duration</span><span class="o">=</span><span class="n">duration</span><span class="p">,</span><span class="w"> </span><span class="n">barSize</span><span class="o">=</span><span class="n">barSizeSetting</span><span class="p">,</span><span class="w"> </span><span class="n">whatToShow</span><span class="o">=</span><span class="s2">"MIDPOINT"</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<p>The setting that I am using on my Trader Workstation are the following:</p>

<p><img src="https://github.com/Wario84/blog/raw/main/assets/imgs/00_set_up_API.gif?raw=true" alt="" /><!-- --></p>

<h2 id="optimizing-the-sma-crossover-bands-short-run-backtesting">Optimizing the SMA Crossover bands (Short-run Backtesting)</h2>

<p>A key requirement for the success of the algorithm is to identify which
set of bands (long and short-run) are better to predict changes in the
behavior of the series. We are interesting on benchmarking a series of
pair of bands so we can identify which combination is more profitable.
In other words, we are going to set the criteria of the highest balance
at the end of the testing period to select the pair of bands.</p>

<p>Probably there is fastest vectorized way to identify the intersections,
but to calculate the final profit I think it is only possible to do with
a loop. Because, the margins of profit/loss change in every transaction
and the accumulation of capital depends on this interactive process.</p>

<p>The range I selected for the SMA in the short run is from <code class="language-plaintext highlighter-rouge">5:100</code> and
<code class="language-plaintext highlighter-rouge">10:300</code>for the long run. In each iteration the algorithm will select
one pair of bands, fit the corresponding models, calculate benchmarks
and estimate the capital at the end of the period. Effectively algorithm
tests <code class="language-plaintext highlighter-rouge">27550</code> combinations of short and long-run bands and saves
measurements of performance for latter analysis. The optimization of the
bands was conducted in a data set that runs over 5 days in candles of 30
seconds, a data set of <code class="language-plaintext highlighter-rouge">94624</code>.</p>

<p>Similar to the study by Gurrib (2016), I assume that:</p>

<ol>
  <li>The frequency of data is set to candles of 30 seconds.</li>
  <li>The effect of discounts, taxes and commissions are ignored.</li>
  <li>All orders occur immediately at market prices.</li>
  <li>Limit and stop order options are not allowed at this stage.</li>
</ol>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">perf_df</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">data.frame</span><span class="p">(</span><span class="n">matrix</span><span class="p">(</span><span class="n">ncol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">13</span><span class="p">,</span><span class="w"> </span><span class="n">nrow</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">))</span><span class="w">
</span><span class="n">colnames</span><span class="p">(</span><span class="n">perf_df</span><span class="p">)</span><span class="w"> </span><span class="o">&lt;-</span><span class="w">
  </span><span class="nf">c</span><span class="p">(</span><span class="w">
    </span><span class="s2">"n"</span><span class="p">,</span><span class="w">
    </span><span class="s2">"m"</span><span class="p">,</span><span class="w">
    </span><span class="s2">"capital"</span><span class="p">,</span><span class="w">
    </span><span class="s2">"num_trades"</span><span class="p">,</span><span class="w">
    </span><span class="s2">"trades_per_min"</span><span class="p">,</span><span class="w">
    </span><span class="s2">"numWinningTrades"</span><span class="p">,</span><span class="w">
    </span><span class="s2">"numLosingTrades"</span><span class="p">,</span><span class="w">
    </span><span class="s2">"mae_short"</span><span class="p">,</span><span class="w">
    </span><span class="s2">"mae_long"</span><span class="p">,</span><span class="w">
    </span><span class="s2">"rmse_short"</span><span class="p">,</span><span class="w">
    </span><span class="s2">"rmse_long"</span><span class="p">,</span><span class="w">
    </span><span class="s2">"corr_short"</span><span class="p">,</span><span class="w">
    </span><span class="s2">"corr_long"</span><span class="w">
  </span><span class="p">)</span><span class="w">

</span><span class="c1"># Set commission rate</span><span class="w">
</span><span class="c1"># commission_rate &lt;- 0.00075</span><span class="w">
</span><span class="n">commission_rate</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="m">0</span><span class="w">

</span><span class="c1"># Loop over n and m</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="m">5</span><span class="o">:</span><span class="m">100</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
  </span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">m</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="m">10</span><span class="o">:</span><span class="m">300</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
    </span><span class="c1"># Calculate moving averages</span><span class="w">
    </span><span class="n">data</span><span class="o">$</span><span class="n">sma_short</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">SMA</span><span class="p">(</span><span class="n">data</span><span class="o">$</span><span class="n">USD.SEK.Close</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="p">)</span><span class="w">
    </span><span class="n">data</span><span class="o">$</span><span class="n">sma_long</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">SMA</span><span class="p">(</span><span class="n">data</span><span class="o">$</span><span class="n">USD.SEK.Close</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">m</span><span class="p">)</span><span class="w">
    </span><span class="c1"># data$sma_short[is.na(data$sma_short)] &lt;- 0</span><span class="w">
    </span><span class="c1"># data$sma_long[is.na(data$sma_long)] &lt;- 0</span><span class="w">
    
    </span><span class="c1"># Mean Absolute Error (MAE)</span><span class="w">
    </span><span class="n">mae_short</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">mean</span><span class="p">(</span><span class="nf">abs</span><span class="p">(</span><span class="n">data</span><span class="o">$</span><span class="n">sma_short</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">data</span><span class="o">$</span><span class="n">USD.SEK.Close</span><span class="p">),</span><span class="n">na.rm</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
    </span><span class="n">mae_long</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">mean</span><span class="p">(</span><span class="nf">abs</span><span class="p">(</span><span class="n">data</span><span class="o">$</span><span class="n">sma_long</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">data</span><span class="o">$</span><span class="n">USD.SEK.Close</span><span class="p">),</span><span class="w"> </span><span class="n">na.rm</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
    
    </span><span class="c1"># Root Mean Squared Error (RMSE)</span><span class="w">
    </span><span class="n">rmse_short</span><span class="w"> </span><span class="o">&lt;-</span><span class="w">
      </span><span class="nf">sqrt</span><span class="p">(</span><span class="n">mean</span><span class="p">(</span><span class="nf">sum</span><span class="p">(</span><span class="n">data</span><span class="o">$</span><span class="n">sma_short</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">data</span><span class="o">$</span><span class="n">USD.SEK.Close</span><span class="p">,</span><span class="w"> </span><span class="n">na.rm</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nb">T</span><span class="p">)</span><span class="w"> </span><span class="o">^</span><span class="w"> </span><span class="m">2</span><span class="p">))</span><span class="w">
    </span><span class="n">rmse_long</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">sqrt</span><span class="p">(</span><span class="n">mean</span><span class="p">(</span><span class="nf">sum</span><span class="p">(</span><span class="n">data</span><span class="o">$</span><span class="n">sma_long</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">data</span><span class="o">$</span><span class="n">USD.SEK.Close</span><span class="p">,</span><span class="w"> </span><span class="n">na.rm</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w"> </span><span class="o">^</span><span class="w"> </span><span class="m">2</span><span class="p">))</span><span class="w">
    
    </span><span class="c1"># Correlation Coefficient</span><span class="w">
    </span><span class="n">corr_short</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">cor</span><span class="p">(</span><span class="n">data</span><span class="o">$</span><span class="n">sma_short</span><span class="p">,</span><span class="w"> </span><span class="n">data</span><span class="o">$</span><span class="n">USD.SEK.Close</span><span class="p">,</span><span class="w"> </span><span class="n">use</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"complete.obs"</span><span class="p">)</span><span class="w">
    </span><span class="n">corr_long</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">cor</span><span class="p">(</span><span class="n">data</span><span class="o">$</span><span class="n">sma_long</span><span class="p">,</span><span class="w"> </span><span class="n">data</span><span class="o">$</span><span class="n">USD.SEK.Close</span><span class="p">,</span><span class="w"> </span><span class="n">use</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"complete.obs"</span><span class="p">)</span><span class="w">
    
    
    </span><span class="c1"># Initialize variables</span><span class="w">
    </span><span class="n">init_capital</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2000</span><span class="w">
    </span><span class="n">capital</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2000</span><span class="w">
    </span><span class="n">pos</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="w">
    </span><span class="n">numTrades</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="w">
    </span><span class="n">numWinningTrades</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="w">
    </span><span class="n">numLosingTrades</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="w">
    
    
    
    
    </span><span class="c1"># Backtest strategy</span><span class="w">
    </span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="m">2</span><span class="o">:</span><span class="n">nrow</span><span class="p">(</span><span class="n">data</span><span class="p">)){</span><span class="w">      </span><span class="c1"># Check for a cross</span><span class="w">
      </span><span class="c1"># c &lt;- c + 1L</span><span class="w">
      </span><span class="c1"># #print cross</span><span class="w">
      </span><span class="c1"># print(paste0("cross: ", c))</span><span class="w">
      </span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="o">!</span><span class="nf">is.na</span><span class="p">(</span><span class="n">data</span><span class="o">$</span><span class="n">sma_short</span><span class="p">[</span><span class="n">i</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="m">1</span><span class="p">])</span><span class="w"> </span><span class="o">&amp;&amp;</span><span class="w">
          </span><span class="o">!</span><span class="nf">is.na</span><span class="p">(</span><span class="n">data</span><span class="o">$</span><span class="n">sma_long</span><span class="p">[</span><span class="n">i</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="m">1</span><span class="p">])</span><span class="w"> </span><span class="o">&amp;&amp;</span><span class="w">
          </span><span class="n">data</span><span class="o">$</span><span class="n">sma_short</span><span class="p">[</span><span class="n">i</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="o">&lt;=</span><span class="w"> </span><span class="n">data</span><span class="o">$</span><span class="n">sma_long</span><span class="p">[</span><span class="n">i</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="o">&amp;&amp;</span><span class="w">
          </span><span class="n">data</span><span class="o">$</span><span class="n">sma_short</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="w"> </span><span class="o">&gt;</span><span class="w"> </span><span class="n">data</span><span class="o">$</span><span class="n">sma_long</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="w"> </span><span class="o">&amp;&amp;</span><span class="w"> </span><span class="n">capital</span><span class="w"> </span><span class="o">&gt;</span><span class="w"> </span><span class="m">0</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
        </span><span class="c1"># Buy</span><span class="w">
        </span><span class="n">pos</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="p">(</span><span class="n">capital</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">capital</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">commission_rate</span><span class="p">)</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="nf">as.numeric</span><span class="p">(</span><span class="n">data</span><span class="o">$</span><span class="n">USD.SEK.Close</span><span class="p">[</span><span class="n">i</span><span class="p">])</span><span class="w">
        </span><span class="n">print</span><span class="p">(</span><span class="n">paste</span><span class="p">(</span><span class="s2">"BUY:"</span><span class="p">,</span><span class="w"> </span><span class="n">i</span><span class="p">,</span><span class="w"> </span><span class="n">pos</span><span class="p">))</span><span class="w">
        </span><span class="n">capital</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="w">
        </span><span class="n">numTrades</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">numTrades</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="w">
      </span><span class="p">}</span><span class="w">
      </span><span class="k">else</span><span class="w"> </span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="o">!</span><span class="nf">is.na</span><span class="p">(</span><span class="n">data</span><span class="o">$</span><span class="n">sma_short</span><span class="p">[</span><span class="n">i</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="m">1</span><span class="p">])</span><span class="w"> </span><span class="o">&amp;&amp;</span><span class="w">
               </span><span class="o">!</span><span class="nf">is.na</span><span class="p">(</span><span class="n">data</span><span class="o">$</span><span class="n">sma_long</span><span class="p">[</span><span class="n">i</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="m">1</span><span class="p">])</span><span class="w"> </span><span class="o">&amp;&amp;</span><span class="w">
               </span><span class="n">data</span><span class="o">$</span><span class="n">sma_short</span><span class="p">[</span><span class="n">i</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="o">&gt;=</span><span class="w"> </span><span class="n">data</span><span class="o">$</span><span class="n">sma_long</span><span class="p">[</span><span class="n">i</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="o">&amp;&amp;</span><span class="w">
               </span><span class="n">data</span><span class="o">$</span><span class="n">sma_short</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="w"> </span><span class="o">&lt;</span><span class="w"> </span><span class="n">data</span><span class="o">$</span><span class="n">sma_long</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="w"> </span><span class="o">&amp;&amp;</span><span class="w"> </span><span class="n">pos</span><span class="w"> </span><span class="o">&gt;</span><span class="w"> </span><span class="m">0</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
        </span><span class="c1"># Sell</span><span class="w">
        </span><span class="n">capital</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">as.numeric</span><span class="p">(</span><span class="n">pos</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="n">data</span><span class="o">$</span><span class="n">USD.SEK.Close</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">pos</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="n">data</span><span class="o">$</span><span class="n">USD.SEK.Close</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">commission_rate</span><span class="p">)</span><span class="w">
        </span><span class="n">print</span><span class="p">(</span><span class="n">paste</span><span class="p">(</span><span class="s2">"SELL:"</span><span class="p">,</span><span class="w"> </span><span class="n">i</span><span class="p">,</span><span class="w"> </span><span class="n">capital</span><span class="p">))</span><span class="w">
        </span><span class="n">pos</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="w">
        </span><span class="n">numTrades</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">numTrades</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="w">
        
        </span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="n">capital</span><span class="w"> </span><span class="o">&gt;</span><span class="w"> </span><span class="n">init_capital</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
          </span><span class="n">numWinningTrades</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">numWinningTrades</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="w">
          
        </span><span class="p">}</span><span class="w"> </span><span class="k">else</span><span class="w"> </span><span class="p">{</span><span class="w">
          </span><span class="n">numLosingTrades</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">numLosingTrades</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="w">
          
        </span><span class="p">}</span><span class="w">
      </span><span class="p">}</span><span class="w">
      
    </span><span class="p">}</span><span class="w">
  </span><span class="p">}</span><span class="w">
  </span><span class="c1"># c &lt;- 0L</span><span class="w">
  </span><span class="c1"># Append performance to dataframe</span><span class="w">
  </span><span class="n">perf_df</span><span class="w"> </span><span class="o">&lt;-</span><span class="w">
    </span><span class="n">rbind</span><span class="p">(</span><span class="w">
      </span><span class="n">perf_df</span><span class="p">,</span><span class="w">
      </span><span class="n">data.frame</span><span class="p">(</span><span class="w">
        </span><span class="n">n</span><span class="p">,</span><span class="w">
        </span><span class="n">m</span><span class="p">,</span><span class="w">
        </span><span class="n">capital</span><span class="p">,</span><span class="w">
        </span><span class="n">numTrades</span><span class="p">,</span><span class="w">
        </span><span class="n">trades_per_min</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">numTrades</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="p">(</span><span class="w"> </span><span class="p">(</span><span class="n">nrow</span><span class="p">(</span><span class="n">data</span><span class="p">)</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="m">30</span><span class="p">)</span><span class="o">/</span><span class="m">60</span><span class="w"> </span><span class="p">),</span><span class="w">
        </span><span class="n">numWinningTrades</span><span class="p">,</span><span class="w">
        </span><span class="n">numLosingTrades</span><span class="p">,</span><span class="w">
        </span><span class="n">mae_short</span><span class="p">,</span><span class="w">
        </span><span class="n">mae_long</span><span class="p">,</span><span class="w">
        </span><span class="n">rmse_short</span><span class="p">,</span><span class="w">
        </span><span class="n">rmse_long</span><span class="p">,</span><span class="w">
        </span><span class="n">corr_short</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">corr_short</span><span class="p">,</span><span class="w">
        </span><span class="n">corr_long</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">corr_short</span><span class="w">
      </span><span class="p">)</span><span class="w">
    </span><span class="p">)</span><span class="w">
  </span><span class="n">print</span><span class="p">(</span><span class="n">tail</span><span class="p">(</span><span class="n">perf_df</span><span class="p">))</span><span class="w">
</span><span class="p">}</span><span class="w">

  
</span><span class="n">colnames</span><span class="p">(</span><span class="n">perf_df</span><span class="p">)</span><span class="w"> </span><span class="o">&lt;-</span><span class="w">
  </span><span class="nf">c</span><span class="p">(</span><span class="w">
    </span><span class="s2">"n"</span><span class="p">,</span><span class="w">
    </span><span class="s2">"m"</span><span class="p">,</span><span class="w">
    </span><span class="s2">"capital"</span><span class="p">,</span><span class="w">
    </span><span class="s2">"numTrades"</span><span class="p">,</span><span class="w">
    </span><span class="s2">"trades_per_min"</span><span class="p">,</span><span class="w">
    </span><span class="s2">"numWinningTrades"</span><span class="p">,</span><span class="w">
    </span><span class="s2">"numLosingTrades"</span><span class="p">,</span><span class="w">
    </span><span class="s2">"mae_short"</span><span class="p">,</span><span class="w">
    </span><span class="s2">"mae_long"</span><span class="p">,</span><span class="w">
    </span><span class="s2">"rmse_short"</span><span class="p">,</span><span class="w">
    </span><span class="s2">"rmse_long"</span><span class="p">,</span><span class="w">
    </span><span class="s2">"corr_short"</span><span class="p">,</span><span class="w">
    </span><span class="s2">"corr_long"</span><span class="w">
  </span><span class="p">)</span><span class="w">




</span><span class="c1">#### TOP PERFORMANCE ####</span><span class="w">
</span><span class="c1"># Print final max capital</span><span class="w">
</span><span class="n">print</span><span class="p">(</span><span class="n">perf_df</span><span class="p">[</span><span class="n">which.max</span><span class="p">(</span><span class="n">perf_df</span><span class="o">$</span><span class="n">capital</span><span class="p">),])</span><span class="w">

</span><span class="c1"># Print final max numWinningTrades  </span><span class="w">
</span><span class="n">print</span><span class="p">(</span><span class="n">perf_df</span><span class="p">[</span><span class="n">which.max</span><span class="p">(</span><span class="n">perf_df</span><span class="o">$</span><span class="n">numWinningTrades</span><span class="w"> </span><span class="p">),])</span><span class="w">



</span><span class="c1">#### BEST FIT ####</span><span class="w">

</span><span class="c1"># Print final max rmse_short   </span><span class="w">
</span><span class="n">print</span><span class="p">(</span><span class="n">perf_df</span><span class="p">[</span><span class="n">which.max</span><span class="p">(</span><span class="n">perf_df</span><span class="o">$</span><span class="n">rmse_short</span><span class="p">),</span><span class="m">1</span><span class="p">])</span><span class="w">

</span><span class="c1"># Print final max mae_short   </span><span class="w">
</span><span class="n">print</span><span class="p">(</span><span class="n">perf_df</span><span class="p">[</span><span class="n">which.max</span><span class="p">(</span><span class="n">perf_df</span><span class="o">$</span><span class="n">mae_short</span><span class="p">),</span><span class="m">1</span><span class="p">])</span><span class="w">

</span><span class="c1"># Print final max mae_short   </span><span class="w">
</span><span class="n">print</span><span class="p">(</span><span class="n">perf_df</span><span class="p">[</span><span class="n">which.max</span><span class="p">(</span><span class="n">perf_df</span><span class="o">$</span><span class="n">corr_short</span><span class="p">),</span><span class="m">1</span><span class="p">])</span><span class="w">

</span><span class="c1"># Print final max rmse_long   </span><span class="w">
</span><span class="n">print</span><span class="p">(</span><span class="n">perf_df</span><span class="p">[</span><span class="n">which.max</span><span class="p">(</span><span class="n">perf_df</span><span class="o">$</span><span class="n">rmse_long</span><span class="p">),</span><span class="m">2</span><span class="p">])</span><span class="w">

</span><span class="c1"># Print final max mae_long   </span><span class="w">
</span><span class="n">print</span><span class="p">(</span><span class="n">perf_df</span><span class="p">[</span><span class="n">which.max</span><span class="p">(</span><span class="n">perf_df</span><span class="o">$</span><span class="n">mae_long</span><span class="p">),</span><span class="m">2</span><span class="p">])</span><span class="w">

</span><span class="c1"># Print final max mae_long   </span><span class="w">
</span><span class="n">print</span><span class="p">(</span><span class="n">perf_df</span><span class="p">[</span><span class="n">which.max</span><span class="p">(</span><span class="n">perf_df</span><span class="o">$</span><span class="n">corr_long</span><span class="p">),</span><span class="m">2</span><span class="p">])</span><span class="w">
</span></code></pre></div></div>

<p>In terms of performance (capital return) the pair that won is the
<code class="language-plaintext highlighter-rouge">8, 300</code> followed closely by the <code class="language-plaintext highlighter-rouge">5, 300</code> for the short and long-run
respectively.</p>

<h2 id="long-run-backtesting">Long Run Backtesting</h2>

<p>To have a better idea of the behavior of the algorithm, I decided to run
the algorithm using 6 months of data in candles 30 seconds with a total
of <code class="language-plaintext highlighter-rouge">2961360</code> data points. Testing the algorithm over six months will
give us a better perspective of how well the SMA bands capture the long
and short run trends in the data and a better approximation of the
financial return.</p>

<h3 id="improvements">Improvements</h3>

<p>I decided to make some small changes to the previous algorithm. Firstly,
I wanted to compute the <code class="language-plaintext highlighter-rouge">grossprofit/loss</code> of each transaction.
Secondly, I estimate the return of investment (ROI) of each transaction
to compute the average and standard deviation of the returns at the end
of the exercise and approximate a Sharpe Ratio. Thirdly, I wanted to
correct a misleading <code class="language-plaintext highlighter-rouge">numWinningTrades/numLosingTrades</code> indicator in the
previous algorithm. In the previous algorithm, I consider a wining trade
if the current capital was higher than the initial capital after each
transaction. However to be more accurate it is better to consider a
winning trade when the <code class="language-plaintext highlighter-rouge">buyPrice &gt; sellPrice</code>. This is a bit counter
intuitive but remember that the algorithm buys when the price is
increasing, so the USD (dollar) invested will render more Krones (SEK).
For instance, imagine that you invest 10 USD, and the price of the Krone
is 12 (buying price), that is 120 SEK. Then, if the algorithm identifies
a selling signal at a price of 10.5 SEK (selling price), your profit
would have been 1.43 USD for this transaction, calculated as (120 / 10.5
= 11.43).</p>

<h3 id="first-run-of-the-algorithm">First run of the algorithm</h3>

<p>In the first run of the algorithm I wanted to secure winning
transactions only. I attempt to achieve this by adding a rule
<code class="language-plaintext highlighter-rouge">buyPrice &gt; as.numeric(data$USD.SEK.Close[i])</code>, so I will guarantee that
the selling price was always bellow the buying price and make winning
trades all the time. The buying rule was transformed as follows:</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="w"> </span><span class="k">else</span><span class="w"> </span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="o">!</span><span class="nf">is.na</span><span class="p">(</span><span class="n">data</span><span class="o">$</span><span class="n">sma_short</span><span class="p">[</span><span class="n">i</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="m">1</span><span class="p">])</span><span class="w"> </span><span class="o">&amp;&amp;</span><span class="w">
           </span><span class="o">!</span><span class="nf">is.na</span><span class="p">(</span><span class="n">data</span><span class="o">$</span><span class="n">sma_long</span><span class="p">[</span><span class="n">i</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="m">1</span><span class="p">])</span><span class="w"> </span><span class="o">&amp;&amp;</span><span class="w">
           </span><span class="n">data</span><span class="o">$</span><span class="n">sma_short</span><span class="p">[</span><span class="n">i</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="o">&gt;=</span><span class="w"> </span><span class="n">data</span><span class="o">$</span><span class="n">sma_long</span><span class="p">[</span><span class="n">i</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="o">&amp;&amp;</span><span class="w">
           </span><span class="n">data</span><span class="o">$</span><span class="n">sma_short</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="w"> </span><span class="o">&lt;</span><span class="w"> </span><span class="n">data</span><span class="o">$</span><span class="n">sma_long</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="w"> </span><span class="o">&amp;&amp;</span><span class="w"> </span><span class="n">pos</span><span class="w"> </span><span class="o">&gt;</span><span class="w"> </span><span class="m">0</span><span class="w"> </span><span class="o">&amp;&amp;</span><span class="w"> </span><span class="n">buyPrice</span><span class="w"> </span><span class="o">&gt;</span><span class="w"> </span><span class="nf">as.numeric</span><span class="p">(</span><span class="n">data</span><span class="o">$</span><span class="n">USD.SEK.Close</span><span class="p">[</span><span class="n">i</span><span class="p">]))</span><span class="w"> 
</span></code></pre></div></div>

<p>Unfortunately this change didn’t report a greater performance than the
regular unconstrained moving average. The issue is that the series
eventually reach local maximum or minimum values. For instance, if the
algorithm buys at a local minimum point in the series the selling
condition will never be fulfilled
<code class="language-plaintext highlighter-rouge">buyPrice &gt; as.numeric(data$USD.SEK.Close[i])</code>. The algorithm’s
performance suffered because it purchased an asset at a local minimum
price, and since then, the price has consistently risen. This situation
makes it unlikely for future prices to be lower, leading to lower
overall performance. In a nutshell, it seems that in order to take
advance of the volatility of the series and make higher profit it is
necessary to lose some trades as long as on the averages we are winning
more often. This is the reported performance of the first run:</p>

<table>
  <thead>
    <tr>
      <th style="text-align: right">n</th>
      <th style="text-align: right">m</th>
      <th style="text-align: right">capital</th>
      <th style="text-align: right">net_profit</th>
      <th style="text-align: right">grossProfit</th>
      <th style="text-align: right">grossLoss</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="text-align: right">8</td>
      <td style="text-align: right">300</td>
      <td style="text-align: right">2086.763</td>
      <td style="text-align: right">86.763</td>
      <td style="text-align: right">86.763</td>
      <td style="text-align: right">0</td>
    </tr>
  </tbody>
</table>

<p>SMA constrained Crossover performance</p>

<p>The total profit over the six months was only 86.763 USD, a return of
investment of only 4.33 %. However as expected, the total number of
trades is low and more importantly there are no trades on loss.</p>

<table>
  <thead>
    <tr>
      <th style="text-align: right">buynumTrades</th>
      <th style="text-align: right">sellnumTrades</th>
      <th style="text-align: right">trades_per_min</th>
      <th style="text-align: right">numWinningTrades</th>
      <th style="text-align: right">numLosingTrades</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="text-align: right">56</td>
      <td style="text-align: right">55</td>
      <td style="text-align: right">0.001</td>
      <td style="text-align: right">55</td>
      <td style="text-align: right">0</td>
    </tr>
  </tbody>
</table>

<h3 id="second-run-of-the-algorithm">Second run of the algorithm</h3>

<p>In my second attempt I ran the unconstrained version (original version)
with the additional elements that I discussed previously, as follows:</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">n</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="m">8</span><span class="w">
</span><span class="n">m</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="m">300</span><span class="w">

</span><span class="c1"># Calculate moving averages</span><span class="w">
</span><span class="n">data</span><span class="o">$</span><span class="n">sma_short</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">SMA</span><span class="p">(</span><span class="n">data</span><span class="o">$</span><span class="n">USD.SEK.Close</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="p">)</span><span class="w">
</span><span class="n">data</span><span class="o">$</span><span class="n">sma_long</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">SMA</span><span class="p">(</span><span class="n">data</span><span class="o">$</span><span class="n">USD.SEK.Close</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">m</span><span class="p">)</span><span class="w">
</span><span class="c1"># data$sma_short[is.na(data$sma_short)] &lt;- 0</span><span class="w">
</span><span class="c1"># data$sma_long[is.na(data$sma_long)] &lt;- 0</span><span class="w">

</span><span class="c1"># Mean Absolute Error (MAE)</span><span class="w">
</span><span class="n">mae_short</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">mean</span><span class="p">(</span><span class="nf">abs</span><span class="p">(</span><span class="n">data</span><span class="o">$</span><span class="n">sma_short</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">data</span><span class="o">$</span><span class="n">USD.SEK.Close</span><span class="p">),</span><span class="n">na.rm</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
</span><span class="n">mae_long</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">mean</span><span class="p">(</span><span class="nf">abs</span><span class="p">(</span><span class="n">data</span><span class="o">$</span><span class="n">sma_long</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">data</span><span class="o">$</span><span class="n">USD.SEK.Close</span><span class="p">),</span><span class="w"> </span><span class="n">na.rm</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">

</span><span class="c1"># Root Mean Squared Error (RMSE)</span><span class="w">
</span><span class="n">rmse_short</span><span class="w"> </span><span class="o">&lt;-</span><span class="w">
  </span><span class="nf">sqrt</span><span class="p">(</span><span class="n">mean</span><span class="p">(</span><span class="nf">sum</span><span class="p">(</span><span class="n">data</span><span class="o">$</span><span class="n">sma_short</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">data</span><span class="o">$</span><span class="n">USD.SEK.Close</span><span class="p">,</span><span class="w"> </span><span class="n">na.rm</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nb">T</span><span class="p">)</span><span class="w"> </span><span class="o">^</span><span class="w"> </span><span class="m">2</span><span class="p">))</span><span class="w">
</span><span class="n">rmse_long</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">sqrt</span><span class="p">(</span><span class="n">mean</span><span class="p">(</span><span class="nf">sum</span><span class="p">(</span><span class="n">data</span><span class="o">$</span><span class="n">sma_long</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">data</span><span class="o">$</span><span class="n">USD.SEK.Close</span><span class="p">,</span><span class="w"> </span><span class="n">na.rm</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w"> </span><span class="o">^</span><span class="w"> </span><span class="m">2</span><span class="p">))</span><span class="w">

</span><span class="c1"># Correlation Coefficient</span><span class="w">
</span><span class="n">corr_short</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">cor</span><span class="p">(</span><span class="n">data</span><span class="o">$</span><span class="n">sma_short</span><span class="p">,</span><span class="w"> </span><span class="n">data</span><span class="o">$</span><span class="n">USD.SEK.Close</span><span class="p">,</span><span class="w"> </span><span class="n">use</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"complete.obs"</span><span class="p">)</span><span class="w">
</span><span class="n">corr_long</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">cor</span><span class="p">(</span><span class="n">data</span><span class="o">$</span><span class="n">sma_long</span><span class="p">,</span><span class="w"> </span><span class="n">data</span><span class="o">$</span><span class="n">USD.SEK.Close</span><span class="p">,</span><span class="w"> </span><span class="n">use</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"complete.obs"</span><span class="p">)</span><span class="w">



</span><span class="c1"># Initialize variables</span><span class="w">
</span><span class="n">init_capital</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2000</span><span class="w">
</span><span class="n">capital</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2000</span><span class="w">
</span><span class="n">buyCapital</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="m">0</span><span class="w">
</span><span class="n">pos</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="w">
</span><span class="n">grossPnL</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="w">
</span><span class="n">buynumTrades</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="w">
</span><span class="n">sellnumTrades</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="w">
</span><span class="n">numWinningTrades</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="w">
</span><span class="n">numLosingTrades</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="w">
</span><span class="n">grossProfit</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="w">
</span><span class="n">grossLoss</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="w">
</span><span class="n">commission_rate</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="w">
</span><span class="n">buyPrice</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="w">  </span><span class="c1"># Initialize previousPrice to 0</span><span class="w">
</span><span class="n">sellPrice</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="w">
</span><span class="n">roi</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">vector</span><span class="p">(</span><span class="s2">"numeric"</span><span class="p">,</span><span class="w"> </span><span class="n">length</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">)</span><span class="w">

</span><span class="c1"># Backtest strategy</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="m">2</span><span class="o">:</span><span class="n">nrow</span><span class="p">(</span><span class="n">data</span><span class="p">)){</span><span class="w">      </span><span class="c1"># Check for a cross</span><span class="w">
  </span><span class="c1"># ...</span><span class="w">
  </span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="o">!</span><span class="nf">is.na</span><span class="p">(</span><span class="n">data</span><span class="o">$</span><span class="n">sma_short</span><span class="p">[</span><span class="n">i</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="m">1</span><span class="p">])</span><span class="w"> </span><span class="o">&amp;&amp;</span><span class="w">
      </span><span class="o">!</span><span class="nf">is.na</span><span class="p">(</span><span class="n">data</span><span class="o">$</span><span class="n">sma_long</span><span class="p">[</span><span class="n">i</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="m">1</span><span class="p">])</span><span class="w"> </span><span class="o">&amp;&amp;</span><span class="w">
      </span><span class="n">data</span><span class="o">$</span><span class="n">sma_short</span><span class="p">[</span><span class="n">i</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="o">&lt;=</span><span class="w"> </span><span class="n">data</span><span class="o">$</span><span class="n">sma_long</span><span class="p">[</span><span class="n">i</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="o">&amp;&amp;</span><span class="w">
      </span><span class="n">data</span><span class="o">$</span><span class="n">sma_short</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="w"> </span><span class="o">&gt;</span><span class="w"> </span><span class="n">data</span><span class="o">$</span><span class="n">sma_long</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="w"> </span><span class="o">&amp;&amp;</span><span class="w"> </span><span class="n">capital</span><span class="w"> </span><span class="o">&gt;</span><span class="w"> </span><span class="m">0</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
    </span><span class="c1"># Buy</span><span class="w">
    </span><span class="n">buyPrice</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">as.numeric</span><span class="p">(</span><span class="n">data</span><span class="o">$</span><span class="n">USD.SEK.Close</span><span class="p">[</span><span class="n">i</span><span class="p">])</span><span class="w">  </span><span class="c1"># Store the buy price</span><span class="w">
    </span><span class="n">buyCapital</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">capital</span><span class="w">
    </span><span class="n">pos</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="p">(</span><span class="n">capital</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">capital</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">commission_rate</span><span class="p">)</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">buyPrice</span><span class="w">
    </span><span class="n">print</span><span class="p">(</span><span class="n">paste</span><span class="p">(</span><span class="s2">"BUY:"</span><span class="p">,</span><span class="w"> </span><span class="n">i</span><span class="p">,</span><span class="w"> </span><span class="n">pos</span><span class="p">))</span><span class="w">
    </span><span class="n">capital</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="w">
    </span><span class="n">buynumTrades</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">buynumTrades</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="w">
  </span><span class="p">}</span><span class="w">
  </span><span class="k">else</span><span class="w"> </span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="o">!</span><span class="nf">is.na</span><span class="p">(</span><span class="n">data</span><span class="o">$</span><span class="n">sma_short</span><span class="p">[</span><span class="n">i</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="m">1</span><span class="p">])</span><span class="w"> </span><span class="o">&amp;&amp;</span><span class="w">
           </span><span class="o">!</span><span class="nf">is.na</span><span class="p">(</span><span class="n">data</span><span class="o">$</span><span class="n">sma_long</span><span class="p">[</span><span class="n">i</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="m">1</span><span class="p">])</span><span class="w"> </span><span class="o">&amp;&amp;</span><span class="w">
           </span><span class="n">data</span><span class="o">$</span><span class="n">sma_short</span><span class="p">[</span><span class="n">i</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="o">&gt;=</span><span class="w"> </span><span class="n">data</span><span class="o">$</span><span class="n">sma_long</span><span class="p">[</span><span class="n">i</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="o">&amp;&amp;</span><span class="w">
           </span><span class="n">data</span><span class="o">$</span><span class="n">sma_short</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="w"> </span><span class="o">&lt;</span><span class="w"> </span><span class="n">data</span><span class="o">$</span><span class="n">sma_long</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="w"> </span><span class="o">&amp;&amp;</span><span class="w"> </span><span class="n">pos</span><span class="w"> </span><span class="o">&gt;</span><span class="w"> </span><span class="m">0</span><span class="p">){</span><span class="w"> </span><span class="c1">#&amp;&amp; buyPrice &gt; as.numeric(data$USD.SEK.Close[i])</span><span class="w">
    </span><span class="c1"># Sell</span><span class="w">
    </span><span class="n">sellPrice</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">as.numeric</span><span class="p">(</span><span class="n">data</span><span class="o">$</span><span class="n">USD.SEK.Close</span><span class="p">[</span><span class="n">i</span><span class="p">])</span><span class="w">  </span><span class="c1"># Calculate PnL based on current capital</span><span class="w">
    </span><span class="n">capital</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">as.numeric</span><span class="p">(</span><span class="n">pos</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="n">sellPrice</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">pos</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="n">sellPrice</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">commission_rate</span><span class="p">)</span><span class="w">
    </span><span class="n">grossPnL</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">capital</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">buyCapital</span><span class="w">
    </span><span class="n">print</span><span class="p">(</span><span class="n">paste</span><span class="p">(</span><span class="s2">"SELL:"</span><span class="p">,</span><span class="w"> </span><span class="n">i</span><span class="p">,</span><span class="w"> </span><span class="n">capital</span><span class="p">))</span><span class="w">
    </span><span class="n">pos</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="w">
    </span><span class="n">sellnumTrades</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">sellnumTrades</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="w">
    
    </span><span class="n">roi</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="n">roi</span><span class="p">,</span><span class="w"> </span><span class="p">(</span><span class="n">buyPrice</span><span class="o">-</span><span class="n">sellPrice</span><span class="o">/</span><span class="n">buyPrice</span><span class="p">)</span><span class="o">*</span><span class="m">100</span><span class="p">)</span><span class="w">
    </span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="n">buyPrice</span><span class="w"> </span><span class="o">&gt;</span><span class="w"> </span><span class="n">sellPrice</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
      </span><span class="n">numWinningTrades</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">numWinningTrades</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="w">
      </span><span class="n">grossProfit</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">grossProfit</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">grossPnL</span><span class="w">
    </span><span class="p">}</span><span class="w"> </span><span class="k">else</span><span class="w"> </span><span class="p">{</span><span class="w">
      </span><span class="n">numLosingTrades</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">numLosingTrades</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="w">
      </span><span class="n">grossLoss</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">grossLoss</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="nf">abs</span><span class="p">(</span><span class="n">grossPnL</span><span class="p">)</span><span class="w">
    </span><span class="p">}</span><span class="w">
  </span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>

<p>The reported performance over the same period of data (6 months) is
presented on the table bellow. The net profit of the unconstrained
simple moving average over six months was 317.278 with an initial
investment of 2000 USD. A total of 15.86 % return of investment, not bad
at all, considering that we only tested half a year.</p>

<table>
  <thead>
    <tr>
      <th style="text-align: right">n</th>
      <th style="text-align: right">m</th>
      <th style="text-align: right">capital</th>
      <th style="text-align: right">net_profit</th>
      <th style="text-align: right">grossProfit</th>
      <th style="text-align: right">grossLoss</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="text-align: right">8</td>
      <td style="text-align: right">300</td>
      <td style="text-align: right">2317.278</td>
      <td style="text-align: right">317.278</td>
      <td style="text-align: right">1700.261</td>
      <td style="text-align: right">1382.983</td>
    </tr>
  </tbody>
</table>

<p>SMA unconstrained Crossover performance</p>

<p>Remarkably, the unconstrained variant of the algorithm, absent the
condition <code class="language-plaintext highlighter-rouge">buyPrice &gt; as.numeric(data$USD.SEK.Close[i])</code>, exhibited a
loss in approximately 20% of its trades. This is quite high, and it is
an area of opportunity for further implementations of the algorithm. I
will start by testing a less restrictive condition of selling that
allows to sell on loss but only around certain margin, perhaps the
standard deviation of long-run SMA.</p>

<table>
  <thead>
    <tr>
      <th style="text-align: right">buynumTrades</th>
      <th style="text-align: right">sellnumTrades</th>
      <th style="text-align: right">trades_per_min</th>
      <th style="text-align: right">numWinningTrades</th>
      <th style="text-align: right">numLosingTrades</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="text-align: right">2025</td>
      <td style="text-align: right">2025</td>
      <td style="text-align: right">0.001</td>
      <td style="text-align: right">1612</td>
      <td style="text-align: right">413</td>
    </tr>
  </tbody>
</table>

<p>Composition of the trades</p>

<p>The distribution of the Return of Investment (ROI) of each trade is the
following:</p>

<table>
  <thead>
    <tr>
      <th style="text-align: right">Min.</th>
      <th style="text-align: right">1st Qu.</th>
      <th style="text-align: right">Median</th>
      <th style="text-align: right">Mean</th>
      <th style="text-align: right">3rd Qu.</th>
      <th style="text-align: right">Max.</th>
      <th style="text-align: right">sd</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="text-align: right">-2.688</td>
      <td style="text-align: right">0.006</td>
      <td style="text-align: right">0.025</td>
      <td style="text-align: right">0.007</td>
      <td style="text-align: right">0.053</td>
      <td style="text-align: right">1.31</td>
      <td style="text-align: right">0.15</td>
    </tr>
  </tbody>
</table>

<p>ROI Summary Statistics</p>

<p><img src="https://github.com/Wario84/blog/raw/main/assets/imgs/sma_roi_histogram.png?raw=true" alt="" /><!-- --></p>

<h2 id="final-remarks-and-areas-of-improvement">Final remarks and areas of improvement.</h2>

<p>The SMA crossover algorithm proved to be successful, achieving a total
return on investment of 15.86% over six months with 30-second candles.
However, it’s important to note that this performance heavily depends on
specific parameter values (bands), candle intervals, and the chosen
stock. In our rigorous testing, we explored a staggering 27,550
combinations of short and long-run bands over five days to identify the
winning pair.</p>

<p>While the algorithm showed promise, there are areas for improvement.
First, we observed a relatively high gross loss (1382 USD) compared to
the gross gain (1780 USD), resulting in approximately 20% of losses.
Enhancing the algorithm with additional rules, such as introducing
resistance bands, may help mitigate losses during market uncertainties.
Secondly, more realistic estimations of transaction commissions need to
be incorporated to provide a more accurate representation of algorithm
performance. It’s worth noting that Interactive Brokers limits regular
trading accounts to one-minute candles, which may impact trading
strategies. Looking ahead, optimizing and testing the algorithm’s
performance in the equity market, particularly with stocks displaying
higher returns and upward trends, could yield even better results.
Finally, the next phase involves implementing the algorithm using live
market data through the <code class="language-plaintext highlighter-rouge">reqMktData</code> function and testing it in a paper
trading account to assess its real-time performance.</p>]]></content><author><name>Mario H. Gonzalez-Sauri</name></author><summary type="html"><![CDATA[Introduction]]></summary></entry><entry><title type="html">Creating a Dashboard of CPU Benchmarks Using R and Python.</title><link href="https://mario1084.github.io/blog/2023/02/15/cpu_benchmark_plotly_dash.html" rel="alternate" type="text/html" title="Creating a Dashboard of CPU Benchmarks Using R and Python." /><published>2023-02-15T00:00:00+00:00</published><updated>2023-02-15T00:00:00+00:00</updated><id>https://mario1084.github.io/blog/2023/02/15/cpu_benchmark_plotly_dash</id><content type="html" xml:base="https://mario1084.github.io/blog/2023/02/15/cpu_benchmark_plotly_dash.html"><![CDATA[<h2 id="introduction">Introduction</h2>

<p>In this post I will teach you how to create and deploy a dashboard with
a preview of the dataset alongside useful data visualization tools.
Dashboards are useful for creating interactive and customizable data
visualizations and web applications. They allow you to create a dynamic
user interface that can interact with data and update in real-time,
making it an excellent tool for data exploration, analysis, and sharing.
Dashboards can be used for a wide range of purposes, from monitoring
business metrics to visualizing scientific data.</p>

<p>To create the dashboard, I will combine R an Python to take advantage of
strengths of each language for web scraping, data cleaning, dashboard
creation and deployment. My source to gather CPU information and
benchmarks is <a href="https://www.cpubenchmark.net/cpu_list.php">CPU
list from cpubenchmark.net</a>. The goal is to scrap this data to create
a dataset of benchmarks for our CPU dashboard example.</p>

<h2 id="web-scrapping-html-tables-with-rvest">Web Scrapping HTML tables with rvest</h2>

<p>The library <code class="language-plaintext highlighter-rouge">rvest</code> from R has many interesting functions for web
scrapping. We are interested a function that can transform a HTML tables
(<code class="language-plaintext highlighter-rouge">&lt;table&gt;</code> and <code class="language-plaintext highlighter-rouge">&lt;/table&gt;</code>) into readable dataframes.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">library</span><span class="p">(</span><span class="s2">"rvest"</span><span class="p">)</span><span class="w">

</span><span class="c1">## Read data</span><span class="w">
</span><span class="n">webpage</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">read_html</span><span class="p">(</span><span class="s2">"https://www.cpubenchmark.net/cpu_list.php"</span><span class="p">)</span><span class="w">
</span><span class="n">tbls</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">webpage</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">html_nodes</span><span class="p">(</span><span class="s2">"table"</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">html_table</span><span class="p">(</span><span class="n">fill</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
</span><span class="nf">length</span><span class="p">(</span><span class="n">tbls</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## [1] 3
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">head</span><span class="p">(</span><span class="n">tbls</span><span class="p">[[</span><span class="m">2</span><span class="p">]])</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## # A tibble: 6 × 5
##   `CPU Name`              `CPU Mark(higher is better)` Rank(lo…¹ CPU V…² Price…³
##   &lt;chr&gt;                   &lt;chr&gt;                            &lt;int&gt;   &lt;dbl&gt; &lt;chr&gt;  
## 1 AArch64 rev 2 (aarch64) 2,246                             2187      NA &lt;NA&gt;   
## 2 AArch64 rev 4 (aarch64) 1,797                             2439      NA &lt;NA&gt;   
## 3 AC8257V/WAB             774                               3269      NA &lt;NA&gt;   
## 4 AMD 3015Ce              2,088                             2263      NA &lt;NA&gt;   
## 5 AMD 3015e               2,691                             1969      NA &lt;NA&gt;   
## 6 AMD 3020e               2,446                             2069      NA &lt;NA&gt;   
## # … with abbreviated variable names ¹​`Rank(lower is better)`,
## #   ²​`CPU Value(higher is better)`, ³​`Price(USD)`
</code></pre></div></div>

<p>Now the next step is to convert this table into a dataframe and perform
some basic data cleaning. We are going to use regular expressions to
transform strings of character into numeric and integer values.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">cpus_bench</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">tbls</span><span class="p">[[</span><span class="m">2</span><span class="p">]]</span><span class="w">
</span><span class="n">cpus_bench</span><span class="o">$</span><span class="n">`CPU Mark(higher is better)`</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">as.numeric</span><span class="p">(</span><span class="n">gsub</span><span class="p">(</span><span class="s2">","</span><span class="p">,</span><span class="w"> </span><span class="s2">""</span><span class="p">,</span><span class="w"> </span><span class="n">cpus_bench</span><span class="o">$</span><span class="n">`CPU Mark(higher is better)`</span><span class="p">))</span><span class="w">
</span><span class="n">cpus_bench</span><span class="o">$</span><span class="n">`Rank(lower is better)`</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">as.numeric</span><span class="p">(</span><span class="n">cpus_bench</span><span class="o">$</span><span class="n">`Rank(lower is better)`</span><span class="p">)</span><span class="w">
</span><span class="n">cpus_bench</span><span class="o">$</span><span class="n">`CPU Value(higher is better)`</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">as.numeric</span><span class="p">(</span><span class="n">cpus_bench</span><span class="o">$</span><span class="n">`CPU Value(higher is better)`</span><span class="p">)</span><span class="w">
</span><span class="n">cpus_bench</span><span class="o">$</span><span class="n">`Price(USD)`</span><span class="w"> </span><span class="o">&lt;-</span><span class="w">  </span><span class="n">gsub</span><span class="p">(</span><span class="s2">"(^\\$)|(\\*$)"</span><span class="p">,</span><span class="w"> </span><span class="s2">""</span><span class="p">,</span><span class="w"> </span><span class="n">cpus_bench</span><span class="o">$</span><span class="n">`Price(USD)`</span><span class="p">)</span><span class="w">
</span><span class="n">cpus_bench</span><span class="o">$</span><span class="n">`Price(USD)`</span><span class="w"> </span><span class="o">&lt;-</span><span class="w">  </span><span class="n">gsub</span><span class="p">(</span><span class="s2">","</span><span class="p">,</span><span class="w"> </span><span class="s2">""</span><span class="p">,</span><span class="w"> </span><span class="n">cpus_bench</span><span class="o">$</span><span class="n">`Price(USD)`</span><span class="p">)</span><span class="w">
</span><span class="n">cpus_bench</span><span class="o">$</span><span class="n">`Price(USD)`</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">as.numeric</span><span class="p">(</span><span class="n">cpus_bench</span><span class="o">$</span><span class="n">`Price(USD)`</span><span class="p">)</span><span class="w">
</span><span class="n">head</span><span class="p">(</span><span class="n">cpus_bench</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## # A tibble: 6 × 5
##   `CPU Name`              `CPU Mark(higher is better)` Rank(lo…¹ CPU V…² Price…³
##   &lt;chr&gt;                                          &lt;dbl&gt;     &lt;dbl&gt;   &lt;dbl&gt;   &lt;dbl&gt;
## 1 AArch64 rev 2 (aarch64)                         2246      2187      NA      NA
## 2 AArch64 rev 4 (aarch64)                         1797      2439      NA      NA
## 3 AC8257V/WAB                                      774      3269      NA      NA
## 4 AMD 3015Ce                                      2088      2263      NA      NA
## 5 AMD 3015e                                       2691      1969      NA      NA
## 6 AMD 3020e                                       2446      2069      NA      NA
## # … with abbreviated variable names ¹​`Rank(lower is better)`,
## #   ²​`CPU Value(higher is better)`, ³​`Price(USD)`
</code></pre></div></div>

<p>Now that we have the data in good shape, it is time to retrieve more
information on CPU benchmarks. The <code class="language-plaintext highlighter-rouge">cpus_bench</code> contains information on
4080 CPUs, however, to make the bashboard more efective, I want to
concentrate on the top-1000 CPUs according to the CPU Mark.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">## Sort according to CPU Mark</span><span class="w">
</span><span class="n">cpus_bench</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">cpus_bench</span><span class="p">[</span><span class="n">order</span><span class="p">(</span><span class="n">cpus_bench</span><span class="o">$</span><span class="n">`CPU Mark(higher is better)`</span><span class="p">,</span><span class="w"> </span><span class="n">decreasing</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nb">T</span><span class="p">),</span><span class="w"> </span><span class="p">]</span><span class="w">
</span><span class="n">cpus_bench_1000</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">cpus_bench</span><span class="p">[</span><span class="m">1</span><span class="o">:</span><span class="m">1000</span><span class="p">,]</span><span class="w">
</span><span class="n">head</span><span class="p">(</span><span class="n">cpus_bench_1000</span><span class="p">,</span><span class="w"> </span><span class="m">10L</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## # A tibble: 10 × 5
##    `CPU Name`                        CPU Mark(higher i…¹ Rank(…² CPU V…³ Price…⁴
##    &lt;chr&gt;                                           &lt;dbl&gt;   &lt;dbl&gt;   &lt;dbl&gt;   &lt;dbl&gt;
##  1 AMD EPYC 9654                                  124119       1    10.5  11805 
##  2 AMD Ryzen Threadripper PRO 5995WX               96237       2    14.5   6645.
##  3 AMD EPYC 7773X                                  90731       3    21.4   4249 
##  4 AMD EPYC 7763                                   85944       4    23.4   3665 
##  5 AMD EPYC 7J13                                   85661       5    NA       NA 
##  6 AMD EPYC 7713                                   85521       6    23.1   3700.
##  7 AMD EPYC 7713P                                  83439       7    18.3   4550 
##  8 AMD Ryzen Threadripper PRO 3995WX               83097       8    13.3   6267.
##  9 AMD EPYC 7V13                                   82878       9    NA       NA 
## 10 AMD Ryzen Threadripper 3990X                    81109      10    11.5   7069 
## # … with abbreviated variable names ¹​`CPU Mark(higher is better)`,
## #   ²​`Rank(lower is better)`, ³​`CPU Value(higher is better)`, ⁴​`Price(USD)`
</code></pre></div></div>

<p>Very interesting, in the top-10 we find only AMD processors…</p>

<p>To add the rest of the CPU benchmark we are going to take advantage of
this simple function that takes the name of a CPU and creates an HTML
link that will be used by a web scrapping algorithm to retrieve the CPU
benchmarks.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">i</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="m">1L</span><span class="w">
</span><span class="n">paste0</span><span class="p">(</span><span class="s2">"https://www.cpubenchmark.net/cpu.php?cpu="</span><span class="p">,</span><span class="w"> </span><span class="n">gsub</span><span class="p">(</span><span class="s2">" "</span><span class="p">,</span><span class="w"> </span><span class="s2">"\\+"</span><span class="p">,</span><span class="w"> </span><span class="n">cpus_bench_1000</span><span class="o">$</span><span class="n">`CPU Name`</span><span class="p">[</span><span class="n">i</span><span class="p">]))</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## [1] "https://www.cpubenchmark.net/cpu.php?cpu=AMD+EPYC+9654"
</code></pre></div></div>

<p>The idea is to write a simple loop that would iterate over all the
top-1000 CPUs and gather information on benchmarks, such as
“integer_math(MOps/Sec)”,“floating_point_math(MOps/Sec)”,“find_prime_numbers(Million
Primes/Sec)”, ect…</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># read in HTML data</span><span class="w">
</span><span class="n">i</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="m">1L</span><span class="w">
</span><span class="n">df_bind</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">list</span><span class="p">()</span><span class="w">
</span><span class="k">for</span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="m">1L</span><span class="o">:</span><span class="m">1000L</span><span class="p">){</span><span class="w">
</span><span class="n">webpage</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">read_html</span><span class="p">(</span><span class="n">paste0</span><span class="p">(</span><span class="s2">"https://www.cpubenchmark.net/cpu.php?cpu="</span><span class="p">,</span><span class="w"> </span><span class="n">gsub</span><span class="p">(</span><span class="s2">" "</span><span class="p">,</span><span class="w"> </span><span class="s2">"\\+"</span><span class="p">,</span><span class="w"> </span><span class="n">cpus_bench_1000</span><span class="o">$</span><span class="n">`CPU Name`</span><span class="p">[</span><span class="n">i</span><span class="p">])))</span><span class="w">
</span><span class="n">tbls</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">webpage</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">html_nodes</span><span class="p">(</span><span class="s2">"table"</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w">
  </span><span class="n">html_table</span><span class="p">(</span><span class="n">fill</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
</span><span class="n">df</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">try</span><span class="p">(</span><span class="n">data.table</span><span class="p">(</span><span class="n">t</span><span class="p">(</span><span class="n">tbls</span><span class="p">[[</span><span class="m">2</span><span class="p">]][</span><span class="m">2</span><span class="p">])))</span><span class="w">
</span><span class="k">if</span><span class="p">(</span><span class="nf">any</span><span class="p">(</span><span class="nf">class</span><span class="p">(</span><span class="n">df</span><span class="p">)</span><span class="o">%in%</span><span class="s2">"data.table"</span><span class="p">)){</span><span class="w">
  </span><span class="k">if</span><span class="p">(</span><span class="n">ncol</span><span class="p">(</span><span class="n">df</span><span class="p">)</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="m">9</span><span class="p">){</span><span class="w">
    </span><span class="n">setnames</span><span class="p">(</span><span class="n">df</span><span class="p">,</span><span class="w"> </span><span class="n">t</span><span class="p">(</span><span class="n">tbls</span><span class="p">[[</span><span class="m">2</span><span class="p">]][</span><span class="m">1</span><span class="p">]))</span><span class="w">
    </span><span class="n">year</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">as.integer</span><span class="p">(</span><span class="n">sub</span><span class="p">(</span><span class="s2">"^.*\\s"</span><span class="p">,</span><span class="w"> </span><span class="s2">""</span><span class="p">,</span><span class="n">trimws</span><span class="p">(</span><span class="n">gsub</span><span class="p">(</span><span class="s1">'&lt;/p&gt;.*$'</span><span class="p">,</span><span class="w"> </span><span class="s1">''</span><span class="p">,</span><span class="w"> </span><span class="n">gsub</span><span class="p">(</span><span class="s1">'^.*&lt;strong class="bg-table-row"&gt;CPU First Seen on Charts:&lt;/strong&gt;'</span><span class="p">,</span><span class="w"> </span><span class="s2">""</span><span class="p">,</span><span class="w"> </span><span class="n">webpage</span><span class="p">)))))</span><span class="w">
    </span><span class="n">gz</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">as.integer</span><span class="p">(</span><span class="n">sub</span><span class="p">(</span><span class="s2">"\\s.*"</span><span class="p">,</span><span class="w"> </span><span class="s2">""</span><span class="p">,</span><span class="n">trimws</span><span class="p">(</span><span class="n">gsub</span><span class="p">(</span><span class="s1">'&lt;/p&gt;.*$'</span><span class="p">,</span><span class="w"> </span><span class="s1">''</span><span class="p">,</span><span class="w"> </span><span class="n">gsub</span><span class="p">(</span><span class="s1">'^.*&lt;strong&gt;Clockspeed:&lt;/strong&gt;'</span><span class="p">,</span><span class="w"> </span><span class="s2">""</span><span class="p">,</span><span class="w"> </span><span class="n">webpage</span><span class="p">)))))</span><span class="w">
    </span><span class="n">cores</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">as.integer</span><span class="p">(</span><span class="n">trimws</span><span class="p">(</span><span class="n">gsub</span><span class="p">(</span><span class="s1">'&lt;strong&gt;.*$'</span><span class="p">,</span><span class="w"> </span><span class="s1">''</span><span class="p">,</span><span class="w"> </span><span class="n">gsub</span><span class="p">(</span><span class="s1">'^.*&lt;strong&gt;Cores:&lt;/strong&gt;'</span><span class="p">,</span><span class="w"> </span><span class="s2">""</span><span class="p">,</span><span class="w"> </span><span class="n">webpage</span><span class="p">))))</span><span class="w">
    </span><span class="n">threads</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">as.integer</span><span class="p">(</span><span class="n">trimws</span><span class="p">(</span><span class="n">gsub</span><span class="p">(</span><span class="s1">'&lt;/p&gt;.*$'</span><span class="p">,</span><span class="w"> </span><span class="s1">''</span><span class="p">,</span><span class="w"> </span><span class="n">gsub</span><span class="p">(</span><span class="s1">'^.*&lt;strong&gt;Threads:&lt;/strong&gt;'</span><span class="p">,</span><span class="w"> </span><span class="s2">""</span><span class="p">,</span><span class="w"> </span><span class="n">webpage</span><span class="p">))))</span><span class="w">
    </span><span class="n">df_bind</span><span class="p">[[</span><span class="n">i</span><span class="p">]]</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">cbind.data.frame</span><span class="p">(</span><span class="n">cpus_bench_1000</span><span class="p">[</span><span class="n">i</span><span class="p">,],</span><span class="w"> </span><span class="n">df</span><span class="p">,</span><span class="w"> </span><span class="n">gz</span><span class="p">,</span><span class="w"> </span><span class="n">cores</span><span class="p">,</span><span class="w"> </span><span class="n">threads</span><span class="p">,</span><span class="w"> </span><span class="n">year</span><span class="p">)</span><span class="w">  
  </span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">

</span><span class="n">cpus_bench_full</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">rbindlist</span><span class="p">(</span><span class="n">df_bind</span><span class="p">,</span><span class="w"> </span><span class="n">fill</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
</span><span class="n">head</span><span class="p">(</span><span class="n">cpus_bench_full</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<p>The algorithm may seem a bit intimidating but it is actually quite
simple. It is gathering pieces of information on specific parts of the
HTML code. For instance, to gather the information on the CPU benchmarks
we are always scrapping the second table from the website as
<code class="language-plaintext highlighter-rouge">tbls[[2]][2]</code>. Then we are scrapping the begging and end of HTML tabs
that contain useful information such as <code class="language-plaintext highlighter-rouge">gz</code>, <code class="language-plaintext highlighter-rouge">cores</code> and the <code class="language-plaintext highlighter-rouge">threads</code>
from the HTML source code. The final data set looks like this:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>##   X                          cpu_name cpu_mark.higher_is_better.
## 1 1                     AMD EPYC 9654                     124119
## 2 2 AMD Ryzen Threadripper PRO 5995WX                      95829
## 3 3                    AMD EPYC 7773X                      90731
## 4 4                     AMD EPYC 7763                      85944
## 5 5                     AMD EPYC 7J13                      85661
## 6 6                     AMD EPYC 7713                      85521
##   cpu_value.higher_is_better. price_usd integer_math.MOps.Sec.
## 1                       10.51  11805.00                 978227
## 2                       14.98   6399.00                 631867
## 3                       21.11   4299.00                 533457
## 4                       23.26   3695.00                 547840
## 5                          NA        NA                 555507
## 6                       23.11   3699.99                 533785
##   floating_point_math.MOps.Sec. find_prime_numbers.Million.Primes.Sec.
## 1                        522611                                     NA
## 2                        343904                                    676
## 3                        301129                                     NA
## 4                        299973                                    665
## 5                        300486                                    686
## 6                        272582                                    621
##   random_string_sorting.Thousand.Strings.Sec. data_encryption.MBytes.Sec.
## 1                                          NA                      187949
## 2                                         676                      132563
## 3                                          NA                      135770
## 4                                         665                      124591
## 5                                         686                      123954
## 6                                         621                      107100
##   data_compression.MBytes.Sec. physics.Frames.Sec.
## 1                           NA                  NA
## 2                           NA                  NA
## 3                           NA                  NA
## 4                           NA                  NA
## 5                           NA                  NA
## 6                           NA                  NA
##   extended_instructions.Million.Matrices.Sec. single_thread.MOps.Sec. ghz cores
## 1                                      200277                    2893   2    96
## 2                                      123388                    3302   2    64
## 3                                       91298                    2513   2    64
## 4                                       98801                    2576   2    64
## 5                                       99971                    2449   2    64
## 6                                       94897                    2718   2    64
##   threads year
## 1     192 2022
## 2     128 2022
## 3     128 2022
## 4     128 2021
## 5     128 2021
## 6     128 2021
</code></pre></div></div>

<h2 id="dasboard-in-plotly-dash-from-python">Dasboard in Plotly Dash from Python</h2>

<p>The library that I am going to use to create the dashboard is called
<code class="language-plaintext highlighter-rouge">Plotly Dash</code>. Plotly Dash has several advantages for deploying a static
web application compared to other libraries. Firstly, it has high level
of interactivity, meaning that users are able to play around with the
data, apply filters, and perform various operations. Secondly, in my
view, it is also flexible as it allows users to create an customize
different plots and layouts, and it is relatively easy to customize.
Thirdly, it has a high level of integration specially with Pandas and
Numpy that are the main libraries that are commonly use of data science
in Python. Finally, the library is relatively easy to deploy at zero
cost as a static website that can be easily embedded or used as a stand
alone service.</p>

<p>Don’t let the code overwhelm you! The structure of the dashboard python
code is simple, we start by loading the packages that we are going to
use. The first line loads the entire Dash library to be used in the
script and the the next three lines import specific modules from the
<code class="language-plaintext highlighter-rouge">Dash</code> library, which are <code class="language-plaintext highlighter-rouge">dcc</code>, <code class="language-plaintext highlighter-rouge">html</code>, and <code class="language-plaintext highlighter-rouge">dash_table</code>. These modules
are needed to create the visual components of the dashboard such as
tables, dropdowns menus, graphs, and other HTML elements. <code class="language-plaintext highlighter-rouge">Pandas</code> is
imported in order to read and manipulate the dataset, which is stored in
a CSV file. Then, the Plotly graph objects (<code class="language-plaintext highlighter-rouge">graph_objs</code>) are imported
to create a pie chart and the Plotly express (<code class="language-plaintext highlighter-rouge">px</code>) is imported to
create a scatter plot. Afther we have loaded the libraries and modules,
we load the dataset using <code class="language-plaintext highlighter-rouge">read_csv</code> method from Pandas, either from a
local file or from a remote URL, in this case I am exporting the data
from my Github repository.</p>

<p>The <code class="language-plaintext highlighter-rouge">app.layout</code> is where the components of the dashboard are defined
such as tables, dropdowns, graphs, and other HTML elements. Here, you
can be creative and write a layout that is both visually appealing and
functional. I am going for a simple design with a heather,
<code class="language-plaintext highlighter-rouge">html.H1('CPU Benchmark Data'),</code> and a slim line that works as separator
between the sections of the dashboard <code class="language-plaintext highlighter-rouge">html.Hr(),</code>. I start the
dashboard presenting a preview of the top-10 rows of the dataset using
the function <code class="language-plaintext highlighter-rouge">dash_table.DataTable()</code>. The function has several
arguments, but perhaps the most important one is the data source
<code class="language-plaintext highlighter-rouge">data=df.head(10).to_dict('records'),</code>which displays only the first 10
rows of the data.</p>

<p>After the data preview, I define another line <code class="language-plaintext highlighter-rouge">html.Br()</code>, to mark the
beginning of another section of the web application, followed by the
function, <code class="language-plaintext highlighter-rouge">html.H4('Histogram variable:')</code>, that displays a tittle of
the histogram. Next, I define a dropdown menu to select between each
column of the dataset:</p>

<p><code class="language-plaintext highlighter-rouge">dcc.Dropdown(         id='variable-selector',         options=[{'label': i, 'value': i} for i in df.columns],         value='cpu_value(higher_is_better)'     )</code></p>

<p>This is followed by two bottoms that are used to sort the table in an
ascending or descending manner according to the variable selected. This
snipped of code is the following one:</p>

<p><code class="language-plaintext highlighter-rouge">dcc.RadioItems(         id='sort-order',         options=[{'label': i, 'value': i} for i in ['Ascending', 'Descending']],         value='Ascending',         labelStyle={'display': 'inline-block'}     )</code></p>

<p>And finally, we display the histogram using the following function:</p>

<p><code class="language-plaintext highlighter-rouge">dcc.Graph(         id='histogram',         figure={}     )</code></p>

<p>The rest of the layout follows the same mechanics. I define a dropdown
menu, <code class="language-plaintext highlighter-rouge">dcc.Dropdown()</code>, to select a variable for the next plot then I
render the plot using the same <code class="language-plaintext highlighter-rouge">dcc.Graph()</code> function.</p>

<p>After defining the <code class="language-plaintext highlighter-rouge">app.layout</code>, we have to write the <code class="language-plaintext highlighter-rouge">app.callback</code>
decorator that is used to bind the input/output of the interactive
components (i.e., the dropdown menus) to the graph. Furthermore, the
<code class="language-plaintext highlighter-rouge">update_histogram_and_table</code>, <code class="language-plaintext highlighter-rouge">update_pie_chart</code>, and
<code class="language-plaintext highlighter-rouge">update_scatter_plot</code> functions that are the callback functions that
update the graph based on the user input.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">dash</span>
<span class="kn">from</span> <span class="nn">dash</span> <span class="kn">import</span> <span class="n">dcc</span>
<span class="kn">from</span> <span class="nn">dash</span> <span class="kn">import</span> <span class="n">html</span>
<span class="kn">from</span> <span class="nn">dash</span> <span class="kn">import</span> <span class="n">dash_table</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>
<span class="kn">import</span> <span class="nn">plotly.graph_objs</span> <span class="k">as</span> <span class="n">go</span> <span class="c1"># for the pie chart
</span><span class="kn">import</span> <span class="nn">plotly.express</span> <span class="k">as</span> <span class="n">px</span> <span class="c1"># for the scatter plot
</span>
<span class="n">external_stylesheets</span> <span class="o">=</span> <span class="p">[</span><span class="s">'https://codepen.io/chriddyp/pen/bWLwgP.css'</span><span class="p">]</span>

<span class="n">app</span> <span class="o">=</span> <span class="n">dash</span><span class="p">.</span><span class="n">Dash</span><span class="p">(</span><span class="n">__name__</span><span class="p">,</span> <span class="n">external_stylesheets</span><span class="o">=</span><span class="n">external_stylesheets</span><span class="p">)</span>

<span class="c1"># Load the dataset
#df = pd.read_csv("test_cpus.csv")
</span><span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s">"https://raw.githubusercontent.com/Wario84/blog/main/assets/data/test_cpus.csv"</span><span class="p">)</span>

<span class="n">app</span><span class="p">.</span><span class="n">layout</span> <span class="o">=</span> <span class="n">html</span><span class="p">.</span><span class="n">Div</span><span class="p">([</span>
    <span class="n">html</span><span class="p">.</span><span class="n">H1</span><span class="p">(</span><span class="s">'CPU Benchmark Data'</span><span class="p">),</span>
    <span class="n">html</span><span class="p">.</span><span class="n">Hr</span><span class="p">(),</span>
    <span class="n">html</span><span class="p">.</span><span class="n">H3</span><span class="p">(</span><span class="s">'Data Preview:'</span><span class="p">),</span>
    <span class="n">dash_table</span><span class="p">.</span><span class="n">DataTable</span><span class="p">(</span>
        <span class="nb">id</span><span class="o">=</span><span class="s">'table'</span><span class="p">,</span>
        <span class="n">columns</span><span class="o">=</span><span class="p">[{</span><span class="s">"name"</span><span class="p">:</span> <span class="n">i</span><span class="p">,</span> <span class="s">"id"</span><span class="p">:</span> <span class="n">i</span><span class="p">}</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">df</span><span class="p">.</span><span class="n">columns</span><span class="p">],</span>
        <span class="n">data</span><span class="o">=</span><span class="n">df</span><span class="p">.</span><span class="n">head</span><span class="p">(</span><span class="mi">10</span><span class="p">).</span><span class="n">to_dict</span><span class="p">(</span><span class="s">'records'</span><span class="p">),</span>
        <span class="n">style_table</span><span class="o">=</span><span class="p">{</span><span class="s">'overflowX'</span><span class="p">:</span> <span class="s">'auto'</span><span class="p">},</span>
        <span class="n">style_cell</span><span class="o">=</span><span class="p">{</span><span class="s">'textAlign'</span><span class="p">:</span> <span class="s">'left'</span><span class="p">},</span>
        <span class="n">sort_action</span><span class="o">=</span><span class="s">'native'</span><span class="p">,</span>
        <span class="n">page_action</span><span class="o">=</span><span class="s">'none'</span><span class="p">,</span>
        <span class="n">style_data_conditional</span><span class="o">=</span><span class="p">[{</span>
            <span class="s">'if'</span><span class="p">:</span> <span class="p">{</span><span class="s">'row_index'</span><span class="p">:</span> <span class="s">'odd'</span><span class="p">},</span>
            <span class="s">'backgroundColor'</span><span class="p">:</span> <span class="s">'rgb(248, 248, 248)'</span>
        <span class="p">}]</span>
    <span class="p">),</span>
    <span class="n">html</span><span class="p">.</span><span class="n">Br</span><span class="p">(),</span>
    <span class="n">html</span><span class="p">.</span><span class="n">H4</span><span class="p">(</span><span class="s">'Histogram variable:'</span><span class="p">),</span>
    <span class="n">dcc</span><span class="p">.</span><span class="n">Dropdown</span><span class="p">(</span>
        <span class="nb">id</span><span class="o">=</span><span class="s">'variable-selector'</span><span class="p">,</span>
        <span class="n">options</span><span class="o">=</span><span class="p">[{</span><span class="s">'label'</span><span class="p">:</span> <span class="n">i</span><span class="p">,</span> <span class="s">'value'</span><span class="p">:</span> <span class="n">i</span><span class="p">}</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">df</span><span class="p">.</span><span class="n">columns</span><span class="p">],</span>
        <span class="n">value</span><span class="o">=</span><span class="s">'cpu_value(higher_is_better)'</span>
    <span class="p">),</span>
    <span class="n">dcc</span><span class="p">.</span><span class="n">RadioItems</span><span class="p">(</span>
        <span class="nb">id</span><span class="o">=</span><span class="s">'sort-order'</span><span class="p">,</span>
        <span class="n">options</span><span class="o">=</span><span class="p">[{</span><span class="s">'label'</span><span class="p">:</span> <span class="n">i</span><span class="p">,</span> <span class="s">'value'</span><span class="p">:</span> <span class="n">i</span><span class="p">}</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="p">[</span><span class="s">'Ascending'</span><span class="p">,</span> <span class="s">'Descending'</span><span class="p">]],</span>
        <span class="n">value</span><span class="o">=</span><span class="s">'Ascending'</span><span class="p">,</span>
        <span class="n">labelStyle</span><span class="o">=</span><span class="p">{</span><span class="s">'display'</span><span class="p">:</span> <span class="s">'inline-block'</span><span class="p">}</span>
    <span class="p">),</span>
    <span class="n">dcc</span><span class="p">.</span><span class="n">Graph</span><span class="p">(</span>
        <span class="nb">id</span><span class="o">=</span><span class="s">'histogram'</span><span class="p">,</span>
        <span class="n">figure</span><span class="o">=</span><span class="p">{}</span>
    <span class="p">),</span>
        <span class="n">html</span><span class="p">.</span><span class="n">Br</span><span class="p">(),</span>
    <span class="n">html</span><span class="p">.</span><span class="n">H4</span><span class="p">(</span><span class="s">'Pie-chart variable:'</span><span class="p">),</span>
     <span class="n">dcc</span><span class="p">.</span><span class="n">Dropdown</span><span class="p">(</span>
                <span class="nb">id</span><span class="o">=</span><span class="s">"variable-selector-2"</span><span class="p">,</span>
                <span class="n">options</span><span class="o">=</span><span class="p">[</span>
                    <span class="p">{</span><span class="s">"label"</span><span class="p">:</span> <span class="s">"Ghz"</span><span class="p">,</span> <span class="s">"value"</span><span class="p">:</span> <span class="s">"ghz"</span><span class="p">},</span>
                    <span class="p">{</span><span class="s">"label"</span><span class="p">:</span> <span class="s">"Cores"</span><span class="p">,</span> <span class="s">"value"</span><span class="p">:</span> <span class="s">"cores"</span><span class="p">},</span>
                    <span class="p">{</span><span class="s">"label"</span><span class="p">:</span> <span class="s">"Threads"</span><span class="p">,</span> <span class="s">"value"</span><span class="p">:</span> <span class="s">"threads"</span><span class="p">},</span>
                    <span class="p">{</span><span class="s">"label"</span><span class="p">:</span> <span class="s">"Year"</span><span class="p">,</span> <span class="s">"value"</span><span class="p">:</span> <span class="s">"year"</span><span class="p">},</span>
                    <span class="c1">#"ghz","cores","threads"
</span>
                <span class="p">],</span>
                <span class="c1">#style={"width": "45%"}
</span>                <span class="n">value</span><span class="o">=</span><span class="s">"cores"</span>
                
            <span class="p">),</span>
             <span class="n">dcc</span><span class="p">.</span><span class="n">Graph</span><span class="p">(</span><span class="nb">id</span><span class="o">=</span><span class="s">"pie-chart"</span><span class="p">),</span>
             <span class="n">html</span><span class="p">.</span><span class="n">Br</span><span class="p">(),</span>
    <span class="n">html</span><span class="p">.</span><span class="n">H4</span><span class="p">(</span><span class="s">'Scatter-Plot variable:'</span><span class="p">),</span>
    <span class="n">dcc</span><span class="p">.</span><span class="n">Dropdown</span><span class="p">(</span>
        <span class="nb">id</span><span class="o">=</span><span class="s">'variable-selector-3'</span><span class="p">,</span>
        <span class="n">options</span><span class="o">=</span><span class="p">[{</span><span class="s">'label'</span><span class="p">:</span> <span class="n">i</span><span class="p">,</span> <span class="s">'value'</span><span class="p">:</span> <span class="n">i</span><span class="p">}</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">df</span><span class="p">.</span><span class="n">columns</span><span class="p">],</span>
        <span class="n">value</span><span class="o">=</span><span class="s">'cpu_name'</span>
    <span class="p">),</span>
             
             <span class="n">dcc</span><span class="p">.</span><span class="n">Graph</span><span class="p">(</span><span class="nb">id</span><span class="o">=</span><span class="s">"scatter-plot"</span><span class="p">),</span>
<span class="p">])</span>

<span class="o">@</span><span class="n">app</span><span class="p">.</span><span class="n">callback</span><span class="p">(</span>
    <span class="p">[</span><span class="n">dash</span><span class="p">.</span><span class="n">dependencies</span><span class="p">.</span><span class="n">Output</span><span class="p">(</span><span class="s">'histogram'</span><span class="p">,</span> <span class="s">'figure'</span><span class="p">),</span>
     <span class="n">dash</span><span class="p">.</span><span class="n">dependencies</span><span class="p">.</span><span class="n">Output</span><span class="p">(</span><span class="s">'table'</span><span class="p">,</span> <span class="s">'data'</span><span class="p">)],</span>
    <span class="p">[</span><span class="n">dash</span><span class="p">.</span><span class="n">dependencies</span><span class="p">.</span><span class="n">Input</span><span class="p">(</span><span class="s">'variable-selector'</span><span class="p">,</span> <span class="s">'value'</span><span class="p">),</span>
     <span class="n">dash</span><span class="p">.</span><span class="n">dependencies</span><span class="p">.</span><span class="n">Input</span><span class="p">(</span><span class="s">'sort-order'</span><span class="p">,</span> <span class="s">'value'</span><span class="p">)]</span>
<span class="p">)</span>
<span class="k">def</span> <span class="nf">update_histogram_and_table</span><span class="p">(</span><span class="n">variable</span><span class="p">,</span> <span class="n">sort_order</span><span class="p">):</span>
    <span class="n">df_sorted</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">sort_values</span><span class="p">(</span><span class="n">by</span><span class="o">=</span><span class="s">'cpu_value(higher_is_better)'</span><span class="p">,</span> <span class="n">ascending</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
    <span class="k">if</span> <span class="n">sort_order</span> <span class="o">==</span> <span class="s">'Ascending'</span><span class="p">:</span>
        <span class="n">df_sorted</span> <span class="o">=</span> <span class="n">df_sorted</span><span class="p">.</span><span class="n">iloc</span><span class="p">[::</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span>
    <span class="n">data_table</span> <span class="o">=</span> <span class="n">df_sorted</span><span class="p">.</span><span class="n">head</span><span class="p">(</span><span class="mi">10</span><span class="p">).</span><span class="n">to_dict</span><span class="p">(</span><span class="s">'records'</span><span class="p">)</span>

    <span class="n">fig</span> <span class="o">=</span> <span class="p">{</span>
        <span class="s">'data'</span><span class="p">:</span> <span class="p">[{</span>
            <span class="s">'x'</span><span class="p">:</span> <span class="n">df</span><span class="p">[</span><span class="n">variable</span><span class="p">],</span>
            <span class="s">'type'</span><span class="p">:</span> <span class="s">'histogram'</span>
        <span class="p">}],</span>
        <span class="s">'layout'</span><span class="p">:</span> <span class="p">{</span>
            <span class="s">'title'</span><span class="p">:</span> <span class="s">'Histogram of '</span> <span class="o">+</span> <span class="n">variable</span><span class="p">,</span>
            <span class="s">'xaxis'</span><span class="p">:</span> <span class="p">{</span><span class="s">'title'</span><span class="p">:</span> <span class="n">variable</span><span class="p">},</span>
            <span class="s">'yaxis'</span><span class="p">:</span> <span class="p">{</span><span class="s">'title'</span><span class="p">:</span> <span class="s">'Count'</span><span class="p">}</span>
        <span class="p">}</span>
    <span class="p">}</span>

    <span class="k">return</span> <span class="n">fig</span><span class="p">,</span> <span class="n">data_table</span>

<span class="o">@</span><span class="n">app</span><span class="p">.</span><span class="n">callback</span><span class="p">(</span>
    <span class="n">dash</span><span class="p">.</span><span class="n">dependencies</span><span class="p">.</span><span class="n">Output</span><span class="p">(</span><span class="s">"pie-chart"</span><span class="p">,</span> <span class="s">"figure"</span><span class="p">),</span>
    <span class="p">[</span><span class="n">dash</span><span class="p">.</span><span class="n">dependencies</span><span class="p">.</span><span class="n">Input</span><span class="p">(</span><span class="s">"variable-selector-2"</span><span class="p">,</span> <span class="s">"value"</span><span class="p">)]</span>
<span class="p">)</span>

<span class="k">def</span> <span class="nf">update_pie_chart</span><span class="p">(</span><span class="n">selected_column</span><span class="p">):</span>
    <span class="c1">#filtered_df = df[df['year'] == selected_column]
</span>    <span class="n">values</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="n">selected_column</span><span class="p">].</span><span class="n">value_counts</span><span class="p">().</span><span class="n">values</span>
    <span class="n">labels</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="n">selected_column</span><span class="p">].</span><span class="n">value_counts</span><span class="p">().</span><span class="n">index</span>
    <span class="n">fig</span> <span class="o">=</span> <span class="n">go</span><span class="p">.</span><span class="n">Figure</span><span class="p">(</span><span class="n">data</span><span class="o">=</span><span class="p">[</span><span class="n">go</span><span class="p">.</span><span class="n">Pie</span><span class="p">(</span><span class="n">labels</span><span class="o">=</span><span class="n">labels</span><span class="p">,</span> <span class="n">values</span><span class="o">=</span><span class="n">values</span><span class="p">)])</span>
    <span class="c1">#fig.update_layout(title=f"{selected_column} distribution in {selected_column}")
</span>    <span class="k">return</span> <span class="n">fig</span>

<span class="o">@</span><span class="n">app</span><span class="p">.</span><span class="n">callback</span><span class="p">(</span>
    <span class="n">dash</span><span class="p">.</span><span class="n">dependencies</span><span class="p">.</span><span class="n">Output</span><span class="p">(</span><span class="s">"scatter-plot"</span><span class="p">,</span> <span class="s">"figure"</span><span class="p">),</span>
    <span class="p">[</span><span class="n">dash</span><span class="p">.</span><span class="n">dependencies</span><span class="p">.</span><span class="n">Input</span><span class="p">(</span><span class="s">"variable-selector-3"</span><span class="p">,</span> <span class="s">"value"</span><span class="p">)]</span>
<span class="p">)</span>
<span class="k">def</span> <span class="nf">update_scatter_plot</span><span class="p">(</span><span class="n">variable</span><span class="p">):</span>
    <span class="k">return</span> <span class="n">px</span><span class="p">.</span><span class="n">scatter</span><span class="p">(</span><span class="n">df</span><span class="p">,</span> <span class="n">x</span><span class="o">=</span><span class="s">"price_usd"</span><span class="p">,</span> <span class="n">y</span><span class="o">=</span><span class="n">variable</span><span class="p">).</span><span class="n">update_layout</span><span class="p">(</span>
        <span class="n">xaxis</span><span class="o">=</span><span class="p">{</span><span class="s">"title"</span><span class="p">:</span> <span class="s">"Price (USD)"</span><span class="p">},</span>
        <span class="n">yaxis</span><span class="o">=</span><span class="p">{</span><span class="s">"title"</span><span class="p">:</span> <span class="n">variable</span><span class="p">.</span><span class="n">capitalize</span><span class="p">()},</span>
        <span class="n">margin</span><span class="o">=</span><span class="p">{</span><span class="s">"l"</span><span class="p">:</span> <span class="mi">40</span><span class="p">,</span> <span class="s">"b"</span><span class="p">:</span> <span class="mi">40</span><span class="p">,</span> <span class="s">"t"</span><span class="p">:</span> <span class="mi">10</span><span class="p">,</span> <span class="s">"r"</span><span class="p">:</span> <span class="mi">10</span><span class="p">},</span>
        <span class="n">height</span><span class="o">=</span><span class="mi">300</span><span class="p">,</span>
    <span class="p">)</span>

<span class="k">if</span> <span class="n">__name__</span> <span class="o">==</span> <span class="s">'__main__'</span><span class="p">:</span>
    <span class="n">app</span><span class="p">.</span><span class="n">run_server</span><span class="p">(</span><span class="n">debug</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
</code></pre></div></div>

<h2 id="deploying-as-a-static-website">Deploying as a static website</h2>

<p>To finally deploy the dashboard as web application, I am going to rely
on this video put forward by the people from Plotly:
<a href="https://www.youtube.com/watch?v=H16dZMYmvqo">Deploy your Python
Data App to the Web for Free - Dash</a>. The procedure is step by step,
and it very simple, first, we put the <code class="language-plaintext highlighter-rouge">.py</code> python script in a public
Github repository. Then we open an account on <code class="language-plaintext highlighter-rouge">render.com</code> and follow a
simple procedure.</p>

<h2 id="the-final-cpu-dashboard">The final CPU Dashboard</h2>

<p>Finally, I present you the CPU benchmark Dashboard. But for a better experience and visualization, I invite you to check out
the static website at <a href="https://cpu-benchmark-plotly-dash.onrender.com/">cpu-benchmark-plotly-dash.onrender.com</a></p>

<iframe src="https://cpu-benchmark-plotly-dash.onrender.com/" title="w" width="800px" height="600px" frameborder="50">
</iframe>]]></content><author><name>Mario H. Gonzalez-Sauri</name></author><summary type="html"><![CDATA[Introduction]]></summary></entry><entry><title type="html">Dynamic network of collaboration in Machine Learning using R and Python.</title><link href="https://mario1084.github.io/blog/2022/10/07/net_ml.html" rel="alternate" type="text/html" title="Dynamic network of collaboration in Machine Learning using R and Python." /><published>2022-10-07T00:00:00+00:00</published><updated>2022-10-07T00:00:00+00:00</updated><id>https://mario1084.github.io/blog/2022/10/07/net_ml</id><content type="html" xml:base="https://mario1084.github.io/blog/2022/10/07/net_ml.html"><![CDATA[<h2 id="introduction">Introduction</h2>

<p>In this blog entry, I will use data from Web of Science to draw a
network of collaboration in the field of Machine Learning. I am going to
disentangle the core universities that have published the top highly
cited <code class="language-plaintext highlighter-rouge">2878</code> articles about Machine Learning in Web of Science. These
are the most important scientific contributions to the field downloaded
in October 2022.</p>

<p>I use data of Web of Science which is the most widely used database of
research publications and citations. Most universities have a license to
use this database for research purposes. My query is simple, I use the
multidisciplinary Web of Science Core collection searching on the
publication’s Tittle, Abstract or Keywords the word “Machine Learning”.
Later I filter subsetting only the highly cited publications in the
field.</p>

<h2 id="libraries">Libraries</h2>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">library</span><span class="p">(</span><span class="n">data.table</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">ggplot2</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">svglite</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## Warning: package 'svglite' was built under R version 4.2.1
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">library</span><span class="p">(</span><span class="n">igraph</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## 
## Attaching package: 'igraph'

## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum

## The following object is masked from 'package:base':
## 
##     union
</code></pre></div></div>

<h2 id="data">Data</h2>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">csvs</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">dir</span><span class="p">(</span><span class="n">pattern</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"savedrecs.*csv$"</span><span class="p">)</span><span class="w">
</span><span class="n">csvs</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">lapply</span><span class="p">(</span><span class="n">csvs</span><span class="p">,</span><span class="w"> </span><span class="n">fread</span><span class="p">)</span><span class="w">
</span><span class="n">csvs</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">rbindlist</span><span class="p">(</span><span class="n">csvs</span><span class="p">)</span><span class="w">
</span><span class="nf">dim</span><span class="p">(</span><span class="n">csvs</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## [1] 2878   72
</code></pre></div></div>

<h3 id="general-approach">General Approach</h3>

<p>I will use <code class="language-plaintext highlighter-rouge">R</code> and <code class="language-plaintext highlighter-rouge">Regex</code>(regular expressions) to clean the address
field of the publication to extract the affiliations of the authors.
Then I will use the <code class="language-plaintext highlighter-rouge">d3graph</code> library of <code class="language-plaintext highlighter-rouge">Python</code> to produce a dynamic
network of university collaboration.</p>

<h3 id="use-regex-to-clean-the-data">Use Regex to clean the data</h3>

<p>Now, that we have put together the files, it’s time to extract the
data from the <code class="language-plaintext highlighter-rouge">Addresses</code> field. Looking closely at this column:</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">csvs</span><span class="o">$</span><span class="n">Addresses</span><span class="p">[</span><span class="m">1</span><span class="o">:</span><span class="m">3</span><span class="p">]</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## [1] "[Muehlematter, Urs J.] Univ Zurich, Univ Hosp Zurich, Inst Diagnost &amp; Intervent Radiol, Zurich, Switzerland; [Daniore, Paola; Vokinger, Kerstin N.] Univ Zurich, Inst Law, CH-8001 Zurich, Switzerland"                                      
## [2] "[Fu, Xiangzheng; Cai, Lijun; Zeng, Xiangxiang] Hunan Univ, Coll Comp Sci &amp; Elect Engn, Changsha 410082, Hunan, Peoples R China; [Zou, Quan] Univ Elect Sci &amp; Technol China, Inst Fundamental &amp; Frontier Sci, Chengdu 610054, Peoples R China"
## [3] "[Raissi, Maziar; Karniadakis, George Em] Brown Univ, Div Appl Math, Providence, RI 02912 USA"
</code></pre></div></div>

<p>It is clear that this string has a pattern in which the authors are
surrounded by square brackets, for instance <code class="language-plaintext highlighter-rouge">[Muehlematter, Urs J.]</code>,
and immediately after the record reports the university
<code class="language-plaintext highlighter-rouge">Univ Zurich</code>. If the article is published by two or more different
universities the field will be separated by a <code class="language-plaintext highlighter-rouge">;</code> semicolon.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># First we aim for separating authors:</span><span class="w">
</span><span class="n">samp</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">unlist</span><span class="p">(</span><span class="n">strsplit</span><span class="p">(</span><span class="n">csvs</span><span class="o">$</span><span class="n">Addresses</span><span class="p">[</span><span class="m">2</span><span class="p">],</span><span class="w"> </span><span class="s2">"; \\["</span><span class="w"> </span><span class="p">))</span><span class="w">
</span><span class="n">samp</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## [1] "[Fu, Xiangzheng; Cai, Lijun; Zeng, Xiangxiang] Hunan Univ, Coll Comp Sci &amp; Elect Engn, Changsha 410082, Hunan, Peoples R China"
## [2] "Zou, Quan] Univ Elect Sci &amp; Technol China, Inst Fundamental &amp; Frontier Sci, Chengdu 610054, Peoples R China"
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Then we aim to extract the universities</span><span class="w">
</span><span class="n">samp</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">gsub</span><span class="p">(</span><span class="s2">".*\\] "</span><span class="p">,</span><span class="w"> </span><span class="s2">""</span><span class="p">,</span><span class="w">  </span><span class="n">samp</span><span class="p">)</span><span class="w">
</span><span class="n">samp</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## [1] "Hunan Univ, Coll Comp Sci &amp; Elect Engn, Changsha 410082, Hunan, Peoples R China"                 
## [2] "Univ Elect Sci &amp; Technol China, Inst Fundamental &amp; Frontier Sci, Chengdu 610054, Peoples R China"
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># We have to clean everything after the comma</span><span class="w">
</span><span class="n">samp</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">gsub</span><span class="p">(</span><span class="s2">",.*"</span><span class="p">,</span><span class="w"> </span><span class="s2">""</span><span class="p">,</span><span class="w">  </span><span class="n">samp</span><span class="p">)</span><span class="w">
</span><span class="n">samp</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## [1] "Hunan Univ"                     "Univ Elect Sci &amp; Technol China"
</code></pre></div></div>

<h3 id="build-the-dataframe-publication-university">Build the dataframe publication-university</h3>

<p>Perfect, now we apply this to the whole dataset. We have to give the
universities a unique identifier if they collaborate in the same
publication.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">ml_data</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">list</span><span class="p">()</span><span class="w">
</span><span class="n">i</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="m">1L</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="m">1L</span><span class="o">:</span><span class="n">nrow</span><span class="p">(</span><span class="n">csvs</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
  </span><span class="n">temp</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">unlist</span><span class="p">(</span><span class="n">strsplit</span><span class="p">(</span><span class="n">csvs</span><span class="o">$</span><span class="n">Addresses</span><span class="p">[</span><span class="n">i</span><span class="p">],</span><span class="w"> </span><span class="s2">"; \\["</span><span class="w"> </span><span class="p">))</span><span class="w">
  </span><span class="n">temp</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">gsub</span><span class="p">(</span><span class="s2">".*\\] "</span><span class="p">,</span><span class="w"> </span><span class="s2">""</span><span class="p">,</span><span class="w">  </span><span class="n">temp</span><span class="p">)</span><span class="w">
  </span><span class="n">temp</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">gsub</span><span class="p">(</span><span class="s2">",.*"</span><span class="p">,</span><span class="w"> </span><span class="s2">""</span><span class="p">,</span><span class="w">  </span><span class="n">temp</span><span class="p">)</span><span class="w">
  </span><span class="k">if</span><span class="p">(</span><span class="nf">length</span><span class="p">(</span><span class="n">temp</span><span class="p">)</span><span class="o">&gt;</span><span class="m">0</span><span class="p">){</span><span class="w">
    </span><span class="n">ml_data</span><span class="p">[[</span><span class="n">i</span><span class="p">]]</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">data.frame</span><span class="p">(</span><span class="n">id</span><span class="o">=</span><span class="n">i</span><span class="p">,</span><span class="w"> </span><span class="n">univ</span><span class="o">=</span><span class="n">temp</span><span class="p">)</span><span class="w">
  </span><span class="p">}</span><span class="w">
  </span><span class="p">}</span><span class="w">

</span><span class="n">ml_data</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">rbindlist</span><span class="p">(</span><span class="n">ml_data</span><span class="p">)</span><span class="w">

</span><span class="c1"># We have 13074 in total</span><span class="w">
</span><span class="nf">dim</span><span class="p">(</span><span class="n">ml_data</span><span class="p">)</span><span class="w"> 
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## [1] 13074     2
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Subset only to the top 50 universities</span><span class="w">
</span><span class="n">ml_data</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">ml_data</span><span class="p">[</span><span class="n">univ</span><span class="w"> </span><span class="o">%in%</span><span class="w"> </span><span class="nf">names</span><span class="p">(</span><span class="n">sort</span><span class="p">(</span><span class="n">table</span><span class="p">(</span><span class="n">ml_data</span><span class="o">$</span><span class="n">univ</span><span class="p">),</span><span class="w"> </span><span class="n">decreasing</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nb">T</span><span class="p">)[</span><span class="m">1</span><span class="o">:</span><span class="m">50</span><span class="p">]),</span><span class="w">  </span><span class="p">]</span><span class="w"> 
</span></code></pre></div></div>

<h2 id="the-edgelist">The edgelist</h2>

<p>The edgelist is the key input that we need to plot the network. However,
we have to perform additional data manipulation before we have a list of
pairs of universities. The data so far contains pairs of
<code class="language-plaintext highlighter-rouge">c(publication, university)</code>, however, what we need is a list
that contains pairs of universities when they work together in a project
<code class="language-plaintext highlighter-rouge">c(university, university)</code>. The article by RPubs (2022), describes more
about this type of conversion from the theoretical angle. From the data
science perspective, I show here Gonzalez-Sauri (2022) several ways to
perform this transformation.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">ml_data</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>##         id                           univ
##    1:    2 Univ Elect Sci &amp; Technol China
##    2:    4                            MIT
##    3:    4              Northwestern Univ
##    4:    7               Chinese Acad Sci
##    5:    8 Univ Elect Sci &amp; Technol China
##   ---                                    
## 2661: 2873                   Tianjin Univ
## 2662: 2873            Natl Univ Singapore
## 2663: 2875                     Wuhan Univ
## 2664: 2875                 Univ Cambridge
## 2665: 2877            Univ Calif Berkeley
</code></pre></div></div>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">edge_lst</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">merge</span><span class="p">(</span><span class="n">ml_data</span><span class="p">,</span><span class="w"> </span><span class="n">ml_data</span><span class="p">,</span><span class="w"> </span><span class="n">by</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"id"</span><span class="p">,</span><span class="w"> </span><span class="n">allow.cartesian</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
</span><span class="n">edge_lst</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">edge_lst</span><span class="p">[</span><span class="n">edge_lst</span><span class="o">$</span><span class="n">univ.x</span><span class="w"> </span><span class="o">!=</span><span class="w"> </span><span class="n">edge_lst</span><span class="o">$</span><span class="n">univ.y</span><span class="p">,</span><span class="w"> </span><span class="m">-1L</span><span class="p">]</span><span class="w">
</span><span class="nf">dim</span><span class="p">(</span><span class="n">edge_lst</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>## [1] 6594    2
</code></pre></div></div>

<p>I want to differentiate the strength of the link or edge, so, I will
calculate the betweenness centrality at the level of the edge. I will
append this to the dataset and then export it to a csv-file.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">g1</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">igraph</span><span class="o">::</span><span class="n">graph_from_data_frame</span><span class="p">(</span><span class="n">edge_lst</span><span class="p">,</span><span class="w"> </span><span class="n">directed</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nb">F</span><span class="p">)</span><span class="w">
</span><span class="n">edge_lst</span><span class="p">[,</span><span class="w"> </span><span class="n">weight</span><span class="o">:=</span><span class="w"> </span><span class="n">edge.betweenness</span><span class="p">(</span><span class="n">g1</span><span class="p">)]</span><span class="w"> 
</span><span class="n">setnames</span><span class="p">(</span><span class="n">edge_lst</span><span class="p">,</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"source"</span><span class="p">,</span><span class="w"> </span><span class="s2">"target"</span><span class="p">,</span><span class="w"> </span><span class="s2">"weight"</span><span class="p">))</span><span class="w"> 
</span><span class="n">fwrite</span><span class="p">(</span><span class="n">edge_lst</span><span class="p">,</span><span class="w"> </span><span class="s2">"edge_lst.csv"</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>

<h2 id="top-universities-in-machine-learning">Top Universities in Machine Learning</h2>

<p>Just for curiosity lets look at the top 20 universities working in the
field of Machine Learning.</p>

<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">top_ml</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">sort</span><span class="p">(</span><span class="n">table</span><span class="p">(</span><span class="n">ml_data</span><span class="o">$</span><span class="n">univ</span><span class="p">),</span><span class="w"> </span><span class="n">decreasing</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nb">T</span><span class="p">)[</span><span class="m">1</span><span class="o">:</span><span class="m">20</span><span class="p">]</span><span class="w">
</span><span class="n">top_ml</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">data.frame</span><span class="p">(</span><span class="n">top_ml</span><span class="p">)</span><span class="w">
</span><span class="n">colnames</span><span class="p">(</span><span class="n">top_ml</span><span class="p">)</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"univ"</span><span class="p">,</span><span class="w"> </span><span class="s2">"pubs"</span><span class="p">)</span><span class="w">

</span><span class="n">p</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">ggplot</span><span class="p">(</span><span class="n">top_ml</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">univ</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">pubs</span><span class="p">,</span><span class="w"> </span><span class="n">fill</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">pubs</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">geom_bar</span><span class="p">(</span><span class="n">stat</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"identity"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">theme_minimal</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">theme</span><span class="p">(</span><span class="n">axis.text.x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_text</span><span class="p">(</span><span class="w">
    </span><span class="n">angle</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">45</span><span class="p">,</span><span class="w">
    </span><span class="c1">#vjust = 0.5,</span><span class="w">
    </span><span class="n">hjust</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="w">
  </span><span class="p">))</span><span class="w">

</span><span class="c1"># save the picture</span><span class="w">
</span><span class="n">ggsave</span><span class="p">(</span><span class="n">file</span><span class="o">=</span><span class="s2">"top_ml_univ.svg"</span><span class="p">,</span><span class="w"> </span><span class="n">plot</span><span class="o">=</span><span class="n">p</span><span class="p">,</span><span class="w"> </span><span class="n">width</span><span class="o">=</span><span class="m">16</span><span class="p">,</span><span class="w"> </span><span class="n">height</span><span class="o">=</span><span class="m">10</span><span class="p">)</span><span class="w">

</span><span class="n">p</span><span class="w">
</span></code></pre></div></div>

<p><img src="https://github.com/Wario84/blog/raw/main/assets/imgs/10-07-2022-net_ml/top_ml_univ.svg?raw=true" alt="" /><!-- --></p>

<h2 id="python-dynamic-network">Python Dynamic Network</h2>

<p>For the network, I will use the <code class="language-plaintext highlighter-rouge">d3graph</code> library created by Taskesen
(2022).</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>
<span class="kn">from</span> <span class="nn">d3graph</span> <span class="kn">import</span> <span class="n">d3graph</span><span class="p">,</span> <span class="n">vec2adjmat</span>

<span class="c1"># Import data
</span><span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s">"./blog/edge_lst.csv"</span><span class="p">)</span>

<span class="c1"># Show the input data
</span><span class="k">print</span><span class="p">(</span><span class="n">df</span><span class="p">)</span>

<span class="c1"># Create an adjaceny matrix
</span><span class="n">adjmat</span> <span class="o">=</span> <span class="n">vec2adjmat</span><span class="p">(</span><span class="n">source</span><span class="o">=</span><span class="n">df</span><span class="p">[</span><span class="s">"source"</span><span class="p">].</span><span class="n">tolist</span><span class="p">(),</span> <span class="n">target</span><span class="o">=</span><span class="n">df</span><span class="p">[</span><span class="s">"target"</span><span class="p">].</span><span class="n">to_list</span><span class="p">())</span>

<span class="c1"># Initialize
</span><span class="n">d3</span> <span class="o">=</span> <span class="n">d3graph</span><span class="p">()</span>

<span class="c1"># Build force-directed graph with default settings
</span><span class="n">d3</span><span class="p">.</span><span class="n">graph</span><span class="p">(</span><span class="n">adjmat</span><span class="p">)</span>

<span class="c1"># Show graph
</span><span class="n">d3</span><span class="p">.</span><span class="n">show</span><span class="p">()</span>

</code></pre></div></div>

<h2 id="static-networks">Static Networks</h2>

<p>The results are quite nice. First I would like to show the results
filtering universities that have 20 edges (co-authored publications) or
more.</p>

<iframe src="https://wario84.github.io/blog/assets/imgs/10-07-2022-net_ml/edge_20.html" title="w" width="775px" height="775px" frameborder="0"></iframe>

<p>Then we have the network when we filter only universities with more than 45 connections.</p>

<iframe src="https://wario84.github.io/blog/assets/imgs/10-07-2022-net_ml/edge_45.html" title="w" width="775px" height="775px" frameborder="0"></iframe>

<h2 id="dynamic-network">Dynamic Network</h2>

<p>Finally, we have the main dynamic network that we can use to display several thresholds of network connections.</p>

<iframe src="https://wario84.github.io/blog/assets/imgs/10-07-2022-net_ml/edge_700pp.html" title="w" width="800px" height="600px" frameborder="50"></iframe>

<h2 id="references">References</h2>

<div id="refs" class="references csl-bib-body hanging-indent">

<div id="ref-BibEntry2022Oct5" class="csl-entry">

<p>Gonzalez-Sauri. 2022. “<span class="nocase">What its the most
efficient method to create an edgelist/adjacency matrix from two sets of
IDs?</span>” <em>Stack Overflow</em>. <a href="https://stackoverflow.com/questions/42764954/what-its-the-most-efficient-method-to-create-an-edgelist-adjacency-matrix-from-t">https://stackoverflow.com/questions/42764954/what-its-the-most-efficient-method-to-create-an-edgelist-adjacency-matrix-from-t</a>.</p>
</div>

<div id="ref-BibEntry2022Oct3" class="csl-entry">

<p>RPubs. 2022. “<span class="nocase">RPubs - Bipartite/Two-Mode
Networks in igraph</span>.” <a href="https://rpubs.com/pjmurphy/317838">https://rpubs.com/pjmurphy/317838</a>.</p>
</div>

<div id="ref-erdogant2022Oct" class="csl-entry">

<p>Taskesen, Erdogan. 2022. “<span class="nocase">d3graph</span>.”
<em>GitHub</em>. <a href="https://github.com/erdogant/d3graph">https://github.com/erdogant/d3graph</a>.</p>
</div>

</div>]]></content><author><name>Mario H. Gonzalez-Sauri</name></author><summary type="html"><![CDATA[Introduction]]></summary></entry><entry><title type="html">Introduction to Data Transformation with Python and Pandas.</title><link href="https://mario1084.github.io/blog/2022/10/04/intro_pandas.html" rel="alternate" type="text/html" title="Introduction to Data Transformation with Python and Pandas." /><published>2022-10-04T00:00:00+00:00</published><updated>2022-10-04T00:00:00+00:00</updated><id>https://mario1084.github.io/blog/2022/10/04/intro_pandas</id><content type="html" xml:base="https://mario1084.github.io/blog/2022/10/04/intro_pandas.html"><![CDATA[<h2 id="importing-modules">Importing Modules</h2>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>

<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>
</code></pre></div></div>

<h2 id="loading-data">Loading Data</h2>

<p>For this tutorial you need to download the</p>

<ul>
  <li>
    <p>The <code class="language-plaintext highlighter-rouge">avocados</code>, <code class="language-plaintext highlighter-rouge">homelessness.csv</code>, <code class="language-plaintext highlighter-rouge">sales_subset.csv</code> and
<code class="language-plaintext highlighter-rouge">temperatures.csv</code> dataset from <a href="https://www.kaggle.com/code/kakamana/datacamp-data-manipulation-with-pandas/data">Kaggle’s
website.</a></p>
  </li>
  <li>
    <p>The <code class="language-plaintext highlighter-rouge">COVID19 Daily Updates</code>data from <a href="https://www.kaggle.com/datasets/gpreda/coronavirus-2019ncov?select=covid-19-all.csv">Kaggle’s
website.</a></p>
  </li>
</ul>

<h2 id="creating-dataframes">Creating Dataframes</h2>

<h3 id="a-dataframe-from-a-list">A Dataframe from a list</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">a</span> <span class="o">=</span> <span class="p">[[</span><span class="s">'a'</span><span class="p">,</span> <span class="s">'1.2'</span><span class="p">,</span> <span class="s">'4.2'</span><span class="p">],</span> <span class="p">[</span><span class="s">'b'</span><span class="p">,</span> <span class="s">'70'</span><span class="p">,</span> <span class="s">'0.03'</span><span class="p">],</span> <span class="p">[</span><span class="s">'x'</span><span class="p">,</span> <span class="s">'5'</span><span class="p">,</span> <span class="s">'0'</span><span class="p">]]</span>
<span class="n">df11</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">columns</span><span class="o">=</span><span class="p">[</span><span class="s">'one'</span><span class="p">,</span> <span class="s">'two'</span><span class="p">,</span> <span class="s">'three'</span><span class="p">])</span>
<span class="k">print</span><span class="p">(</span><span class="n">df11</span><span class="p">)</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>  one  two three
0   a  1.2   4.2
1   b   70  0.03
2   x    5     0
</code></pre></div></div>

<h3 id="a-dataframe-from-array">A Dataframe from Array</h3>

<p>Example 1</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">dates</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">date_range</span><span class="p">(</span><span class="s">"20130101"</span><span class="p">,</span> <span class="n">periods</span><span class="o">=</span><span class="mi">6</span><span class="p">)</span> <span class="c1"># pandas indexes
</span><span class="n">r_num</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">randn</span><span class="p">(</span><span class="mi">6</span><span class="p">,</span> <span class="mi">4</span><span class="p">)</span> <span class="c1"># array
</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">r_num</span><span class="p">,</span> <span class="n">index</span><span class="o">=</span><span class="n">dates</span><span class="p">,</span> <span class="n">columns</span><span class="o">=</span><span class="nb">list</span><span class="p">(</span><span class="s">"ABCD"</span><span class="p">))</span>
<span class="k">print</span><span class="p">(</span><span class="n">df</span><span class="p">)</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>                   A         B         C         D
2013-01-01  0.817994 -0.924007 -1.515711 -0.198598
2013-01-02  0.673364 -1.914110 -0.126208 -0.282033
2013-01-03  1.312579  0.340656 -0.300397 -0.838614
2013-01-04 -0.732977 -0.560867 -0.515910 -0.768784
2013-01-05 -2.045106 -0.929131 -0.029660  0.529883
2013-01-06 -1.343257 -0.250821 -0.046303  0.944569
</code></pre></div></div>

<p>Example 2</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">r_num</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="mi">26</span><span class="p">,</span><span class="mi">1</span><span class="p">))</span>
<span class="c1"># Convert 1D array to a 2D numpy array of 5 rows and 5 columns
</span><span class="n">arr_2d</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">reshape</span><span class="p">(</span><span class="n">r_num</span><span class="p">,</span> <span class="p">(</span><span class="mi">5</span><span class="p">,</span> <span class="mi">5</span><span class="p">))</span>
<span class="n">df4</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">arr_2d</span><span class="p">)</span>

<span class="k">print</span><span class="p">(</span><span class="n">df4</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="nb">type</span><span class="p">(</span><span class="n">df4</span><span class="p">))</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    0   1   2   3   4
0   1   2   3   4   5
1   6   7   8   9  10
2  11  12  13  14  15
3  16  17  18  19  20
4  21  22  23  24  25
&lt;class 'pandas.core.frame.DataFrame'&gt;
</code></pre></div></div>

<h3 id="a-dataframe-from-a-dictionary">A Dataframe from a dictionary</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">dict</span> <span class="o">=</span> <span class="p">{</span>
        <span class="s">"A"</span><span class="p">:</span> <span class="mf">1.0</span><span class="p">,</span>
        <span class="s">"B"</span><span class="p">:</span> <span class="n">pd</span><span class="p">.</span><span class="n">Timestamp</span><span class="p">(</span><span class="s">"20130102"</span><span class="p">),</span>
        <span class="s">"C"</span><span class="p">:</span> <span class="n">pd</span><span class="p">.</span><span class="n">Series</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">index</span><span class="o">=</span><span class="nb">list</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="mi">4</span><span class="p">)),</span> <span class="n">dtype</span><span class="o">=</span><span class="s">"float32"</span><span class="p">),</span>
        <span class="s">"D"</span><span class="p">:</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">([</span><span class="mi">3</span><span class="p">]</span> <span class="o">*</span> <span class="mi">4</span><span class="p">,</span> <span class="n">dtype</span><span class="o">=</span><span class="s">"int32"</span><span class="p">),</span>
        <span class="s">"E"</span><span class="p">:</span> <span class="n">pd</span><span class="p">.</span><span class="n">Categorical</span><span class="p">([</span><span class="s">"test"</span><span class="p">,</span> <span class="s">"train"</span><span class="p">,</span> <span class="s">"test"</span><span class="p">,</span> <span class="s">"train"</span><span class="p">]),</span>
        <span class="s">"F"</span><span class="p">:</span> <span class="s">"foo"</span><span class="p">,</span>
    <span class="p">}</span>

<span class="n">df2</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="nb">dict</span><span class="p">)</span>

<span class="k">print</span><span class="p">(</span><span class="n">df2</span><span class="p">)</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>     A          B    C  D      E    F
0  1.0 2013-01-02  1.0  3   test  foo
1  1.0 2013-01-02  1.0  3  train  foo
2  1.0 2013-01-02  1.0  3   test  foo
3  1.0 2013-01-02  1.0  3  train  foo
</code></pre></div></div>

<h3 id="a-dataframe-froma-list-of-dictionaries">A Dataframe froma list of dictionaries</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Create a list of dictionaries with new data
</span><span class="n">df9</span> <span class="o">=</span> <span class="p">[</span>
    <span class="p">{</span><span class="s">"date"</span><span class="p">:</span> <span class="s">"2019-11-03"</span><span class="p">,</span> <span class="s">"small_sold"</span><span class="p">:</span> <span class="mi">10376832</span><span class="p">,</span> <span class="s">"large_sold"</span><span class="p">:</span> <span class="mi">7835071</span><span class="p">},</span>
    <span class="p">{</span><span class="s">"date"</span><span class="p">:</span> <span class="s">"2019-11-10"</span><span class="p">,</span> <span class="s">"small_sold"</span><span class="p">:</span> <span class="mi">10717154</span><span class="p">,</span> <span class="s">"large_sold"</span><span class="p">:</span> <span class="mi">8561348</span><span class="p">},</span>
<span class="p">]</span>

<span class="c1"># Convert list into DataFrame
</span><span class="n">df9</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">df9</span><span class="p">)</span>

<span class="c1"># Print the new DataFrame
</span><span class="k">print</span><span class="p">(</span><span class="n">df9</span><span class="p">)</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>         date  small_sold  large_sold
0  2019-11-03    10376832     7835071
1  2019-11-10    10717154     8561348
</code></pre></div></div>

<h3 id="a-dataframe-from-csv">A Dataframe from CSV</h3>

<p>To print <code class="language-plaintext highlighter-rouge">df3</code> I use the <code class="language-plaintext highlighter-rouge">head()</code> method to print only the top rows.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Read the csv from working directory
</span><span class="n">df3</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s">"homelessness.csv"</span><span class="p">)</span>

<span class="k">print</span><span class="p">(</span><span class="n">df3</span><span class="p">.</span><span class="n">head</span><span class="p">())</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>   Unnamed: 0              region  ... family_members  state_pop
0           0  East South Central  ...          864.0    4887681
1           1             Pacific  ...          582.0     735139
2           2            Mountain  ...         2606.0    7158024
3           3  West South Central  ...          432.0    3009733
4           4             Pacific  ...        20964.0   39461588

[5 rows x 6 columns]
</code></pre></div></div>

<p>Another example of importing from csv.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df6</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s">"sales_subset.csv"</span><span class="p">)</span>

<span class="k">print</span><span class="p">(</span><span class="n">df6</span><span class="p">.</span><span class="n">head</span><span class="p">())</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>   Unnamed: 0  store type  ...  temperature_c fuel_price_usd_per_l  unemployment
0           0      1    A  ...       5.727778             0.679451         8.106
1           1      1    A  ...       8.055556             0.693452         8.106
2           2      1    A  ...      16.816667             0.718284         7.808
3           3      1    A  ...      22.527778             0.748928         7.808
4           4      1    A  ...      27.050000             0.714586         7.808

[5 rows x 10 columns]
</code></pre></div></div>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df7</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s">"temperatures.csv"</span><span class="p">)</span>

<span class="k">print</span><span class="p">(</span><span class="n">df7</span><span class="p">.</span><span class="n">head</span><span class="p">())</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>   Unnamed: 0        date     city        country  avg_temp_c
0           0  2000-01-01  Abidjan  Côte D'Ivoire      27.293
1           1  2000-02-01  Abidjan  Côte D'Ivoire      27.685
2           2  2000-03-01  Abidjan  Côte D'Ivoire      29.061
3           3  2000-04-01  Abidjan  Côte D'Ivoire      28.162
4           4  2000-05-01  Abidjan  Côte D'Ivoire      27.547
</code></pre></div></div>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df8</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s">"covid-19-all.csv"</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">df8</span><span class="p">.</span><span class="n">head</span><span class="p">())</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>&lt;string&gt;:1: DtypeWarning: Columns (0,1) have mixed types. Specify dtype option on import or set low_memory=False.
  Country/Region Province/State  Latitude  ...  Recovered  Deaths        Date
0            NaN            NaN       NaN  ...    41727.0  2191.0  2021-01-01
1            NaN            NaN       NaN  ...    33634.0  1181.0  2021-01-01
2            NaN            NaN       NaN  ...    67395.0  2762.0  2021-01-01
3            NaN            NaN       NaN  ...     7463.0    84.0  2021-01-01
4            NaN            NaN       NaN  ...    11146.0   405.0  2021-01-01

[5 rows x 8 columns]
</code></pre></div></div>

<h2 id="methods-and-attributes-of-a-dataframe">Methods and Attributes of a DataFrame</h2>

<p>In Python, there are specific <code class="language-plaintext highlighter-rouge">methods</code>, or operations that can be
performed for each data structure. Similarly, there are specific
characteristics of the data structures called <code class="language-plaintext highlighter-rouge">attributes</code>. We can
assess all the methods and attributes associated with a data structure
using the following lines:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">att_meth</span> <span class="o">=</span> <span class="nb">dir</span><span class="p">(</span><span class="n">df</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">att_meth</span><span class="p">)</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>['A', 'B', 'C', 'D', 'T', '_AXIS_LEN', '_AXIS_ORDERS', '_AXIS_TO_AXIS_NUMBER', '_HANDLED_TYPES', '__abs__', '__add__', '__and__', '__annotations__', '__array__', '__array_priority__', '__array_ufunc__', '__array_wrap__', '__bool__', '__class__', '__contains__', '__copy__', '__deepcopy__', '__delattr__', '__delitem__', '__dict__', '__dir__', '__divmod__', '__doc__', '__eq__', '__finalize__', '__floordiv__', '__format__', '__ge__', '__getattr__', '__getattribute__', '__getitem__', '__getstate__', '__gt__', '__hash__', '__iadd__', '__iand__', '__ifloordiv__', '__imod__', '__imul__', '__init__', '__init_subclass__', '__invert__', '__ior__', '__ipow__', '__isub__', '__iter__', '__itruediv__', '__ixor__', '__le__', '__len__', '__lt__', '__matmul__', '__mod__', '__module__', '__mul__', '__ne__', '__neg__', '__new__', '__nonzero__', '__or__', '__pos__', '__pow__', '__radd__', '__rand__', '__rdivmod__', '__reduce__', '__reduce_ex__', '__repr__', '__rfloordiv__', '__rmatmul__', '__rmod__', '__rmul__', '__ror__', '__round__', '__rpow__', '__rsub__', '__rtruediv__', '__rxor__', '__setattr__', '__setitem__', '__setstate__', '__sizeof__', '__str__', '__sub__', '__subclasshook__', '__truediv__', '__weakref__', '__xor__', '_accessors', '_accum_func', '_add_numeric_operations', '_agg_by_level', '_agg_examples_doc', '_agg_summary_and_see_also_doc', '_align_frame', '_align_series', '_append', '_arith_method', '_as_manager', '_attrs', '_box_col_values', '_can_fast_transpose', '_check_inplace_and_allows_duplicate_labels', '_check_inplace_setting', '_check_is_chained_assignment_possible', '_check_label_or_level_ambiguity', '_check_setitem_copy', '_clear_item_cache', '_clip_with_one_bound', '_clip_with_scalar', '_cmp_method', '_combine_frame', '_consolidate', '_consolidate_inplace', '_construct_axes_dict', '_construct_axes_from_arguments', '_construct_result', '_constructor', '_constructor_sliced', '_convert', '_count_level', '_data', '_dir_additions', '_dir_deletions', '_dispatch_frame_op', '_drop_axis', '_drop_labels_or_levels', '_ensure_valid_index', '_find_valid_index', '_flags', '_from_arrays', '_from_mgr', '_get_agg_axis', '_get_axis', '_get_axis_name', '_get_axis_number', '_get_axis_resolvers', '_get_block_manager_axis', '_get_bool_data', '_get_cleaned_column_resolvers', '_get_column_array', '_get_index_resolvers', '_get_item_cache', '_get_label_or_level_values', '_get_numeric_data', '_get_value', '_getitem_bool_array', '_getitem_multilevel', '_gotitem', '_hidden_attrs', '_indexed_same', '_info_axis', '_info_axis_name', '_info_axis_number', '_info_repr', '_init_mgr', '_inplace_method', '_internal_names', '_internal_names_set', '_is_copy', '_is_homogeneous_type', '_is_label_or_level_reference', '_is_label_reference', '_is_level_reference', '_is_mixed_type', '_is_view', '_iset_item', '_iset_item_mgr', '_iset_not_inplace', '_item_cache', '_iter_column_arrays', '_ixs', '_join_compat', '_logical_func', '_logical_method', '_maybe_cache_changed', '_maybe_update_cacher', '_metadata', '_mgr', '_min_count_stat_function', '_needs_reindex_multi', '_protect_consolidate', '_reduce', '_reduce_axis1', '_reindex_axes', '_reindex_columns', '_reindex_index', '_reindex_multi', '_reindex_with_indexers', '_rename', '_replace_columnwise', '_repr_data_resource_', '_repr_fits_horizontal_', '_repr_fits_vertical_', '_repr_html_', '_repr_latex_', '_reset_cache', '_reset_cacher', '_sanitize_column', '_series', '_set_axis', '_set_axis_name', '_set_axis_nocheck', '_set_is_copy', '_set_item', '_set_item_frame_value', '_set_item_mgr', '_set_value', '_setitem_array', '_setitem_frame', '_setitem_slice', '_slice', '_stat_axis', '_stat_axis_name', '_stat_axis_number', '_stat_function', '_stat_function_ddof', '_take_with_is_copy', '_to_dict_of_blocks', '_typ', '_update_inplace', '_validate_dtype', '_values', '_where', 'abs', 'add', 'add_prefix', 'add_suffix', 'agg', 'aggregate', 'align', 'all', 'any', 'append', 'apply', 'applymap', 'asfreq', 'asof', 'assign', 'astype', 'at', 'at_time', 'attrs', 'axes', 'backfill', 'between_time', 'bfill', 'bool', 'boxplot', 'clip', 'columns', 'combine', 'combine_first', 'compare', 'convert_dtypes', 'copy', 'corr', 'corrwith', 'count', 'cov', 'cummax', 'cummin', 'cumprod', 'cumsum', 'describe', 'diff', 'div', 'divide', 'dot', 'drop', 'drop_duplicates', 'droplevel', 'dropna', 'dtypes', 'duplicated', 'empty', 'eq', 'equals', 'eval', 'ewm', 'expanding', 'explode', 'ffill', 'fillna', 'filter', 'first', 'first_valid_index', 'flags', 'floordiv', 'from_dict', 'from_records', 'ge', 'get', 'groupby', 'gt', 'head', 'hist', 'iat', 'idxmax', 'idxmin', 'iloc', 'index', 'infer_objects', 'info', 'insert', 'interpolate', 'isin', 'isna', 'isnull', 'items', 'iteritems', 'iterrows', 'itertuples', 'join', 'keys', 'kurt', 'kurtosis', 'last', 'last_valid_index', 'le', 'loc', 'lookup', 'lt', 'mad', 'mask', 'max', 'mean', 'median', 'melt', 'memory_usage', 'merge', 'min', 'mod', 'mode', 'mul', 'multiply', 'ndim', 'ne', 'nlargest', 'notna', 'notnull', 'nsmallest', 'nunique', 'pad', 'pct_change', 'pipe', 'pivot', 'pivot_table', 'plot', 'pop', 'pow', 'prod', 'product', 'quantile', 'query', 'radd', 'rank', 'rdiv', 'reindex', 'reindex_like', 'rename', 'rename_axis', 'reorder_levels', 'replace', 'resample', 'reset_index', 'rfloordiv', 'rmod', 'rmul', 'rolling', 'round', 'rpow', 'rsub', 'rtruediv', 'sample', 'select_dtypes', 'sem', 'set_axis', 'set_flags', 'set_index', 'shape', 'shift', 'size', 'skew', 'slice_shift', 'sort_index', 'sort_values', 'squeeze', 'stack', 'std', 'style', 'sub', 'subtract', 'sum', 'swapaxes', 'swaplevel', 'tail', 'take', 'to_clipboard', 'to_csv', 'to_dict', 'to_excel', 'to_feather', 'to_gbq', 'to_hdf', 'to_html', 'to_json', 'to_latex', 'to_markdown', 'to_numpy', 'to_parquet', 'to_period', 'to_pickle', 'to_records', 'to_sql', 'to_stata', 'to_string', 'to_timestamp', 'to_xarray', 'to_xml', 'transform', 'transpose', 'truediv', 'truncate', 'tz_convert', 'tz_localize', 'unstack', 'update', 'value_counts', 'values', 'var', 'where', 'xs']
</code></pre></div></div>

<p>All <code class="language-plaintext highlighter-rouge">methods</code> have parenthesis and typically they take arguments.
However <code class="language-plaintext highlighter-rouge">attributes</code> do no have parenthesis. Attributes and methods can
be called using the <code class="language-plaintext highlighter-rouge">.</code> notation.</p>

<h3 id="dataframe-attributes">DataFrame Attributes</h3>

<h3 id="dimension-and-data-types">Dimension and data types</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Print the dimension of the dataframe
</span><span class="k">print</span><span class="p">(</span><span class="n">df3</span><span class="p">.</span><span class="n">shape</span><span class="p">)</span>

<span class="c1"># Print the dataframe column types
</span><span class="k">print</span><span class="p">(</span><span class="n">df3</span><span class="p">.</span><span class="n">dtypes</span><span class="p">)</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(51, 6)
Unnamed: 0          int64
region             object
state              object
individuals       float64
family_members    float64
state_pop           int64
dtype: object
</code></pre></div></div>

<h3 id="columns-and-rows">Columns and Rows</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Print the column index of dataframe
</span><span class="k">print</span><span class="p">(</span><span class="n">df3</span><span class="p">.</span><span class="n">columns</span><span class="p">)</span>

<span class="c1"># Print the row index of dataframe
</span><span class="k">print</span><span class="p">(</span><span class="n">df3</span><span class="p">.</span><span class="n">index</span><span class="p">)</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Index(['Unnamed: 0', 'region', 'state', 'individuals', 'family_members',
       'state_pop'],
      dtype='object')
RangeIndex(start=0, stop=51, step=1)
</code></pre></div></div>

<h3 id="dataframe-methods">DataFrame Methods</h3>

<h3 id="describe-the-data-frame">Describe the Data Frame</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># First Rows
</span><span class="k">print</span><span class="p">(</span><span class="n">df</span><span class="p">.</span><span class="n">head</span><span class="p">())</span>

<span class="c1"># Last Rows
</span><span class="k">print</span><span class="p">(</span><span class="n">df</span><span class="p">.</span><span class="n">tail</span><span class="p">())</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>                   A         B         C         D
2013-01-01  0.817994 -0.924007 -1.515711 -0.198598
2013-01-02  0.673364 -1.914110 -0.126208 -0.282033
2013-01-03  1.312579  0.340656 -0.300397 -0.838614
2013-01-04 -0.732977 -0.560867 -0.515910 -0.768784
2013-01-05 -2.045106 -0.929131 -0.029660  0.529883
                   A         B         C         D
2013-01-02  0.673364 -1.914110 -0.126208 -0.282033
2013-01-03  1.312579  0.340656 -0.300397 -0.838614
2013-01-04 -0.732977 -0.560867 -0.515910 -0.768784
2013-01-05 -2.045106 -0.929131 -0.029660  0.529883
2013-01-06 -1.343257 -0.250821 -0.046303  0.944569
</code></pre></div></div>

<h3 id="information">Information</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">print</span><span class="p">(</span><span class="n">df</span><span class="p">.</span><span class="n">info</span><span class="p">())</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>&lt;class 'pandas.core.frame.DataFrame'&gt;
DatetimeIndex: 6 entries, 2013-01-01 to 2013-01-06
Freq: D
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   A       6 non-null      float64
 1   B       6 non-null      float64
 2   C       6 non-null      float64
 3   D       6 non-null      float64
dtypes: float64(4)
memory usage: 240.0 bytes
None
</code></pre></div></div>

<h3 id="sort-a-dataframe">Sort a DataFrame</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Sortdf3by individuals (ascending)
</span><span class="n">df3_ind</span> <span class="o">=</span> <span class="n">df3</span><span class="p">.</span><span class="n">sort_values</span><span class="p">(</span><span class="s">"individuals"</span><span class="p">)</span> 

<span class="c1"># Print the top few rows
</span><span class="k">print</span><span class="p">(</span><span class="n">df3_ind</span><span class="p">.</span><span class="n">head</span><span class="p">())</span>

<span class="c1"># Sortdf3by individuals (descending)
</span><span class="n">df3_ind</span> <span class="o">=</span> <span class="n">df3</span><span class="p">.</span><span class="n">sort_values</span><span class="p">(</span><span class="s">"individuals"</span><span class="p">,</span> <span class="n">ascending</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span> 

<span class="c1"># Print the top few rows
</span><span class="k">print</span><span class="p">(</span><span class="n">df3_ind</span><span class="p">.</span><span class="n">head</span><span class="p">())</span>


<span class="c1"># Sortdf3by region ascending, then descending family members
</span><span class="n">df3_reg_fam</span> <span class="o">=</span> <span class="n">df3</span><span class="p">.</span><span class="n">sort_values</span><span class="p">([</span><span class="s">'region'</span><span class="p">,</span> <span class="s">'family_members'</span><span class="p">],</span> <span class="n">ascending</span><span class="o">=</span><span class="p">[</span><span class="bp">True</span><span class="p">,</span> <span class="bp">False</span><span class="p">])</span>

<span class="c1"># Print the top few rows
</span><span class="k">print</span><span class="p">(</span><span class="n">df3_reg_fam</span><span class="p">.</span><span class="n">head</span><span class="p">())</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    Unnamed: 0              region  ... family_members  state_pop
50          50            Mountain  ...          205.0     577601
34          34  West North Central  ...           75.0     758080
7            7      South Atlantic  ...          374.0     965479
39          39         New England  ...          354.0    1058287
45          45         New England  ...          511.0     624358

[5 rows x 6 columns]
    Unnamed: 0              region  ... family_members  state_pop
4            4             Pacific  ...        20964.0   39461588
32          32        Mid-Atlantic  ...        52070.0   19530351
9            9      South Atlantic  ...         9587.0   21244317
43          43  West South Central  ...         6111.0   28628666
47          47             Pacific  ...         5880.0    7523869

[5 rows x 6 columns]
    Unnamed: 0              region  ... family_members  state_pop
13          13  East North Central  ...         3891.0   12723071
35          35  East North Central  ...         3320.0   11676341
22          22  East North Central  ...         3142.0    9984072
49          49  East North Central  ...         2167.0    5807406
14          14  East North Central  ...         1482.0    6695497

[5 rows x 6 columns]
</code></pre></div></div>

<h2 id="indexing">Indexing</h2>

<p>By default when we create a dataframe from scratch, Pandas assigns two
indexes for rows and columns using integers ranging from zero until the
last value, for instance, the <code class="language-plaintext highlighter-rouge">df4</code> that was created from a 2D array.</p>

<ul>
  <li>Column index</li>
</ul>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df4</span><span class="p">.</span><span class="n">columns</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>RangeIndex(start=0, stop=5, step=1)
</code></pre></div></div>

<ul>
  <li>Row index</li>
</ul>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df4</span><span class="p">.</span><span class="n">index</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>RangeIndex(start=0, stop=5, step=1)
</code></pre></div></div>

<p>However, in dataframes created from a <code class="language-plaintext highlighter-rouge">csv</code> we have column names by
default and whole numbers as indexes of the rows. For instance, the
dataframes <code class="language-plaintext highlighter-rouge">df6</code> and <code class="language-plaintext highlighter-rouge">df7</code>.</p>

<ul>
  <li>Column index</li>
</ul>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Examples of column indexes
</span><span class="n">df6</span><span class="p">.</span><span class="n">columns</span>

<span class="n">df7</span><span class="p">.</span><span class="n">columns</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Index(['Unnamed: 0', 'store', 'type', 'department', 'date', 'weekly_sales',
       'is_holiday', 'temperature_c', 'fuel_price_usd_per_l', 'unemployment'],
      dtype='object')
Index(['Unnamed: 0', 'date', 'city', 'country', 'avg_temp_c'], dtype='object')
</code></pre></div></div>

<ul>
  <li>Row index</li>
</ul>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Examples of row indexes
</span><span class="n">df6</span><span class="p">.</span><span class="n">index</span>

<span class="n">df6</span><span class="p">.</span><span class="n">index</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>RangeIndex(start=0, stop=10774, step=1)
RangeIndex(start=0, stop=10774, step=1)
</code></pre></div></div>

<p>Changing column names is straightforward:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Old
</span><span class="k">print</span><span class="p">(</span><span class="n">df9</span><span class="p">)</span>


<span class="n">df9</span><span class="p">.</span><span class="n">columns</span> <span class="o">=</span> <span class="p">[</span><span class="s">'A'</span><span class="p">,</span> <span class="s">'B'</span><span class="p">,</span> <span class="s">'C'</span><span class="p">]</span>
<span class="c1"># New
</span><span class="k">print</span><span class="p">(</span><span class="n">df9</span><span class="p">)</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>         date  small_sold  large_sold
0  2019-11-03    10376832     7835071
1  2019-11-10    10717154     8561348
            A         B        C
0  2019-11-03  10376832  7835071
1  2019-11-10  10717154  8561348
</code></pre></div></div>

<p>Is possible to use one column as a row index, as follows:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df7_1</span> <span class="o">=</span> <span class="n">df7</span><span class="p">.</span><span class="n">set_index</span><span class="p">(</span><span class="s">"city"</span><span class="p">)</span>
<span class="n">df7_1</span><span class="p">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>         Unnamed: 0        date        country  avg_temp_c
city                                                      
Abidjan           0  2000-01-01  Côte D'Ivoire      27.293
Abidjan           1  2000-02-01  Côte D'Ivoire      27.685
Abidjan           2  2000-03-01  Côte D'Ivoire      29.061
Abidjan           3  2000-04-01  Côte D'Ivoire      28.162
Abidjan           4  2000-05-01  Côte D'Ivoire      27.547
</code></pre></div></div>

<p>Dropping the index works as follows:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df7_1</span><span class="p">.</span><span class="n">reset_index</span><span class="p">().</span><span class="n">head</span><span class="p">()</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>      city  Unnamed: 0        date        country  avg_temp_c
0  Abidjan           0  2000-01-01  Côte D'Ivoire      27.293
1  Abidjan           1  2000-02-01  Côte D'Ivoire      27.685
2  Abidjan           2  2000-03-01  Côte D'Ivoire      29.061
3  Abidjan           3  2000-04-01  Côte D'Ivoire      28.162
4  Abidjan           4  2000-05-01  Côte D'Ivoire      27.547
</code></pre></div></div>

<p>Once row-indexed, ideally, you would like to sort the dataframe
according to this index.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">print</span><span class="p">(</span><span class="n">df7_1</span><span class="p">.</span><span class="n">sort_index</span><span class="p">())</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>         Unnamed: 0        date        country  avg_temp_c
city                                                      
Abidjan           0  2000-01-01  Côte D'Ivoire      27.293
Abidjan         106  2008-11-01  Côte D'Ivoire      27.302
Abidjan         107  2008-12-01  Côte D'Ivoire      27.472
Abidjan         108  2009-01-01  Côte D'Ivoire      26.912
Abidjan         109  2009-02-01  Côte D'Ivoire      28.224
...             ...         ...            ...         ...
Xian          16391  2004-09-01          China      17.889
Xian          16392  2004-10-01          China      11.229
Xian          16393  2004-11-01          China       5.720
Xian          16395  2005-01-01          China      -2.209
Xian          16499  2013-09-01          China         NaN

[16500 rows x 4 columns]
</code></pre></div></div>

<h2 id="subsetting-rows">Subsetting rows</h2>

<p>To subset specific values of the dataframe we can filter columns using
relational operators (Boolean conditions) to return <code class="language-plaintext highlighter-rouge">True</code> of <code class="language-plaintext highlighter-rouge">False</code>
subsets of the dataframe. There are at least two common ways to achieve
this same objective. Imagine, you want to subset the rows from <code class="language-plaintext highlighter-rouge">df3</code>
where the column <code class="language-plaintext highlighter-rouge">individuals</code> is greater than 10000.</p>

<p>For example, passing the column as an attribute</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df3</span><span class="p">[</span><span class="n">df3</span><span class="p">.</span><span class="n">individuals</span><span class="o">&gt;</span><span class="mi">10000</span><span class="p">]</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    Unnamed: 0              region  ... family_members  state_pop
4            4             Pacific  ...        20964.0   39461588
9            9      South Atlantic  ...         9587.0   21244317
32          32        Mid-Atlantic  ...        52070.0   19530351
37          37             Pacific  ...         3337.0    4181886
43          43  West South Central  ...         6111.0   28628666
47          47             Pacific  ...         5880.0    7523869

[6 rows x 6 columns]
</code></pre></div></div>

<p>Alternatively, we can directly subset passing
<code class="language-plaintext highlighter-rouge">df3["individuals"]&gt;10000</code>, which is transformed into an object of
<code class="language-plaintext highlighter-rouge">&lt;class 'pandas.core.series.Series'&gt;</code></p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">print</span><span class="p">(</span><span class="nb">type</span><span class="p">(</span><span class="n">df3</span><span class="p">[</span><span class="s">"individuals"</span><span class="p">]</span><span class="o">&gt;</span><span class="mi">10000</span><span class="p">))</span>

<span class="n">df3</span><span class="p">[</span><span class="n">df3</span><span class="p">[</span><span class="s">"individuals"</span><span class="p">]</span><span class="o">&gt;</span><span class="mi">10000</span><span class="p">]</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>&lt;class 'pandas.core.series.Series'&gt;
    Unnamed: 0              region  ... family_members  state_pop
4            4             Pacific  ...        20964.0   39461588
9            9      South Atlantic  ...         9587.0   21244317
32          32        Mid-Atlantic  ...        52070.0   19530351
37          37             Pacific  ...         3337.0    4181886
43          43  West South Central  ...         6111.0   28628666
47          47             Pacific  ...         5880.0    7523869

[6 rows x 6 columns]
</code></pre></div></div>

<p>Another way to subset different rows is by using the method <code class="language-plaintext highlighter-rouge">loc</code>, which
takes advantage of the rows and columns indexes.</p>

<h3 id="subsetting-rows-using-the-loc">Subsetting rows using the <code class="language-plaintext highlighter-rouge">loc</code></h3>

<ul>
  <li>
    <p>We use the <code class="language-plaintext highlighter-rouge">[rows, columns]</code> brackets after the dataframe to
distinguish between rows and columns.</p>
  </li>
  <li>
    <p>Indexes of rows and columns are typically <code class="language-plaintext highlighter-rouge">strings</code> nested in <code class="language-plaintext highlighter-rouge">list</code>
objects.</p>
  </li>
  <li>
    <p>To subset range of rows or columns is easy with the <code class="language-plaintext highlighter-rouge">:</code> slicing
operator</p>
  </li>
</ul>

<p>Let’s look a the <code class="language-plaintext highlighter-rouge">df7</code>, first we set the column <code class="language-plaintext highlighter-rouge">city</code> as a row index,
and then we pass the list <code class="language-plaintext highlighter-rouge">cities</code> to map all the rows of the cities
<code class="language-plaintext highlighter-rouge">"Moscow", "Saint Petersburg"</code>.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">cities</span> <span class="o">=</span> <span class="p">[</span><span class="s">"Abidjan"</span><span class="p">,</span> <span class="s">"Xian"</span><span class="p">]</span>
<span class="n">df7_1</span><span class="p">.</span><span class="n">loc</span><span class="p">[</span><span class="n">cities</span><span class="p">].</span><span class="n">head</span><span class="p">()</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>         Unnamed: 0        date        country  avg_temp_c
city                                                      
Abidjan           0  2000-01-01  Côte D'Ivoire      27.293
Abidjan           1  2000-02-01  Côte D'Ivoire      27.685
Abidjan           2  2000-03-01  Côte D'Ivoire      29.061
Abidjan           3  2000-04-01  Côte D'Ivoire      28.162
Abidjan           4  2000-05-01  Côte D'Ivoire      27.547
</code></pre></div></div>

<p>Going further we can specify multilevel indexes, that is, indexes nested
inside other indexes. Lets give an example were we nest the column
<code class="language-plaintext highlighter-rouge">country</code> inside the index of <code class="language-plaintext highlighter-rouge">city</code> from the dataset <code class="language-plaintext highlighter-rouge">df7</code>.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Index df7 by country &amp; city
</span><span class="n">df7_1</span> <span class="o">=</span> <span class="n">df7</span><span class="p">.</span><span class="n">set_index</span><span class="p">([</span><span class="s">"country"</span><span class="p">,</span> <span class="s">"city"</span><span class="p">])</span>

<span class="c1"># List of tuples: Brazil, Rio De Janeiro &amp; Pakistan, Lahore
</span><span class="n">rows_to_keep</span> <span class="o">=</span> <span class="p">[(</span><span class="s">"Brazil"</span><span class="p">,</span> <span class="s">"Rio De Janeiro"</span><span class="p">),</span> <span class="p">(</span><span class="s">"Pakistan"</span><span class="p">,</span> <span class="s">"Lahore"</span><span class="p">)]</span> <span class="c1"># this is a list
</span><span class="k">print</span><span class="p">(</span><span class="n">df7_1</span><span class="p">.</span><span class="n">loc</span><span class="p">[</span><span class="n">rows_to_keep</span><span class="p">].</span><span class="n">head</span><span class="p">())</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>                        Unnamed: 0        date  avg_temp_c
country city                                              
Brazil  Rio De Janeiro       12540  2000-01-01      25.974
        Rio De Janeiro       12541  2000-02-01      26.699
        Rio De Janeiro       12542  2000-03-01      26.270
        Rio De Janeiro       12543  2000-04-01      25.750
        Rio De Janeiro       12544  2000-05-01      24.356
</code></pre></div></div>

<p>Now that we have two indexes is possible to sort the data accordingly</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df7_1</span><span class="p">.</span><span class="n">sort_index</span><span class="p">(</span><span class="n">level</span><span class="o">=</span><span class="p">[</span><span class="s">"country"</span><span class="p">,</span> <span class="s">"city"</span><span class="p">],</span> <span class="n">ascending</span><span class="o">=</span><span class="p">[</span><span class="bp">True</span><span class="p">,</span> <span class="bp">False</span><span class="p">]).</span><span class="n">head</span><span class="p">()</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>                   Unnamed: 0        date  avg_temp_c
country     city                                     
Afghanistan Kabul        7260  2000-01-01       3.326
            Kabul        7261  2000-02-01       3.454
            Kabul        7262  2000-03-01       9.612
            Kabul        7263  2000-04-01      17.925
            Kabul        7264  2000-05-01      24.658
</code></pre></div></div>

<p>Using the <code class="language-plaintext highlighter-rouge">loc</code> method is easy to slice blocks of values using the
indexes. For instance, in the <code class="language-plaintext highlighter-rouge">df7_1</code>, that was indexed and sorted, we
can pass the range <code class="language-plaintext highlighter-rouge">"Pakistan":"Russia"</code>, which will subset this set of
values from the dataframe. Is important to sort the data.frame before
passing the indexes of rows as follows:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df7_1</span> <span class="o">=</span> <span class="n">df7_1</span><span class="p">.</span><span class="n">sort_index</span><span class="p">()</span>
<span class="n">df7_1</span><span class="p">.</span><span class="n">loc</span><span class="p">[</span><span class="s">"Pakistan"</span><span class="p">:</span><span class="s">"Russia"</span><span class="p">]</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>                           Unnamed: 0        date  avg_temp_c
country  city                                                
Pakistan Faisalabad              4785  2000-01-01      12.792
         Faisalabad              4786  2000-02-01      14.339
         Faisalabad              4787  2000-03-01      20.309
         Faisalabad              4788  2000-04-01      29.072
         Faisalabad              4789  2000-05-01      34.845
...                               ...         ...         ...
Russia   Saint Petersburg       13360  2013-05-01      12.355
         Saint Petersburg       13361  2013-06-01      17.185
         Saint Petersburg       13362  2013-07-01      17.234
         Saint Petersburg       13363  2013-08-01      17.153
         Saint Petersburg       13364  2013-09-01         NaN

[1155 rows x 3 columns]
</code></pre></div></div>

<p>Finally, given that we have indexed the <code class="language-plaintext highlighter-rouge">df7_1</code> according to two
columns, we can subset pairs of index values and return a given set of
rows.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df7_1</span><span class="p">.</span><span class="n">loc</span><span class="p">[(</span><span class="s">"Pakistan"</span><span class="p">,</span><span class="s">"Lahore"</span><span class="p">):(</span><span class="s">"Russia"</span><span class="p">,</span> <span class="s">"Moscow"</span><span class="p">)].</span><span class="n">head</span><span class="p">()</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>                 Unnamed: 0        date  avg_temp_c
country  city                                      
Pakistan Lahore        8415  2000-01-01      12.792
         Lahore        8416  2000-02-01      14.339
         Lahore        8417  2000-03-01      20.309
         Lahore        8418  2000-04-01      29.072
         Lahore        8419  2000-05-01      34.845
</code></pre></div></div>

<h3 id="subsetting-rows-using-the-iloc">Subsetting rows using the <code class="language-plaintext highlighter-rouge">iloc</code></h3>

<p>Using <code class="language-plaintext highlighter-rouge">iloc</code> with a Dataframe is similar to the <code class="language-plaintext highlighter-rouge">loc</code>:</p>

<ul>
  <li>
    <p>We use the <code class="language-plaintext highlighter-rouge">[rows, columns]</code> brackets after the dataframe to
distinguish between rows and columns.</p>
  </li>
  <li>
    <p>Indexes of rows and columns are typically integers starting from
zero.</p>
  </li>
  <li>
    <p>To subset range of rows or columns is easy with the <code class="language-plaintext highlighter-rouge">:</code> slicing
operator</p>
  </li>
</ul>

<p>Subset the first five rows</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df7</span><span class="p">.</span><span class="n">iloc</span><span class="p">[:</span><span class="mi">5</span><span class="p">,</span> <span class="p">:]</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>   Unnamed: 0        date     city        country  avg_temp_c
0           0  2000-01-01  Abidjan  Côte D'Ivoire      27.293
1           1  2000-02-01  Abidjan  Côte D'Ivoire      27.685
2           2  2000-03-01  Abidjan  Côte D'Ivoire      29.061
3           3  2000-04-01  Abidjan  Côte D'Ivoire      28.162
4           4  2000-05-01  Abidjan  Côte D'Ivoire      27.547
</code></pre></div></div>

<h3 id="subsetting-rows-in-two-columns-or-more">Subsetting rows in two columns or more</h3>

<p>A more complex example involves a subset involving filtering of two
columns. We are going to subset in two levels grouping each condition
inside <code class="language-plaintext highlighter-rouge">()</code> parenthesis as follows:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">print</span><span class="p">(</span><span class="n">df3</span><span class="p">[(</span><span class="n">df3</span><span class="p">[</span><span class="s">"state_pop"</span><span class="p">]</span> <span class="o">&gt;</span> <span class="mi">5000000</span><span class="p">)</span> <span class="o">&amp;</span> <span class="p">(</span><span class="n">df3</span><span class="p">[</span><span class="s">"individuals"</span><span class="p">]</span><span class="o">&gt;</span> <span class="mi">5000</span><span class="p">)])</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    Unnamed: 0              region  ... family_members  state_pop
2            2            Mountain  ...         2606.0    7158024
4            4             Pacific  ...        20964.0   39461588
5            5            Mountain  ...         3250.0    5691287
9            9      South Atlantic  ...         9587.0   21244317
10          10      South Atlantic  ...         2556.0   10511131
13          13  East North Central  ...         3891.0   12723071
21          21         New England  ...        13257.0    6882635
22          22  East North Central  ...         3142.0    9984072
30          30        Mid-Atlantic  ...         3350.0    8886025
32          32        Mid-Atlantic  ...        52070.0   19530351
33          33      South Atlantic  ...         2817.0   10381615
35          35  East North Central  ...         3320.0   11676341
38          38        Mid-Atlantic  ...         5349.0   12800922
42          42  East South Central  ...         1744.0    6771631
43          43  West South Central  ...         6111.0   28628666
47          47             Pacific  ...         5880.0    7523869

[16 rows x 6 columns]
</code></pre></div></div>

<p>Another example, in <code class="language-plaintext highlighter-rouge">df3</code> filter for rows where family_members is less
than 1000 and region is Pacific.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df3</span><span class="p">[(</span><span class="n">df3</span><span class="p">.</span><span class="n">family_members</span> <span class="o">&lt;</span> <span class="mi">1000</span><span class="p">)</span> <span class="o">&amp;</span> <span class="p">(</span><span class="n">df3</span><span class="p">.</span><span class="n">region</span> <span class="o">==</span> <span class="s">"Pacific"</span><span class="p">)]</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>   Unnamed: 0   region   state  individuals  family_members  state_pop
1           1  Pacific  Alaska       1434.0           582.0     735139
</code></pre></div></div>

<p>Finally, we can subset from a list of options using the method <code class="language-plaintext highlighter-rouge">isin()</code>.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># The Mojave Desert states
</span><span class="n">canu</span> <span class="o">=</span> <span class="p">[</span><span class="s">"California"</span><span class="p">,</span> <span class="s">"Arizona"</span><span class="p">,</span> <span class="s">"Nevada"</span><span class="p">,</span> <span class="s">"Utah"</span><span class="p">]</span>

<span class="c1"># Filter for rows in the Mojave Desert states
</span><span class="n">df3</span><span class="p">[</span><span class="n">df3</span><span class="p">.</span><span class="n">state</span><span class="p">.</span><span class="n">isin</span><span class="p">(</span><span class="n">canu</span><span class="p">)].</span><span class="n">head</span><span class="p">()</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    Unnamed: 0    region       state  individuals  family_members  state_pop
2            2  Mountain     Arizona       7259.0          2606.0    7158024
4            4   Pacific  California     109008.0         20964.0   39461588
28          28  Mountain      Nevada       7058.0           486.0    3027341
44          44  Mountain        Utah       1904.0           972.0    3153550
</code></pre></div></div>

<h3 id="subsetting-columns">Subsetting columns</h3>

<p>To subset, it is necessary to know the column names of the dataframe.
For instance, the column names of the dataframe <code class="language-plaintext highlighter-rouge">df4</code> can be retrieved
using the attribute <code class="language-plaintext highlighter-rouge">df4.columns.values</code>, as follows:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">print</span><span class="p">(</span><span class="n">df4</span><span class="p">.</span><span class="n">columns</span><span class="p">.</span><span class="n">values</span><span class="p">)</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[0 1 2 3 4]
</code></pre></div></div>

<p>Knowing the column index it is easy then to subset the dataframe:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># First column
</span><span class="n">df4</span><span class="p">[[</span><span class="mi">0</span><span class="p">]]</span>

<span class="c1"># Last column
</span><span class="n">df4</span><span class="p">[[</span><span class="mi">4</span><span class="p">]]</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    0
0   1
1   6
2  11
3  16
4  21
    4
0   5
1  10
2  15
3  20
4  25
</code></pre></div></div>

<p>Another example using the dataframe <code class="language-plaintext highlighter-rouge">df3</code> whose column indexes are</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">print</span><span class="p">(</span><span class="n">df3</span><span class="p">.</span><span class="n">columns</span><span class="p">.</span><span class="n">values</span><span class="p">)</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>['Unnamed: 0' 'region' 'state' 'individuals' 'family_members' 'state_pop']
</code></pre></div></div>

<p>We can pass a list of column names to subset the <code class="language-plaintext highlighter-rouge">state</code> and
<code class="language-plaintext highlighter-rouge">family_members</code> columns as follows:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># select two columns
</span><span class="n">df3</span><span class="p">[[</span><span class="s">"state"</span><span class="p">,</span> <span class="s">"family_members"</span><span class="p">]].</span><span class="n">head</span><span class="p">()</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>        state  family_members
0     Alabama           864.0
1      Alaska           582.0
2     Arizona          2606.0
3    Arkansas           432.0
4  California         20964.0
</code></pre></div></div>

<p>To subset columns using the <code class="language-plaintext highlighter-rouge">loc</code> method we have to select the rows and
columns that we are selecting. If we intend to select all the rows and
we only are subsetting columns, we have to pass the <code class="language-plaintext highlighter-rouge">:</code> slice operator.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df3</span><span class="p">.</span><span class="n">loc</span><span class="p">[:,</span> <span class="p">[</span><span class="s">"state"</span><span class="p">,</span> <span class="s">"family_members"</span><span class="p">]].</span><span class="n">head</span><span class="p">()</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>        state  family_members
0     Alabama           864.0
1      Alaska           582.0
2     Arizona          2606.0
3    Arkansas           432.0
4  California         20964.0
</code></pre></div></div>

<p>Another example using the <code class="language-plaintext highlighter-rouge">df7</code></p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df7</span><span class="p">.</span><span class="n">loc</span><span class="p">[:,</span> <span class="p">[</span><span class="s">"city"</span><span class="p">,</span> <span class="s">"country"</span><span class="p">]].</span><span class="n">head</span><span class="p">()</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>      city        country
0  Abidjan  Côte D'Ivoire
1  Abidjan  Côte D'Ivoire
2  Abidjan  Côte D'Ivoire
3  Abidjan  Côte D'Ivoire
4  Abidjan  Côte D'Ivoire
</code></pre></div></div>

<p>Similarly, to the <code class="language-plaintext highlighter-rouge">loc</code> method, subsetting columns with the <code class="language-plaintext highlighter-rouge">iloc</code>
method uses integers to map columns. For instance subsetting the columns
<code class="language-plaintext highlighter-rouge">["city", "country"]</code> from <code class="language-plaintext highlighter-rouge">df7</code>. The slice operators <code class="language-plaintext highlighter-rouge">:</code> using the
<code class="language-plaintext highlighter-rouge">iloc</code> method states that we are calling all the rows or columns.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df7</span><span class="p">.</span><span class="n">columns</span>
<span class="n">df7</span><span class="p">.</span><span class="n">iloc</span><span class="p">[:,</span> <span class="p">[</span><span class="mi">1</span><span class="p">,</span><span class="mi">3</span><span class="p">]].</span><span class="n">head</span><span class="p">()</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Index(['Unnamed: 0', 'date', 'city', 'country', 'avg_temp_c'], dtype='object')
         date        country
0  2000-01-01  Côte D'Ivoire
1  2000-02-01  Côte D'Ivoire
2  2000-03-01  Côte D'Ivoire
3  2000-04-01  Côte D'Ivoire
4  2000-05-01  Côte D'Ivoire
</code></pre></div></div>

<h2 id="subsetting-rows-and-columns">Subsetting rows and columns</h2>

<p>First, locate the rows of the second column that comply with the rule.
In this example, we locate elements that are divisible by <code class="language-plaintext highlighter-rouge">2</code>.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Subset elements of the second column that are divisible by two
</span><span class="n">rows</span> <span class="o">=</span> <span class="n">df4</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">%</span> <span class="mi">2</span> <span class="o">==</span> <span class="mi">0</span>
<span class="k">print</span><span class="p">(</span><span class="nb">type</span><span class="p">(</span><span class="n">rows</span><span class="p">))</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>&lt;class 'pandas.core.series.Series'&gt;
</code></pre></div></div>

<p>Notice that Python creates an object of type
<code class="language-plaintext highlighter-rouge">pandas.core.series.Series</code>. Next, we use this object to subset the
elements of the second column in the following way:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df5</span> <span class="o">=</span> <span class="n">df4</span><span class="p">[[</span><span class="mi">1</span><span class="p">]][</span><span class="n">rows</span><span class="p">]</span>

<span class="k">print</span><span class="p">(</span><span class="n">df5</span><span class="p">.</span><span class="n">shape</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="nb">type</span><span class="p">(</span><span class="n">df5</span><span class="p">))</span>
<span class="k">print</span><span class="p">(</span><span class="n">df5</span><span class="p">)</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(3, 1)
&lt;class 'pandas.core.frame.DataFrame'&gt;
    1
0   2
2  12
4  22
</code></pre></div></div>

<p>Another method of subsetting rows and columns is by using indexes and
the <code class="language-plaintext highlighter-rouge">loc</code> method. Recall that this method works only if the dataset
contains indexes and is sorted accordingly. To refresh these steps,
let’s perform the following operations:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df7_1</span> <span class="o">=</span> <span class="n">df7</span><span class="p">.</span><span class="n">set_index</span><span class="p">([</span><span class="s">"country"</span><span class="p">,</span> <span class="s">"city"</span><span class="p">])</span> <span class="c1"># set the indexes
</span><span class="n">df7_1</span> <span class="o">=</span> <span class="n">df7_1</span><span class="p">.</span><span class="n">sort_index</span><span class="p">()</span> <span class="c1"># sort the dataframe descending
</span></code></pre></div></div>

<p>Now we can subset all rows that contain the country <code class="language-plaintext highlighter-rouge">Zimbabwe</code> on the
column <code class="language-plaintext highlighter-rouge">date</code>, as follows:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df7_1</span><span class="p">.</span><span class="n">loc</span><span class="p">[</span><span class="s">"Zimbabwe"</span><span class="p">,</span> <span class="s">"date"</span><span class="p">].</span><span class="n">head</span><span class="p">()</span>

<span class="nb">type</span><span class="p">(</span><span class="n">df7_1</span><span class="p">.</span><span class="n">index</span><span class="p">.</span><span class="n">values</span><span class="p">)</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>city
Harare    2000-01-01
Harare    2000-02-01
Harare    2000-03-01
Harare    2000-04-01
Harare    2000-05-01
Name: date, dtype: object
&lt;class 'numpy.ndarray'&gt;
</code></pre></div></div>

<h3 id="subsetting-rows-and-columns-using-the-iloc">Subsetting rows and columns using the <code class="language-plaintext highlighter-rouge">iloc</code></h3>

<p>This method is similar to the <code class="language-plaintext highlighter-rouge">loc</code> method, but instead of using
<code class="language-plaintext highlighter-rouge">strings</code> as indexes, we use integers to call rows and columns. We
follow these two rules:</p>

<ul>
  <li>
    <p>We use the <code class="language-plaintext highlighter-rouge">[rows, columns]</code> brackets to distinguish between rows
and columns.</p>
  </li>
  <li>
    <p>Indexes of rows and columns are numbers that typically start in <code class="language-plaintext highlighter-rouge">0</code>.</p>
  </li>
</ul>

<p>Recall <code class="language-plaintext highlighter-rouge">df4</code></p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">print</span><span class="p">(</span><span class="n">df4</span><span class="p">)</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    0   1   2   3   4
0   1   2   3   4   5
1   6   7   8   9  10
2  11  12  13  14  15
3  16  17  18  19  20
4  21  22  23  24  25
</code></pre></div></div>

<p>Subset the first and last element of <code class="language-plaintext highlighter-rouge">df4</code></p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Subset the first element of df4
</span><span class="k">print</span><span class="p">(</span><span class="n">df4</span><span class="p">.</span><span class="n">iloc</span><span class="p">[</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">])</span>

<span class="c1"># Subset the last element of df4
</span><span class="k">print</span><span class="p">(</span><span class="n">df4</span><span class="p">.</span><span class="n">iloc</span><span class="p">[</span><span class="mi">4</span><span class="p">,</span><span class="mi">4</span><span class="p">])</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>1
25
</code></pre></div></div>

<p>Extract the first column</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df4</span><span class="p">.</span><span class="n">iloc</span><span class="p">[:,[</span><span class="mi">0</span><span class="p">]]</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    0
0   1
1   6
2  11
3  16
4  21
</code></pre></div></div>

<p>Extract the first row</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df4</span><span class="p">.</span><span class="n">iloc</span><span class="p">[[</span><span class="mi">0</span><span class="p">],:]</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>   0  1  2  3  4
0  1  2  3  4  5
</code></pre></div></div>

<p>Extract the last 10 elements of the second column</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df7</span><span class="p">.</span><span class="n">iloc</span><span class="p">[(</span><span class="n">df7</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">-</span><span class="mi">11</span><span class="p">):(</span><span class="n">df7</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">-</span><span class="mi">1</span><span class="p">),</span> <span class="p">[</span><span class="mi">1</span><span class="p">]]</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>             date
16489  2012-11-01
16490  2012-12-01
16491  2013-01-01
16492  2013-02-01
16493  2013-03-01
16494  2013-04-01
16495  2013-05-01
16496  2013-06-01
16497  2013-07-01
16498  2013-08-01
</code></pre></div></div>

<p>Get the first 5 rows of columns 3 and 4.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df7</span><span class="p">.</span><span class="n">iloc</span><span class="p">[:</span><span class="mi">4</span><span class="p">,</span> <span class="mi">2</span><span class="p">:</span><span class="mi">4</span><span class="p">]</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>      city        country
0  Abidjan  Côte D'Ivoire
1  Abidjan  Côte D'Ivoire
2  Abidjan  Côte D'Ivoire
3  Abidjan  Côte D'Ivoire
</code></pre></div></div>

<p>Extract all elements in the first row that comply with a condition. For
this example we are going to nest the condition of all the values in the
first row greater than two, <code class="language-plaintext highlighter-rouge">df4.iloc[0,:]&gt;2</code>. Then we are using the
<code class="language-plaintext highlighter-rouge">np.where</code> method to locate the indexes of elements that contain a
<code class="language-plaintext highlighter-rouge">True</code> boolean.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df4</span><span class="p">.</span><span class="n">iloc</span><span class="p">[[</span><span class="mi">0</span><span class="p">],</span> <span class="n">np</span><span class="p">.</span><span class="n">where</span><span class="p">(</span><span class="n">df4</span><span class="p">.</span><span class="n">iloc</span><span class="p">[</span><span class="mi">0</span><span class="p">,:]</span><span class="o">&gt;</span><span class="mi">2</span><span class="p">)[</span><span class="mi">0</span><span class="p">]]</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>   2  3  4
0  3  4  5
</code></pre></div></div>

<h2 id="transforming">Transforming</h2>

<h3 id="describe-summary-statistics">Describe (Summary Statistics)</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># mean
</span><span class="k">print</span><span class="p">(</span><span class="n">df3</span><span class="p">.</span><span class="n">state_pop</span><span class="p">.</span><span class="n">mean</span><span class="p">())</span>

<span class="c1"># median
</span><span class="k">print</span><span class="p">(</span><span class="n">df3</span><span class="p">.</span><span class="n">state_pop</span><span class="p">.</span><span class="n">median</span><span class="p">())</span>

<span class="c1"># variance
</span><span class="k">print</span><span class="p">(</span><span class="n">df3</span><span class="p">.</span><span class="n">state_pop</span><span class="p">.</span><span class="n">var</span><span class="p">())</span>

<span class="c1"># standard deviation
</span><span class="k">print</span><span class="p">(</span><span class="n">df3</span><span class="p">.</span><span class="n">state_pop</span><span class="p">.</span><span class="n">std</span><span class="p">())</span>

<span class="c1"># min value
# standard deviation
</span><span class="k">print</span><span class="p">(</span><span class="n">df3</span><span class="p">.</span><span class="n">state_pop</span><span class="p">.</span><span class="nb">min</span><span class="p">())</span>

<span class="c1"># max value
</span><span class="k">print</span><span class="p">(</span><span class="n">df3</span><span class="p">.</span><span class="n">state_pop</span><span class="p">.</span><span class="nb">min</span><span class="p">())</span>

<span class="c1"># all together
</span><span class="k">print</span><span class="p">(</span><span class="n">df3</span><span class="p">.</span><span class="n">describe</span><span class="p">())</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>6405637.274509804
4461153.0
53688706994844.23
7327257.808678784
577601
577601
       Unnamed: 0    individuals  family_members     state_pop
count   51.000000      51.000000       51.000000  5.100000e+01
mean    25.000000    7225.784314     3504.882353  6.405637e+06
std     14.866069   15991.025083     7805.411811  7.327258e+06
min      0.000000     434.000000       75.000000  5.776010e+05
25%     12.500000    1446.500000      592.000000  1.777414e+06
50%     25.000000    3082.000000     1482.000000  4.461153e+06
75%     37.500000    6781.500000     3196.000000  7.340946e+06
max     50.000000  109008.000000    52070.000000  3.946159e+07
</code></pre></div></div>

<p>If the dataframe contains integers or real numbers is easy to perform
the operations across the columns or rows.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Mean of rows across columns
</span><span class="n">df4</span><span class="p">.</span><span class="n">mean</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>

<span class="c1"># Mean of columns across columns
</span><span class="n">df4</span><span class="p">.</span><span class="n">mean</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>0     3.0
1     8.0
2    13.0
3    18.0
4    23.0
dtype: float64
0    11.0
1    12.0
2    13.0
3    14.0
4    15.0
dtype: float64
</code></pre></div></div>

<h3 id="column-operations">Column operations</h3>

<p>The main column operations are:</p>

<ul>
  <li>Sum</li>
</ul>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">print</span><span class="p">(</span><span class="n">df3</span><span class="p">.</span><span class="n">state_pop</span><span class="p">.</span><span class="nb">sum</span><span class="p">())</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>326687501
</code></pre></div></div>

<ul>
  <li>Cumulative sum</li>
</ul>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">print</span><span class="p">(</span><span class="n">df3</span><span class="p">.</span><span class="n">state_pop</span><span class="p">.</span><span class="n">cumsum</span><span class="p">())</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>0       4887681
1       5622820
2      12780844
3      15790577
4      55252165
5      60943452
6      64514972
7      65480451
8      66181998
9      87426315
10     97937446
11     99358039
12    101108575
13    113831646
14    120527143
15    123675761
16    126587120
17    131048273
18    135707963
19    137047020
20    143082822
21    149965457
22    159949529
23    165555778
24    168536798
25    174658421
26    175719086
27    177644700
28    180672041
29    182025506
30    190911531
31    193004272
32    212534623
33    222916238
34    223674318
35    235350659
36    239290894
37    243472780
38    256273702
39    257331989
40    262416145
41    263294843
42    270066474
43    298695140
44    301848690
45    302473048
46    310974334
47    318498203
48    320302494
49    326109900
50    326687501
Name: state_pop, dtype: int64
</code></pre></div></div>

<ul>
  <li>Cumulative product</li>
</ul>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">print</span><span class="p">(</span><span class="n">df3</span><span class="p">.</span><span class="n">state_pop</span><span class="p">.</span><span class="n">cumprod</span><span class="p">())</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>0                 4887681
1           3593124922659
2     7272930357681714200
3     7903566051232904824
4    -6717292260295177376
5     8873061169631614368
6     8148294484792166400
7     5819023669836478464
8    -1205241372944291840
9    -6292177132517226496
10    6622373097463437312
11    4962005595153618944
12    3470577629245915136
13   -8423188867997155328
14   -2225371326546493440
15    3946997816422858752
16    5096266778097451008
17    1136688303390556160
18    1494541717077688320
19    7730232222543446016
20    8113663438137196544
21   -1654939752842264576
22    6166687450565443584
23    1500558237656678400
24    6253697334853500928
25   -8790508571491041280
26   -7109132568401805312
27   -5744006173422518272
28   -2284299832833081344
29   -8673003558171312128
30   -3504656614122586112
31   -5377308965807849472
32   -4878392185126387712
33    1709610911523667968
34    8941421250245296128
35    9215048314536329216
36   -8334418405679431680
37    4198662944055099392
38   -8723413785341067264
39     581225513859678208
40    1910243011917250560
41    2130586611002376192
42   -2684937009104945152
43   -2708850407756529664
44    7509713552135946240
45   -6781182851288137728
46   -6382118816838582272
47   -4484145143506534400
48    2332118313460563968
49   -4719128645426216960
50   -8733419210157326336
Name: state_pop, dtype: int64
</code></pre></div></div>

<h3 id="adding-new-columns">Adding new columns</h3>

<p>To add new columns we can take a dataframe, for instance <code class="language-plaintext highlighter-rouge">df4</code>, write
the new index (or column name) and then pass the values on the left of
<code class="language-plaintext highlighter-rouge">=</code> as follows</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># add columns 3 and 4 and save them on a new column
</span><span class="n">df4</span><span class="p">[</span><span class="mi">5</span><span class="p">]</span> <span class="o">=</span> <span class="n">df4</span><span class="p">[</span><span class="mi">2</span><span class="p">]</span> <span class="o">+</span> <span class="n">df4</span><span class="p">[</span><span class="mi">3</span><span class="p">]</span>

<span class="c1"># estimate column 3 as a proportion of column 5 and add the result in a new column
</span><span class="n">df4</span><span class="p">[</span><span class="mi">6</span><span class="p">]</span> <span class="o">=</span> <span class="n">df4</span><span class="p">[</span><span class="mi">2</span><span class="p">]</span> <span class="o">/</span> <span class="n">df4</span><span class="p">[</span><span class="mi">5</span><span class="p">]</span>

<span class="k">print</span><span class="p">(</span><span class="n">df4</span><span class="p">)</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    0   1   2   3   4   5         6
0   1   2   3   4   5   7  0.428571
1   6   7   8   9  10  17  0.470588
2  11  12  13  14  15  27  0.481481
3  16  17  18  19  20  37  0.486486
4  21  22  23  24  25  47  0.489362
</code></pre></div></div>

<p>If the columns do not have a numeric index, but a string name, the
procedure to add a new column is by passing a string with the name of
the new column, as follows:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Create indiv_per_10k col as homeless individuals per 10k state pop
</span><span class="n">df3</span><span class="p">[</span><span class="s">"indiv_per_10k"</span><span class="p">]</span> <span class="o">=</span> <span class="mi">10000</span> <span class="o">*</span> <span class="n">df3</span><span class="p">.</span><span class="n">individuals</span> <span class="o">/</span> <span class="n">df3</span><span class="p">.</span><span class="n">state_pop</span>   
</code></pre></div></div>

<h3 id="converting-column-types">Converting column types</h3>

<p>Transform the first column into a <code class="language-plaintext highlighter-rouge">datetime64[ns]</code></p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Print class of object
</span><span class="k">print</span><span class="p">(</span><span class="n">df7</span><span class="p">[</span><span class="s">"date"</span><span class="p">].</span><span class="n">dtypes</span><span class="p">)</span> <span class="c1"># column date is an object
</span>
<span class="c1"># transform 
</span><span class="n">df7</span><span class="p">[</span><span class="s">"date"</span><span class="p">]</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">to_datetime</span><span class="p">(</span><span class="n">df7</span><span class="p">[</span><span class="s">"date"</span><span class="p">])</span>

<span class="c1"># Print the new class of object
</span><span class="k">print</span><span class="p">(</span><span class="n">df7</span><span class="p">[</span><span class="s">"date"</span><span class="p">].</span><span class="n">dtypes</span><span class="p">)</span>  <span class="c1"># column date is an object
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>object
datetime64[ns]
</code></pre></div></div>

<p>Now that <code class="language-plaintext highlighter-rouge">date</code> is of class <code class="language-plaintext highlighter-rouge">datetime64[ns]</code>, we can create new columns
of year, month or day</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Add a year column to temperatures
</span><span class="n">df7</span><span class="p">[</span><span class="s">"year"</span><span class="p">]</span> <span class="o">=</span> <span class="n">df7</span><span class="p">[</span><span class="s">"date"</span><span class="p">].</span><span class="n">dt</span><span class="p">.</span><span class="n">year</span>
<span class="n">df7</span><span class="p">[</span><span class="s">"month"</span><span class="p">]</span> <span class="o">=</span> <span class="n">df7</span><span class="p">[</span><span class="s">"date"</span><span class="p">].</span><span class="n">dt</span><span class="p">.</span><span class="n">month</span>
<span class="n">df7</span><span class="p">[</span><span class="s">"day"</span><span class="p">]</span> <span class="o">=</span> <span class="n">df7</span><span class="p">[</span><span class="s">"date"</span><span class="p">].</span><span class="n">dt</span><span class="p">.</span><span class="n">day</span>

<span class="n">df7</span><span class="p">.</span><span class="n">head</span><span class="p">()</span>
<span class="k">print</span><span class="p">(</span><span class="n">df7</span><span class="p">.</span><span class="n">dtypes</span><span class="p">)</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>   Unnamed: 0       date     city        country  avg_temp_c  year  month  day
0           0 2000-01-01  Abidjan  Côte D'Ivoire      27.293  2000      1    1
1           1 2000-02-01  Abidjan  Côte D'Ivoire      27.685  2000      2    1
2           2 2000-03-01  Abidjan  Côte D'Ivoire      29.061  2000      3    1
3           3 2000-04-01  Abidjan  Côte D'Ivoire      28.162  2000      4    1
4           4 2000-05-01  Abidjan  Côte D'Ivoire      27.547  2000      5    1
Unnamed: 0             int64
date          datetime64[ns]
city                  object
country               object
avg_temp_c           float64
year                   int64
month                  int64
day                    int64
dtype: object
</code></pre></div></div>

<p>Here is another example using <code class="language-plaintext highlighter-rouge">df11</code>, columns two and three are strings
that can be transformed into numeric elements.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">print</span><span class="p">(</span><span class="n">df11</span><span class="p">.</span><span class="n">dtypes</span><span class="p">)</span>


<span class="n">df11</span><span class="p">[[</span><span class="s">'two'</span><span class="p">,</span> <span class="s">'three'</span><span class="p">]]</span> <span class="o">=</span> <span class="n">df11</span><span class="p">[[</span><span class="s">'two'</span><span class="p">,</span> <span class="s">'three'</span><span class="p">]].</span><span class="n">astype</span><span class="p">(</span><span class="nb">float</span><span class="p">)</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>one      object
two      object
three    object
dtype: object
</code></pre></div></div>

<h3 id="aggregating">Aggregating</h3>

<p>The aggregate method works by defining a function that latter is going
to be called on each column of a dataframe using the method <code class="language-plaintext highlighter-rouge">agg()</code></p>

<p>For instance, consider the following function that computes the 75% and
the 25% quantiles</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">iqr</span><span class="p">(</span><span class="n">column</span><span class="p">):</span>
    <span class="k">return</span> <span class="n">column</span><span class="p">.</span><span class="n">quantile</span><span class="p">(</span><span class="mf">0.75</span><span class="p">)</span> <span class="o">-</span> <span class="n">column</span><span class="p">.</span><span class="n">quantile</span><span class="p">(</span><span class="mf">0.25</span><span class="p">)</span>
</code></pre></div></div>

<p>We can use this function on columns one and three of the <code class="language-plaintext highlighter-rouge">df4</code> dataframe
as follows:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">print</span><span class="p">(</span><span class="n">df4</span><span class="p">[[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">2</span><span class="p">]].</span><span class="n">agg</span><span class="p">(</span><span class="n">iqr</span><span class="p">))</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>0    10.0
2    10.0
dtype: float64
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">agg()</code> can take more than one function, for instance</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">print</span><span class="p">(</span><span class="n">df4</span><span class="p">[[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">2</span><span class="p">]].</span><span class="n">agg</span><span class="p">([</span><span class="n">iqr</span><span class="p">,</span> <span class="n">np</span><span class="p">.</span><span class="n">median</span><span class="p">,</span> <span class="n">np</span><span class="p">.</span><span class="n">mean</span><span class="p">]))</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>           0     2
iqr     10.0  10.0
median  11.0  13.0
mean    11.0  13.0
</code></pre></div></div>

<h3 id="grouping">Grouping</h3>

<p>For grouping, we use the method <code class="language-plaintext highlighter-rouge">.groupby()</code> from the <code class="language-plaintext highlighter-rouge">pandas</code> library
that splits a dataframe according to a certain categorical variable. The
following snipped splits the dataframe using the variable <code class="language-plaintext highlighter-rouge">type</code> in two
categories, and then computes the <code class="language-plaintext highlighter-rouge">sum()</code> of the column <code class="language-plaintext highlighter-rouge">weekly_sales</code>.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df6</span><span class="p">.</span><span class="n">groupby</span><span class="p">(</span><span class="s">"type"</span><span class="p">)[</span><span class="s">"weekly_sales"</span><span class="p">].</span><span class="nb">sum</span><span class="p">()</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>type
A    2.337163e+08
B    2.317840e+07
Name: weekly_sales, dtype: float64
</code></pre></div></div>

<p>We can even take more than one category</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df6</span><span class="p">.</span><span class="n">groupby</span><span class="p">([</span><span class="s">"type"</span><span class="p">,</span> <span class="s">"is_holiday"</span><span class="p">])[</span><span class="s">"weekly_sales"</span><span class="p">].</span><span class="nb">sum</span><span class="p">()</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>type  is_holiday
A     False         2.336927e+08
      True          2.360181e+04
B     False         2.317678e+07
      True          1.621410e+03
Name: weekly_sales, dtype: float64
</code></pre></div></div>

<p>Now we can use the <code class="language-plaintext highlighter-rouge">.groupby</code> together with the <code class="language-plaintext highlighter-rouge">agg</code> functions to
calculate the min, max, mean and median of a variable.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df6</span><span class="p">.</span><span class="n">groupby</span><span class="p">(</span><span class="s">"type"</span><span class="p">)[[</span><span class="s">"unemployment"</span><span class="p">,</span> <span class="s">"fuel_price_usd_per_l"</span><span class="p">]].</span><span class="n">agg</span><span class="p">([</span><span class="n">np</span><span class="p">.</span><span class="n">mean</span><span class="p">,</span> <span class="n">np</span><span class="p">.</span><span class="n">median</span><span class="p">,</span> <span class="n">np</span><span class="p">.</span><span class="nb">max</span><span class="p">,</span> <span class="n">np</span><span class="p">.</span><span class="nb">min</span><span class="p">])</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>     unemployment                ... fuel_price_usd_per_l                    
             mean median   amax  ...               median      amax      amin
type                             ...                                         
A        7.972611  8.067  8.992  ...             0.735455  1.107410  0.664129
B        9.279323  9.199  9.765  ...             0.803348  1.107674  0.760023

[2 rows x 8 columns]
</code></pre></div></div>

<h3 id="dealing-with-nan-values">Dealing with NaN values</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Initial number of rows
</span><span class="n">na_rows</span> <span class="o">=</span> <span class="n">df8</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>

<span class="c1"># Report the sum of NaN values in each column
</span><span class="n">df8</span><span class="p">.</span><span class="n">isna</span><span class="p">().</span><span class="nb">sum</span><span class="p">()</span>

<span class="c1"># Report the sum of NaN values in each column as a proportion 
</span><span class="n">df8</span><span class="p">.</span><span class="n">isna</span><span class="p">().</span><span class="nb">sum</span><span class="p">()</span><span class="o">/</span><span class="n">df8</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Country/Region    171061
Province/State    223208
Latitude          171062
Longitude         171062
Confirmed             19
Recovered            386
Deaths               432
Date                   0
dtype: int64
Country/Region    0.137736
Province/State    0.179724
Latitude          0.137736
Longitude         0.137736
Confirmed         0.000015
Recovered         0.000311
Deaths            0.000348
Date              0.000000
dtype: float64
</code></pre></div></div>

<p>Imputing <code class="language-plaintext highlighter-rouge">NaN</code> values with zero</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">#Impute NaN with Zero
</span><span class="n">df8_1</span> <span class="o">=</span> <span class="n">df8</span><span class="p">.</span><span class="n">fillna</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span>

<span class="c1">#Display no NaNs
</span><span class="n">df8_1</span><span class="p">.</span><span class="n">isna</span><span class="p">().</span><span class="nb">any</span><span class="p">()</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Country/Region    False
Province/State    False
Latitude          False
Longitude         False
Confirmed         False
Recovered         False
Deaths            False
Date              False
dtype: bool
</code></pre></div></div>

<p>Removing rows with missing values</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Drop Na values
</span><span class="n">df8</span> <span class="o">=</span> <span class="n">df8</span><span class="p">.</span><span class="n">dropna</span><span class="p">()</span>

<span class="c1"># How many rows were dropped?
</span><span class="n">na_rows</span> <span class="o">-</span> <span class="n">df8</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>223572
</code></pre></div></div>

<h3 id="pivot-tables">Pivot Tables</h3>

<p>The pivot table is another method to transform data using categorical
variables to split the dataframe. By default, the <code class="language-plaintext highlighter-rouge">.pivot_table</code> method
computes the mean of a variable.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Pivot for mean weekly_sales for each store type
</span><span class="n">df6</span><span class="p">.</span><span class="n">pivot_table</span><span class="p">(</span><span class="n">values</span><span class="o">=</span><span class="s">"weekly_sales"</span><span class="p">,</span> <span class="n">index</span><span class="o">=</span><span class="s">"type"</span><span class="p">)</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>      weekly_sales
type              
A     23674.667242
B     25696.678370
</code></pre></div></div>

<p>An example selecting two variables</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Pivot for mean weekly_sales for each store type
</span><span class="n">df6</span><span class="p">.</span><span class="n">pivot_table</span><span class="p">(</span><span class="n">values</span><span class="o">=</span><span class="p">[</span><span class="s">"weekly_sales"</span><span class="p">,</span> <span class="s">"unemployment"</span><span class="p">],</span> <span class="n">index</span><span class="o">=</span><span class="s">"type"</span><span class="p">)</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>      unemployment  weekly_sales
type                            
A         7.972611  23674.667242
B         9.279323  25696.678370
</code></pre></div></div>

<p>We can extend the pivot table capabilities by passing two functions
instead of one, as follows</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Pivot for mean weekly_sales for each store type
</span><span class="n">df6</span><span class="p">.</span><span class="n">pivot_table</span><span class="p">(</span><span class="n">values</span><span class="o">=</span><span class="s">"weekly_sales"</span><span class="p">,</span> <span class="n">index</span><span class="o">=</span><span class="s">"type"</span><span class="p">,</span> <span class="n">aggfunc</span><span class="o">=</span><span class="p">[</span><span class="n">np</span><span class="p">.</span><span class="n">mean</span><span class="p">,</span> <span class="n">np</span><span class="p">.</span><span class="n">median</span><span class="p">])</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>              mean       median
      weekly_sales weekly_sales
type                           
A     23674.667242     11943.92
B     25696.678370     13336.08
</code></pre></div></div>

<p>We can split the dataframe passing one variable as <code class="language-plaintext highlighter-rouge">index</code> and another
as <code class="language-plaintext highlighter-rouge">columns</code></p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df6</span><span class="p">.</span><span class="n">pivot_table</span><span class="p">(</span><span class="n">values</span><span class="o">=</span><span class="s">"weekly_sales"</span><span class="p">,</span> <span class="n">index</span><span class="o">=</span><span class="s">"type"</span><span class="p">,</span> <span class="n">columns</span><span class="o">=</span><span class="s">"is_holiday"</span><span class="p">)</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>is_holiday         False       True
type                               
A           23768.583523  590.04525
B           25751.980533  810.70500
</code></pre></div></div>

<p>Create a pivot table adding the <code class="language-plaintext highlighter-rouge">avg_temp_c</code> column from <code class="language-plaintext highlighter-rouge">df7</code>, with
<code class="language-plaintext highlighter-rouge">country</code> and <code class="language-plaintext highlighter-rouge">city</code> as rows, and <code class="language-plaintext highlighter-rouge">year</code> as columns.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df7</span><span class="p">[</span><span class="s">"year"</span><span class="p">]</span> <span class="o">=</span> <span class="n">df7</span><span class="p">[</span><span class="s">"date"</span><span class="p">].</span><span class="n">dt</span><span class="p">.</span><span class="n">year</span>
</code></pre></div></div>

<p>Compute over two variables and replaces the <code class="language-plaintext highlighter-rouge">NaN</code> values, adding a <code class="language-plaintext highlighter-rouge">sum</code>
of the columns and rows with the argument <code class="language-plaintext highlighter-rouge">margins=True</code>.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df6</span><span class="p">.</span><span class="n">pivot_table</span><span class="p">(</span><span class="n">values</span><span class="o">=</span><span class="s">"weekly_sales"</span><span class="p">,</span> <span class="n">index</span><span class="o">=</span><span class="s">"department"</span><span class="p">,</span> <span class="n">columns</span><span class="o">=</span><span class="p">[</span><span class="s">"store"</span><span class="p">,</span> <span class="s">"type"</span><span class="p">],</span> <span class="n">fill_value</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span> <span class="n">margins</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>store                  1             2  ...            39           All
type                   A             A  ...             A              
department                              ...                            
1           23491.755000  32392.588333  ...  21423.068333  32052.467153
2           47421.124167  68156.664167  ...  60768.638333  71380.022778
3           12872.590000  17012.000000  ...  16847.852500  18278.390625
4           38382.255833  47650.447500  ...  41670.077500  44863.253681
5           23761.120000  30331.717500  ...  23466.439167  37189.000000
...                  ...           ...  ...           ...           ...
96          27897.153333  33841.960833  ...  24947.875833  20337.607681
97          33771.761667  40757.997500  ...  23002.670000  26584.400833
98          10853.782500  14009.203333  ...   9089.097500  11820.590278
99            466.364545    455.516364  ...    317.189091    379.123659
All         20896.941787  26517.435162  ...  18414.938423  23843.950149

[81 rows x 13 columns]
</code></pre></div></div>

<h2 id="exporting">Exporting</h2>

<h3 id="dataframe-to-csv">DataFrame to CSV</h3>

<p>The most common format to export data is by saving the dataframe on a
<code class="language-plaintext highlighter-rouge">csv</code> file</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df9</span><span class="p">.</span><span class="n">to_csv</span><span class="p">(</span><span class="s">"df9.csv"</span><span class="p">)</span>
</code></pre></div></div>

<h3 id="dataframe-to-latex">DataFrame to Latex</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">print</span><span class="p">(</span><span class="n">df</span><span class="p">.</span><span class="n">style</span><span class="p">.</span><span class="n">to_latex</span><span class="p">())</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>\begin{tabular}{lrrrr}
 &amp; A &amp; B &amp; C &amp; D \\
2013-01-01 00:00:00 &amp; 0.817994 &amp; -0.924007 &amp; -1.515711 &amp; -0.198598 \\
2013-01-02 00:00:00 &amp; 0.673364 &amp; -1.914110 &amp; -0.126208 &amp; -0.282033 \\
2013-01-03 00:00:00 &amp; 1.312579 &amp; 0.340656 &amp; -0.300397 &amp; -0.838614 \\
2013-01-04 00:00:00 &amp; -0.732977 &amp; -0.560867 &amp; -0.515910 &amp; -0.768784 \\
2013-01-05 00:00:00 &amp; -2.045106 &amp; -0.929131 &amp; -0.029660 &amp; 0.529883 \\
2013-01-06 00:00:00 &amp; -1.343257 &amp; -0.250821 &amp; -0.046303 &amp; 0.944569 \\
\end{tabular}
</code></pre></div></div>

<h2 id="references">References</h2>

<ul>
  <li>
    <p>Stack Overflow (2022)</p>
  </li>
  <li>
    <p>Data Camp (2022)</p>
  </li>
</ul>

<div id="refs" class="references csl-bib-body hanging-indent">

<div id="ref-datacamp1" class="csl-entry">

Data Camp. 2022. “<span class="nocase">Data Manipulation with pandas -
DataCamp Learn</span>.”
&lt;https://app.datacamp.com/learn/courses/data-manipulation-with-pandas&gt;.

</div>

<div id="ref-stackover1" class="csl-entry">

Stack Overflow. 2022. “<span class="nocase">Change column type in
pandas</span>.”
&lt;https://stackoverflow.com/questions/15891038/change-column-type-in-pandas&gt;.

</div>

</div>]]></content><author><name>Mario H. Gonzalez-Sauri</name></author><summary type="html"><![CDATA[Importing Modules]]></summary></entry><entry><title type="html">Introduction to Algorithms</title><link href="https://mario1084.github.io/blog/2022/03/27/intro_algorithms.html" rel="alternate" type="text/html" title="Introduction to Algorithms" /><published>2022-03-27T00:00:00+00:00</published><updated>2022-03-27T00:00:00+00:00</updated><id>https://mario1084.github.io/blog/2022/03/27/intro_algorithms</id><content type="html" xml:base="https://mario1084.github.io/blog/2022/03/27/intro_algorithms.html"><![CDATA[<h1 id="introduction">Introduction</h1>

<p>In lectures one to four, we have set the stage to introduce the heart of Applied Data Science: algorithmic thinking for problem-solving. In lecture one, we learn about the scope of Data Science and the rise of Big Data. Lecture two, is an introduction to the use of inductive reasoning applied to Data Science. Lectures three and four are an overview on basic estimation using statistics and econometrics applied with <strong>R</strong> Programming. In this lecture, I cover another pillar of Data Science: Algorithm programming using <em>control flow structures</em> and <em>functions</em>.</p>

<p>Functions and control flow structures are the building blocks of algorithm programming. So far, we have use <code class="language-plaintext highlighter-rouge">r-packages</code> and more specifically <code class="language-plaintext highlighter-rouge">FUN(X)</code> functions, that take arguments and perform certain action. However, in this lecture, we will learn the elementary building blocks from algorithmic programming. Learning the elementary building blocks of algorithmic programming has two main advantages in your formation in Data Science. Firstly, algorithmic programming allows you to understand in detail how the functions work. I’m sure that thus far, you know that if we pass a numeric vector <code class="language-plaintext highlighter-rouge">x</code>, inside the function<code class="language-plaintext highlighter-rouge">mean(x)</code>, <strong>R</strong> somehow computes the mean. However, after learning the building blocks of algorithmic programming, we are going to be able to understand how the functions work. What is the series of steps behind the computation of certain functions? And how do the arguments of the function being used, in which order? In a nutshell, algorithmic programming enables you to deeply understand functions and packages in <strong>R</strong>.</p>

<p>The second advantage of algorithmic programming is that enables you to go beyond the “out-of-the-shelf” functions from the <strong>R-base</strong> and other packages. Indeed, instead of being bound to only functions from the <strong>R-base</strong>  and other packages, algorithmic programming, gives you the tools to create your own functions. In general, is recommended to search first if there are no functions available to perform the action that you want. But, knowing algorithmic programming removes the constraints of only using available tools and gives you the freedom of developing tools that fulfil your particular needs. Indeed, we would like to build our own functions and algorithms in two cases. Firstly, when we can’t find a similar function in <a href="https://stat.ethz.ch/R-manual/R-devel/library/base/html/00Index.html">The R Base</a> or in the packages maintained by <a href="https://cran.r-project.org/web/packages/available_packages_by_name.html">The Comprehensive R Archive Network (CRAN)</a>. Especially, if this is your first course in Data Science, you would like to verify first if there is a function on CRAN that fulfils your needs before investing time building your own function. Using a function from CRAN’s database is typically a better option, not only because we save time, but also because the code is audited by lead experts in their corresponding fields. Secondly, we may opt to build our own function when we often perform a sequential series of functions or repetitive steps. For instance, I typically use the function <code class="language-plaintext highlighter-rouge">lapply</code> combined with the function <code class="language-plaintext highlighter-rouge">class</code> to verify the class of each column from a <code class="language-plaintext highlighter-rouge">df</code> (<code class="language-plaintext highlighter-rouge">data.frame</code>) in the following manner <code class="language-plaintext highlighter-rouge">lapply(df, class)</code>.</p>

<h1 id="what-is-an-algorithm">What is an algorithm?</h1>

<p>An algorithm is simply a “well-defined computational procedure that takes some value, or set of values, as input and produces some value, or set of values, as output” [@cormen2022introduction]. Indeed, an algorithm serves a specific purpose and has a specific procedure designed to solve a problem. Before we start learning the necessary syntax to produce algorithms in <strong>R</strong>, we are going to understand the structure (procedure) of examples of algorithms employing pseudo-code or flow diagrams. Later, in a second step, we will revise the specific R-code that we need to produce the algorithm in <strong>R</strong>.</p>

<h2 id="example-1-fibonacci-sequence">Example 1: Fibonacci Sequence</h2>

<blockquote>
  <p>Input: A variable <code class="language-plaintext highlighter-rouge">x</code> that is the number of numbers to generate in the Fibonacci serie.</p>
</blockquote>

<blockquote>
  <p>Output: A series <code class="language-plaintext highlighter-rouge">z</code> that is the Fibonacci serie of lenght <code class="language-plaintext highlighter-rouge">x</code>.</p>
</blockquote>

<!-- [![](https://mermaid.ink/img/pako:eNpdkEFrwzAMhf-K0CmF9rJjYB1tskMvGzS7rMsOXqy0gdhObRkakvz32W0CYzpJz5_fExqwMpIwxbMV3QU-8lJDqN3XQXee4ZaC9uqHLJh67hywgTNpsoLpGzab7ViwsDzCPim8Ar4QtMIxPC0fVg_PfUAhS3ZS3pnZN5jFyZFtaCGzSOZJZrzm-2sf45dM-c83j_RpODi4wXP_Mj3UU1THt_ewFvxVPsmNcEyOxN7qOfrqSVe0ghnENSqySjQy3GWIWokBVFRiGlpJtfAtl1jqKaC-k2GpV9mwsZjWonW0RuHZFL2uMGXraYHyRoQzq5mafgEpe3o8)](https://mermaid-js.github.io/mermaid-live-editor/edit#pako:eNpdkEFrwzAMhf-K0CmF9rJjYB1tskMvGzS7rMsOXqy0gdhObRkakvz32W0CYzpJz5_fExqwMpIwxbMV3QU-8lJDqN3XQXee4ZaC9uqHLJh67hywgTNpsoLpGzab7ViwsDzCPim8Ar4QtMIxPC0fVg_PfUAhS3ZS3pnZN5jFyZFtaCGzSOZJZrzm-2sf45dM-c83j_RpODi4wXP_Mj3UU1THt_ewFvxVPsmNcEyOxN7qOfrqSVe0ghnENSqySjQy3GWIWokBVFRiGlpJtfAtl1jqKaC-k2GpV9mwsZjWonW0RuHZFL2uMGXraYHyRoQzq5mafgEpe3o8) -->

<div>
 <script src="https://cdn.jsdelivr.net/npm/mermaid/dist/mermaid.min.js"></script>
    Fibonacci Sequence Generator:
    <div class="mermaid">
    graph TD
    A[Input x: lenght of sequence to generate] --&gt;|Start| B(Sum the last 2 numbers)
    B--&gt; C(Add the number to the series z)
    C--&gt; D(Count the y of generated numbers)
    D--&gt; Z{Is x = y?}
    Z--&gt; |NO| B
    Z--&gt; |Yes| R(Return the sequence z)
    </div>
</div>
<p><br /></p>

<h1 id="flow-controls-while-and-if">Flow controls: <code class="language-plaintext highlighter-rouge">while</code> and <code class="language-plaintext highlighter-rouge">if</code></h1>

<p>To implement the algorithm in Example one, we need to expand our knowledge of <strong>R</strong> operators. The operator <code class="language-plaintext highlighter-rouge">if</code> and <code class="language-plaintext highlighter-rouge">while</code> are always followed by a <code class="language-plaintext highlighter-rouge">(...)</code> that assesses a logical condition and some <code class="language-plaintext highlighter-rouge">{...}</code> brackets that perform a set of actions, <code class="language-plaintext highlighter-rouge">...</code> if the condition is fulfilled. For instance, in Example 1, the input of the algorithm is a variable <code class="language-plaintext highlighter-rouge">x</code> that defines the number of elements to generate in the sequence. The output of the algorithm is a sequence <code class="language-plaintext highlighter-rouge">z</code> that has <code class="language-plaintext highlighter-rouge">y</code> number of elements. To generate the series we need to add one number to the series <code class="language-plaintext highlighter-rouge">z</code> that corresponds to the sum of the last two elements of the previous series until the total length is <code class="language-plaintext highlighter-rouge">y</code>. To start the algorithm we need the variable <code class="language-plaintext highlighter-rouge">x</code> that inputs the number of values to generate in the series <code class="language-plaintext highlighter-rouge">z</code>. followed by <code class="language-plaintext highlighter-rouge">y</code> which is the initial value of the length of <code class="language-plaintext highlighter-rouge">z</code>. Assuming that the user is going to request more than two numbers in the series then <code class="language-plaintext highlighter-rouge">y&gt;2</code> (Requesting less than two numbers makes the algorithm redundant).</p>

<p>Next, the algorithm needs to be programmed to continue doing a series of steps until the goal is reached. Remember that the goal is to produce a series <code class="language-plaintext highlighter-rouge">z</code> that has a total length of <code class="language-plaintext highlighter-rouge">x</code>. Notice that <code class="language-plaintext highlighter-rouge">y</code> is then the actual number of elements in the series <code class="language-plaintext highlighter-rouge">z</code> in each iteration of the process of adding one element to the series. To operationalize the algorithm we are going to use <code class="language-plaintext highlighter-rouge">while(y&lt;x){...}</code>, which assesses the condition where <code class="language-plaintext highlighter-rouge">y&lt;x</code>. The operator will perform the set of actions <code class="language-plaintext highlighter-rouge">...</code> only while the condition is satisfied <code class="language-plaintext highlighter-rouge">y&lt;x</code>. That means that the algorithm using <code class="language-plaintext highlighter-rouge">while</code> stops when <code class="language-plaintext highlighter-rouge">y&gt;=x</code> (when the corresponding series <code class="language-plaintext highlighter-rouge">z</code> has a total length of <code class="language-plaintext highlighter-rouge">x</code> or more).</p>

<pre><code class="language-{r}">
x &lt;- 20L
y &lt;- 0L
z &lt;- c(0L, 1L)

while (y &lt; x) {
    z[length(z) + 1L] &lt;- sum(z[c(length(z) - 1L, length(z))])
    y &lt;- length(z)
}
z

</code></pre>

<h2 id="example-2-sorthing-algorithm">Example 2: Sorthing Algorithm</h2>

<blockquote>
  <p>Input: A variable \(x={a_1, a_2, \dots, a_n}\) of <code class="language-plaintext highlighter-rouge">n</code> rational numbers.</p>
</blockquote>

<blockquote>
  <p>Output: A permutation (reordering) of <code class="language-plaintext highlighter-rouge">x</code> called <code class="language-plaintext highlighter-rouge">y</code> such that \(y={a_1^*, a_2^*, \dots, a_n^*}\)</p>
</blockquote>

<p><em>Source: (Cormen, Leiserson, Rivest, and Stein, 2022).</em></p>

<!-- [![](https://mermaid.ink/img/pako:eNpdkc1uwjAQhF9l5EtAIof0GIlULfSAVLUHuISmqlyyAUNiR_4pQYR3r5OAVGofbO9-s6Ndn9lG5cRiBr-2mtc7rOaZ7F5PHwtZO4smBochLQiqgOZWKMlLSFd9kzYoSW7tDvITYZi0S8u1bfE8WvEDIZDJdJ9EAQqtKgRNAC5zGKs0QVgIieBAp2A8GD53FTAbvapNb9LZ2R2h1vQjlLk6xgjEdB9GN9WsV63PC4PmSyAcaGdCJPDFEUpqbIhHXAZ-3fNtSqZFGo2WR17fC3EUvp8_0qtRGnlh-jAofLrPDvA_4_Gd09u7n0cmu91FbyebsIp0xUXup3_uYhnz3VaUsdhfcyq4K23GMnnxqKtzbuklF352LC54aWjCuLNqeZIbFlvt6AbNBfcfWV2pyy-iLJg9)](https://mermaid.live/edit#pako:eNpdkc1uwjAQhF9l5EtAIof0GIlULfSAVLUHuISmqlyyAUNiR_4pQYR3r5OAVGofbO9-s6Ndn9lG5cRiBr-2mtc7rOaZ7F5PHwtZO4smBochLQiqgOZWKMlLSFd9kzYoSW7tDvITYZi0S8u1bfE8WvEDIZDJdJ9EAQqtKgRNAC5zGKs0QVgIieBAp2A8GD53FTAbvapNb9LZ2R2h1vQjlLk6xgjEdB9GN9WsV63PC4PmSyAcaGdCJPDFEUpqbIhHXAZ-3fNtSqZFGo2WR17fC3EUvp8_0qtRGnlh-jAofLrPDvA_4_Gd09u7n0cmu91FbyebsIp0xUXup3_uYhnz3VaUsdhfcyq4K23GMnnxqKtzbuklF352LC54aWjCuLNqeZIbFlvt6AbNBfcfWV2pyy-iLJg9) -->

<div>
 <script src="https://cdn.jsdelivr.net/npm/mermaid/dist/mermaid.min.js"></script>
    Sorting Algorithm
    <div class="mermaid">
    graph TD
    A[Input x: a series of rational numbers length n] --&gt;|Start| B(Take 'n&gt;=j&gt;1' from 'x' and store it in 'key')
    B --&gt; C(Location of the previous number: 'i=j-1')
    C --&gt; Z{Is x_i -previous- &gt; key -next- ? }
    Z --&gt; |Yes| Y1(Swap x_i -previous with key -next- )
    Y1--&gt;Y2(Swap key-next with x_i -previous- )
    Z --&gt; |NO| B
    </div>
</div>
<p><br /></p>

<h2 id="for-loop"><code class="language-plaintext highlighter-rouge">for</code> loop</h2>

<p>The sorting algorithm of Example 2, takes a series <code class="language-plaintext highlighter-rouge">x</code> of unsorted rational numbers and using an iterative procedure (<code class="language-plaintext highlighter-rouge">for</code> loop) compares each value on the list with all other values in the series. Using an index of previous value <code class="language-plaintext highlighter-rouge">i</code> and next value <code class="language-plaintext highlighter-rouge">j</code> and a storing vector <code class="language-plaintext highlighter-rouge">key</code> the algorithm swaps places when a previous number <code class="language-plaintext highlighter-rouge">x[i]</code> is greater than the current number being compared (<code class="language-plaintext highlighter-rouge">key</code>). This algorithm performs the same action as the function <code class="language-plaintext highlighter-rouge">sort(x)</code> with the argument <code class="language-plaintext highlighter-rouge">decreasing = FALSE</code>. Therefore, the algorithm itself has no more purpose than learning how the <code class="language-plaintext highlighter-rouge">for</code> loop is being used in <strong>R</strong>. The most fundamental aspect of the <code class="language-plaintext highlighter-rouge">for</code> loop is that takes <code class="language-plaintext highlighter-rouge">n</code> values in a series to perform a list of steps in each iteration of the loop. In this case, the algorithm evaluates each number in the series <code class="language-plaintext highlighter-rouge">x</code> to verify if a previous number is bigger than the next number in the series <code class="language-plaintext highlighter-rouge">x[i]&gt;x[j]</code>. If that is the case, the algorithm replaces (swaps) the previous number <code class="language-plaintext highlighter-rouge">x[i]</code> with the current number being evaluated <code class="language-plaintext highlighter-rouge">x[j]</code></p>

<pre><code class="language-{r}">
# Unsorted
x &lt;- sample(1L:99L, 15)
x

# sorted with the function
sort(x, decreasing = FALSE)

# iterative sorting algorithm
for(j in 2L:length(x)){
    key &lt;- x[j]
    i &lt;- j - 1
    while(i&gt;0&amp;&amp;x[i]&gt;key){ #previous number in the series (x[i]) is greater than next number (key)
        x[i+1] &lt;-  x[i] #swap previous number (x[i]) with the next number (x[i+1])
        i &lt;- i - 1 
        x[i + 1] &lt;- key # swap next number (key) with previous number (x[i+1])
    }
}
# Sorted
x
</code></pre>

<h2 id="example-3-odds-and-even-numbers">Example 3: Odds and Even numbers</h2>

<p>This example is may have only a pedagogical application. The algorithm samples one random number  <code class="language-plaintext highlighter-rouge">sample(..., 1)</code> between one and <code class="language-plaintext highlighter-rouge">lim</code> to generate an <code class="language-plaintext highlighter-rouge">x</code> numeric series of length <code class="language-plaintext highlighter-rouge">y</code> of <em>even</em> or <em>odd</em> numbers.</p>

<blockquote>
  <p>Input: A variable <code class="language-plaintext highlighter-rouge">lim</code> that defines the range of numbers to sample \([1, lim]\) and the length of the series to generate. Also, we need a binary variable to switch the series from  <em>even</em> to <em>odd</em> numbers.</p>
</blockquote>

<blockquote>
  <p>Output: A series <code class="language-plaintext highlighter-rouge">x</code> of <em>even</em> or <em>odd</em> numbers.</p>
</blockquote>

<h2 id="if-else"><code class="language-plaintext highlighter-rouge">if</code>, <code class="language-plaintext highlighter-rouge">else</code></h2>

<p>A good analogy to understand the dynamics of <code class="language-plaintext highlighter-rouge">if</code> and <code class="language-plaintext highlighter-rouge">else</code> operators is a choice or selection between a set of possible categories.</p>

<h2 id="example-5-choice-algorithm">Example 5: Choice Algorithm</h2>

<p>Suppose you have a bag of candies with the following flavours: <code class="language-plaintext highlighter-rouge">c("orange", "lemon", "strawberry", "mango")</code>. Your preference is <code class="language-plaintext highlighter-rouge">lemon</code> above all and <code class="language-plaintext highlighter-rouge">strawberry</code> over <code class="language-plaintext highlighter-rouge">orange</code>, you dislike <code class="language-plaintext highlighter-rouge">mango</code>. Suppose that the bag contains <code class="language-plaintext highlighter-rouge">100</code> candies, and you are interested in how many candies you would need to take (by chance) before getting <code class="language-plaintext highlighter-rouge">3 lemon</code> candies in total?</p>

<blockquote>
  <p>Input: A random sample <code class="language-plaintext highlighter-rouge">c</code> with repetition of size 100 of candies (the bag).</p>
</blockquote>

<blockquote>
  <p>Output: A series <code class="language-plaintext highlighter-rouge">x</code> of at least three lemon candies.</p>
</blockquote>

<pre><code class="language-{r}">
candies &lt;- c("orange", "lemon", "strawberry", "mango")

candy_bag &lt;- sample(candies, 100L, replace = T)

lemon &lt;- 0L
picks &lt;- ""
c &lt;- 1L



while(sum(picks=="lemon")&lt;3){
  pick &lt;- sample(candy_bag, 1L)
  if(pick=="lemon"){
    picks[c] &lt;- pick
    c &lt;- c + 1L
  }else if(pick=="strawberry"){
    picks[c] &lt;- pick
    c &lt;- c + 1L
  }else if(pick=="orange"){
    picks[c] &lt;- pick
    c &lt;- c + 1L
  }
  
}

# Number of picks
length(picks)

# Distribution of picks
library(ggplot2)
ggplot(as.data.frame(table(picks)), aes(x=picks, y = Freq)) +
  geom_bar(stat="identity")
</code></pre>

<h2 id="functions"><code class="language-plaintext highlighter-rouge">Functions</code></h2>

<p><code class="language-plaintext highlighter-rouge">FUN(...)</code>, functions make explicit the kind of input that we the algorithms need in the form of <strong>arguments</strong>. Functions can take <code class="language-plaintext highlighter-rouge">n=...</code> arguments as our implementation may require. As we mention before the operator <code class="language-plaintext highlighter-rouge">if(...)</code> evaluates a logical condition and it is the gatekeeper of a set of operations grouped within <code class="language-plaintext highlighter-rouge">{}</code> brackets. Finally, the <code class="language-plaintext highlighter-rouge">ifelse</code> and the <code class="language-plaintext highlighter-rouge">else</code> operators evaluate a set of logical conditions always after the first condition stated in the <code class="language-plaintext highlighter-rouge">if</code> operator.</p>

<h2 id="example-6-even-or-odd-numbers">Example 6: Even or Odd numbers</h2>
<p>In the implementation of Example 6, the algorithm employs <code class="language-plaintext highlighter-rouge">if</code> and <code class="language-plaintext highlighter-rouge">else if</code> to select between an <em>odd</em> or <em>even</em> number. Instead of using a <code class="language-plaintext highlighter-rouge">for</code> loop, that has a deterministic number of iterations, the examples use a <code class="language-plaintext highlighter-rouge">while</code> to prevent the algorithm to stop before generating a series length <code class="language-plaintext highlighter-rouge">y</code> of even or odd numbers.</p>

<blockquote>
  <p>Input: A random sample <code class="language-plaintext highlighter-rouge">lim</code> that generates</p>
</blockquote>

<blockquote>
  <p>Output: A series <code class="language-plaintext highlighter-rouge">x</code> of at least three lemon candies.</p>
</blockquote>

<pre><code class="language-{r}">
odd_even &lt;- function(lim=100L, y=25L, even=TRUE){
    x &lt;-  vector(mode = "numeric")
    i &lt;- 1L
    while (length(x)&lt;y) {
        n &lt;- sample(1L:lim, 1L)
    if(even &amp; n %% 2 == 0){
        x[i] &lt;- n
        i &lt;- i + 1L
    }else if(!even &amp; n %% 2 != 0 ){
        x[i] &lt;- n
        i &lt;- i + 1L
    }}
  x  
}

# Generate 20 odd numbers between 1 and 1000L
odd_even(lim=1000L, y=20L, even = FALSE)

# Generate 20 even numbers between 1 and 50L
odd_even(lim=1000L, y=20L, even = TRUE)

</code></pre>

<h2 id="example-7--randomized-hire-assistant">Example 7:  Randomized Hire-Assistant</h2>

<p>Finally, example 7 employs previous control flows. Starts by assuming that there is a fixed supply of assistants in the labor market of data science. Further, it assumes that by a process of selection and interview their ability is explicit. Each candidate arrives at the interview randomly, and the goal is then to select the top candidate on a fixed number of interviews.</p>

<blockquote>
  <p>Input: A vector of candidates <code class="language-plaintext highlighter-rouge">supply</code> with a vector <code class="language-plaintext highlighter-rouge">a</code> of ability. Additionally, a vector of <code class="language-plaintext highlighter-rouge">interviews</code> that contains the max number of interviews in each experiment.</p>
</blockquote>

<blockquote>
  <p>Output: A matrix <code class="language-plaintext highlighter-rouge">H</code> with i-rows hired assistant ability per j-column round of interviews. From this matrix, we are interested in estimating the total number of hires for each round of interviews, <code class="language-plaintext highlighter-rouge">c(150L, 250L, 1000L, 3000L, 5000L, 10000L, 15000L)</code>,</p>
</blockquote>

<p>The algorithm uses a <code class="language-plaintext highlighter-rouge">for</code> loop to perform an iterative selection of candidates using the vector <code class="language-plaintext highlighter-rouge">interviews</code>. Then selects a random sample of candidates <code class="language-plaintext highlighter-rouge">selected</code> for the interviews. Using a <code class="language-plaintext highlighter-rouge">while</code> operator each iteration continues to run until <code class="language-plaintext highlighter-rouge">length(selected)==0</code>. For each run, I sample one candidate for <code class="language-plaintext highlighter-rouge">interview</code> and remove it from the <code class="language-plaintext highlighter-rouge">selected</code> vector. Finally, using an <code class="language-plaintext highlighter-rouge">if(best&lt;interview)</code> operator the algorithm hires a candidate if their ability is higher than the current best candidate.</p>

<pre><code class="language-{r}">h &lt;- 1L
supply &lt;- 20000L
hires &lt;- 0L
a &lt;- runif(supply)
interviews &lt;- c(150L, 250L, 1000L, 3000L, 5000L, 10000L, 15000L)

H &lt;- matrix(NA, nrow = supply, ncol = length(interviews))

j &lt;- 2L
for(j in seq_along(interviews)){
    selected &lt;- sample(a, interviews[j]) #interview candidate
    h &lt;- 1L
    best &lt;- 0
    while(length(selected)!=0){
        i &lt;- sample(1L:length(selected), 1)
        interview &lt;- selected[i]
        selected &lt;- selected[-i]
        
        if(best&lt;interview){
        best &lt;- interview
        H[h,j] &lt;- interview
        h &lt;- h + 1L
        }
        }
    }



# Total Hires
hires &lt;-  colSums(!is.na(H))
names(hires) &lt;- paste0(interviews)
hires

# Average Hiring Ability
ability &lt;- colSums(H, na.rm = T)
avg_ability &lt;- ability/hires
avg_ability

# Plot
plot(x=interviews, y=avg_ability)



</code></pre>

<h1 id="references">References</h1>

<div id="refs" class="references csl-bib-body hanging-indent">
<div id="ref-cormen2022introduction" class="csl-entry">
Cormen, T. H., C. E. Leiserson, R. L. Rivest, and C. Stein. 2022.
<em>Introduction to Algorithms, Fourth Edition</em>. MIT Press. <a href="https://books.google.nl/books?id=RSMuEAAAQBAJ">https://books.google.nl/books?id=RSMuEAAAQBAJ</a>.
</div>
</div>]]></content><author><name>Mario H. Gonzalez-Sauri</name></author><summary type="html"><![CDATA[Introduction]]></summary></entry><entry><title type="html">Inferential Statistics: Causation or Correlation?</title><link href="https://mario1084.github.io/blog/2022/03/21/corr_caus.html" rel="alternate" type="text/html" title="Inferential Statistics: Causation or Correlation?" /><published>2022-03-21T00:00:00+00:00</published><updated>2022-03-21T00:00:00+00:00</updated><id>https://mario1084.github.io/blog/2022/03/21/corr_caus</id><content type="html" xml:base="https://mario1084.github.io/blog/2022/03/21/corr_caus.html"><![CDATA[<!--  FORMAT: https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet -->
<h1 id="introduction">Introduction</h1>

<h1 id="introduction-1">Introduction</h1>

<p>In lecture number three, we review the use of descriptive statistics to answer questions such as: “What is the current state of affairs?”; “How often, how many, when?” An also I introduce the us of the correlation coefficient to assess “what is the association between two variables?” However, in many cases, to show that two variables have an association sometime is not enough. Associations only measure how the set of variables change toguether, but, they do not say anything regarding the direction or magnitude of the relationship. To say something about the direction, means to discover if one varibles is the cause or determinant of another. Here, there is a clear order in the relationship between two varibles, for instance, \(X \rightarrow Y\), represents that \(X\) is the cause or determinant of \(Y\). The magnitude of the relationship refers the measurement of the effect of \(X\) on \(Y\), for instance, if \(X\) changes by one unit how much does \(Y\) would vary?</p>

<p>The distiction between a correlation and a causal relationship between two variables is not only important but necesarry in many applications. Imagine for intance the development of a vaccine or an important policy prescription. Obviosly, the research that backs-up these developments will impact the life of many people. Therefore, we would like to make a precise inference to be able to claim with robustness the magnitude and the direction that exist between variables. An asociation between two variables is not strong enough to draw conclusions about the population of our interest. In many cases we would like to move then from showing a correlation between variables to find which variable is the cause or determinant of the other. This kind of research is the central quest of econometrics and Data Science and has a special place in empirical economics.</p>

<h1 id="causality-and-correlation">Causality and Correlation</h1>

<p>As it turns out, the kind of relationship between two variables is no so clear. As Data Scientist, we should proceed with scientific skepticism when we analyze the relationship between two variables. When we measure a correlation between two variables, we are merely assessing the association between two variables, but that is not the same as causation. To help you draw a line between an association, correlation and a causal relationship between two variables, I elaborate on some properties that causal relationships must have:</p>

<h2 id="a-causal-mechanism">A Causal Mechanism</h2>

<p>The development of Machine Learning and Big Data are pushing the boundaries between <strong>data-driven</strong> and <strong>theory-driven</strong> research <a href="https://aisel.aisnet.org/jais/vol19/iss12/1/">(Maass, Parsons, Et Al., 2018)</a>. Indeed, there is nowadays a real debate on the power of Data Science to replace the scientific method:</p>

<p><br /></p>

<blockquote>
  <blockquote>
    <p>Very large databases are a major opportunity for science and data analytics is a remarkable new field of investigation in computer science. The effectiveness of these tools is used to support a “philosophy” against the scientific method as developed throughout history. According to this view, computer-discovered correlations should replace understanding and guide prediction and action. Consequently, there will be no need to give scientific meaning to phenomena, by proposing, say, causal relations, since regularities in very large databases are enough: “with enough data, the numbers speak for themselves”. The “end of science” is proclaimed… <a href="https://link.springer.com/article/10.1007/s10699-016-9489-4#Sec9">Claude &amp; Longo, 2017</a>.</p>
  </blockquote>
</blockquote>

<p><br /></p>

<p>However, in Economics, we are skeptical about weather data driven methods can really be a substitute for theory driven research. With the advent of Information Communication Systems (ICT) and now Big data, are generating large volumes of information. The availability of all sorts of data, also pose a challenge of identifying meaningful relationships between variables. The issue is that more often than before, we can find out by chance pairs of variables that seem to be related, but in fact they are completely disconnected from each other. In Economics, there is long persistent concern about this kind of problem called <strong>spurious</strong> relationship between variables. In Layman’s terms, a spurious relationship occurs when a set of variables seem to have a relationship, however, they are in fact completely unrelated.</p>

<p>So then, what is the solution to avoid the trap of the <em>spurious</em> relationship between two variables? The answer is a well-defined and coherent theoretical framework. In fact, the main stream of methodology in economics, has always been about finding methods to prove economic theory and not the other way around. Although, that trend is changing, and some Data Scientist would argue that research is becoming more data driven, the fact is that in economics there is no substitute for a well-defined and coherent theory. The seminal work of <a href="https://www.cambridge.org/core/books/methodology-of-economics/A02870A52E4F457D4EFBAA3242BAE541">Blaug (1992)</a>, takes a closer look at the developments of methodology in economics and argues that:</p>

<p><br /></p>
<blockquote>
  <blockquote>
    <p>Methodology is study of the relationship between <em>theoretical concepts</em> and warranted conclusions about the real world; in particular, methodology is that branch of economics where we examine the ways in which economists justify their theories and the reasons they offer for preferring one theory over another.
<br /></p>
  </blockquote>
</blockquote>

<h2 id="an-exogenous-model">An Exogenous Model</h2>

<p>To argue that two or more variables hold a causal relationship, we must ensure that our models are exogenous. What does that mean? Well, to say that \(X \rightarrow Y\), requires that we control in the estimation all other factors that affect \(Z \rightarrow Y\) our dependent variable. If our theory suggest that \(X\) causes \(Y\), we must ensure that our estimation isolates well the causal mechanism. In other words, we must account jointly with \(X\) all the other \(Z\) determinants of \(Y\). If we fail to include all the variables that are systematically affecting \(Y\), we fall in the <strong>omitted variable bias (OVB)</strong> trap. OVB is common, because if there are variables that remain confounded or unobservable (\(Z\)), make it hard to distinguish if \(X\) determines \(Y\), or perhaps is \(Z\)? A graphical approach to understand OVB treats to a causal estimation is represented in the following diagram. Here we can see that our variable of interest \(X\) is indeed causing \(Y\), however, there is another variable (in the yellow region), \(Z\) that is jointly affecting \(Y\). Failing then to control for \(Z\) induces a discrepancy between the population parameter(s) and our estimate(s) called <strong>bias</strong>.</p>

<center>
<div>
 <script src="https://cdn.jsdelivr.net/npm/mermaid/dist/mermaid.min.js"></script>
    Omitted Variable Bias (OVB).
    <div class="mermaid">
  graph TD
    X --&gt;  Y
    subgraph OVB;
    Z --&gt; Y
    classDef red fill:#fdc
    class AN red
    end
    </div>
</div>

Biased model:

$$Y=\beta_1 X+\epsilon$$
<br />

Unbiased model:

$$Y=\beta_1 X+ \beta_2 Z+\epsilon$$
<br />

</center>

<p>The classic example is the estimation of years of <em>education</em> (\(X\)) on <em>income</em> (\(Y\)). The problem is that we can measure really well the years of education, but other determinants like ability, motivation and number of hours of study are very hard to measure. Even if we have psychometric measurements of IQ, these metrics are only proxies of the latent ability at the individual level. A proxy means that is just an approximation of the real variable that remains or confounded or unobserved. The book of <a href="https://www.pearson.com/us/higher-education/program/Stock-Introduction-to-Econometrics-Plus-My-Lab-Economics-with-Pearson-e-Text-Access-Card-Package-4th-Edition/PGM2416966.html">Stock and Watson, (2019)</a>, offers another example from the study of school grades (\(Y\)) and the student-teacher ratio. The intuition of the study is that if the student-teacher ratio is high, then the grades are low. The causal mechanism is that explains this negative relation is the lack of capacity of teachers to properly tutor many students. However, the estimation, suffers from OBV, because it does not account for the percentage of English learners in some schools. This is a problem because migrant children might require additional tutoring, given that they do not master the language. Another potential source of OVB is the lack of control of the time of the test. As it turns out, the time of the test can impact the scores, because in early morning and later in the evening the alertness may reduce.</p>

<p>A special type of OVB is called <strong>self-selection</strong>. Self-selection appears in an estimation when there are inherent characteristics of the unit of observation that affect the outcome variable (\(Y\)) but remain confounded or latent. This perhaps sounds quite abstract, so let’s give some examples to clarify the concept. Imagine that you are interested in estimating the effect of education quality (\(Y\)) on the career success (\(Y\)), measured in monthly income. So you run a model and control for the different schools; among the sample you have graduates from Oxford, Harvard, Stanford and so on. Then in your estimation, it appears that indeed the higher the rank of the university (for the <a href="https://www.timeshighereducation.com/world-university-rankings">World University Ranking</a>) the higher the measured salary of the graduates \(Y\). But wait, aren’t more able students also more likely to enroll themselves into highly ranked universities? Indeed, variables such as ability and motivation are very difficult to observe, and hence it is hard to determine if schooling from highly ranked universities causes latter career success. Although the association between university prestige and career success intuitively makes sense, most of the time we are only able to describe a simple correlation between the variables  (Gonzalez-Sauri and Rossello 2022).. Another example, from studies of science and technology, is whether research collaboration causes higher research productivity? Similarly to the previous example, intuitively, we may expect that increments in the collaboration render beneficial exchanges between researchers that increase the overall productivity of authors. Similar to the previous example, intuitively, makes sense collaboration brings gains of human capital, division of labor and pooling of resources. However, we disregard, that these positive externalities from collaboration depend on the individual self-selection into networks or teams (Ductor 2015). The self-selection takes place because researchers do not connect or make partnerships with everybody randomly. But most of the time, a researcher’s own preferences in terms of discipline, research interest and other individual characteristics such as their personality are the reason behind the membership into different networks. Thus, it is hard to tell weather is collaboration the determinant of productivity or is it some other individual characteristics that help some researchers to be part of prolific networks.</p>

<p>A second source of <strong>endogeneity</strong> (opposite of exogeneity) is called <strong>reverse-causality</strong>. Reverse causality is a real problem in many datasets because there is some feedback mechanism \(Y \rightarrow X\) in which the dependent variable also affects the explanatory variable. A classic example in economics of this issue is presents in the functions of supply and demand. The issue is that \(S=P\) supply varies depending on the selling price, but simultaneously, the price is also changing according to the demand \(D=P\). This is an issue of feedback, in which the dependent variable (supply-demand), affects the explanatory variable (price) under equilibrium. This system of equations has a problem of <strong>reverse causality</strong> that is not so straightforward to solve. In graphical form, the problem or reverse causality is represented in the following way:</p>

<center>
<div>
 <script src="https://cdn.jsdelivr.net/npm/mermaid/dist/mermaid.min.js"></script>
    Reverse Causality.
    <div class="mermaid">
  graph TD
    P[Price] --&gt;  S[Supply]
    subgraph REV-CAUSAL;
    D[Demand] --&gt; P
    classDef red fill:#fdc
    class AN red
    end
    S ---|Equilibrium: =| D
    </div>
</div>

Biased model:

$$S=\beta_1 P+\epsilon$$
<br />

Unbiased model:

$$S=\beta_1 P +\epsilon$$
$$D=\beta_2 P +\epsilon$$
<br />

</center>

<p>A similar problem that poses a thread to exogeneity, is called <strong>circularity</strong>, and it happens when past realizations of a dependent variable have an impact on contemporaneous values. There are many examples of this problem in finance and time series econometrics. For instance, in macroeconomic estimations of the \(GDP_{t}\) it is crucial to include the previous state of affairs \(GDP_{t-y}\). Where \(t&gt;y\) stands for a previous period, such that the current or contemporaneous \(GDP\) depends on the state of affairs of the last year. The problem of circularity in graphical form is described as follows:</p>

<center>
<div>
 <script src="https://cdn.jsdelivr.net/npm/mermaid/dist/mermaid.min.js"></script>
    Circularity.
    <div class="mermaid">
  graph TD
    X --&gt;  GDP_t2
    subgraph CIRCULARITY;
    GDP_t1 --&gt; GDP_t2
    classDef red fill:#fdc
    class AN red
    end
    </div>
</div>

Biased model:

$$GDP_{t+1}=\beta_1 X+\epsilon$$
<br />

Unbiased model:

$$GDP_{t+1}=\beta_1 X + \beta_2 GDP_{t+1} +\epsilon$$

<br />

</center>

<h2 id="nature-of-the-data-observational-vs-experimental">Nature of the data: Observational vs Experimental</h2>

<p>If we think deeply about the exogeneity threads discussed in the last section (OVB, self-selection, reverse causality and circularity) we may see a common problem. Yes indeed, at the heart of the issue of endogeneity (opposite of exogeneity) there is a common problem of confounding factors. A confounding factor, in Layman’s terms, is simply a variable that we can’t get our hands on. Either because we do not have the data, we can’t measure it (OVB), due to a problem of self-selection or because our dependent variable has a form of feedback (reverse causality or circularity). All the listed problems, induce bias and yield an unreliable inference because at the backbone of the estimation there is a problem of confounding variables. Having confounding factors in an estimation is like cooking with an incomplete recipe, or analogous to having a jigsaw puzzle with missing pieces.</p>

<div style="text-align:center;line-height:150%">
<a href="https://www.zazzle.com/one_missing_puzzle_piece_black_tile-227261345801427234"><img src="https://rlv.zcache.com/one_missing_puzzle_piece_black_tile-rd6c9e5c34175418fa4d8d833eb958540_agtk1_8byvr_1024.jpg?max_dim=325" alt="One Missing Puzzle Piece - Black Tile" style="border:0;" /></a>
<br />
<a href="https://www.zazzle.com/one_missing_puzzle_piece_black_tile-227261345801427234">One Missing Puzzle Piece - Black Tile</a>
<br />by <a href="https://www.zazzle.com/store/flowstonegraphics">FlowstoneGraphics</a>
</div>

<p>This general issue of confounding variables is not easy to solve with the vast majority of the data sets that we can get our hands on. Indeed, the aforementioned threads to exogeneity may persist even in the most tidy and organize data from relational datasets (SQL for instance), surveys or administrative records. And unfortunately, the issue of confounding factors is not solved by increasing the magnitude and quantity of the data at our disposal. Even if we could collect Big Data that has millions of records using Web Scrapping algorithms, a large company or government agency, the problems may persist.</p>

<p>One way to solve the problem of confounding factors is to employ what has become the golden standard in Economics and Social Sciences is called <strong>Randomized Control Trials (RCTs)</strong>. Data that comes from RCTs is called <strong>experimental data</strong> and is different from all data we can collect from other sources, generally called <strong>observational data</strong>. An RCT typically has a well-defined causal mechanism that directs the process of data collection to eradicate by design the problems of confounding variable. Indeed, the power of the RCTs derive from their power to isolate well the causal mechanism by virtue of a random assignment. In simple terms, the ideal RCT design starts by selecting at least two groups of similar units (individuals, firms, regions). These two groups must be similar in all characteristics such that any difference between them becomes insignificant on the averages. The comparison is then an “apples to apples” and not “apples to oranges”.</p>

<p><img src="https://noushinn.github.io/experimentation_course/fig/RCT-graphic.png" alt="" />
 <em>Source:<a href="https://noushinn.github.io/experimentation_course/defining-the-problem.html">Initiating an Experiment, Ch.4</a></em></p>

<p>The heart of the RCT is changing randomly the circumstances that surround the causal-mechanism in one of the two groups, namely, the <strong>treatment-group</strong>. The randomized assignment has two main virtues, firstly, we eradicate the problem of self-selection by controlling which of the two identical groups receives the treatment. Keep in mind that the treatment embodies the causal-mechanism that we are aiming to showcase \(X \rightarrow Y\). The second benefit is that by changing the circumstances randomly, the variable of interest is most likely disconnected or unrelated to any other \(Z\) factor affecting the outcome variable \(Y\) of our research.</p>

<p><!-- The Joshua Angrist, on of the 2021 Nobel Prize winner in his book Mastering Metrics, described the key and simple power fool advantages of RCTs. First, the RCTS --></p>

<h1 id="examples">Examples</h1>

<h2 id="chocolate-consumption-and-noble-laureates">Chocolate consumption and Noble laureates</h2>

<p>The study of <a href="https://www.sciencedirect.com/science/article/pii/S2590291120300711">Aloys LeoPrinz, (2020)</a>, studies the well known association between Nobel laureates and chocolate consumption. At first glimpse, when we look at the consumption of coffee and chocolate with the number of Nobel laureates winner, we can tell that there is a positive relationship. Using descriptive statistics, we can assess this easily with a scatter plot or correlation table.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>library(ggplot2)
library(gridExtra)
library(dplyr)


choc_lauretes &lt;- readRDS("choc_lauretes.rds")

grid.arrange(
 
choc_lauretes %&gt;%
  ggplot(aes(cholate_per_cap, no_nobel_lau)) +
  geom_point() +
  geom_smooth(method = "lm", se = T),
 
choc_lauretes %&gt;%
  ggplot(aes(coffee_per_cap, no_nobel_lau)) +
  geom_point() +
  geom_smooth(method = "lm", se = T),
 
  nrow = 1
)



cor(choc_lauretes[, c(2L:4L)],  use="complete.obs")

</code></pre></div></div>

<table border="1">
<caption align="top"> Table: 1 </caption>
<tr> <th>  </th> <th> cholate_per_cap </th> <th> coffee_per_cap </th> <th> no_nobel_lau </th>  </tr>
  <tr> <td align="right"> cholate_per_cap </td> <td align="right"> 1.00 </td> <td align="right">  </td> <td align="right">  </td> </tr>
  <tr> <td align="right"> coffee_per_cap </td> <td align="right"> 0.45 </td> <td align="right"> 1.00 </td> <td align="right">  </td> </tr>
  <tr> <td align="right"> no_nobel_lau </td> <td align="right"> 0.17 </td> <td align="right"> -0.12 </td> <td align="right"> 1.00 </td> </tr>
   </table>

<p><br /></p>

<p>The correlation matrix shows that both chocolate and coffee have a positive association to the number of Laureates winners. The correlation of chocolate is to the Nobel winners is stronger than of the coffee. While looking at the scatter plot, we see that the trend indicates almost no relation between the coffee consumption and Nobel winners, however, we can observe a clear positive trend between chocolate consumption and Nobel Prize winners. Does chocolate consumption cause people to become smarter?</p>

<p><br /></p>

<p><img src="https://github.com/Wario84/idsc_mgs/raw/master/assets/imgs/choc_coffee.svg?raw=true" alt="" /></p>

<p>As you are probably suspecting, to show a causal link between chocolate and human cognition, a simple correlation and trend analysis are not enough. But what is missing?</p>

<ul>
  <li>Causal Mechanism</li>
</ul>

<p>In fact the sduty of <a href="https://www.sciencedirect.com/science/article/pii/S2590291120300711">Aloys LeoPrinz, (2020)</a> provides a compelling theory that claims that because of the effects of flavonoids and caffeine has a positive effect on cognition and the dopaminergic reward system of the human brain. However, his paper does not provide the nuances about how come the flavoids and caffeine interact with a particular area(s) of the brain to yield that effect. In fact, his empirical study does not provide any biological evidence supporting that claim.</p>

<ul>
  <li>Data</li>
</ul>

<p>The study uses observational data, and does not solve the problem of self-selection and confounding variables. Moreover, the unit of observation (countries) is quite disconnected from the unit of analysis (Nobel laureates winners). That is, the study attempts to describe a causal mechanism that occurs at the micro level, namely, in the brain of Nobel Prize winners. In other words, he is using a macro data at the country level to draw conclusions about the brain of researchers.</p>

<ul>
  <li>Endogeneity</li>
</ul>

<p>The study does not controls for important confounding variables such as natural ability and the level of education of individuals. Also, there is no account for motivation and the number of weekly hours that researchers invest in their work. The lack of these controls, induces doubts in the estimation because they are important determinants of research productivity. Furthermore, the data has a problem of self-selection, because individuals are choosing to consume chocolate or coffee. Henceforth, we can observe the outcome of individuals of similar characteristics that do not consume coffee or chocolate (control or counterfactual group).</p>

<h2 id="social-norms-and-energy-consumption">Social Norms and Energy Consumption.</h2>

<p>The study of <a href="https://journals.sagepub.com/doi/10.1111/j.1467-9280.2007.01917.x">Schultz, Nolan, Et. Al, (2017)</a> conducts an experiment on 290 households in San Marcos, CA, USA. The experiment was design to analyze the effect of two different kinds of social norms. One group was treated with an intervention that induce a “descriptive norm”, namely, the group was given information on their energy consumption compare to the average consumption of the neighborhood. The average consumption, has implicitly, a social norm given that individuals tend to have conformity with the behavior of their peers. The second group, was treated with another norm called the “injunctive norm” that embodies the perceptions of what is commonly right or wrong in a given situation. The core of the analysis is then to measure the effect of the two norms before and after the treatment.</p>

<ul>
  <li>Causal Mechanism</li>
</ul>

<p>The study uses the theoretical framework of “Focus theory” that predicts that if only one of the two types of norms is prominent in an individual’s consciousness, it will exert the stronger influence on behavior <a href="https://www.annualreviews.org/doi/abs/10.1146/annurev.psych.55.090902.142015?casa_token=ulIwMSo3LUQAAAAA:wRxVgEx3GoCzpwZv_2AQIWnVcLoKBs9ii-0fLxK1-8w86QqjIAarB5zkmGvyZiZXp0ISjMZkEY8F9g">(Cialdini &amp; Goldstein, 2004)</a>. The theory thus prescribe that the group treated with a “descriptive norm” will increase and decrease their energy consumption towards the mean. In contrast, the group that is treated with an “injunctive norm” should change the behavior only if they receive a negative signal (a sad face) when their consumption is above the mean, but not the other way around (boomerang effect).</p>

<ul>
  <li>Data</li>
</ul>

<p>The study uses experimental data because the treatment (social norm) is allocated randomly. The experimental data has the advantage of removing the problem of self-selection and the variable of interest \(X\), the social norm, is, by virtue of the random assignment, uncorrelated with other determinants \(Z\), of the energy consumption \(Y\).</p>

<ul>
  <li>Endogeneity</li>
</ul>

<p>The study only derives conclusions based on a difference in means, and they do not assess the effects of other factors that might drive the change of behavior, for instance, unemployment during the period of observation or absence in the household due to holidays or work. Further, it is not clear that the two groups were completely isolated from one another. The causal-mechanism depends on the prominence of one of the norms in the mind of the individuals. However, neighborhoods are typically well-know to communicate and interact among themselves, henceforth, it is not unlikely that a norm affected more than one household.</p>

<h1 id="references">References</h1>

<div id="refs" class="references csl-bib-body hanging-indent">
<div id="ref-1992Blaug10.1017/CBO9780511528224" class="csl-entry">
Blaug, Mark. 1992. <em>The Methodology of Economics: Or, How Economists
Explain</em>. 2nd ed. Cambridge Surveys of Economic Literature.
Cambridge University Press. <a href="https://doi.org/10.1017/CBO9780511528224">https://doi.org/10.1017/CBO9780511528224</a>.
</div>
<div id="ref-2017Calude10.1007/s1069901694894" class="csl-entry">
Calude, Cristian S., and Giuseppe Longo. 2017. <span>"The Deluge of
Spurious Correlations in Big Data."</span> <em>Foundations of
Science</em> 22 (3): 595-612. <a href="https://doi.org/10.1007/s10699-016-9489-4">https://doi.org/10.1007/s10699-016-9489-4</a>.
</div>
<div id="ref-2004Cialdini10.1146/annurev.psych.55.090902.142015" class="csl-entry">
Cialdini, Robert B., and Noah J. Goldstein. 2004. <span>"Social
Influence: Compliance and Conformity."</span> <em>Annual Review of
Psychology</em> 55 (1): 591-621. <a href="https://doi.org/10.1146/annurev.psych.55.090902.142015">https://doi.org/10.1146/annurev.psych.55.090902.142015</a>.
</div>
<div id="ref-2015Ductorhttps//doi.org/10.1111/obes.12070" class="csl-entry">
Ductor, Lorenzo. 2015. <span>"Does Co-Authorship Lead to Higher Academic
Productivity?"</span> <em>Oxford Bulletin of Economics and
Statistics</em> 77 (3): 385-407. https://doi.org/<a href="https://doi.org/10.1111/obes.12070">https://doi.org/10.1111/obes.12070</a>.
</div>
<div id="ref-2022GonzalezSauri10.1007/s11162022096797" class="csl-entry">
Gonzalez-Sauri, Mario, and Giulia Rossello. 2022. <span>"The Role of
Early-Career University Prestige Stratification on the Future Academic
Performance of Scholars."</span> <em>Research in Higher Education</em>,
April. <a href="https://doi.org/10.1007/s11162-022-09679-7">https://doi.org/10.1007/s11162-022-09679-7</a>.
</div>
<div id="ref-2018Maass10.17705/1jais.00526" class="csl-entry">
Maass, Wolfgang, Jeffrey Parsons, Sandeep Purao, Veda C Storey, and
Carson Woo. 2018. <span>"Data-Driven Meets Theory-Driven Research in the
Era of Big Data: Opportunities and Challenges for Information Systems
Research."</span> <em>Journal of the Association for Information
Systems</em> 19 (12): 1. <a href="https://doi.org/10.17705/1jais.00526">https://doi.org/10.17705/1jais.00526</a>.
</div>
<div id="ref-2020Nabavi" class="csl-entry">
Nabavi, Noushin. 2020. <em><span class="nocase">Chapter 4 Defining the
problem</span></em>. <a href="https://noushinn.github.io/experimentation_course/defining-the-problem.html">https://noushinn.github.io/experimentation_course/defining-the-problem.html</a>.
</div>
<div id="ref-2020Prinzhttps//doi.org/10.1016/j.ssaho.2020.100082" class="csl-entry">
Prinz, Aloys Leo. 2020. <span>"Chocolate Consumption and Noble
Laureates."</span> <em>Social Sciences &amp; Humanities Open</em>. <a href="https://doi.org/10.1016/j.ssaho.2020.100082">https://doi.org/10.1016/j.ssaho.2020.100082</a>.
</div>
<div id="ref-2007Schultz10.1111/j.14679280.2007.01917.x" class="csl-entry">
Schultz, P. Wesley, Jessica M. Nolan, Robert B. Cialdini, Noah J.
Goldstein, and Vladas Griskevicius. 2007. <span>"The Constructive,
Destructive, and Reconstructive Power of Social Norms."</span>
<em>Psychological Science</em> 18 (5): 429-34. <a href="https://doi.org/10.1111/j.1467-9280.2007.01917.x">https://doi.org/10.1111/j.1467-9280.2007.01917.x</a>.
</div>
<div id="ref-2003Stock" class="csl-entry">
Stock, James, and Mark W. Watson. 2003. <em>Introduction to
Econometrics</em>. New York: Prentice Hall; Prentice Hall.
</div>
<div id="ref-2022THE" class="csl-entry">
Times Higher Education. 2022. <span>"<span>World University
Rankings</span>."</span> <a href="https://www.timeshighereducation.com/world-university-rankings">https://www.timeshighereducation.com/world-university-rankings</a>.
</div>
</div>
<p>&lt;/div&gt;</p>]]></content><author><name>Mario H. Gonzalez-Sauri</name></author><summary type="html"><![CDATA[Introduction]]></summary></entry></feed>