Skip to main content
Version: 1.0.0 (latest)

Matomo

Need help deploying these sources, or figuring out how to run them in your data stack?
Join our Slack community or book a call with a dltHub Solutions Engineer.

Matomo is a free and open-source web analytics platform that provides detailed insights into website and application performance with features like visitor maps, site search analytics, real-time visitor tracking, and custom reports.

This Matomo dlt verified source and pipeline example loads data using “Matomo API” to the destination of your choice.

The endpoints that this verified source supports are:

NameDescription
matomo_reportsDetailed analytics summaries of website traffic, visitor behavior, and more
matomo_visitsIndividual user sessions on your website, pages viewed, visit duration and more

Setup Guide

Grab credentials

  1. Sign in to Matomo.
  2. Hit the Administration (⚙) icon, top right.
  3. Navigate to "Personal > Security" on the left menu.
  4. Find and select "Auth Tokens > Create a New Token."
  5. Verify with your password.
  6. Add a descriptive label for your new token.
  7. Click "Create New Token."
  8. Your token is displayed.
  9. Copy the access token and update it in the .dlt/secrets.toml file.
  10. Your Matomo URL is the web address in your browser when logged into Matomo, typically "https://mycompany.matomo.cloud/". Update it in the .dlt/config.toml.
  11. The site_id is a unique ID for each monitored site in Matomo, found in the URL or via Administration > Measureables > Manage under ID.

Note: The Matomo UI, which is described here, might change. The full guide is available at this link.

Initialize the verified source

To get started with your data pipeline, follow these steps:

  1. Enter the following command:

    dlt init matomo duckdb

    This command will initialize the pipeline example with Matomo as the source and duckdb as the destination.

  2. If you'd like to use a different destination, simply replace duckdb with the name of your preferred destination.

  3. After running this command, a new directory will be created with the necessary files and configuration settings to get started.

For more information, read the guide on how to add a verified source.

Add credential

  1. Inside the .dlt folder, you'll find a file called secrets.toml, which is where you can securely store your access tokens and other sensitive information. It's important to handle this file with care and keep it safe. Here's what the file looks like:

    # put your secret values and credentials here
    # do not share this file and do not push it to github
    [sources.matomo]
    api_token= "access_token" # please set me up!"
  2. Replace the api_token value with the previously copied one to ensure secure access to your Matomo resources.

  3. Next, follow the destination documentation instructions to add credentials for your chosen destination, ensuring proper routing of your data to the final destination.

  4. Next, store your pipeline configuration details in the .dlt/config.toml.

    Here's what the config.toml looks like:

    [sources.matomo]
    url = "Please set me up !" # please set me up!
    queries = ["a", "b", "c"] # please set me up!
    site_id = 0 # please set me up!
    live_events_site_id = 0 # please set me up!
  5. Replace the value of url and site_id with the one that you copied above.

  6. To monitor live events on a website, enter the live_event_site_id (usually it is same as site_id).

For more information, read the General Usage: Credentials.

Run the pipeline

  1. Before running the pipeline, ensure that you have installed all the necessary dependencies by running the command:
    pip install -r requirements.txt
  2. You're now ready to run the pipeline! To get started, run the following command:
    python matomo_pipeline.py
  3. Once the pipeline has finished running, you can verify that everything loaded correctly by using the following command:
    dlt pipeline <pipeline_name> show
    For example, the pipeline_name for the above pipeline example is matomo, you may also use any custom name instead.

For more information, read the guide on how to run a pipeline.

Sources and resources

dlt works on the principle of sources and resources.

Source matomo_reports

This function executes and loads a set of reports defined in "queries" for a specific Matomo site identified by "site_id".

@dlt.source(max_table_nesting=2)
def matomo_reports(
api_token: str = dlt.secrets.value,
url: str = dlt.config.value,
queries: List[DictStrAny] = dlt.config.value,
site_id: int = dlt.config.value,
) -> Iterable[DltResource]:
...

api_token: API access token for Matomo server authentication, defaults to "./dlt/secrets.toml"

url : Matomo server URL, defaults to "./dlt/config.toml"

queries: List of dictionaries containing info on what data to retrieve from Matomo API.

site_id: Website's Site ID as per Matomo account.

Note: This is an incremental source method and loads the "last_date" from the state of last pipeline run.

Source matomo_visits:

The function loads visits from current day and the past initial_load_past_days in first run. In subsequent runs it continues from last load and skips active visits until closed.

def matomo_visits(
api_token: str = dlt.secrets.value,
url: str = dlt.config.value,
live_events_site_id: int = dlt.config.value,
initial_load_past_days: int = 10,
visit_timeout_seconds: int = 1800,
visit_max_duration_seconds: int = 3600,
get_live_event_visitors: bool = False,
) -> List[DltResource]:
...

api_token: API token for authentication, defaulting to "./dlt/secrets.toml".

url: Matomo server URL, defaulting to ".dlt/config.toml"

live_events_site_id: Website ID for live events.

initial_load_past_days: Days to load initially, defaulting to 10.

visit_timeout_seconds: Session timeout (in seconds) before a visit closes, defaulting to 1800.

visit_max_duration_seconds: Max visit duration (in seconds) before a visit closes, defaulting to 3600.

get_live_event_visitors: Retrieve unique visitor data, defaulting to False.

Note: This is an incremental source method and loads the "last_date" from the state of last pipeline run.

Resource get_last_visits

This function retrieves site visits within a specified timeframe. If a start date is given, it begins from that date. If not, it retrieves all visits up until now.

@dlt.resource(
name="visits", write_disposition="append", primary_key="idVisit", selected=True
)
def get_last_visits(
client: MatomoAPIClient,
site_id: int,
last_date: dlt.sources.incremental[float],
visit_timeout_seconds: int = 1800,
visit_max_duration_seconds: int = 3600,
rows_per_page: int = 2000,
) -> Iterator[TDataItem]:
...

site_id: Unique ID for each Matomo site.

last_date: Last resource load date, if exists.

visit_timeout_seconds: Time (in seconds) until a session is inactive and deemed closed. Default: 1800.

visit_max_duration_seconds: Maximum duration (in seconds) of a visit before closure. Default: 3600.

rows_per_page: Number of rows on each page.

Note: This is an incremental resource method and loads the "last_date" from the state of last pipeline run.

Transformer visitors

This function, retrieves unique visit information from get_last_visits.

@dlt.transformer(
data_from=get_last_visits,
write_disposition="merge",
name="visitors",
primary_key="visitorId",
)
def get_unique_visitors(
visits: List[DictStrAny], client: MatomoAPIClient, site_id: int
) -> Iterator[TDataItem]:
...

visits: Recent visit data within the specified timeframe.

client: Interface for Matomo API calls.

site_id: Unique Matomo site identifier.

Customization

Create your own pipeline

If you wish to create your own pipelines, you can leverage source and resource methods from this verified source.

  1. Configure the pipeline by specifying the pipeline name, destination, and dataset as follows:

    pipeline = dlt.pipeline(
    pipeline_name="matomo", # Use a custom name if desired
    destination="duckdb", # Choose the appropriate destination (e.g., duckdb, redshift, post)
    dataset_name="matomo_data" # Use a custom name if desired
    )

    To read more about pipeline configuration, please refer to our documentation.

  2. To load the data from reports.

    data_reports = matomo_reports()
    load_info = pipeline_reports.run(data_reports)
    print(load_info)

    "site_id" defined in ".dlt/config.toml"

  3. To load custom data from reports using queries.

    queries = [
    {
    "resource_name": "custom_report_name",
    "methods": ["CustomReports.getCustomReport"],
    "date": "2023-01-01",
    "period": "day",
    "extra_params": {"idCustomReport": 1}, #id of the report
    },
    ]

    site_id = 1 #id of the site for which reports are being loaded

    load_data = matomo_reports(queries=queries, site_id=site_id)
    load_info = pipeline_reports.run(load_data)
    print(load_info)

    You can pass queries and site_id in the ".dlt/config.toml" as well.

  4. To load data from reports and visits.

    data_reports = matomo_reports()
    data_events = matomo_visits()
    load_info = pipeline_reports.run([data_reports, data_events])
    print(load_info)
  5. To load data on live visits and visitors, and only retrieve data from today.

    load_data = matomo_visits(initial_load_past_days=1, get_live_event_visitors=True)
    load_info = pipeline_events.run(load_data)
    print(load_info)

This demo works on codespaces. Codespaces is a development environment available for free to anyone with a Github account. You'll be asked to fork the demo repository and from there the README guides you with further steps.
The demo uses the Continue VSCode extension.

Off to codespaces!

DHelp

Ask a question

Welcome to "Codex Central", your next-gen help center, driven by OpenAI's GPT-4 model. It's more than just a forum or a FAQ hub – it's a dynamic knowledge base where coders can find AI-assisted solutions to their pressing problems. With GPT-4's powerful comprehension and predictive abilities, Codex Central provides instantaneous issue resolution, insightful debugging, and personalized guidance. Get your code running smoothly with the unparalleled support at Codex Central - coding help reimagined with AI prowess.