R and Python for determing dataframe equality in ETL

Greetings Domo Community,

I have currently been refactoring Magic ETL tiles and have been attempting to use Python and R scripts to test for equality before finalizing the refactored tile components. Using both approaches with Python and R I have had significant difficulties in completing this task, which I will outline here.

Python:

I began by using domomagic to load both datasets into the Python tile. Upon doing so, I used Pandas as a tool for a comparison. Then, I sorted all of the rows by their corresponding column values. I print the shapes to ensure they are the same - in each case they were the same - and then I print the column names and test their equality. Here comes the issue, I do a join to test what exists only in the left and what exists only in the right dataframes. I additionally make an assert_frame_equal call from the Pandas library and I test the equality. I consistently find that the left and right only dataframes are exactly equal upon inspection, the assert_frame_equals never triggers (And I have seen it before) and the .equals() returns false - completely contradicting each other. I have gone through every debugging technique I can think of and have not made any progress. It is worth noting, that even though the rows are sorted and the column is reset before testing equality, the left and right suggest different indexes when printed. If anyone has any suggestions or insights about the internals of Python in an ETL, I would be grateful for your feedback. The Python code is attached below. Please continue reading for Ruby issue.

# Import the domomagic package into the script

from domomagic import *

import pandas as pd

from pandas.testing import assert_frame_equal

import numpy as np

pd.set_option('display.max_columns', None)

# read data from inputs into a data frame

input1 = read_dataframe('Total Formulas')

input2 = read_dataframe('Address Nulls from SUM joins 4')

input1 = input1.sort_values(by = input1.columns.tolist()).reset_index(drop=True)

input2 = input2.sort_values(by = input2.columns.tolist()).reset_index(drop=True)

print("Data frame 1 Shape: " + str(input1.shape))

print("Data frame 2 Shape: " + str(input2.shape))

print("Dataframe 1 Columns: " + str(input1.columns))

print("Dataframe 2 Columns: " + str(input2.columns))

print(input1.columns == input2.columns)

dfLeft = input1.merge(input2, how = 'outer' ,indicator=True).loc[lambda x : x['_merge']=='left_only']

dfRight = input1.merge(input2, how = 'outer' ,indicator=True).loc[lambda x : x['_merge']=='right_only']

print("\n\n\n\n")

print("Left only shape")

print(dfLeft.shape)

print(dfLeft)

print("\n\n\n\n")

print("Right only shape")

print(dfRight.shape)

print(dfRight)

print("\n\n\n\n")

assert_frame_equal(input1, input2, check_dtype=True)

print(input1.info())

print(input2.info())

print(input1.equals(input2))

#This additionally returns False, even being exactly the same upon inspection.

print(dfLeft.equals(dfRight))

Ruby:

Now for the Ruby issue. Keep in mind that I use this script with the exact same two datasets that I attempt with Python. This script works far better. However, the anti_join call, which just tells what rows are in right and not in left (and vise versa) has for me at most printed out 1 row on any sample run with large input. However, on this specific sample run, it says that there is one column that is only in the left and one that is only on the right. However, upon manual inspection, these rows are available in both and exactly the same. Furthermore, the left only row is the first row of the left hand side dataset and right only is the second row. For the right hand side dataset, it is the exact same thing, except switched backwards, Left only on row 2 and right only on row 1. This return false to the all.equal. My debugging process began by making sure that the column names were read by R and that perhaps it was maybe reading each first row as the column header, this would make sense, right? After printing the column headers this was not the case. As far as sorting, even if the sorting was done wrong in the R code, many of the other elements that exist (and I verified in both) are not in any specific order upon input to the R tile. It is worth noting that the data was preceded by a Grouping on Project and FY Period, so there is no reason there would be any duplicates. Final remarks, there are numerous cases with different size inputs that Python says True and R says false with nothing more than their respective .equals() call. However, I have found R to be far more accurate from manual inspection.

# Import the domomagic library into the script.

library('domomagic')

library('dplyr')

# read data from inputs into a data frame

#Theirs

input1 <- read.dataframe('Total Formulas')

#Mine

input2 <- read.dataframe('Address Nulls from SUM joins 4')

input1 <- input1[order("Project", "FY Period"),]

#input1 <- arrange_all(input1)

input2 <- input2[order("Project", "FY Period"),]

#input2 <- arrange_all(input2)

print(all.equal(input1,input2))

Diff <- anti_join(input1, input2, by = c("Project", "FY Period"))

print(Diff)

Diff1 <- anti_join(input2, input1, by = c("Project", "FY Period"))

print(Diff1)

# write a data frame so it's available to the next action

write.dataframe(input1)

I appreciate any help. Thanks, Will

Quick Links

Accepted answers

All comments

bdavis

I'm not sure if you're trying to determine if they have the same number of columns and names, or if the data needs to be the exact same in both dataframes. Some sample of your data and output helps a bunch as well. Also R is not Ruby, they're different languages. ;)

Now, I don't use R in Domo, but I use it outside of Domo. In base R you can use identical() to determine if two dataframes are exactly the same.

identical(x, y, num.eq = TRUE, single.NA = TRUE, attrib.as.set = TRUE,
          ignore.bytecode = TRUE, ignore.environment = FALSE,
          ignore.srcref = TRUE)

other categories

Product Ideas
Have a Domo product enhancement idea? Submit or upvote on ideas in the Ideas Exchange.
Ideas Exchange
Suggest & vote on new features you would like to see implemented in the Domo Product.
Data Connections
Ask questions about Connectors, Workbench, Cloud Amplifier and get best practices from Domo peers
Connectors
Connectors, Custom Connectors, Writeback
Workbench
Ask questions about Workbench, a secure, client-side solution for uploading your on-premise data to Domo.
Cloud Integrations
Ask questions about Cloud Integrations and Federated Data connection to your data warehouse or lake.
Data & ETL
Ask questions about Magic ETL, SQL DataFlows, DataFusion, Dataset Views and get best practices from Domo peers
Magic ETL
Ask Magic ETL questions and get answers from Domo peers
SQL DataFlows
Ask SQL DataFlow questions and get answers from Domo peers
Datasets
Ask DataFusion and Dataset Views questions and get answers from Domo peers
Visualize & Apps
Ask questions about Beast Mode, Cards, Charting, Dashboards, Stories, Variables and get best practices from Domo peers
Dashboards
Ask Cards, Dashboards, and Stories questions and get answers from Domo peers
App Studio
Ask questions about building apps in App Studio.
Pro-code Components
Ask questions about pro-code components and Domo Bricks and get answers from Domo peers.
Charting & Analyzer
Ask questions about charting and Analyzer and get answers from Domo peers.
Calculations & Variables (Beast Mode)
Ask questions about using calculated fields and Variables (Beast Modes) in Analyzer.
AI & Data science
Ask questions about DomoAI and get answers from Domo peers.
Domo AI & AI Chat
Ask questions about AI Chat and AI assistants.
Managing AI
Ask questions about managing AI with AI Playground, AI projects, AI models, and more.
Jupyter Workspaces
Ask questions about Jupyter Workspaces, Notebooks, and file share.
Automate
Ask questions about App Framework, Workflows, Domo Bricks, Domo Developer, API and get best practices from Domo peers
Workflows
Ask questions about Task Center, building automations with Domo Workflows, and executing JavaScript or Python code with Code Engine.
Alerts
Ask questions about managing alerts in Domo and get answers from Domo peers.
Distribute
Ask questions about Domo Everywhere, Scheduled Reports, Mobile and get best practices from Domo peers
Domo Everywhere
Ask questions about embedded analytics with Domo Everywhere.
Reporting
Ask questions about Scheduled Reports, Report Builder, and Slideshow Publications.
Manage
Ask questions about Governance Administration, Approvals, Teams, Alerts, and Buzz and get best practices from Domo peers
Governance & Security
Ask questions about People, Groups, Roles, Sandbox, Activity log, Buzz, Teams, Approvals and PDP and get best practices from Domo peers
Navigation & Productivity
Ask questions about navigation, Projects & Tasks, Goals, and Buzz chat.
APIs
Ask APIs and Developer.domo.com questions and get answers from Domo peers
Add-ins & Plugins
Ask questions about plugins, Microsoft add-ins, and other third-party software integrations.
Domo Community Gallery
Watch how our Customers are using Domo to solve their complex problems.
Product Releases
Domo support and product teams are here to live-answer questions about the most recent product releases. Please post questions in this Forum board for all users to benefit (rather than submitting a support ticket).
Domo University
Questions or discussions related to Domo University, trainings and certifications
Community Forums
Getting Started
Welcome to Domo's Community Forums! You'll find everything you need to get started in this category.
Community Announcements
Get the latest from Domo's Community Team.
Social Groups
Archive
Old or outdated content that could still be found helpful.

Find more posts tagged with

Magic ETL

Python