How to filter column by a specific language
Hi,
I’m currently working on a dataset that contains a “Page Title” column with rows in both English and other languages. I want to set up a filter to exclude all non-English titles. I use the following REGEX formula:
CASE
WHEN Page Title
REGEXP '^[A-Za-z0-9 .,!?-]*$' THEN 'English'
ELSE 'Other'
END
This formula helps remove all non-Latin alphabets and special characters. However, a few Portuguese and Spanish titles are still present.
Is there a way to exclude all non-English titles using a SQL query or a more advanced beast mode formula than what I’m currently using?
Thanks!
Best Answers
-
If you are trying to capture other titles that include accented characters, the following might work.
CASE
WHEN Page Title REGEXP '^[A-Za-zÀ-ÿ0-9 .,!?-]*$' THEN 'English'
ELSE 'Other'
ENDThis would cover most accented Latin characters used in European countries.
Basic Latin alphabet characters (A-Za-z)
Accented characters used in European languages (À-ÿ)
Numbers (0-9)
Common punctuation (.,!?-)** Was this post helpful? Click Agree or Like below. **
** Did this solve your problem? Accept it as a solution! **0 -
Hello @Mansa_TCC,
If the title in Portuguese or Spanish contains only standard English letters, this might be why you can't filter them.
This can be done with Domo Jupyter if you have access to this function.
Here is the script:
pip install langdetect
from langdetect import detect#Function to detect language
def detect_language(title):
try:
return detect(title)
except LangDetectException:
return 'Unknown'#Apply the function to create the 'Language' column
sample['Language'] = sample['title'].apply(detect_language)
#Display the DataFrame with the new column
sampleHere is the result:
You can then write to the dataset with the new column.
I can't think of any other solution. Unfortunately, I don't see 'langdetect' as part of the Python in ETL and you can't do it from there. If you need, I can help you with the rest of the steps.
If you found this post helpful, please use 💡/💖/👍/😊 below! If it solved your problem, don't forget to accept the answer.
2
Answers
-
If you are trying to capture other titles that include accented characters, the following might work.
CASE
WHEN Page Title REGEXP '^[A-Za-zÀ-ÿ0-9 .,!?-]*$' THEN 'English'
ELSE 'Other'
ENDThis would cover most accented Latin characters used in European countries.
Basic Latin alphabet characters (A-Za-z)
Accented characters used in European languages (À-ÿ)
Numbers (0-9)
Common punctuation (.,!?-)** Was this post helpful? Click Agree or Like below. **
** Did this solve your problem? Accept it as a solution! **0 -
Hello @Mansa_TCC,
If the title in Portuguese or Spanish contains only standard English letters, this might be why you can't filter them.
This can be done with Domo Jupyter if you have access to this function.
Here is the script:
pip install langdetect
from langdetect import detect#Function to detect language
def detect_language(title):
try:
return detect(title)
except LangDetectException:
return 'Unknown'#Apply the function to create the 'Language' column
sample['Language'] = sample['title'].apply(detect_language)
#Display the DataFrame with the new column
sampleHere is the result:
You can then write to the dataset with the new column.
I can't think of any other solution. Unfortunately, I don't see 'langdetect' as part of the Python in ETL and you can't do it from there. If you need, I can help you with the rest of the steps.
If you found this post helpful, please use 💡/💖/👍/😊 below! If it solved your problem, don't forget to accept the answer.
2
Categories
- All Categories
- 1.8K Product Ideas
- 1.8K Ideas Exchange
- 1.6K Connect
- 1.2K Connectors
- 300 Workbench
- 6 Cloud Amplifier
- 9 Federated
- 2.9K Transform
- 102 SQL DataFlows
- 626 Datasets
- 2.2K Magic ETL
- 3.9K Visualize
- 2.5K Charting
- 753 Beast Mode
- 61 App Studio
- 41 Variables
- 692 Automate
- 177 Apps
- 456 APIs & Domo Developer
- 49 Workflows
- 10 DomoAI
- 38 Predict
- 16 Jupyter Workspaces
- 22 R & Python Tiles
- 398 Distribute
- 115 Domo Everywhere
- 276 Scheduled Reports
- 7 Software Integrations
- 130 Manage
- 127 Governance & Security
- 8 Domo Community Gallery
- 38 Product Releases
- 11 Domo University
- 5.4K Community Forums
- 40 Getting Started
- 30 Community Member Introductions
- 110 Community Announcements
- 4.8K Archive