Making a Report Bilingual: The full story (Part 1)

Hello hello,

It’s been quite a few months since I last published. As I was going down the rabbit hole at work I thought it would be a good idea to blog my way through the tunnel. It turns out it was not. The sheer complexity of the task and some big gotchas rendered the approach unfeasible.

Now that the report is in production, it is time I try to write up the approach in a way that makes sense.

No it’s not about metadata translation

First things first. Many people may think, «man, that’s a long blog post just to talk about the fact that you can have more than one culture in your model». I’m not here to talk about model metadata translation, I’m here to talk about *report* translation. In a report, actual model metadata is almost non visible, or invisible at all.

To make my point, using the metadata translator tool as I read from this awesome post by Marc Lelijveld, I added the Catalan translation to the model and it did a fair job.

Then I created a rather simple report, using a few measures, a couple of field parameters and a few visuals. I was *very* careful not to change any of the names once I dragged them into the visual objects. And then I tried to change the display language and see the effects:

I highlighted the small bits that are actually translated because they are just parts of the model that are visible in the report. It’s obvious, but let’s say it. This approach is very limited because the names used in the model may not be the ones you want in the report (impossible to tell apart order date from delivery date for example) and even more important, very relevant bits of the report are not translated at all because they are data and not model metadata. Well and last but not least there’s elements that have nothing to do with the model like the title of the report that might also need translation.

If you think this solution is poor, in the real world is even worse. The reason is that in a complex report you will have many measures with long names and this are almost always modified when dragged into visual objects. And titles and subtitles they will be modified (and they should!) so when you change the display language while watching a report, you will see ~~almost~~ no change at all.

The takeaway here is that the moment you modify the name of a field in a visual object, then no matter the number of translations you have in your model and the display language chosen by the user, that name is not going to change. The report is now better as the titles are easier to understand, but we see no Catalan words at all when that is the display language.

The real deal:

If your customer wants to see the report translated in the real meaning of the word, you need to think about everything. At least everything that is technically possible. And you want it to do it in a way that is scalable and maintainable. The idea is to extract metadata form the report, to get all the titles display names and so on, as well as the fields actually used in each visual. When translating data we’ll create mapping tables saying which field needs to be used in each language, for each visual and then we’ll bring it all back to the semantic model with a lot (A LOT) of field parameters and a few measures. But as an MVP from japan Takeshi Kagata told me as I was rambling through my rehearsal of a presentation in Japanese on calc groups… «don’t tell me, show me» (ditched the powerpoint and did everything in PBID — best idea ever) Anyway, here’s the thing moving.

The main caveat at this point is that field parameters are still in preview, and one of the remaining things is that visuals do not remember the expansion state of each field. So when you swap new fields in the column chart, it will always colapse all levels, which sucks a bit but apparently it’s somewhere in the backlog so some day this will not happen. Another caveat that may not go away though is that not all visual objects allow field parameters. If you look closely you will see that in the video I’m not using the new KPI card anymore and I’m using a matrix — I’m not sure why it’s like this. There should be a «I know what I’m doing» setting that would let you use field parameters anywhere, as long as it has measures where it expects measures and columns where it expects columns — and if you mess up you get the gray box of death, but that setting is not there so we need to get creative. I saw also that the «All» part is indeed shown in the display language, so I guess it’s not all lost after all.

Now that we now the destination, let’s build the path to get there!

Scope:

On the report metadata I will not attempt to cover all the possible use cases. The recent improvements in visuals such as reference labels in kpi cards, or in bar charts and the like pose quite a challenge that might not be worth the extra effort for it. Since I did not have this requirement on my report, we’ll let it be for the time being. Thus we’ll cover fields used in the data wells (is that the right term?) but not the fields used in formatting and so on. As for the text we’ll cover Titles, Subtitles and Text Boxes. That should be enough. With a little more stretch you might be able to include slicer header (which is not the title). And I hope you are not using custom axis titles, because they can’t be dynamic. Also the label for Total can’t be tied to measure for now, so hopefully you don’t have a requirement there.

For the data itself we’ll cover the challenges I found myself. Not all fields require translation, some times translation is available in the source data, sometimes is not. We’ll be covering a rather «easy» use case, English to Catalan (beautiful language!). I mean it’s easy because we will not need to translate employee names for example. Well I said translate, but i meant transliterate — something that you will need to to if you are translating from say Japanese to English, or Arabic to Chinese. You get the idea.

Technology:

I have developed the whole solution with Dataflows Gen 2. I know first hand that this is not the best approach as CU consumption is pretty high. But first build something that works, then improve! The beauty of fabric is that you can then swap little pieces and you can replace one dataflow with a notebook if you want. To read from the report I «export» the pbix to a pbip folder with pbir enabled, as I’m still to shy to go full on this format (and we do need deployment pipelines, one of the remaining limitations) but it’s easy to imagine that we can read from the git folder storing all the report files once we get this part going. But let’s not get ahead of ourselves.

For storing the different tables we’ll use three different lakehouses: «landing», «enriched», «curated». I’m probably breaking most rules on the medalion architecture. I will basically put the tables ingested in landing, then intermediate tables like translations in enriched and in curated I will model the data in a way that is convenient to load into the model, even though we will still do some extra transformations there as well.

Oh while we are at this: if you are about to create a dataflow gen2 that will write all tables to the same lakehouse, create the dataflow from the lakehouse! this will make all queries have a default destination to that lakehouse. You might need to clear the data destination for intermediate queries but the process is much faster than *adding* a data destination. Although the speed of the process has increased it’s still a multiclick-operation with several small delays.

Part 1: The Data

For reasons that will become clear later, it’s best to start translating the data and then move on to translate the report metadata.

Let’s make our goal clear. We need a way to get our data tables with extra columns translating the columns that need translations and are not available at the source. Then we’ll think on how we can systematically swap columns by language, but at any rate we need to get these extra columns. To be very clear, we need the same tables, with the same field names, but with some extra fields with the translations. Of course, we don’t want to translate everything every day (expensive, slow, and inaccurate as automatic translations are definitely not fool-proof). We will need some way of introducing translation corrections. So let’s go step by step.

Step 1: The landing

The tables as we have them in our model, will become the «source» of our translation ETL. So yes, we need to move all the power query logic (and ideally all any DAX columns too) upstream. If we can we should create a view or materialized table with all the logic baked in, but if it’s impossible we’ll just replicate the code as the initial query in a dataflow gen 2. Remember that the names need to match exactly what we have in the model, with spaces, uppercase letters and so on. But of course we will not land the data as-is into a lakehouse. Lakehouses hate spaces and remember you might have just any crazy thing in the model, like japanese column names (Don’t ask). So even if we’ll have to use these names in the semantic model in the end, at least during the ETL process lets use a sanitized version of them which will let us automate a great deal of our ETL:

– first thing we’ll make sure that all fields follow snake case. All lowercase and underscore instead of spaces. No parenthesis, points or any spacial chars. Lakehouses like it this way.

– check all the text fields of the table, and those that contain actual words with meaning that need translations, add a suffix that will signal that. For instance «color_english». If you are lucky enough to have translated fields on the source, also label them with a suffix for fields in that language for instance month_short_catalan. Yes, in particular for the date table, you want to take care of the translations manually in the source or the source query the latest. I assure you that no automatic translator will correctly translate your month abbreviations in another language.

While we are at it we will need to create a mapping table to be able to restore original field names once we load the data into the semantic model. With some powerquery (or pyspark) if that’s your thing, this can be easily achieved. If you are doing it in power query, you can use this nice function I did

Both types of tables will be created into our landing lakehouse.

Step 2: Translation Table Mapping

This is something I did not think I needed until I was deep in the mud. In the process of analyzing which fields need translations and which already have them, we will need to create a mapping table that will pair each English field with its Catalan counterpart. This seems quite straightforward until you realize that this «field» needs to be exactly how fields are defined in Power BI, meaning ‘Table Name'[Column Name] (even casing matters at this point! although this seems that will be fixed at some point) . Thus, we will need a way to get to original table name in the model, but not just that. Sometimes we will have role playing dimensions for date or employee or things like this. So a single table of our ETL may be more than one table in our model, and we need to be able to translate all the fields used in the report. Thus, we need a Translation Table Mapping. It sounds fancy, but it’s quite easy. It’s just a list with the name of the tables we are translating paired with the name of the table in the model along with the model name. And as we just said a table of the etl might be in two or more rows due to role playing dimensions.

This mapping will be most likely defined in an excel sheet in a Sharepoint, and with a dataflow gen 2 is quite easy to replicate it in our landing lakehouse.

Step 3: The values

This is the step that generates more tables, for each table in the ETL it will generate 3 tables. With dataflows gen2 this is pushing it a bit because your are not supposed to go over 40 queries per dataflow, and you reach that number pretty quickly in a real world project. As we’ll see in a minute, indeed only the first of the three tables needs to be refreshed every day, the other two only need to be refreshed once if the schema changes and that’s it. If I ever implement this with notebooks, hopefully it might be as easy as an optional parameter «update_all = true» (I’m looking at you dataflows….) .

The values table

this table needs to contain all the pairs field-value that require a translation. It will then have two fields: field name and value.

The fields table

This table is required for later steps, it’s just the list of fields that require translation and the translation is not available in the source (i.e. fields that we’ll have to translate!). It’s not that necessary to materialize it, but I chose to do so as I don’t like to calculate the same thing twice.

The translation field mapping table

this table is one of the key tables that we’ll use to translate the report metadata. It shows all the mapping between the two languages that this table will contain. This includes all the fields to translate, also those that already have the translations in the source. Using this table it will be possible to «translate» what fields need to be used in each visual. We’ll see that in part 2 of this series once we get into the report metadata.

All together looks a bit like this:

Step 4: Translate the values

This is when you actually translate the values. This looks like a straight forward thing to do, but to do things properly you need to first join the values of the previous step with all the existing translations from previous runs, then check if there’s any new value that you need to translate, if so translated it with Microsoft Translator, DeepL or the API of your choice. I went with Microsoft Translator. And last but not least you need to fix those translations! (we’ll see this in the next step)

This part was tricky to deploy in different workspaces because you try to deploy something that already works to workspace where the lakehouse still does not have the tables in the lakehouse. How am I supposed to bring in previous translations when the translation table does not even exist?? Well, I built a function that brings a table from a lakehouse with 4 parameters: workspace, lakehouse name, table name and «table to return if not found». So I create a table out of thin air with no rows but the right fields, and if the table does not exist yet I’ll use that one. This way it everything works from the start. Even if I refactor in pyspark I’ll have to replicate this logic as it is so convenient! You even have the function in two flavors, lakehouse and data warehouse. To create the empty table you can leverage also this function of mine

In this step I found it convenient to create a timestamp query that make it easy for me to determine when this dataflow has been refreshed. We’ll see why in the next step.

Step 5: Translation Corrections

Yes, correcting translations is not an afterthought. You will need to do that. Good news is that is fairly straightforward. Just create a query that brings all the translation tables appended along with their table name without the «_translations» sufix. Since I followed the same naming convention for all of them and they share the same field names, at least in power query it was quite easy to achieve. Of course as this will be facing the business user, switch the snake case to spaces with proper casing. Throughout this project I refined the functions I have to switch between styles of casing for headers.

The reload button is a button to refresh the table but with some security… every time you type any new translation it will update the «Last manual modification» time stamp through some nifty vba code (I used to code vba 8h a day for like 5 years) the last reload brings the time of the last refresh of the translation dataflow, is that extra table with the timestamp we mentioned in the previous step. Just be a bit careful with time zones and so on, but there’s always solutions to problems. When clicking on reload it will check that the last reload has been after the last manual modification and it will refresh the table. Each refresh will bring the data translations table with an empty extra column to write the new translations. That’s why it’s so important to make sure that all these have been baked in our translation repository. Otherwise all carefully manually written new translations will be lost. Since the file is in SharePoint you can probably recover a version where they still exist, but still.

Of course the excel is only half the way, now we need to get this data back into fabric. This is fairly straightforward, just bringing back the snake_case and only keeping those rows that do have a new translation. We can bring it all as a single table in the landing lakehouse. A function will make the magic of picking up only the rows relevant to each table.

Oh since the input is just an excel… make sure to remove possible duplicates that someone may have knowingly or unknowingly added to the file…! Why do I know this? Don’t ask.

Step 6: Prepare tables for the semantic model

In this step we’ll rebuild the table of the landing dataflow (step 1) but adding all the extra columns needed (which we have in the fields table of step 3), using our fancy translations table (step 4). This proved a challenge since the number of joins will depend on the number of fields that need translation. I’m quite proud of this bit so let’s have we look at the power query:

translations_field function will return the rows of the translations table of certain field

(translations_table as table, field as text) =>
let
  Source = translations_table,
  #"Filtered rows" = Table.SelectRows(Source, each ([field_name_english] = field))
in
  #"Filtered rows"

Since each table has its own translation translations table, the field name is enough to limit the translations to use.

Next function translations_field_name will get the table of the previous function and drill down to get the field name in its Catalan version (i.e. the name of the new column)

(translations_table as table, field as text) =>
let
  Source = translations_field(translations_table, field),
  #"Drill down" = Source{0}[field_name_catalan]
in
  #"Drill down"

Now we can build another function fx_table_with_translated_field that adds a single given column to our table using the above functions:

(source_table as table, translations_table as table, field as text) =>
let
  Source = source_table,
  translations_field_table = translations_field(translations_table,field),
  #"Merged queries" = 
    if Table.RowCount(translations_field_table) > 0 then 
      Table.NestedJoin(Source, {field}, translations_field(translations_table,field), {"value_english"}, "join_field", JoinKind.LeftOuter)
    else
      Source,
  #"Expanded employee_area_en" = 
    if Table.RowCount(translations_field_table) > 0 then 
      Table.ExpandTableColumn(#"Merged queries", "join_field", {"value_catalan"}, {translations_field_name(translations_table,field)})
    else
      #"Merged queries"
in
  #"Expanded employee_area_en"

As you see I had to make this function robust in case there was no translations for the given table, which may happen for tables that already have all their translations. (Of course we could choose to not add them to this dataflow but I wanted to have everything following the same ETL.)

This is all good, but how can can we make this process iterate through all the fields that require translation and adds as many columns as necessary all with the right field name?

have a looko at fx_table_with_translated_fields notice the s.

(baseTableName as text) as table =>
let
  source_table = 
    fxGetLakehouseTable(WorkspaceName, landingLakehouseName, baseTableName),
  
  translations_table = 
    fxGetLakehouseTable(WorkspaceName, enrichedLakehouseName, baseTableName & "_translations"),

  source_table_fields = 
    fxGetLakehouseTable(WorkspaceName, enrichedLakehouseName, baseTableName & "_fields"),

  Source = 
    List.Accumulate(
      source_table_fields[english_fields],
      source_table,
      (output_table,current_field) => 
        fx_table_with_translated_field(output_table,translations_table,current_field))
in
  Source

Notice that inside the List.Accumulate we convert the single column fields table into a list of fields over which we iterate. The seed of our accumulation is our source table, the one from the landing dataflow. And then we use this table and the current field of the iteration as arguments for fx_table_with _translated_field. This is one of those things you see in tutorials and you thing that’s cool, but I’ll never manage it myself. When I saw it working I almost cried in joy.

That’s all we’ll do in dataflows, but let’s not forget we need to restore the original field names as we don’t want to break anything from our original report. Let’s dive into that in the next step

Step 7: Restore original field names when loading to the semantic model

Lakehouses don’t like spaces, but semantic models do, and we should use them! That’s why it’s now time to restore the original field names. But wait, for that we’ll need the field mapping that we created in our landing lakehouse. It’s not good manners to grab tables from the landing lakehouse when loading a semantic model, but at the same time we don’t need to do any transformation to it that justifies yet more dataflows. Well, fear not because there’s something called «shortcuts» Basically in our curated lakehouse we can create a shortcut to the tables in the landing lakehouse and it will be just as if the tables were indeed there but without any data movement! Cool, isn’t it?

With this in place we just need a function that will retrieve both the main table and the mapping table from our curated lakehouse and will restore the «source» field names to the table. This sounds very tricky but it’s actually quite straight forward with the right functions. The first step is tu understand how Power Query does field renames. It does it through the Table.RenameColumns function, providing a table and a list of lists, with the inner lists being just pairs of old name-new name. So the question is how can we build such thing from the field_mapping table. We know we can get the values of a column as list by just calling it as #»table name»[column name]. But how could we create this list with little lists pairing the first item of the first list with the first item of the second list and so on? Well, this is not in the UI, but there’s a function called List.Zip that does just this. With all this knowledge in place, let’s see how the function looks like!

fxGetRemappedTable:

(TableName as text) =>
let
    mapping = fxGetLakehouseTable(WorkspaceName,curatedLakehouseName,TableName & "_field_mapping"),
    fieldPairs = List.Zip({mapping[destination_name],mapping[source_name]}),
    translatedTable = fxGetLakehouseTable(WorkspaceName,curatedLakehouseName,TableName),
    #"Renamed Columns" = Table.RenameColumns(translatedTable,fieldPairs)
in
    #"Renamed Columns"

Just four steps! Ok, with a couple calls to the fxGetLakehouseTable function <– (much nicer that getting all those guids in your code if you ask me)

Ok but there’s also the business of replacing the original query pointing to the database by the new query pointing to the curated Lakehouose. There’s a couple approaches. The fearless approach is to create the new query pointing to the lakehouse and then just grab all the code, go to the old query open the advanced editor and replace it in full. This is a bit scary, in particular if for some reason you want to go back to the database code. Call me chicken, but what I do is as follows:

Let’s stay with our product table called «Product Catalog» in our semantic model:

Create a new query «Product Catalog (Lakehouse)» using the fxGetRemappedTable function
check that columns and types match for the fields contained in both this table and the original «Product Catalog» table.
Add a custom step at the end of «Product Catalog» query
right click on the new step and select «Extract previous»
name the new query «Product Catalog (Database)»
Disable load on this new query and place it right before the original one (i wish it would do this by default)
On the original query remove the custom step. You now only have one step pointing to «Product Catalog (Database)»
Edit the step formula by replacing the word «Database» by «Lakehouse» so that it points to «Product Catalog (Lakehouse)»

This might not be for everyone as you end up with quite a few more queries than originally but gives me peace of mind that I can switch back if things go wrong. Of course you can do a cleanup at some point by placing the code on the final query and removing the two intermediate queries.

Anyway how does the final result look like? Well it looks almost perfect!

As you can see the new fields do not use proper casing, but to be honest I just let it be. For reasons that we’ll see in the second part of this series, *do not rename the catalan fields at this point* as it will break the magic. If you want to apply some automatic transformations these will need to be replicated in the translation field mapping table as well.

Well and that’s all for now!

Hopefully I can put together the second part of this series sometime soon.

Thank you for reading and now let’s follow the conversation on LinkedIn and Bluesky. (I will not be posting in Twitter for the time being)

Twittear

HOME /

SOBRE NOSOTROS /

SERVICIOS /

EQUIPO /

CONTACTO /

BLOG /

KIT DIGITAL /

Blog