Jump to content

I'll try to explain the situation. We need to extract some data from a DB we don't have access to. We were provided an API in python that returns the data in a .csv file.

 

This data is then inserted in a table of a DB (Oracle) to which we have access to. When I perform a select on it, a character in particular is not recognized, “—” (U+2014). I was able to figure out the correct character set (WE8MSWIN1252), but it breakes all other accented characters (áéíóèòì)

Those are the character sets I've tried:

  • US7ASCII: US 7-bit ASCII character set

  • WE8ISO8859P1: ISO 8859-1 West European 8-bit character set

  • EE8MSWIN1250: Microsoft Windows East European Code Page 1250

  • WE8MSWIN1252: Microsoft Windows West European Code Page 1252

  • WE8EBCDIC1047: IBM West European EBCDIC Code Page 1047

  • JA16SJISTILDE: Japanese Shift-JIS Character Set, compatible with MS Code Page 932

  • ZHT16MSWIN950: Microsoft Windows Traditional Chinese Code Page 950

  • UTF8: Unicode 3.0 Universal character set CESU-8 encoding form

  • AL32UTF8: Unicode 5.0 Universal character set UTF-8 encoding form

(yes, I tried chinese and japanese, in an act of desperation)

 

The database is using UTF8 for NCHAR and AL32UTF8 as character sets.

 

Btw, I'm using CONVERT in the outer select of the query. Tried both with and without the source character set argument.

 

Are there better ways to do it? What am I missing?

MOTHERBOARD: ASRock H97 Pro4 CPU: Intel Core i5-4460 @3.30 Ghz Intel Xeon E3-1271v3 @4.00 Ghz RAM: 32Gb (4x8Gb) Kingstone HyperX Fury DDR3@1600 Mhz (9-9-9-27)

GPU: MSI 390 8Gb Gaming Edition PSU: XFX TS 650w Bronze Enermax Revolution D.F. 650w 80+ Gold MOUSE: Logitech G502 Proteus Spectrum KEYBOARD: Monokey Standard Suave Blue

STORAGE: SSD Samsung EVO 850 250Gb // HDD WD Green 1Tb // HDD WD Blue 4Tb // HDD WD Blue 160Gb CASE: Fractal Design Define R5 Windowed OS: Windows 11 Pro x64 Bit

MONITORS: Samsung CFG7 C24FG7xFQ @144hz // Samsung SyncMaster TA350 LT23A350 @60hz Samsung Odyssey G7 COOLER: Noctua NH-D15

 

Link to comment
https://linustechtips.com/topic/1257247-character-set-problems/
Share on other sites

Link to post
Share on other sites

In Windows 1252 , the emdash character is 0x97 , a single byte.

It UTF-8, it's 0x2014, which is encoded in 3 bytes : 0xe2 0x80 0x94

 

You have to convert anything you insert into the database into utf-8, or set the table to binary and include the codepage with every string

 

<?php

function dumpstring($text) {
	echo strlen($text);
	for ($i=0;$i<strlen($text);$i++) { 
		$code = ord(substr($text,$i,1));
		echo ' 0x'.str_pad(dechex($code),2,'0',STR_PAD_LEFT);
	}
	echo '<br/>';
}

$text = chr(0x97);
$text2 = mb_convert_encoding($text,'utf-8','Windows-1252');
dumpstring($text);
dumpstring($text2);
die();
?>
1 0x97
3 0xe2 0x80 0x94

 

 

// edit: also, why don't you just open the csv file with a hexadecimal editor / viewer to see if the csv file has some BOM header at the beginning and to see how exactly is that emdash encoded.

Link to comment
https://linustechtips.com/topic/1257247-character-set-problems/#findComment-14113149
Share on other sites

Link to post
Share on other sites

2 hours ago, Sauron said:

If you know the standard you could substitute the offending character with the correct one for the rest of the string

Like, with a replace? That could work. I'll definitely keep this in mind for short-term. But I don't know if there are any more screwed up characters tho.

 

@mariushm  I will check tomorrow, didn't know that you could check the encoding of the csv by looking at the header. I don't think binary is an option in our case.

 

MOTHERBOARD: ASRock H97 Pro4 CPU: Intel Core i5-4460 @3.30 Ghz Intel Xeon E3-1271v3 @4.00 Ghz RAM: 32Gb (4x8Gb) Kingstone HyperX Fury DDR3@1600 Mhz (9-9-9-27)

GPU: MSI 390 8Gb Gaming Edition PSU: XFX TS 650w Bronze Enermax Revolution D.F. 650w 80+ Gold MOUSE: Logitech G502 Proteus Spectrum KEYBOARD: Monokey Standard Suave Blue

STORAGE: SSD Samsung EVO 850 250Gb // HDD WD Green 1Tb // HDD WD Blue 4Tb // HDD WD Blue 160Gb CASE: Fractal Design Define R5 Windowed OS: Windows 11 Pro x64 Bit

MONITORS: Samsung CFG7 C24FG7xFQ @144hz // Samsung SyncMaster TA350 LT23A350 @60hz Samsung Odyssey G7 COOLER: Noctua NH-D15

 

Link to comment
https://linustechtips.com/topic/1257247-character-set-problems/#findComment-14113457
Share on other sites

Link to post
Share on other sites

11 hours ago, Parideboy said:

Like, with a replace? That could work. I'll definitely keep this in mind for short-term. But I don't know if there are any more screwed up characters tho.

 

Unfortunately there's no clean way to figure out if something is in the wrong encoding, it's just a bunch of numbers 🤷‍♂️ you'll have to add exceptions as they come up

Don't ask to ask, just ask... please 🤨

sudo chmod -R 000 /*

Link to comment
https://linustechtips.com/topic/1257247-character-set-problems/#findComment-14114859
Share on other sites

Link to post
Share on other sites

Good news everyone!

 

The csv file is encoded in UTF8, but apparently the problem was the sql loader that was using some other character set for some reason.

Now everything is showing as it should, thanks for your help!

MOTHERBOARD: ASRock H97 Pro4 CPU: Intel Core i5-4460 @3.30 Ghz Intel Xeon E3-1271v3 @4.00 Ghz RAM: 32Gb (4x8Gb) Kingstone HyperX Fury DDR3@1600 Mhz (9-9-9-27)

GPU: MSI 390 8Gb Gaming Edition PSU: XFX TS 650w Bronze Enermax Revolution D.F. 650w 80+ Gold MOUSE: Logitech G502 Proteus Spectrum KEYBOARD: Monokey Standard Suave Blue

STORAGE: SSD Samsung EVO 850 250Gb // HDD WD Green 1Tb // HDD WD Blue 4Tb // HDD WD Blue 160Gb CASE: Fractal Design Define R5 Windowed OS: Windows 11 Pro x64 Bit

MONITORS: Samsung CFG7 C24FG7xFQ @144hz // Samsung SyncMaster TA350 LT23A350 @60hz Samsung Odyssey G7 COOLER: Noctua NH-D15

 

Link to comment
https://linustechtips.com/topic/1257247-character-set-problems/#findComment-14115238
Share on other sites

Link to post
Share on other sites

1 hour ago, Sakuriru said:

I was just about to post that it could be a bug on their end lol

 

I'm glad it's all worked out!

Yeah, that was going to be our next step lol

MOTHERBOARD: ASRock H97 Pro4 CPU: Intel Core i5-4460 @3.30 Ghz Intel Xeon E3-1271v3 @4.00 Ghz RAM: 32Gb (4x8Gb) Kingstone HyperX Fury DDR3@1600 Mhz (9-9-9-27)

GPU: MSI 390 8Gb Gaming Edition PSU: XFX TS 650w Bronze Enermax Revolution D.F. 650w 80+ Gold MOUSE: Logitech G502 Proteus Spectrum KEYBOARD: Monokey Standard Suave Blue

STORAGE: SSD Samsung EVO 850 250Gb // HDD WD Green 1Tb // HDD WD Blue 4Tb // HDD WD Blue 160Gb CASE: Fractal Design Define R5 Windowed OS: Windows 11 Pro x64 Bit

MONITORS: Samsung CFG7 C24FG7xFQ @144hz // Samsung SyncMaster TA350 LT23A350 @60hz Samsung Odyssey G7 COOLER: Noctua NH-D15

 

Link to comment
https://linustechtips.com/topic/1257247-character-set-problems/#findComment-14115451
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×